today i go over a deep tour on IPFS (an 8 years-old distributed filesystem based on - and the origin of - libp2p).
in one sentence, IPFS is:
a suite of protocols and specs for organizing|representing, and transferring|routing data through content addressing, which is mapped to peer addresses through a distributed hash table (DHT).
the IPFS protocol is an alternative to location-based addressing (i.e., when a CDN serves HTTP requests), providing low-latency access to replicated data retrieved from multiple locations.
being a hacker for over two decades, i regret losing some of my work | history on the way because they were in some random cloud | third-party servers that are no longer available. when you create, you transfer part of your humanity | life-time to your creation, so losing that feels like losing old photographs (here is a guide on tor and security i wrote a decade ago that it’s still up 🙂).
i also see a cyberpunk future where IPFS is a main backbone of dweb(s), securing individual’s data sovereignty. i invite you to be part of it.
finally, this post was inspired by this tweet:
00. the ipfs protocol
00.0000. UnixFS, MFS, IPLD
00.0001. blocks, CID, multiformats
00.0010. distributed hash table
00.0011. bitswap
00.0100. IPNS
00.0101. gateways
01. running a node
01.0110. install && setup
01.0111. bootstrapping
01.1000. adding objects
01.1001. pinning objects
01.1010. downloading objects
01.1011. security && privacy
10. final thoughts
10.1100. the filecoin blockchain
10.1101. deploy a website in IPFS in 5 minutes
10.1110. resources && open questions
you have some data you want to upload to IPFS: how does it become content addressed?
larger files are split into chunks of 256 KB (or 32 bytes, from default SHA2-256
+ base32
, read on…), called IPFS objects, and each of these objects contains links to all the other objects corresponding to the original data.
using standards from the unix file system (UnixFS), IPFS can chunk and link data too big to fit in a single block, and use the chunked representation to store and manage the data.
in addition, the mutable file system (MFS) is a built-in tool that helps to address files as regular name-based filesystems, i.e., by abstracting the work of updating the links and hashes when you edit, create, or remove objects.
💡 UnixFS is a protocol-buffers (protobuf) based format for describing objects (i.e., files, directories, and symlinks) in IPFS.
this is how UnixfsV1 data structure looks like:
message Data { enum DataType { Raw = 0; Directory = 1; File = 2; Metadata = 3; Symlink = 4; HAMTShard = 5; } required DataType Type = 1; optional bytes Data = 2; optional uint64 filesize = 3; repeated uint64 blocksizes = 4; optional uint64 hashType = 5; optional uint64 fanout = 6; optional uint32 mode = 7; optional UnixTime mtime = 8; } message Metadata { optional string MimeType = 1; } message UnixTime { required int64 Seconds = 1; optional fixed32 FractionalNanoseconds = 2; }
in this paradigm where “client/server” no longer holds up, an object divided into blocks by a chunker is arranged in a tree-like structure using link nodes to them together.
through the interplanetary linked data (IPLD), a meta-format for encoding and decoding merkle-linked data, IPFS represents relationships between objects’ content-addressed data (such as file directories and symlinks).
💡 a merkle tree is a tree in which every node is labeled with the cryptographic hash of a data block, and every node that is not a leaf (branch or
inode
) with the hash of the labels of its child nodes.
more specifically, the chunked representation is used to store and manage data on a CID-based directed acyclic graph called merkle DAG (CID is IPFS’s content ID, an absolute pointer to content, explained in the next session).
💡 merkle DAGs are similar to merkle trees, but there are no balance requirements, and every node can carry a payload.
the returned CID is the hash of the root node in the DAG:
in IPFS, a block is a single unit of data associated with a unique and immutable content identifier representing the data itself (the hash + the codec of the data) as an address string.
each block’s CID is calculated using multiformats, which combines the multihash (info on the hash algorithm) of the data plus its codec.
speaking code, this is the structure that represents the above:
type DecodedMultihash struct {
Code uint64 --> <0x12 for SHA2-256>
Name string --> <SHA2-256>
Length int --> <32 bytes>
Digest []byte
}
and here are the specs for codec versions:
CID v0 codec
base58
encoded SHA2-256
(i.e., a 46 characters string starting with Qm
).💡
base58btc
was developed for bitcoin, with reading-friendly properties such as zero and the capital letter O are not included.
CID v1 codec (flag ipfs add --cid-version 1
)
brings multibase prefixes (info on how the hashed data is encoded), such as:
b - base32 --> CIDS start with ba
z - base58
f - base16
with IPLD multicodec prefixes (info on how to interpret the hashed data after it has been fetched - here is the complete list), such as:
0x55 - raw (raw binary)
0x70 - dag-pb (merkleDAG protobuf)
0x71 - dag-cbor (merkleDAG cbor)
0x78 - git object
0x90 - eth-block (ethereum block)
0x93 - ethereum transaction
now, let’s talk a bit about how IPFS finds a given CID on the network through peer routing (identified by a public peerID
, the IPFS node multihash unique identifier derived from the peer’s public key and linked to its IP and port).
the IPFS protocol is backed by a big (libp2p) distributed hash table (DHT) called kademlia, which maps CID to the peerID
so that the data chunks can be accessed by:
an IPFS path ipfs://<CID>
,
a gateway url https://ipfs.io/ipfs/<CID>
,
your node cli, etc.
💡 libp2p is a networking framework for the development of p2p applications, consisting of a collection of protocols, specifications, and libraries for flexible addressing, transport agnostic, customizable security, peer identity and routing, content discovery, and NAT traversal.
🫱🏻🫲🏽 when a new object is uploaded to IPFS: the peer announces to the network that it has new content by adding an entry to the DHT that maps from CID to its IP address.
🫱🏻🫲🏽 when a user wants to download new data: they would look up its CID in the DHT, find peers’ IP addresses, and download the data directly from them.
💡 DHT is said to be “distributed” because no single node (peer) in the network holds the entire table. each node (peer) stores a subset of the table and the information on where to find the rest.
the DHT’s distance metric, i.e., the distance between two peerID
s, is a logical distance used to classify the nearest peers. this distance function is computed by applying an XOR
operation to the ids and then converting it to a ranking integer.
💡 if an IPFS node does not implement the DHT, delegated content routing to another server or process over HTTP(S) can be configured.
💡 bitswap is IPFS’s message-based (libp2p) protocol for data transfer (as opposed to, for example, the HTTP’s request-response paradigm).
with bitswap, messages (e.g., “who has this CID?”) are transmitted to all connected nodes (peers) as an alternative to transversing the DHT.
the node can then decide if the message should be stored (e.g., in a wantlist), processed, or discarded.
when an object gets updated, and a new CID is generated (changing the IPFS URI), how does the node inform peers?
IPNS is a protocol that maps the hash of the peer’s public key to its (mutable) content by a mutable pointer.
IPNS is associated with records containing the content path (i.e., /ipfs/<CID>
) it links plus metadata (such as the expiration, the version number, and a cryptographic signature signed by the corresponding private key). new IPFS records can be signed and published at any point by the private key holder.
by the way, IPNS is the third type of key-value pairings updated and found using the same DHT protocol (i.e., kademlia):
IPFS gateways allow applications that do not support IPFS to fetch data from the IPFS network by an HTTP interface.
for instance, to make a website hosted on IPFS more accessible, you can put it inside a directory and create a DNSLink record for its CID (e.g., see cloudflare integration). it uses DNS TXT
records to map a DNS name to an IPFS address so you can always point them to the latest version of an IPFS object (e.g., dnslink=/ipfs/<CID>
).
end-users can then make requests to a universal gateway URL such as https://cf-ipfs.com/ipns/en.wikipedia-on-ipfs.org/
and have their requests translated to the correct CID in the background.
💡 while a gateway with a DNSLink record (restricted gateway) is restricted to a particular piece of content (either a specific CID or an IPNS hostname). a universal gateway allows users to access any content hosted on the IPFS network.
install the IPFS CLI using these instructions (note that you also have the option to run a GUI or inside a Docker container).
start your local node with:
this command creates the directory ~/.ipfs
containing the information of your IPFS node:
your node is ready to get started. if you plan to run it intermittently, you will want to configure the ports in your firewall (or if under a VPS).
additionally, you might want to setup your node’s HTTP gateway (or nginx proxy + CORS), to allow HTTP requests to access the content (i.e., allowing fetching to something like http://<your_ip>:8080/ipfs/<CID>
).
a useful CLI command for this is ipfs config show
.
💡 nodes running on residential WiFi tend to have long wait times or high rates of failure because the rest of the network can find it difficult to connect through their NAT (i.e., the internet router, or how the modem knows which private IP / device is acting as a server). this could be ameliorated by setting up port forwarding on the router directing external connections to port
4001
(or moving the node to a hosted server).
for completeness, here is how the IPFS desktop GUI looks like, if you choose to go that direction instead:
when you first connect the internet with a new client or server (by starting the IPFS daemon), your node will be bootstrapped with a collection of IP addresses to start you up (and you can also define the nodes):
if you are serious about running your node, you might want to run the IPFS daemon in the background with some process manager (such as supervisord in linux).
at this point, the add
, pin
, and cat
commands are the most significant IPFS functions, but here is more fun stuff:
the IPFS add
command creates a merkle DAG for the objects (following the UnixFS data format):
💡 pinning is the process of specifying what data is to be persisted on one or more IPFS nodes. this ensures that data is accessible indefinitely and will not be removed during the I garbage collection.
with add
, the content does not automatically replicate across the network. you need to pin the CID to your local DHT, associating it to your IP address/port (or accessible endpoints obtained from NAT traversal):
💡 once your CID is uploaded to other nodes’ DHT (i.e., once another node downloads the content), it can’t be removed from the network. however, if you are the only node hosting the content, you can unpin it and remove it from your DHT.
pinning allows the node to advertise that it has the CID (a continuous process that repeats every ~12 hours).
💡 if many minutes have passed since objects were uploaded to an IPFS node and they’re still not discoverable by other gateways, it’s possible the node is having trouble announcing the objects to the rest of the network.
you can make sure the content is pinned by running:
ipfs pin -r <CID>
or you can force the announcement by running:
ipfs dht provide -rv <CID>.
finally, when working with production and/or large amounts of data, you might want to use a (paid) pinning service that runs many IPFS nodes. here are some examples: pinata, infura, nft.storage, fleek, and filebase.
when you request an object (e.g., by ipfs get <CID>
or by typing ipfs://<CID>
in your brave browser or at an IPFS gateway), the IPFS client:
consults its local DHT table to see where this CID is located and gets back a collection of IP addresses/ports, or
asks connected peers over bitswap, or
makes an HTTP call to a delegated routing server.
once the mapping is retrieved, block fetching is started. the client downloads chunks of the content from each peer. the client also verifies the hashes for each block.
once all the blocks are retrieved, the local node is able to reconstruct the merkle DAG and replicate the requested content.
once the client has the content and if it supports being a DHT server, the client will update its local DHT table with the CID. the new updated DHT is then propagated across the peers (even if it’s not explicitly pinned).
as a public network, participating IPFS nodes (peers) use CIDs to store and advertise data through publicly viewable protocols, without protecting the information about CIDs.
traffic and metadata can be monitored to infer information about the network and the users. while IPFS traffic between nodes is encrypted, the metadata the nodes publish to the DHT is public, including:
CIDs of data that they're providing or retrieving, and
the peerID
, as it’s possible to do a DHT lookup, and particularly if a node is running from a static location, it should reveal its IP address.
although adding built-in privacy layers to decentralized networks is a complex topic and usually conflicts with the modular design of the network, some basic security measures can help to make your node more private:
disable CID re-providing by default (as your node temporarily becomes a host for the content until its local cache is cleared),
encrypt data at rest (content-encryption), and
use an external IP address that is not connected to anything else in your life.
in addition, if you are not running a local node but, instead, loading IPFS through a public gateway (e.g., with brave or from this list), the gateway can still see your IPFS browsing activity (including the log of your IP address or your browser).
if of interest, you can also check some encrypted-base projects on top of IPFS, such as ceramic and orbitDB.
finally, as a dweb hacker, i find the subject of privacy and security in decentralized webs fascinating. at some point, i would like to write more about treat models and poc-ing nodes’ metadata analysis. for now, however, stay tuned to a soon-to-be-published post where i will be talking more in-depth about a similar topic, my research on ethereum validator privacy.
for long-term storage, a step further from pinning services is the filecoin blockchain, a decentralized storage network built on top of IPFS where providers rent their storage space to clients (through a deal on how much data will be stored, for how long, and at what cost).
the verifiable storage in filecoin works under a general consensus named “proof-of-storage”, which is divided into “proof-of-replication” and “proof-of-spacetime” and worth of their own mirror post.
since the retrieval process might be slower than an IPFS pinning service and there is a requirement for the minimal file size accepted, combined IPFS + filecoin solutions are also available. examples: estuary, web3.storage, and chainsafe storage.
for completeness, here are some alternative solutions with a more specific focus or use of a specific data storage mechanism: arweave, bittorrent, and swarm.
for anyone who somehow made it to the end of this article only wanting some sort of “front-end” IPFS deployment how-to guide, here will go…
let’s use fleet to deploy a website hosted in github to IPFS in 111 visual steps:
if you enjoyed IPFS, i invite you to check these further resources:
the library genesis and some other enjoyable IPFS archives:
and to think about these:
IPFS lacks a fully functional search engine. what could be a workaround?
hydras and ipfs: a decentralized playground for malware, arxiv:1905.11880, IPFS: The New Hotbed of Phishing by trustwave, and cyber criminal adoption of IPFS for phishing, malware campaigns by talos.
in a following post, i will be diving more into peer-to-peer connectivity && libp2p. stay tuned.