on uploading your soul to the interplanetary sys

June 3rd, 2023

tl; dr

today i go over a deep tour on IPFS (an 8 years-old distributed filesystem based on - and the origin of - libp2p).

in one sentence, IPFS is:

a suite of protocols and specs for organizing|representing, and transferring|routing data through content addressing, which is mapped to peer addresses through a distributed hash table (DHT).

the IPFS protocol is an alternative to location-based addressing (i.e., when a CDN serves HTTP requests), providing low-latency access to replicated data retrieved from multiple locations.

💎 today’s motivation

being a hacker for over two decades, i regret losing some of my work | history on the way because they were in some random cloud | third-party servers that are no longer available. when you create, you transfer part of your humanity | life-time to your creation, so losing that feels like losing old photographs (here is a guide on tor and security i wrote a decade ago that it’s still up 🙂).

i also see a cypherpunk future where IPFS is a main backbone of dweb(s), securing individual’s data sovereignty. i invite you to be part of it.

.http://protocol.ai.ipns.localhost:8080 (accessible while running a local IPFS node).

finally, this post was inspired by this tweet:

📺 today’s outline

00. the ipfs protocol
            00.0000. UnixFS, MFS, IPLD
            00.0001. blocks, CID, multiformats
            00.0010. distributed hash table
            00.0011. bitswap 
            00.0100. IPNS 
            00.0101. gateways 

01. running a node
            01.0110. install && setup
            01.0111. bootstrapping
            01.1000. adding objects
            01.1001. pinning objects
            01.1010. downloading objects
            01.1011. security && privacy

10. final thoughts
            10.1100. the filecoin blockchain 
            10.1101. deploy a website in IPFS in 5 minutes
            10.1110. resources && open questions

🎶 today’s mood

💾 00. the IPFS protocol

00.0000. UnixFS, MFS, IPLD

you have some data you want to upload to IPFS: how does it become content addressed?

larger files are split into chunks of 256 KB (or 32 bytes, from default SHA2-256 + base32, read on…), called IPFS objects, and each of these objects contains links to all the other objects corresponding to the original data.

using standards from the unix file system (UnixFS), IPFS can chunk and link data too big to fit in a single block, and use the chunked representation to store and manage the data.

in addition, the mutable file system (MFS) is a built-in tool that helps to address files as regular name-based filesystems, i.e., by abstracting the work of updating the links and hashes when you edit, create, or remove objects.

💡 UnixFS is a protocol-buffers (protobuf) based format for describing objects (i.e., files, directories, and symlinks) in IPFS.

this is how UnixfsV1 data structure looks like:

message Data {
	enum DataType {
		Raw = 0;
		Directory = 1;
		File = 2;
		Metadata = 3;
		Symlink = 4;
		HAMTShard = 5;
	}

	required DataType Type = 1;
	optional bytes Data = 2;
	optional uint64 filesize = 3;
	repeated uint64 blocksizes = 4;
	optional uint64 hashType = 5;
	optional uint64 fanout = 6;
	optional uint32 mode = 7;
	optional UnixTime mtime = 8;
}

message Metadata {
	optional string MimeType = 1;
}

message UnixTime {
	required int64 Seconds = 1;
	optional fixed32 FractionalNanoseconds = 2;
}

in this paradigm where “client/server” no longer holds up, an object divided into blocks by a chunker is arranged in a tree-like structure using link nodes to them together.

through the interplanetary linked data (IPLD), a meta-format for encoding and decoding merkle-linked data, IPFS represents relationships between objects’ content-addressed data (such as file directories and symlinks).

💡 a merkle tree is a tree in which every node is labeled with the cryptographic hash of a data block, and every node that is not a leaf (branch or inode) with the hash of the labels of its child nodes.

more specifically, the chunked representation is used to store and manage data on a CID-based directed acyclic graph called merkle DAG (CID is IPFS’s content ID, an absolute pointer to content, explained in the next session).

.chunking is important for deduplication and piecewise transfer and seeking.

💡 merkle DAGs are similar to merkle trees, but there are no balance requirements, and every node can carry a payload.

the returned CID is the hash of the root node in the DAG:

.immutability propriety for hashes (and CIDs): when we change one node in the DAG, every upstream node will have their hashes changed (while A remains in the new DAG).

00.0001. block, CID, multiformats

in IPFS, a block is a single unit of data associated with a unique and immutable content identifier representing the data itself (the hash + the codec of the data) as an address string.

.CID is a cryptographic hash to ensure data’s authenticity, integrity, and de-duplication (two identical files produce an identical CID).

each block’s CID is calculated using multiformats, which combines the multihash (info on the hash algorithm) of the data plus its codec.

.for example, SHA2-256's code is 0x12, so <0x12><20><hash digest>.

speaking code, this is the structure that represents the above:

type DecodedMultihash struct {
   Code   uint64  --> <0x12 for SHA2-256>
   Name   string  --> <SHA2-256>
   Length int     --> <32 bytes>
   Digest []byte 
}

and here are the specs for codec versions:

CID v0 codec
- defaults to base58 encoded SHA2-256 (i.e., a 46 characters string starting with Qm).

💡 base58btc was developed for bitcoin, with reading-friendly properties such as zero and the capital letter O are not included.

CID v1 codec (flag ipfs add --cid-version 1)
- brings multibase prefixes (info on how the hashed data is encoded), such as:
```
b - base32 --> CIDS start with ba
z - base58
f - base16
```
- with IPLD multicodec prefixes (info on how to interpret the hashed data after it has been fetched - here is the complete list), such as:
```
0x55 - raw (raw binary) 
0x70 - dag-pb (merkleDAG protobuf) 
0x71 - dag-cbor (merkleDAG cbor) 
0x78 - git object
0x90 - eth-block (ethereum block)
0x93 - ethereum transaction
```

.breaking a CID down at http://cid.ipfs.tech.ipns.localhost:8080.

00.0010. distributed hash tables

now, let’s talk a bit about how IPFS finds a given CID on the network through peer routing (identified by a public peerID , the IPFS node multihash unique identifier derived from the peer’s public key and linked to its IP and port).

.peerID are generated in a similar fashion than CID (through the underline libp2p specs).

the IPFS protocol is backed by a big (libp2p) distributed hash table (DHT) called kademlia, which maps CID to the peerID so that the data chunks can be accessed by:

an IPFS path ipfs://<CID>,
a gateway url https://ipfs.io/ipfs/<CID>,
your node cli, etc.

💡 libp2p is a networking framework for the development of p2p applications, consisting of a collection of protocols, specifications, and libraries for flexible addressing, transport agnostic, customizable security, peer identity and routing, content discovery, and NAT traversal.

🫱🏻‍🫲🏽 when a new object is uploaded to IPFS: the peer announces to the network that it has new content by adding an entry to the DHT that maps from CID to its IP address.

🫱🏻‍🫲🏽 when a user wants to download new data: they would look up its CID in the DHT, find peers’ IP addresses, and download the data directly from them.

.now multiply this by half a million peers.

💡 DHT is said to be “distributed” because no single node (peer) in the network holds the entire table. each node (peer) stores a subset of the table and the information on where to find the rest.

the DHT’s distance metric, i.e., the distance between two peerIDs, is a logical distance used to classify the nearest peers. this distance function is computed by applying an XOR operation to the ids and then converting it to a ranking integer.

💡 if an IPFS node does not implement the DHT, delegated content routing to another server or process over HTTP(S) can be configured.

00.0011. bitswap

💡 bitswap is IPFS’s message-based (libp2p) protocol for data transfer (as opposed to, for example, the HTTP’s request-response paradigm).

with bitswap, messages (e.g., “who has this CID?”) are transmitted to all connected nodes (peers) as an alternative to transversing the DHT.

the node can then decide if the message should be stored (e.g., in a wantlist), processed, or discarded.

00.0100. IPNS

when an object gets updated, and a new CID is generated (changing the IPFS URI), how does the node inform peers?

IPNS is a protocol that maps the hash of the peer’s public key to its (mutable) content by a mutable pointer.

IPNS is associated with records containing the content path (i.e., /ipfs/<CID>) it links plus metadata (such as the expiration, the version number, and a cryptographic signature signed by the corresponding private key). new IPFS records can be signed and published at any point by the private key holder.

by the way, IPNS is the third type of key-value pairings updated and found using the same DHT protocol (i.e., kademlia):

.http://docs.ipfs.tech.ipns.localhost:8080/concepts/dht/.

00.0101. gateways

IPFS gateways allow applications that do not support IPFS to fetch data from the IPFS network by an HTTP interface.

for instance, to make a website hosted on IPFS more accessible, you can put it inside a directory and create a DNSLink record for its CID (e.g., see cloudflare integration). it uses DNS TXT records to map a DNS name to an IPFS address so you can always point them to the latest version of an IPFS object (e.g., dnslink=/ipfs/<CID>).

end-users can then make requests to a universal gateway URL such as https://cf-ipfs.com/ipns/en.wikipedia-on-ipfs.org/ and have their requests translated to the correct CID in the background.

💡 while a gateway with a DNSLink record (restricted gateway) is restricted to a particular piece of content (either a specific CID or an IPNS hostname). a universal gateway allows users to access any content hosted on the IPFS network.

.a high-level overview of kubo's IPFS subsystem architecture, https://github.com/ipfs/kubo/blob/master/docs/config.md.

💾 01. running a node

01.0110. install && setup

install the IPFS CLI using these instructions (note that you also have the option to run a GUI or inside a Docker container).

start your local node with:

this command creates the directory ~/.ipfs containing the information of your IPFS node:

your node is ready to get started. if you plan to run it intermittently, you will want to configure the ports in your firewall (or if under a VPS).

additionally, you might want to setup your node’s HTTP gateway (or nginx proxy + CORS), to allow HTTP requests to access the content (i.e., allowing fetching to something like http://<your_ip>:8080/ipfs/<CID>).

a useful CLI command for this is ipfs config show.

💡 nodes running on residential WiFi tend to have long wait times or high rates of failure because the rest of the network can find it difficult to connect through their NAT (i.e., the internet router, or how the modem knows which private IP / device is acting as a server). this could be ameliorated by setting up port forwarding on the router directing external connections to port 4001 (or moving the node to a hosted server).

for completeness, here is how the IPFS desktop GUI looks like, if you choose to go that direction instead:

01.0111. bootstrapping

when you first connect the internet with a new client or server (by starting the IPFS daemon), your node will be bootstrapped with a collection of IP addresses to start you up (and you can also define the nodes):

.starting the daemon for a local IPFS node (the node is now online and running in the background, listening for data requests).

if you are serious about running your node, you might want to run the IPFS daemon in the background with some process manager (such as supervisord in linux).

at this point, the add , pin, and cat commands are the most significant IPFS functions, but here is more fun stuff:

01.1000. adding objects

the IPFS add command creates a merkle DAG for the objects (following the UnixFS data format):

01.1001. pinning objects

💡 pinning is the process of specifying what data is to be persisted on one or more IPFS nodes. this ensures that data is accessible indefinitely and will not be removed during the I garbage collection.

with add, the content does not automatically replicate across the network. you need to pin the CID to your local DHT, associating it to your IP address/port (or accessible endpoints obtained from NAT traversal):

💡 once your CID is uploaded to other nodes’ DHT (i.e., once another node downloads the content), it can’t be removed from the network. however, if you are the only node hosting the content, you can unpin it and remove it from your DHT.

pinning allows the node to advertise that it has the CID (a continuous process that repeats every ~12 hours).

💡 if many minutes have passed since objects were uploaded to an IPFS node and they’re still not discoverable by other gateways, it’s possible the node is having trouble announcing the objects to the rest of the network.

you can make sure the content is pinned by running:

ipfs pin -r <CID>

or you can force the announcement by running:

ipfs dht provide -rv <CID>.

finally, when working with production and/or large amounts of data, you might want to use a (paid) pinning service that runs many IPFS nodes. here are some examples: pinata, infura, nft.storage, fleek, and filebase.

01.1010. retrieving objects

when you request an object (e.g., by ipfs get <CID> or by typing ipfs://<CID> in your brave browser or at an IPFS gateway), the IPFS client:

consults its local DHT table to see where this CID is located and gets back a collection of IP addresses/ports, or
asks connected peers over bitswap, or
makes an HTTP call to a delegated routing server.

once the mapping is retrieved, block fetching is started. the client downloads chunks of the content from each peer. the client also verifies the hashes for each block.

once all the blocks are retrieved, the local node is able to reconstruct the merkle DAG and replicate the requested content.

once the client has the content and if it supports being a DHT server, the client will update its local DHT table with the CID. the new updated DHT is then propagated across the peers (even if it’s not explicitly pinned).

.native ways for having your data pinned on the IPFS network.

01.1011. security && privacy

as a public network, participating IPFS nodes (peers) use CIDs to store and advertise data through publicly viewable protocols, without protecting the information about CIDs.

traffic and metadata can be monitored to infer information about the network and the users. while IPFS traffic between nodes is encrypted, the metadata the nodes publish to the DHT is public, including:

CIDs of data that they're providing or retrieving, and
the peerID, as it’s possible to do a DHT lookup, and particularly if a node is running from a static location, it should reveal its IP address.

although adding built-in privacy layers to decentralized networks is a complex topic and usually conflicts with the modular design of the network, some basic security measures can help to make your node more private:

disable CID re-providing by default (as your node temporarily becomes a host for the content until its local cache is cleared),
encrypt data at rest (content-encryption), and
use an external IP address that is not connected to anything else in your life.

in addition, if you are not running a local node but, instead, loading IPFS through a public gateway (e.g., with brave or from this list), the gateway can still see your IPFS browsing activity (including the log of your IP address or your browser).

if of interest, you can also check some encrypted-base projects on top of IPFS, such as ceramic and orbitDB.

finally, as a dweb hacker, i find the subject of privacy and security in decentralized webs fascinating. at some point, i would like to write more about treat models and poc-ing nodes’ metadata analysis. for now, however, stay tuned to a soon-to-be-published post where i will be talking more in-depth about a similar topic, my research on ethereum validator privacy.

💾 10. final thoughts

10.1100. the filecoin blockchain

for long-term storage, a step further from pinning services is the filecoin blockchain, a decentralized storage network built on top of IPFS where providers rent their storage space to clients (through a deal on how much data will be stored, for how long, and at what cost).

the verifiable storage in filecoin works under a general consensus named “proof-of-storage”, which is divided into “proof-of-replication” and “proof-of-spacetime” and worth of their own mirror post.

since the retrieval process might be slower than an IPFS pinning service and there is a requirement for the minimal file size accepted, combined IPFS + filecoin solutions are also available. examples: estuary, web3.storage, and chainsafe storage.

for completeness, here are some alternative solutions with a more specific focus or use of a specific data storage mechanism: arweave, bittorrent, and swarm.