on my rusty sparse merkle tree experiment

tl; dr

today i go over my implementation of a simple library for authenticated data structures and sparse merkle trees.

the source code, in rust, can be found here.


🎵 today’s mood

“the first half of life is devoted to forming a healthy ego, the second half is going inward and letting go of it.” - carl jung

“the ego refuses to be distressed by the provocations of reality, to let itself be compelled to suffer. it insists that it cannot be affected by the traumas of the external world; it shows, in fact, that such traumas are no more than occasions for it to gain pleasure.” - sigmund freud


001. authenticated data structures

an authenticated data structure (ADS) is an advanced data structure on which an untrusted prover can query for an entry, receiving a result and a proof so that the response can be efficiently checked for authenticity.

you can think of ADSs as cryptographic upgrades of the classic algorithms we are used to (such as hash maps, binary trees, or tries), with an extra operation added to the good an old insert(), lookup(), delete(), update(): the commit().

in other words, we can utilize well-known cryptographic hash functions (such as SHA-256 or SHA-3) to calculate a collision-free data structure representation.

we call this hash representation string the commitment (e.g., a small and unique cryptographic representation of the data structure). commitments must uniquely determine the data structure's value (i.e., if a.commit() == b.commit(), then a == b).

because we trust our hash function, every query should have a valid collision-free proof + a valid proof should imply that the result is correct. this means that proofs for queries must be complete and sound, where completeness means that every valid query result has a valid proof and soundness means that valid proofs imply that the query results are correct.

💡 according to seminal Miller et al, "an authenticated data structure (ADS) is a data structure whose operations can be carried out by an untrusted prover, the results of which a verifier can efficiently check as authentic. this is done by having the prover produce a compact proof that the verifier can check along with each operation’s result. ADSs thus support outsourcing data maintenance and processing tasks to untrusted servers without loss of integrity."

cryptographic hash functions

a cryptographic hash function H is a special function that takes an arbitrarily long sequence of bytes and returns some fixed-size "digest" of that sequence. cryptographic hash functions have two special properties for our purposes:

  • preimage resistance: given a digest d, it's infeasible to calculate a string s such that H(s) = d.

  • collision resistance: it's infeasible to find two strings s1 and s2 such that H(s1) == H(s2).

for this library, we will be using the SHA-256 hash function, which has a 256-bit digest.


010. authenticated (sorted) key-value stores

an authenticated key-value store is an ADS of an "associative array" or "map".

the methods of the data structure are described in src/kv_trait.rs of my code:

fn new() -> Self;
fn commit(&self) -> Self::Commitment;
fn check_proof(key: Self::K, res: Option<Self::V>, pf: &Self::LookupProof,
        comm: &Self::Commitment) -> Option<()>;
fn insert(self, key: Self::K, value: Self::V) -> Self;
fn get(&self, key: Self::K) -> (Option<Self::V>,Self::LookupProof);
fn remove(self, key: Self::K) -> Self;

note that insert(), get(), and remove() behave like the same methods in std::collections::HashMap, except that insert() and remove() return copies of the ADS rather than taking &mut self.

you can check the full implementation of a sorted key-value store class, represented by a vectorial abstraction of a (unbalanced) binary tree (i.e., the underlying data structure is an array), at src/sorted_kv.rs.

although the key sorts the structure, this is not a binary search tree, as it allows repeated entries. rather, the structure's commitment is an overall hash calculated recursively as a binary tree (this digest is obtained by mapping the merkle hash of each pair - named "merkle mountain range").


011. merkle trees

merkle trees are the canonical and original example of authenticated data structures, designed for easy inclusion proofs but not easy exclusive proofs (e.g., a prover can efficiently convince the verifier that a particular entry is present but not absent - you will understand this better in the next session).

for example, we could use a vectorial abstraction of a binary tree to represent a collection of keys and values at nodes given by an index i.

although there are multiple representations of an associative array, a general rule for this representation is that:

  • a right sibling could be found through 2 * ix + 2, so (i % 2 == 0) == true

  • a left sibling could be found through 2 * ix + 1, so (i % 2 == 1) == true

the path from a (k, v) node to the root would be calculated as the overall digest hash of the tree from the (k, v) node pointer, with all the sibling hashes.

in other words, the calculation consists of iterating each sibling in the array of siblings' digests, attributing the hashes for left and right sub-trees, and then iteratively hashing them. in this logic, the root hash is the entire commitment of the tree.

💡 since inserting at a specific position cannot be done efficiently (as the whole tree would need to be recomputed), these trees are unsuitable for authenticated key-value maps.


100. sparse merkle trees

now, let’s talk about something even more awesome: sparse merkle trees, which provide efficient exclusion proofs by design (try to figure out why 🕵🏻‍♀️!).

in a sparse merkle tree, a particular (k, v) is represented by a leaf node such that its path from the root is encoded by one of all possible hash digests of the chosen hash function.

so, if the chosen hash function is represented by N bits, the tree's height is N, and the paths down to the 2^N leaf nodes are represented by (N-nodes-deep) bit-strings.

in the case of SHA-256, the tree has a height of 256 and 2^{256} leaf nodes represented by 256-bit paths.
in the case of SHA-256, the tree has a height of 256 and 2^{256} leaf nodes represented by 256-bit paths.

🕵🏻‍♀️ when the (k, v) entry is not present in the map, an empty leaf is assigned (for instance, with an all-0 digest).

in other words, each possible (k, v) entry corresponds to a unique leaf node and is uniquely linked to its position in the tree. on the other hand, each leaf is a unique representation of the 2^N possibilities presented by the digest size N of the cryptographic hash function.

how about the branch nodes?

branches have left and right subtrees given by the digest of its child subtrees digest (something like hash_branch(left_digest_until_here, right_digest_until_here), a concatenated hash of its child nodes).

so any parent branch node can be obtained by hashing together two child nodes recursively until the top to the merkle root of the tree.

💡 note that different hash functions should be used for leaf and branch nodes to prevent preimage attacks.

each child subtree position is found by looking at the bits of hash_function(k). a left branch is denoted as 0 and a right branch as 1, so the most left leaf's key is 0x000..00, the next is 0x00..0, the most right key is 0x11..11, and so on.

in addition, we see that most leaf nodes are empty, and this is cool because the hashes of empty nodes are identical.

the same is true for interior nodes whose children are all empty: subtrees with no leaf nodes are represented as an empty subtree with a digest such as the darling all-0 digest (and hashes of hashes of hashes of (…) of empty nodes are all predictable).

💡 contrary to simple merkle trees, sparse merkle trees are a great choice for authenticated key-value maps due to the history independence of the merkle root from element insertion order.

“well, 2^N with a loooot of zeroes sounds like waste”

yes, to naively create a sparse merkle tree, one would generate all possible 2^N outputs of the hash function as a leaf in the tree and initialize them to the empty value. this generates a nearly unbounded number of data.

however, since required storage is only proportional to the number of items inserted into the tree, some optimizations are possible:

  • items on the leaf node are initially None and, therefore, do not need to be stored.

  • subtrees with no leaf nodes are represented as "empty subtree", with an all-0 digest.

  • internal nodes all have fixed predictable values that can be re-computed as hashes of None values.

therefore, a sparse merkle tree can simply store a set of leaf nodes for the (k, v) pairs plus a set of empty hashes representing the sparse areas of the tree.

and because a sparse merkle is nearly balanced, if there are N entries in the tree, a particular entry could be theoretically reached in log2(N) steps!

rust is the best language, period

as we implement a sparse merkle tree in rust (using SHA-256), we can think of two different types of nodes: SparseMerkleTreeNodeLeaf and SparseMerkleTreeNodeBranch (the full source code for this class is available at src/sparse_merkle_tree.rs).

a very simple lookup function could look like this:

finally, a commit and checking proof could use the following path process:


101. unit tests and fuzzing

to conclude this article, let’s go over some of the testing approaches in this library. first of all, i use nix to set up my virtual environments and you should too.

rust differentiates between "unit" tests and "integration" tests, as described by its documentation. in my code, unit tests are annotated with #[test] macro, and integration tests are annotated with#[cfg(test)] macro.

alternatively, the test hash_btree_insert_get() is an example of a "simulation test", generating scenarios where two different data structures (HashMap and BTreeMap) are tested to behave "the same" for insert() and get() through this check:

assert_eq!(hmap.get(&k), bmap.get(&k));

we extended this concept to compare the behavior of SortedKV and SparseMerkleTree against HashMap through, for example, hash_sortedkv_insert_get() (making sure we test for SortedKV::check_proof()):

we then use quickcheck for property testing, which creates several randomly generated (non-deterministic) inputs, such as in:

fun.


◻️♄

Subscribe to go outside labs
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.