The Quest for a Content Addressable SQLite

Happy Birthday SQLite! SQLite 1.0 was released 23 years ago today, so to celebrate this momentous occasion, join us on an exciting quest as we blend the old with the new, the familiar with the cutting-edge, and dare to dream of a future where edge databases and content addressable storage work together.

Our destination? An experimental project to combine SQLite's features with the high performance, scalability, and deduplication capabilities of Content Addressable Storage (CAS). We’ll probe new directions for scaling replicated SQLite databases like those on edge computing platforms, and even build a working prototype along the way.

In a nutshell, our experiment will leverage a novel SQLite Virtual File System (VFS) written in Rust, which will store database pages in a content addressable storage layer using a content-addressed Array Mapped Trie as our underlying persistent data structure. To top it all off, we’ll use off-the-shelf crates and libraries from the Filecoin ecosystem for maximum data interoperability!

SQLite is 🔥 Right Now

SQLite is an in-process library that implements a self-contained, serverless, zero-configuration, and transactional SQL database engine. Although it's not new, SQLite has been gathering attention in the developer community, and for good reason.

In the current landscape of edge computing platforms and edge networks like Cloudflare Workers and Fly.io, developers are seeking lightweight and flexible database solutions that are easy to set up, deploy, and run. SQLite, with its ability to run as an embedded engine, fits this requirement perfectly. We’re talking about a complete compliant SQL database with all the bells and whistles, contained within a single binary 🤯.

For these reasons and more, SQLite is seeing a lot of interest from the developer world, with SQLite-specific products like Cloudflare’s D1, and turso.tech leveraging the Lite in SQLite to simplify replication and improve query latencies on the edge, while at the same time folks like Fly.io are doubling down on server-side SQLite offerings.

Exploring the Benefits of Content Addressing

Content Addressable Storage, or CAS, is a unique storage mechanism that allows data retrieval based on content, rather than storage location. CAS can be thought of as a scalable, high-performance library that never allows two identical books to be stored. Projects like IPFS, Git, the Google File System, blockchains, and even Amazon’s S3 use CAS concepts under the hood.

CAS works by assigning a unique identifier (a cryptographic hash) to each piece of information, facilitating rapid data retrieval and ensuring data immutability. This immutability feature guarantees a high level of data integrity. The unique id is often called a content id, due to it being derived from the content itself. Also, since the id is unique and content-based, CAS storage systems get deduplication super powers pretty much for free!

A CID (content identifier) is simply a hash-based label used to link to some data. It is going to be our CA in CAS. You can read all about CIDs in the IPFS documentation, which is a practical implementation of a content id.

An Experimental Database for the Edge

Imagine we have a database running on an edge computing platform (or even IoT/mobile devices), where we have extremely high throughput, limited compute/network resources, and a need to sync state between a number of replicas. Ideally, we want these replicas close to the users that are interacting with the data. But we don’t want the replicas to diverge too much (or at all really), and we don’t want one replica to have a totally different view from another. Here’s a great discussion of some of the tradeoffs with putting data on the edge like this. At the end of the day, we’re going to have to decide at what layer of our system we’d like to deal with data consistency.

We initially explored this concept in the context of blockchain-based state machine replication (SMR) across a set of decentralized nodes operating in a Byzantine Fault-Tolerant (BFT) environment. In that context, throughput is low. However, decentralized compute networks conceptually share many threat models and issues with traditional edge compute networks. Here’s a POC of an actor (smart contract)-based implementation that uses the Filecoin VM to maintain consensus on database state.

So let’s try an experiment. Instead of writing to a single database file (or WAL file), let’s see if we can have SQLite store database pages in our content addressable storage layer instead! Then we can let the storage layer handle the hard part of data consistency. To test this idea out, we built a proof of concept (POC) application, which consists of a SQLite Virtual File System (VFS) that leverages the power of a CAS backend.

Cloning or forking databases also becomes as easy as copying a tiny cryptographic hash (like with Git). We also get extremely high concurrency because our SQLite database is now a persistent, immutable data structure. When a modification is made, a new (extremely lightweight) version of the database itself is created that shares most of the structure with the original (and related) databases, and only the changes are added to CAS. This is the essence of structural sharing and persistent data structures!

Virtual File System Design

Now comes the fun part! We are going to implement this POC in Rust to help illustrate the concept. Why Rust? Why not? Also, our team has been exploring Rust more and more lately, especially as we try out some cool experiments with extending SQLite. So first things first, let’s define how we’ll actually implement this thing diagrammatically, before we actually start building the thing.

The easiest way to extend SQLite in this way is to implement a Virtual File System or VFS. You can think of VFS as SQLite’s defacto portability layer. It is the module at the bottom of the SQLite implementation stack that provides portability across operating systems. SQLite ships with a number of default VFSes, but we’re going to implement our own to leverage CAS, rather than any native filesystem. Here’s roughly how we can conceptualize it:

Custom VFS
Custom VFS

We want the normal SQLite application to communicate with the standard SQLite engine. At this layer, we want zero customization. From here, the default SQLite engine interacts with its internal B-Tree data structure, and uses a “Pager” to decide which pages to write out to our custom storage layer. Our custom SQLite VFS layer communicates with the underlying CAS, providing all the features of CAS to the SQLite engine. Nice.

B-Trees and Pages

Ok, what do I mean by B-trees and Pages? Well, if you didn’t know already, the SQLite file format is a stable and well-defined file format for specifying the structure and content of a SQLite database. In practice, it is a relatively simple binary format that describes a series of b-tree pages. It is portable, efficient, and easy to implement. You can read more about some of the internals in this great blog post from our friends at Fly.io, but here’s a quick summary: “SQLite groups rows together into […] chunks called ‘pages’.” These pages are by default 4KBs, but can be any power of two between 512 and 65536 inclusive. Why 4KB by default? That's what file systems typically use as their page size…

So, we’re going to store b-tree pages, and we’ll just take whatever the Pager layer gives us, and stick these “atomic units” into our CAS. But the CAS is based on content addresses, and we want to be able to fetch them later using only their location address (page id). So we need some mapping between page id and content addresses, and we want this mapping to be pretty lightweight and efficient so it doesn’t add a ton of overhead to our CAS layer.

Authenticated Data Structures

This is where we can leverage content address aware data structures to help us out! We’ll use an Array Mapped Trie (AMT) that allows for the insertion and persistence of data, that itself is serializable to a CID. This is a data structure that looks and feels like an array, but actually stores the underlying data and its own structure within an underlying CAS system.

What about something other than a AMT? For sure there are other solutions for actually representing the array structure. In fact, one really nice optimization that Sander Pick used in our early FVM POC was to build a deterministic B-tree like structure over the pages, grouping them into “buckets” of equal size as we move from the bottom up.

This structure is an authenticated data structure (ADS), with efficient digests (fingerprinting) and a mechanism for efficient inclusion (or related) proofs. Merkle Trees provide a very useful starting point to talk about ADS, and we’re going to leverage a Merkle-DAG based AMT data structure for our experiment. If you aren’t familiar with Merkle trees and DAGs, here’s a pretty nice primer to get you started. The digest is going to be in the form of a CID, so that we can reference an entire database structure with a single reference/link. A generic Merkle structure looks something like this:

Basic Merkle Tree
Basic Merkle Tree

As you can see in the above figure, the pages are grouped to form intermediate nodes, which are hashed and form interior nodes, and so on up to a root Merkle DAG node that can be used to represent the entire structure. This is perfect for “mapping” array indices to CIDs, and this structure itself can also be encoded and mapped to a CID. This gives us pretty much everything we need to turn our SQLite database pages into a persistent data-structure that supports structural sharing across a shared CAS layer.

Implementation

The “full” (lightweight example version) implementation is available in this gist, and borrows heavily from this example The details are a bit involved for a blog post (and this one is getting pretty long as it is), but here’s the test module to give you an idea of how you’d interact with this CAS-based SQLite VFS via the very popular rusqlite library:

#[cfg(test)]
mod tests {
    use crate::PagesVfs;
		// We'll use this to actually diff the db!
    use fvm_ipld_amt::{diff, Amt};
    use fvm_ipld_blockstore::MemoryBlockstore;
    use rusqlite::{Connection, OpenFlags};
    use sqlite_vfs::register;

    pub const SQLITE_PAGE_SIZE: usize = 4096;

    #[test]
    fn basic_get_set() {
        let mem = MemoryBlockstore::default();
        let initial_bytes: &'static [u8] = include_bytes!("file.db");
        let vfs = PagesVfs::<_>::new(mem, initial_bytes);
        let other = vfs.clone();

        register("vfs", vfs, true).unwrap();
        let mut conn = Connection::open_with_flags_and_vfs(
            "main.db",
            OpenFlags::SQLITE_OPEN_READ_WRITE
                | OpenFlags::SQLITE_OPEN_CREATE
                | OpenFlags::SQLITE_OPEN_NO_MUTEX,
            "vfs",
        )
        .unwrap();

        // If we set this to memory, then we don't need the "cache" cursor,
        // but then we also don't get the "delete" hook to work with...
        let journal_mode: String = conn
            .query_row("pragma journal_mode = delete", [], |row| row.get(0))
            .unwrap();

        assert_eq!(journal_mode, "delete");

				// We don't need to enforce this, our VFS will adjust based on
				// the initial_bytes
        let page_size: usize = conn
            .query_row("PRAGMA page_size", [], |row| row.get(0))
            .unwrap();
        assert_eq!(page_size, SQLITE_PAGE_SIZE);

        let c = other.root().unwrap();
        println!("{:?}", c.to_string().as_str());

        let tx = conn.transaction().unwrap();
        tx.execute(
            "create table my_table(id integer primary key, msg text);",
            [],
        )
        .unwrap();
        tx.execute("insert into my_table(msg) values('hello');", [])
            .unwrap();
        tx.execute("insert into my_table(msg) values('world');", [])
            .unwrap();
        tx.commit().unwrap();

        let c = other.root().unwrap();

        let mut stmt = conn.prepare("select * from my_table;").unwrap();
        let mut rows = stmt.query([]).unwrap();
        while let Some(r) = rows.next().unwrap() {
            let id: i32 = r.get(0).unwrap();
            let msg: String = r.get(1).unwrap();
            println!("found in db: {} = {}", id, msg);
        }

        let db = MemoryBlockstore::new();
        let new_amt: Amt<Vec<u8>, _> = Amt::new(&db);
				// This actually computes a diff of the two structures
        let results = diff(&other.state.lock().unwrap(), &new_amt).unwrap();
				// We'd expect there to be two pages that are different
        assert_eq!(results.len(), 2);
				// Deterministic structure here should lead to this cid
        assert_eq!(
            c.to_string().as_str(),
            "bafy2bzaced3vtba4kcg4w5q4dnfdstxfom6t4cv72uuq4jgvkeav24ujdcgtg"
        );
    }
}

It is actually pretty simple. We bring our CAS system to the VFS implementation, here it is a shared MemoryBlockstore. This could be anything that implements the Blockstore trait that is provided by fvm_ipld_blockstore. Then we just register the VFS, and from there we essentially just use SQLite as per normal. Nothing fancy to think about. The only difference is, we can now reference the root hash of the database at any time. What’s cool about that is, an entire database can be defined by a single CID and access to a (possibly remote) blockstore!

We don’t really do much in terms of serious locking infrastructure, or even write ahead logging or rollback journals. With our setup, we don’t really need them, because if something goes wrong, we’ll just rollback the underlying AMT to the previous root hash and go from there. Additionally, to “fork” or otherwise bootstrap a new database from our existing one is as easy as calling its clone method. The only things we really need to clone are the references to the blockstrore and the root CID. This works because “loading” an AMT from an existing root and blockstore is extremely lightweight:

let mut new_amt: Amt<Vec<u8>, _> = Amt::load(&cid, &mem).unwrap();
assert_eq!(
    new_amt.flush().unwrap().to_string().as_str(),
    "bafy2bzaced3vtba4kcg4w5q4dnfdstxfom6t4cv72uuq4jgvkeav24ujdcgtg"
);

Other nice features here are that, unlike a normal SQLite database setup, our CAS can be networked, so we don’t have to have any real local storage if we don’t want. This makes it quite easy to use in a shared (or multi-tenant) storage environment like Amazon's EC2 or Cloudflare's R2.

Stay Tuned…

So that’s pretty much it for today!

Here’s a summary so far: We presented an experimental project to support SQLite databases on edge computing platforms, combining SQLite's features with the scalability and deduplication capabilities of Content Addressable Storage (CAS). We detailed the process of creating a SQLite Virtual File System (VFS) in Rust, which stores database pages in a CAS layer using a content addressable Array Mapped Trie as our underlying persistent data structure. To top it all off, we used off-the-shelf crates and libraries from the Filecoin ecosystem for the ultimate in data interoperability!

In a future post: We’ll be doing some benchmarking and measuring the potential benefits and tradeoffs in terms of storage and network traffic. After that, stay tuned for more of these types of posts as we continue to explore the design space around databases, peer-to-peer networks, blockchain, and a whole lot more. In particular,

In the mean time: if you liked this post, and/or if you are a Rust person and know how to improve the above code/design, we’re all ears! Let us know either way in our Discord, via email, Twitter/X, etc.

Subscribe to Tableland
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.