The Data Availability problem: A Brief History and Bright Future

The Data availability problem:

In the simplest of terms, Blockchains are a set of virtual computers agreeing on the same set of actions, and following the same rules. The set of rules they follow is commonly referred to as a consensus algorithm, with valid actions according to a virtual machine.

Inevitably, this leads to an unavoidable situation wherein Blockchains and the data they store are siloed from the outside world. Therefore, any interaction they have with an external data source introduces additional trust assumptions. The data that is available natively on-chain is held entirely by full nodes ‘running’ the blockchain. Where archival nodes are expected to store historical blockchain data and historical state changes, full nodes manage the current state of the blockchain. In addition, full nodes serve the chain data upon request to the agents that wish to interact with the blockchain.

This additional responsibility of full nodes is the central source of the “data availability” problem. In brief, when a new block is posted to the chain, how can full nodes be certain that the data is:

1) Complete (every transaction of the proposed block is available for download) and;          

2) Correct (the transactions are not fraudulent)

Therefore, the data availability problem introduces the key question; how can we be sure that the block proposer is not acting maliciously?

Currently, the only way to ensure that a data withholding attack is not occurring is to download the full block. It is, as Satoshi said in the Bitcoin white-paper, “the only way to confirm the absence of a transaction is to be aware of all transactions.” However, this creates a bottleneck for scalability as nodes can’t verify larger blocks without significantly increasing the hardware requirements. Higher TPS requires larger blocks since each transaction consumes data and processing power (to validate/execute). Thus, a solution that allows for larger blocks without increasing hardware requirements is imperative for scalability.

To date, increasing block sizes has been considered the only viable solution for increasing throughput, but the purpose of this article is to outline the evolution of blockchain architectures that have unlocked new answers to the data availability problem without sacrificing decentralisation or user-sovereignty.

Four key blockchain components

To understand this further and tackle this question holistically, we must first understand the four core functions of a blockchain - Execution, Settlement, Consensus, and Data availability.

  • Execution is the layer where applications live and state is updated.
  • Settlement is where execution layers verify proofs, resolve disputes, and transfer assets between execution layers.
  • Consensus is where nodes agree on the order of transactions included in a block
  • Data availability is making sure that the transaction data is available to users and block producers (consensus nodes).

Monolithic vs. Modular architecture blockchains

A visual representation of Monolithic vs. Modular blockchain architecture
A visual representation of Monolithic vs. Modular blockchain architecture

Traditionally, blockchain architecture has relied upon these four concepts to coalesce on the same layer which has been termed a monolithic architecture. This architecture will always have the least amount of trust assumptions, a positive outcome for users wanting to transact securely. However, there are tremendous limitations on scalability, hardware requirements, and transaction efficiency. This is because the blockchain is responsible for the four components and can’t make software optimisations to increase functionality. This is where a new paradigm has been introduced: modular architecture.

In contrast to monolithic architecture, modular architecture separates those four core concepts of Execution, Settlement, Consensus, and Data Availability, and in so doing allows for greater scalability of the blockchain’s capabilities. In separating the four layers, it allows them to be specialised for intended purposes, which increases efficiency whilst still having minimal trust assumptions.

The narrative and wide-ranging support for modular architectures has developed traction over the first half of 2022. While Rollups, both Optimistic (Arbitrum, Optimism and Metis) and Zero-Knowledge (Starkware, Loopring, AZTEC and zkSync) are in fact modular execution environments, they have mostly not been marketed as such. The recent uptick in support has largely been catalysed by Celestia, which this article will explore in detail. However, it’s important to note that the modular architecture narrative has long been foundational to the data availability discourse.

A brief history:

Bitcoin was the first blockchain, and thus, the first to encounter the data availability problem. Mastercoin, a Bitcoin sidechain that ICO’d in 2013 attempted to use Bitcoin as a data availability layer (DAL) and consensus layer (CL). Rather than needing to bootstrap its own secure consensus by Proof of Work (Proof of Stake was not yet developed), Mastercoin utilized Bitcoin’s already extensive support network of users and their full nodes. Hailed as the Meta-Protocol, by bringing features such as token swaps and betting markets to Bitcoin, Mastercoin hoped to provide the first proof of concept to expand upon Bitcoin’s capabilities. At the technical level this was achieved by allowing transactions to be executed on the Mastercoin sidechain, flagged, and then sent back to Bitcoin, which was used as a Data availability layer (DAL). This is the first example of a sidechain/L2 with a modular philosophy, with extensible features that doesn't require L1 governance/hard forks. Mastercoin was dubbed the original Altcoin, but ultimately it failed.

The reason for Mastercoin’s failure remains an ongoing debate. One often cited reason is the protocol architecture failures, such as low liquidity for tokens issued via the platform due to the absence of AMMs. Hostility from Bitcoin core developers, leading to a price increase in data blobs is also often cited as a contributor to Mastercoin’s downfall.

One thing that is clear, however, is that Mastercoin was difficult to run a light-client on. It required users to run full nodes of both Mastercoin and Bitcoin in order to verify the integrity of Mastercoin blocks, and thus the validity of Mastercoin transactions. This immediately isolated Mastercoin users from the established ecosystem of lightclients and wallets that Bitcoin had developed at that stage.

Ethereum was born out of the idea of a world computer, evolving the initial idea of Bitcoin by allowing decentralized applications to be built on top through being Turing complete. The four core functions were coupled (monolithic architecture), and inextricably linked to ensure maximum security for users. The downside, however, is that this was paramount to the enormous spike in Ethereum gas fees as user adoption increased. Rollups offered a paradigm shift as they welcomed modular execution layers, and radically reduced gas fees for their users (though they won't over the long-term without further innovation). Ethereum’s long term roadmap prepares to implement data-shards to reduce gas fees and maintain its superlative security characteristics. The optimal end state of these shards remains debated by the Ethereum community, but there is general agreement on what the first step will look like.

A brief look at the future of Blockchain architecture:

Proto-danksharding

The first step to facilitating data shards will most likely be through EIP-4844 proto-danksharding, which was introduced by Ethereum foundation researcher Dankrad Feist. In proto-danksharding, the spec is to implement most of the logic (transaction formats, verification rules) that make up the full spec, however, it will not allow for actual sharding. The main feature being introduced will be blob-carrying transactions. Blobs are extremely large (125kB) and can be much cheaper than Calldata (the benefits of this compared to just making calldata cheaper is with regards to average load and worst case load).

If you’re interested in the development and nature of sharding on Ethereum, read this:

A new future for Modular blockchains?

Thus far, this article has focused on more traditional blockchain systems in order to provide adequate context for the variable architectural frameworks currently in use. However, it is worth noting that much of the recent conversation around modularity and data availability has been credited to the following that Celestia has attracted over the course of 2022.

Celestia is a modular blockchain that provides data availability and consensus. By only being responsible for these two functions, Celestia can make specifications that optimize for these environments. This leads to creating a more scalable DAL and CL. Essentially, Celestia is responsible for two things, making sure the data is available and that it is in the correct order.

Celestia enables data availability via its novel proofs known as data availability proofs (DAP). These proofs use erasure codes, which is a method of data protection that ensures that if you lose a piece of data, that you will be able to recover it. The proofs require each light-node to sample a very small number of random chunks from each block in the chain, and only work if there are enough clients in the network to ensure the entire block is sampled. This means the more clients you have, the greater the possible block size you can have in a secured system. In addition, the elegance of this is that it is also incredibly easy to operate a light-client in Celestia. So easy, in fact, that you could run one on your phone, making Celestia incredibly hardware friendly. Vastly lowering the barrier to entry for running a light-client, Celestia becomes hyper-scalable, as scalability = TPS / cost to check the block, not just TPS. 

So Celestia is a layer for data availability and consensus, so what about the other components of the blockchain? For the most part, three different types of rollups will plug-in to Celestia. The first being Celestiums, which are Ethereum rollups that use Celestia for DA.  Celestiums can handle  higher transaction throughput, but it’s worth noting that doing so will make slight to moderate security trade-offs. A sovereign rollup is the second kind, this is mostly just a rollup that we see on Ethereum, however, the sovereign rollup does not need to gain permission to fork and push upgrades. The last type is a settlement rollup, which is an environment that is highly optimised for execution layers to post their transaction prior to the transaction settling and being sent down to Celestia.

To learn more about Celestia, start here: https://celestia.org/learn/

Conclusion

Data availability has long been an issue limiting the scalability of blockchains to date. Via breaking down the traditional blockchain stack, and becoming more modular, DALs are able to optimise their function such that it can provide maximal security without introducing additional trust assumptions. The future is bright for both Ethereum and Celestia, who among many others are introducing new mechanisms to solve for data availability whilst still ensuring users maintain security, sovereignty, and inexpensive transaction costs. Modular blockchains an innovative frontier for the industry, and we look forward to supporting those ecosystems as they continue to grow.

Subscribe to Koji
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.