Data availability refers to the guarantee that the full set of transaction data included in a block is available to all participants in a blockchain network. This concept is critical for maintaining security, especially as blockchain systems scale to higher transaction volumes.
New approaches like sharding, rollups, and light clients distribute transaction processing across shards or rollup chains, rather than having every node process everything. This spreads the work out to allow higher throughput. But a consequence is that no single node sees all data anymore. This means individual nodes can no longer fully verify every transaction, nor can they generate fraud/validity proofs if some transaction data is missing or withheld. Light clients are especially vulnerable if data availability is not guaranteed.
Thus, guaranteeing accessibility of necessary data has become a key challenge in blockchain scaling. A variety of techniques are emerging to provide this assurance without excessive redundancy overhead.
In traditional proof-of-work blockchains like Bitcoin and Ethereum(pre-POS), each block contains a header with metadata and a list of transactions. Full nodes in these networks download and validate every single transaction in each block by independently executing the transactions and checking that they are valid according to the blockchain's protocol rules. This independent execution of transactions allows full nodes to compute the current state that is required to verify and process the next block. Because they perform this transaction execution and verification, full nodes enforce critical transaction validity rules and prevent miners or block producers from including invalid transactions in blocks.
Lightweight clients, also known as SPV (Simplified Payment Verification) clients, take a different approach from full nodes in order to conserve bandwidth and storage. SPV clients only download and verify block headers. They do not execute or validate any transactions. Instead, SPV clients rely on an assumption that the chain favored by the blockchain's consensus algorithm, i.e. the longest chain in Bitcoin, contains only valid blocks that properly follow protocol rules. This allows SPV clients to outsource the actual transaction execution and verification to the blockchain's consensus mechanism itself.
The security model for SPV clients fundamentally depends on having an honest majority of consensus participants, for example miners in proof-of-work blockchains, that correctly apply transaction validity rules and reject any invalid blocks proposed by the minority. If a dishonest majority of miners or block producers colludes, they could coordinate to create blocks with illegal state transitions that create tokens out of thin air, violate conservation of assets, or enable other forms of theft or exploitation. SPV nodes would not be able to detect this malicious behavior on their own because they do not actually validate transactions. In contrast, full nodes enforce all protocol rules regardless of the consensus mechanism, so they would immediately reject such invalid blocks created by a dishonest majority.
To improve the security assumptions for SPV clients, an alerting mechanism called fraud/validity proofs can allow full nodes to generate cryptographic proofs that show light clients that a given block definitively contains an invalid state transition. After receiving a valid fraud/validity proof, light clients can then reject the invalid block even if the consensus mechanism incorrectly accepted it.
However, fraud/validity proofs fundamentally require full nodes that create them to have access to the full set of transaction data referenced in a block in order to re-execute the transactions and identify any invalid state changes. If block producers selectively release only the block headers and withhold the full transaction dataset for a given block, full nodes will not have the information they need to construct fraud/validity proofs. This situation where transaction data is unavailable to the network is known as the "data availability problem".
Without guaranteed data availability, light clients are once again forced to simply trust that block producers are honestly behaving correctly. This complete reliance on trust defeats the purpose of fraud/validity proofs and undermines the security benefits of light client models. For this reason, data availability is absolutely critical for maintaining the expected security and effectiveness of fraud/validity proofs in blockchain networks, especially as they scale to higher transaction volumes.
In addition to the need for data availability in existing networks, data availability becomes even more important in the context of new scaling solutions like sharding and rollups that aim to increase transaction throughput. There’re a bunch of initiatives and projects like proto-danksharding, EIP 4484, Celestia, EigenDA and Avail, which have made a lot of progress to provide efficient and affordable DA for rollups.
In a sharded blockchain architecture, the singular network of validators is split into smaller groups or "shards" that each process and validate only a subset of transactions. Since shards do not process or validate transactions originating from other shards, the individual shard nodes only ever have access to transaction data for their own specific shard.
In rollups, transaction execution occurs off-chain in an optimized environment that allows for greatly increased transaction throughput. Only compressed and summarized transaction data is periodically posted to the main chain layer 1 by the rollup operator. This approach reduces fees and congestion on layer 1 compared to executing all transactions directly on layer 1.
In both sharding and rollups, no single node validates or even observes the full set of transactions across the entire system anymore. The previous data availability assumptions that held for traditional monolithic blockchains are broken. If a sequencer operator withholds the full transaction dataset for a rollup block, or a malicious group of colluding validators produces an invalid block in a shard, the full nodes in other shards or on layer 1 will not have access to the missing data. Without this data, they cannot generate fraud/validity proofs to signal invalid state transitions because the data required to identify the issue is unavailable.
Unless new robust methods are introduced to guarantee data availability, bad actors could exploit these new scaling models to selectively hide invalid transactions while maintaining enough visible block validity to avoid detection. Users are forced to simply trust that shard nodes and rollup operators will act honestly at all times, but trusting a large distributed set of actors to be consistently honest is risky and precisely what blockchains aim to avoid through incentive mechanisms, decentralization and cryptography.
Maintaining the expected security benefits of light client models and effective fraud/validity proofs in the context of cross-shard transactions and layer 2 solutions requires much stronger assurances that the full set of transaction data remains available somewhere in the network upon request. The data itself does not need to be downloaded by all nodes across all shards, but it must at least be readily accessible if participants wish to verify blocks and generate fraud/validity proofs about potential issues.
A number of approaches have been proposed and explored that help provide "data availability" without requiring all nodes in a sharded or layer 2 network to redundantly download and store the full transaction dataset:
Data availability sampling refers to a class of techniques that allow light clients to probabilistically check if transaction data is available by only downloading random fragments of the overall transaction dataset. Initiatives like proto-danksharding, Celestia, EigenDA and Avail, have tried various new techniques like KZG commitments and ZK proofs to achieve better sampling.
Typically, data availability sampling schemes rely on erasure coding, a method that takes the full transaction dataset and mathematically transforms it into a longer coded dataset by adding calculated redundancy. As long as a sufficient subset of the encoded fragments are available, the original data can be reconstructed from the encoded data by inverting the mathematical transform.
Light clients fetch and verify random small pieces of the erasure coded data. If any of the sampled fragments are missing or unavailable, this suggests that the full erasure coded dataset is likely unavailable to the network as a whole. The more samples a client can collect from random parts of the dataset, the higher likelihood the client has of detecting any missing data. Erasure coding parameters can be tuned so that only a very small percentage of total fragments, on the order of 1%, need to be randomly sampled by a light client in order to verify availability of the complete dataset with extremely high statistical confidence.
This general approach allows light clients to very efficiently check the availability of even very large transaction datasets without needing to actually download the entire dataset. The samples are also shared with full nodes on the network to help reconstruct any missing pieces of data and recover unavailable blocks when necessary.
Committee-based data availability schemes assign the responsibility for transaction data availability verification to a relatively small group of trusted nodes called a Data Availability Committee (DAC). The committee nodes store full copies of transaction data from blocks and signal that the data is indeed fully available by posting cryptographic signatures on the main chain. Light clients can then cheaply verify these signatures to gain confidence that the data is available to the committee nodes without actually processing or storing the data themselves.
The fundamental tradeoff with Data Availability Committees is that light clients must ultimately trust the committee nodes to correctly signal data availability. Relying on a centralized and permissioned committee introduces some degree of centralization risks and single points of failure into the network. However, techniques like using a DAC consisting of Proof-of-Stake validators with slashing penalties for misbehavior can reduce, but not completely eliminate, trust requirements for light clients.
In data sharding schemes, transaction data is split into multiple shards and light clients probabilistically sample data from all shards in order to verify data availability across the entire system as a whole. However, implementing cross-shard sampling typically adds considerable complexity to data availability protocols and may require complex networking topology to prevent single points of failure.
Emerging cryptographic proofs like zero-knowledge proofs and zk-SNARKs can potentially be used to prove the validity of state transitions in a block without revealing any of the underlying transaction data. For example, validity proofs can prove that a rollup block transition is fully valid without exposing any of the private transaction data used in the rollup itself.
However, data still fundamentally needs to be available somewhere for full nodes to properly update their local states. If the underlying transaction data for a block is completely withheld by the block producer, full nodes cannot accurately track latest state balances and integrity. Succinct proofs guarantee validity of state changes, but not the availability of the underlying data driving those changes.
Data availability is a critical challenge that must be addressed as blockchains scale transaction volumes and transition to advanced architectures like shards and rollups. Regardless, it is encouraging that multiple viable pathways exist to prevent data availability from becoming a barrier that permanently restricts the scalability and censorship resistance of decentralized blockchain networks as they grow.