Data availability (DA) is crucial to the functionality and security of blockchains. In traditional monolithic blockchain structures, full nodes retrieve blocks from peers to verify the integrity of the entire history of their chain based on predefined rules. This process relies on the assumption that data on the blockchain remains consistently "available" for access by all nodes in the network. This principle holds not just for layer-1 blockchains but also for layer-2 blockchains, which rely on data availability to inherit the security properties of their underlying layer-1 chains. For instance, optimistic rollups require DA for fraud-proof constructions, while ZK-rollups depend on DA to allow users to keep track of their balances and ensure that the rollup full-state is recoverable so that users can withdraw their funds to L1 without the rollup operator.
In the blockchain space, there is an approach called modular architecture, which breaks a monolithic blockchain into multiple function-specific chains. A data availability (DA) layer is one of them. Projects like Celestia, EigenDA, and Espresso's Tiramisu DA are at the forefront of creating these DA layers, setting a foundation for rollups and future modular blockchains. Similarly, Ethereum aims to enhance its rollups' data availability by introducing data blobs.
This article delves into the key concepts presented in the paper: Fraud and Data Availability Proofs. This paper has introduced the data availability problem, the key idea used in Celestia, other DA layers, and data availability sampling in Danksharding—a part of the Ethereum rollup-centric roadmap. Interestingly, the approach from the paper shows that nodes can probabilistically verify data availability without downloading an entire block. Such an approach minimizes hardware requirements, allowing even light nodes to enhance the network's security. Paired with fraud proofs, light clients can be highly confident in the integrity of the blocks they receive.
Light clients allow users to interact with and access blockchain data (e.g., account balances and transaction lists) securely without fully trusting the data provider. Instead of syncing and validating the entire blockchain as full nodes do, light clients operate on block headers. They rely on full nodes or centralized RPC services to provide the necessary data and its associated proof, allowing them to verify the information without downloading the complete block content.
However, efficiency doesn't come without trade-offs. Light clients are unable to verify entire blocks independently. They require a mechanism to ensure that a block they receive has been agreed upon by the majority of validators, which is normally achieved by having validators sign the block header they vote on. Unfortunately, a light client can't directly verify all validators' signatures, especially in blockchains composed of thousands of validators. Ethereum, for example, tackles this problem by employing a rotating set of 512 validators—known as the "sync committee"—which changes every 1.1 days to attest to block headers. While this allows light clients to maintain their efficiency, they are not the ones validating the block themselves. Instead, they place their trust in either the majority of validators or, more riskily, the majority of a subset of validators (the committee). This poses a vulnerability: in situations where the majority is dishonest, light clients may unintentionally align with a corrupted consensus, lacking the evidence to reject it.
In contrast, full nodes download and validate the entire blockchain, giving them the ability to reject any incorrect block, even if it comes from a dishonest majority of consensus. This raises an interesting question: Can honest full nodes, even in the minority, tell light clients to reject invalid blocks, eliminating the need for an honest majority assumption? The answer is yes, and that is where fraud proofs come in!
The idea of fraud proofs is to allow a full node to generate a proof showing the invalidity of a block (for example, an invalid state transition, a wrong state root, or some transactions are invalid). This proof can then be relayed to light clients, convincing them that the block is invalid. A fraud proof should contain only the parts required to validate the block that is suspected to be invalid, enabling light clients to process just these fragments rather than the entirety of the block and reach a conclusion on its validity.
Next, we'll provide an overview of how fraud proofs can be effectively integrated into a block, enabling a blockchain to natively support fraud proofs, based on the insights from the paper. It's important to note that ensuring data availability emerges as a natural prerequisite for generating fraud proofs efficiently.
Rather than only storing a list of transactions, the paper introduces snapshots of state transitions called traces. Traces act like checkpoints of a block, representing intermediate blockchain states at specific block intervals and are interspersed among transactions within the block. As visualized in Figure 1,
trace1 is the intermediate state achieved by applying transaction txn1 to txn3 to the final state of the previous block (i-1).
trace2 is the intermediate state resulting from applying transaction txn4 to txn6 to trace1, and so on.
This structure allows full nodes to efficiently construct fraud proofs by using a subset of invalid intermediate states and their corresponding transactions and letting light clients execute on this small subset to verify the valid block instead of executing the full content in the block.
When a light client initiates a data request, it can ask the network to provide fraud proof for the corresponding block. If the light client doesn’t receive any fraud proof within a predetermined timeout, the light client is highly convinced that the block is valid. Nevertheless, this approach assumes that the light client must connect with at least one honest full node capable of providing the necessary fraud proof.
Although fraud proofs offer an enhanced layer of security, improving the security of light clients to nearly as close as those full nodes, a light client needs to maintain a connection with at least one honest full node to provide fraud proofs to the light client. A requirement of this setup is data availability. Full nodes require access to data to construct fraud proofs.
Data availability remains crucial even if we shift from fraud-proof schemes to validity-proof schemes such as Zero-Knowledge Proofs (ZKP). The absence of data can prevent nodes from reconstructing the state of the blockchain. As a result, users and nodes may find themselves unable to interact with the blockchain despite being aware that the chain state transitions are correct. This brings forth the data availability problem. Ensuring that data is both published and accessible to nodes is crucial. But how can light clients be sure that data is available without downloading a full block?
A straightforward approach is to enable each light client to independently randomly sample small parts of a block. Through repeated sampling, they can ascertain, with a high degree of confidence, that the block is indeed accessible. However, this method of probabilistic sampling isn't optimal. There's a chance that some parts might not get sampled, which makes light clients believe that the block is available where, in fact, it isn't.
To address this issue, the research paper utilizes erasure codes to extend the data, providing data redundancy and the ability to recover the original data in the event of data loss. Erasure coding coupled with data sampling leads to a very efficient sampling scheme, ensuring that all necessary pieces of data will get sampled with very high probability.
An erasure code is a method that extends a message of length into a longer message of length , allowing the original message to be reconstructed from any subset of elements of the extended message. For instance, consider a message m of length , denoted as . Using erasure codes, this message can be extended to length , resulting in , called . Given the properties of erasure codes, any subset of unique elements from can be used to restore both the original message and .
Reed-Solomon coding is a method of encoding a message using polynomials. By interpreting the message as a polynomial , it is possible to transform into a longer message as follows:
Define such that for all .
Extend message to by evaluating for all , setting . The result is , where for .
If you remember your high school math class, given two points on the coordinate plane , one can create a line passing through those points. Think of erasure codes similarly. Given the initial message, represented as two points and , you can draw a line passing through these points. This line represents the Reed-Solomon code of the initial message (specifically, the points on that line are the extended message using the Reed-Solomon code). Note that, in the case of errors, Reed-Solomon codes can correct up to errors, not . Explaining how errors are corrected can get quite complicated, so those interested should refer to source [8] for a more detailed explanation.
Returning to the paper, the data we're encoding consists of the sequence of transactions and intermediate states, as discussed in Section 3 (see Figure 1). We interpret this data as a message and then divide it into small, equal chunks, with each chunk treated as an element of the message. As illustrated in Figure 2, we partition block i into four chunks: chunk 1 contains ; chunk 2 comprises ; and so on. Finally, we utilize Reed-Solomon coding to encode this message.
However, 1D encoding is not optimal. As the message lengthens, the degree of the polynomial increases, leading to higher computational complexity. Moreover, if a 1D encoding is generated incorrectly, it can corrupt the entire dataset, resulting in an unrecoverable block. A solution to this issue is to encode the data in two or more dimensions. With this multi-dimensional encoding approach, incorrectly generated codes may only affect a subset of dimensions, rather than the entire dataset, as in the 1D case.
The proof size for an incorrectly generated code in a single dimension is . In contrast, when using d dimensions, this size decreases significantly to approximately . Conceptually, this involves converting a 1D message of length n into a d-dimensional hypercube, where the length of each side is equal to . By adopting this structure, proofs can be generated along specific axes, effectively reducing the proof size to .
To utilize 2D Reed-Solomon encoding, first we arrange a message into a square matrix rather than a vector before encoding. Here’s how to apply the 2D Reed-Solomon encoding to a message of length n:
Reshape the Matrix: Convert the message of length n into a square matrix of dimensions , as depicted in Figure 3. If does not equal , pad the message's length to fit the matrix appropriately.
Row and Column Encoding: Apply 1D Reed-Solomon encoding to each row and column of the square matrix, as shown in the upper right and lower left matrices of Figure 4.
Vertical Extension Encoding: From the results of step 2, apply the 1D Reed-Solomon encoding again, but this time only to each row of the vertically extended message (the lower left matrix). This step produces the final part of the 2D extended data (shown in the lower right matrix).
Final Matrix Assembly: Construct the final matrix by combining the original square matrix with the three extended matrices, as illustrated in Figure 5.
Understanding the minimum amount of data a malicious block producer must withhold to make a block unrecoverable is crucial. As an example, let's define the matrix of original data as a matrix, which is then extended to a matrix, as depicted in Figure 6. Intuitively, one might assume that the encoding can withstand missing data up to half of the extended matrix, equivalent to out of elements. Given that the matrix expands from elements to —a four-fold increase from the original—it might seem that even a 75% data loss would be tolerable. For instance, with 50% missing data, if both the upper left and upper right halves are missing, it appears plausible to use the lower left quarter to reconstruct them.
Unfortunately, a malicious block producer actually needs to withhold only out of of the data to prevent an honest full node from reconstructing the entire block. This amount is less than half of the total data, contradicting our initial intuition. Let's delve into this more systematically:
Case (a): Suppose a malicious actor withholds the upper left quarter of the extended data, marked red in Figure 7(a). While the block remains recoverable using either the upper right or lower left quarters, this is just the beginning.
Case (b): The malicious actor could try harder by also withholding the first column of the upper right quarter, as seen in Figure 7(b). Despite this, the block can still be reconstructed using the lower quarters.
Case (c): Taking this a step further, if a malicious actor withholds both the first row of the lower left quarter and the first element of the lower right quarter, as depicted in Figure 7(c), the block becomes unrecoverable. From case (b), we already know that the upper right quarter cannot be used to recover the upper left quarter. Similarly, due to its symmetry with the upper right, the same limitation applies to the lower left quarter. Our last hope would be to use the lower right quarter to reconstruct the lower left, then the upper left, and finally, the upper right quarters to recover the entire encoded matrix. However, with the first element of the lower right matrix missing, the first row of the lower left quarter becomes unrecoverable.
It's important to note that case (c) represents the worst possible scenario. In this case, a malicious block producer needs to only withhold out of to make the block unrecoverable. However, to create this scenario, a malicious actor must selectively withhold specific pieces of block data.
When a light client obtains a block header, the paper suggests that it should randomly sample the contents of the block. This step mitigates the risk of a malicious block producer selectively revealing the block contents. The client continues sampling until it becomes confident in the availability of the data for an honest full node. This assurance is crucial as it ensures that the full node can reconstruct the block and produce a fraud proof if necessary. Therefore, the security of the light client depends on this process of data availability sampling, which enables an honest full node to generate a fraud proof for the light client.
However, one might wonder: How many samplings are required for a light client to be convinced (with high probability) of the data availability of the received block header?
Building upon the idea from the previous section, a malicious block producer would have to withhold out of elements of the extended matrix to cause an unrecoverable block. The aim is to determine the probability that a light client, while sampling the contents of this matrix (a matrix with missing elements), encounters at least one unavailable element. Assuming a light client samples elements from an extended matrix times, there are available elements within the matrix. Here, denotes the number of unavailable elements.
Recalling our high school math, when selecting r elements from n elements without replacement, the possible ways are represented as "n choose r," or . Consequently, the chance that a light client only samples the available components, or, in other words, never encounters an unavailable element, can be denoted as:
.
Therefore, the probability that a light client encounters at least one unavailable element is given by:
or
.
To obtain a cleaner form of the above equation, one could rearrange it using a few lines of math, but instead, I will copy from the paper and paste the simplified expression here, which can be expressed as:
.
Let’s plot a graph of the above equation to analyze its probability, which will help us understand the security of a light client more easily.
In Figure 8, it's evident that for smaller values of , such as , the probability approaches 1 much faster than for higher values of . Interestingly, once is large enough, as illustrated by and , the probability becomes nearly independent of . This raises an important question: Why choose a greater value of when it requires that the light client sample more matrix elements to achieve the same level of confidence? The answer lies in the definition of . As explained in Sections 4.1 and 4.2, a message is divided into small, equal chunks, with the number of these chunks denoted by . Consequently, as increases, the size of each chunk decreases, reducing the total size of data a light client needs to download.
At first glance, it might appear counterintuitive that as the chunk size diminishes (with an increasing ), the light client has to sample more data s) to maintain a similar level of confidence. Considering that the total data size a light client downloads is represented by , one might wonder if these two factors should offset each other. However, the situation isn't quite so straightforward. Given that the probability becomes nearly independent of for sufficiently large values, we can maintain a constant while increasing to reduce the chunk size. By optimizing in this manner, the bandwidth required for light clients can be decreased while keeping (almost) unchanged. This approach minimizes the network bandwidth needed to operate a light client effectively.
As a result, a light client can achieve security nearly comparable to that of a full node but only requires slightly more bandwidth for sampling than the conventional light client, which downloads only the block header. As shown in Figure 9, the number of times a light client needs to sample to have a 99% probability of encountering at least one unavailable block stabilizes after . Therefore, even if the value of continues to increase, which in turn reduces the chunk size, the overall data that a light client needs to download decreases.
So far, we have explained the paper: https://arxiv.org/abs/1809.09044, which tries to push the security of a light client as close as a full node by introducing a fraud-proof scheme. This scheme allows a blockchain to naturally support fraud proofs by interleaving execution traces—snapshots of calculations—between transactions within a block. Thus, when an error or fraud is found, an honest full node that connects to a light client can generate the fraud proof for the subset of calculations instead of the whole. As a result, the light client can check the proof efficiently and be able to avoid an invalid block even in the face of a dishonest majority. Note that the security of the light client is just as close as a full node, not equal because the requirement of a fraud proof is that the light client must connect to at least one honest full node.
While the concept of fraud proof seems to be elegant, it brings forth a new challenge known as the data availability problem. When a light client obtains a block header, it must ensure the block data's availability for an honest full node to generate a fraud proof. Thankfully, as demonstrated, the bandwidth required for such data sampling by the light client is minimal when the block is divided into sufficiently large chunks. Our discussion so far has focused on a single light client's sampling. Now, imagine a large network of full nodes and light clients. This structure paves the way for a more robust and decentralized blockchain solution and also leads to a new paradigm called Modular blockchains.
[1] https://arxiv.org/abs/1809.09044
[2] https://github.com/ethereum/research/wiki/A-note-on-data-availability-and-erasure-coding
[3] https://www.parity.io/blog/what-is-a-light-client/
[4] https://a16zcrypto.com/posts/article/an-introduction-to-light-clients/
[6] https://arxiv.org/abs/1905.09274
[7] https://notes.ethereum.org/@dankrad/danksharding_encoding
[8] https://tomverbeure.github.io/2022/08/07/Reed-Solomon.html