Understanding Data Availability Sampling

Thanks to François Thiré for feedback on early versions of this work.

Introduction

Data availability sampling is crucial for scaling Ethereum. However, compared to its importance, it receives little research and attention. I believe this is due to a gap in understanding its inner workings and challenges. This post aims to bridge this gap by explaining, in layman's terms:

  • What the data availability problem is.

  • Various solutions to the problem.

  • Why data availability sampling (DAS) is needed.

  • Various DAS-based solutions (Celestia, Danksharding).

What is the Data Availability Problem?

Rollups scale blockchains by posting data on-chain to the L1 while offloading execution off-chain to L2s. However, because data still needs to be posted to L1, the amount of data that can be posted to L1 becomes the bottleneck. This necessitates the scaling of L1 data throughput, and in doing so, we encounter the data availability problem.

The setup of the data availability problem is illustrated below:

  1. Publisher publishes some data that corresponds to some commitment (Merkle root, KZG commitment, etc) C.

  2. The commitment C is distributed among the validators in the network.

  3. Validators check if the published data is available by running an isAvailable function. isAvailable is a function that takes the commitment C and returns true or false.

Following the formulations introduced in the original DAS paper (Mustafa et.al 2018), we define two desired properties of the isAvailable function. The first is availability:

Definition 1 (Availability)

If isAvailable(C)=true, then the validator could have downloaded the whole data if the node wanted to.

Note the "if the node wanted to" part. It means the node doesn't necessarily download the full data. Indeed, as you see in the following sections, the critical question is how a validator can know it could have downloaded the complete data without actually doing so.

The next is agreement:

Definition 2 (Agreement)

If one honest validator gets isAvailable(C)=true, then all honest validators gets isAvailable(C)=true within some time frame.

Using these definitions, we can state the data availability problem as follows:

Definition 3 (The Data Availability Problem)

How do we design an isAvailable function that satisfies both availability and agreement?

Before delving into the various data availability solutions, let’s review the network assumptions underpinning this article.

Network Assumption

We assume the following:

Assumption 1 (Network)

All honest validators are either directly or indirectly connected.

In other words, malicious actors cannot isolate an honest validator from the network. So when an honest validator submits a message to the network, it will eventually be received by all other honest validators.

Next, we will explore a range of solutions to the data availability problem and their security guarantees.

Naive Data Availability

First, we examine the most straightforward solution: Every node in the network downloads all the data in full, as illustrated below:

The definition of isAvailable is:

Definition 4 (isAvailable for naive DA)

isAvailable(C) = Download the full data, return true if successful and false otherwise.

Next, let's see if availability and agreement hold for this definition of isAvailable.

Availability holds trustlessly.

Here we say trustlessly because a node can check availability without relying on any other entity.

Agreement also holds.

Proof: If one honest validator can download the full data, it can propagate it to other honest validators based on the network's assumption (Assumption 1).

This is what we have in Bitcoin and current Ethereum. It is ideal regarding security as availability and agreement hold with minimal trust assumptions.

However, there is one problem: It does not scale. Each node has to download all the data in its entirety. The question then is, how do we design a scalable isAvailable function that satisfies both availability and agreement?

IPFS

The next solution that might come to mind is to store the data in an external storage system, such as IPFS, and only publish the data's commitment to the network, as illustrated below. This method is scalable because it eliminates the need for the validator to download all the data.

The isAvailable function is defined as:

Definition 5 (isAvailable for IPFS solution)

isAvailable(C)= Download the commitment, return true if successful and false otherwise.

However, the problem with this approach becomes clear when we look at the security guarantee:

Availability holds if we trust the publisher

We fully rely on a single entity, the publisher, to upload the data to IPFS. The system's availability is compromised if the publisher decides not to upload the data and just publishes the data's commitment to the network.

Trusting a single entity is the worst trust assumption. How can we do better?

Data Availability Committee

The issue with the prior approach was its reliance on trusting a single entity, the publisher, to make the data available. Then, what if we introduce a set of entities tasked with checking and attesting that the data has indeed been made available? This is the idea behind data availability committees, and it works as follows:

  1. The publisher sends the full data to some committee.

  2. The committee signs the commitment to the data if they can download it.

  3. The committee signatures will be aggregated as certificates and propagated to the validators.

Here, the validators run the following isAvailable function:

Definiton 6 (isAvailable for data availability committee)

isAvailable(C) = Download the certificate and check that it contains signatures from some threshold number of committee members.

The trust assumption for this function is:

Availability holds if you trust some threshhold number of committee members.

For example, if the threshold is 8 and the committee size is 10, then availability holds as long as 8 out of 10 committee members are honest.

For agreement, since the same certificate is downloaded and propagated to all validators:

Agreement holds.

Now, the next big question becomes: Is there a solution that does not rely on trusting some external committee? To achieve this, we introduce a powerful tool to our toolbox: Erasure coding.

Background Knowledge: Erasure Coding

Erasure coding is a technique that introduces redundancy to data to make the data resistant to partial data loss. It works as follows:

  1. The original data is split into n parts.

  2. The parts are expanded to a larger number, for example, to 2n parts.

The original data can be reconstructed by any n of the 2n parts.

We cover two solutions that use erasure coding: Information dispersal algorithm and data availability sampling.

Information Dispersal Algorithm

Using erasure coding, we can design a solution that:

  • Does not rely on an external committee.

  • Is scalable in the sense that no node needs to download all the data.

It works as follows:

  1. Publisher erasure codes the data and disperses the erasure-coded parts.

  2. Each validator downloads a single pre-assigned part.

  3. The validators run a consensus algorithm. They vote “yes” if they can download their part and “no” otherwise.

The idea of dispersing erasure-coded parts to ensure data availability has been researched under the term Information Dispersal Algorithms (IDA) (M.Rabin 1989) in the past, so we use this terminology in this post.

The isAvaliable function is:

Definition 7 (isAvailable for information dispersal algorithms)

isAvailable(C) = if consensus says “yes” then true else false

Next, let’s look at the security guarantee of this approach. Availability holds with majority trust assumption:

Availability holds with honest majority trust.

The intuition behind this is that if the majority of validators are dishonest, they can force honest validators to accept unavailable data by voting “yes” when their part is not available. On the other hand, if the majority of validators are honest, the original data can be reconstructed using the parts held by the honest validators, even if the publisher makes some parts unavailable or if some dishonest validators withhold their parts.

Since all validators agree on the result of the consensus, we have:

Agreement holds

Projects like Polkadot, NEAR, and EigenDA use this solution to ensure data availability.

The Holy Grail of Data Availability

At first glance, a system with an honest majority assumption for data availability might seem sufficient, especially since many L1s seemingly operate under a similar trust assumption (the so-called “honest majority” assumption). However, there's a crucial difference between the honest majority assumption heard in the L1 security context and the data availability context.

Data withholding attack, where the network is tricked into agreeing that unavailable data is available, is really bad. Suppose a single transaction is withheld. And suppose this transaction is an invalid transaction that mints 99999999 tokens from thin air. Since nobody can download the transaction, nobody can notice the malicious behavior and detect its invalidity. This is different from the so-called “51% attacks” in blockchains. In a 51% attack, the attacker can only reorder or censor transactions, but cannot force honest validators to accept an invalid transaction. This is because every honest validator will download and execute every transaction, and if any of them are unavailable or invalid, they will reject the block containing it.

In other words, Bitcoin and Ethereum do not rely on trusting a majority to reject invalid transactions. However, if we introduce an honest majority assumption for data availability, we have an honest majority assumption for L2 transaction validity. This is a strictly worse safety guarantee than L1, and we can no longer say the L2s inherit the full security of the L1.

This insight leads us to the “holy grail” of data availability:

A scalable system with trustless availability and agreement.

In other words, what we want is a system that has the same security guarantee as naive data availability but does not require all nodes to download everything.

Data availability sampling (DAS) aims to achieve this holy grail.

Data Availablity Sampling

Data availability sampling works as follows:

  • The publisher publishes erasure-coded data.

  • Validators randomly sample the erasure-coded data by attempting to download random parts of the data. If all the sampling succeeds, the data is accepted as available.

DAS can detect data withholding. Suppose a publisher wants to conduct a data-withholding attack. If they make more than half of the data available, the original data can be reconstructed using the available data. Hence for the publisher to withhold the data, they would need to withhold more than half of it. Now, suppose the validator checks if the data is available via DAS, as illustrated below:

if the validator does n sampling requests, the possibility of detecting the data withholding (i.e., the possibility of hitting a withheld data part) is 1 - (1/2)^n. With n = 20, validators can detect data withholding with ~99.9999%.

Strawman DAS

The most straightforward DAS construction is to have the publisher erasure code the data, and then have all validators run DAS against the publisher, as illustrated below:

Here, the isAvailable function is:

Definition 8 (isAvailable for strawman DAS)

isAvailable(C) = Run DAS against the publisher, return true if succeeds and false otherwise.

This gives us:

Availability trustlessly (with high probability)

This is because with DAS, the validator can know, with very high probability, that more than 50% of the data is available. If more than 50% of the data is available, the validator could have gotten 50% of the parts through the sampling interface and reconstructed the full data if it wanted to.

However, this simple setup has the following problem:

Agreement holds if we trust the publisher

This is because a malicious publisher can withold small parts of the data and cause disagreement in the DAS result among the validators. For example, suppose the publisher withholds a single part, as shown below:

For most validators, DAS will succeed and will get isAvailable(C)=true. However, the validator that happened to sample the withheld part will fail their DAS request and get isAvailable(C)=false.

Not having agreement is bad; in the context of blockchains, it leads to reorgs or liveness issues.

So, how do we achieve agreement? To accomplish this, we need a reconstruction protocol.

The Reconstruction Protocol

The intuition of the reconstruction protocol is: If there's enough data available to reconstruct the full data, why not just reconstruct it completely? This way, there wouldn't be any missing parts, and all validators would have their DAS succeed.

We will refer to the process of reconstructing data to reach agreement as the reconstruction protocol, which is illustrated below:

If we had DAS with such a reconstruction protocol, we would have trustless availability and agreement - the holy grail of data availability.

When we talk about DAS, what we actually need is DAS + reconstruction. The reconstruction protocol is at least as important as the DAS protocol but is often overlooked. And the reconstruction protocol is where the core design challenges of constructing a DAS system lie.

So, how would we implement such a reconstruction protocol? More specifically, who will do the reconstruction? We will cover two approaches:

  • Celestia approach

  • Ethereum Danksharding approach

Celestia

Celestia splits validators into full nodes and light nodes. Full nodes download all blocks in full, as done in the naive DA solution:

Celestia scales by introducing large blocks. Downloading and propagating such large blocks requires full nodes to have powerful hardware and bandwidth.

To minimize trust against such powerful full nodes, Celesita introduces light nodes that run DAS against the full node to monitor data availability. This setup is illustrated below:

Light nodes, through the use of DAS, can verify if data is available without having to rely on full nodes.

Since there are now two types of nodes in the network, we will need to define two isAvailable functions. For full nodes:

Definition 9 (isAvailable for Celestia full nodes)

isAvailable (C) = Download full blocks.

For light nodes:

Definition 10 (isAvailable for Celestia light nodes)

isAvailable (C) = Run DAS against the full nodes.

Since full nodes download the full data (as in naive DA) and light nodes run DAS (as in strawman DAS), for both full nodes and light nodes:

Availabiliy holds trustlessly

Then what about agreement?

Reconstruction in Celestia

What happens if full nodes conduct a data-withholding attack?

In the diagram below, some malicious full nodes are conducting a data-withholding attack. These malicious nodes share the full data among themselves and respond to DAS requests from light nodes but withhold the data from other honest full nodes:

Notice that agreement is not holding as a result of this attack.

To get agreement, we need a reconstruction protocol. In Celestia, reconstruction is done by contributions from the light nodes in the following steps.

First, the light nodes run DAS against the malicious full nodes and store the obtained samples locally:

Next, the light clients provide the samples to all full nodes to which they are connected. One of them should be honest from the network assumption:

The honest full node that collected enough samples will reconstruct the full data:

The honest full node will propagate the data to all honest full nodes:

Now agreement holds!

Note that agreement was possible because we had:

  • Enough light nodes in the network to collectively sample enough parts to reconstruct the full data.

  • At least one honest full node to reconstruct the data.

Hence:

Agreement holds there are sufficient number of light nodes and there exists a single honest full node

This is because:

  • We need a sufficient number of honest light nodes to sample enough data for reconstruction collectively.

  • We need a single honest full node to gather samples and conduct the reconstruction.

Celestia's core innovation is empowering light nodes to participate in the reconstruction protocol. Light nodes are no longer "free riders" of the network that passively check availability for their own good. They actively contribute to the network by circulating the samples obtained via DAS.

Furthermore, the more light nodes we have, the more throughput is enabled. This is because the more light nodes, the larger the number of samples routed from malicious to honest full nodes. As a result, reconstruction of larger data from the samples is possible, hence the network can support an increase in block size while still maintaining agreement.

Celestia's design requires powerful full nodes to handle the reconstruction of data. This makes the design hard to incorporate into L1s that want to maintain a large set of validators by keeping the hardware requirements of full nodes low. So the question becomes:

Can we design a DAS + reconstruction scheme without relying on powerful full nodes?

Danksharding aims to achieve this by enabling partial reconstruction.

Danksharding

With partial reconstruction, instead of having powerful full nodes to handle the reconstruction, we have many low-resource nodes to collectively contribute to the reconstruction in a distributed way

Then how can we implement such partial reconstruction? Danksharding utilizes 2D erasure coding for this.

2D Erasure Coding

In 1D erasure coding, we erasure code the data in the horizontal direction as below:

The challenge with 1D erasure coding is that data reconstruction requires downloading and running the reconstruction algorithm on the full data in one go. Therefore, only powerful nodes with enough bandwidth and computing power can execute the reconstruction algorithm.

The core idea of 2D erasure coding is to further erasure code the data in the vertical direction:

By applying erasure coding to the data both horizontally and vertically, we enable partial reconstruction on a per row/column basis:

Furthermore, each cell in this 2D matrix can be reconstructed either in the row direction or the column direction:

As a result, rows that are majority unavailable can still be restored by reconstructing the columns (and vice-versa):

What is Danksharding

Danksharding enables distributed data reconstruction by leveraging 2D erasure coding to the full.

First, the publisher publishes 2D erasure-coded data to the network. Then, nodes in the network check the availability of this data by doing two things:

  • Run DAS against the data.

  • Download a few pre-assigned rows/columns.

This process is illustrated below:

We have the following isAvailable function:

Definition 11 (isAvailable for Danksharding)

isAvailable(C) = Both DAS and the downloading the assigned row/columns succeed

Thanks to DAS, we have:

Availability holds trustlessly

The downloading of the rows/columns comes into play to achieve agreement

Reconstruction in Danksharding

Suppose the publisher withholds a small part of the erasure-coded data, as below:

As a result of this data withholding, some nodes will have the availability check fail due to the DAS failing or the row/column downloading failing.

Now, how do we reach agreement from here?

First, the node assigned to the row containing the unavailable part can reconstruct the row using the available parts, as shown below:

Then, the node that failed its DAS attempt can now sample the missing part from the reconstructed row, as shown below:

Now, agreement holds!

Danksharding enables a decentralized reconstruction protocol that does not rely on powerful nodes.

On the other hand, the difficulty in Danksharding lies in the below arrow:

This significantly complicates the P2P networking design. We no longer have powerful entities that provide the entire set of samples. Instead, a large number of validators collectively supply them, each validator contributing a small portion. When a validator attempts to sample a specific cell in the 2D matrix, it must find and connect to the specific validator storing that sample from among a vast number of validators.

Also, the trust analysis of agreement becomes more complex. If the publisher conducts a data withholding attack, we must ensure that each cell can be reconstructed. This requires having at least one honest attestor per cell, either in the row direction or the column direction. Furthermore, the trust assumption will rely on the specifics of the P2P design.

Note that we can get agreement with 1-of-N trust if we rely on altruistic “super nodes” to reconstruct and provide all samples. One interesting direction would be to utilize the powerful “heavy tier” proposers in designs such as rainbow staking and appointed execution proposers to handle the sample reconstruction/propagation. In general, there seems to be an open design space in the intersection of attester proposer separation and data availability sampling.

Summary

Below is a table summarizing the solutions we covered in this post using the “X-of-Y trust” notation.

Conclusion

Despite its importance, DAS receives far less attention than more “fancy” topics like MEV and PBS.

However, there are still many challenging and interesting research questions that need to be addressed to develop a secure and stable DAS system. I hope this post helps readers gain a better understanding of these challenges and inspires some to start working toward finding solutions.

Reference

Subscribe to Lin Oshitani
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.