Client diversity on Ethereum's consensus layer

February 8th, 2022

This is a detailed accompaniment to an introductory article I wrote at ethereum.org

Ethereum has multiple interoperable clients developed and maintained in different languages by independent teams. This is a major achievement and can provide resilience to the network by limiting the effects of a bug or attack to only the portion of the network running the affected client. However, this strength is only realized if users distribute roughly evenly across the available clients. At present, the vast majority of Ethereum nodes run a single client, inviting unnecessary risk to the network.

Ethereum will soon undergo one of the most significant upgrades to its architecture since its inception - the merge from proof-of-work (PoW) to proof-of-stake (PoS). This will fundamentally change the way the network comes to consensus about the true state of the blockchain and network security is maintained. This new architecture brings security, scalability and sustainability benefits, but at the same time amplifies the risks associated with single-client dominance. This article will explore why…

The Beacon Chain

The Beacon chain is a proof-of-stake (PoS) blockchain. It currently runs in parallel to the Ethereum mainnet but the two will soon be "merged" together. The existing mainnet clients ("execution clients") will continue to host the Ethereum Virtual Machine (EVM) and validate and broadcast transactions but will stop participating in proof-of-work (PoW) mining and relinquish responsibility for coming to consensus on the head of the blockchain. Instead, consensus will become the responsibility of “consensus clients” that bundle transactions from execution clients together with information required for consensus into “Beacon Blocks” which then form the Beacon Chain. Miners will be replaced by "validators" who deposit ether into an Ethereum smart contract ("staking"). This ether acts as collateral incentivizing good behavior. Inactivity or malicious behavior result in the burning of some portion of that staked ether. On the other hand, if a validator behaves appropriately, they are rewarded with ether payouts.

Validator Duties

Good behavior for a validator means participating in validating Beacon Blocks that they receive from peers and voting on their view of the head of the chain. If the blocks they receive are valid, the validator "attests" to them, effectively voting for them to be added to the blockchain. Occasionally, a node will be required to propose a new block, which other validators can attest to. Where there are multiple forks of the blockchain, the one with the greatest accumulation of attestations over its history is identified as the correct one.

Occasionally, a validator will participate in a sync committee. A sync committee is a group of 512 randomly chosen validators that sign block headers so that light clients can retrieve validated blocks without having to access the full historical chain or the full validator set.

Justification and Finality

The Beacon chain sets the rhythm for the network. This rhythm is organized into two units of time: slots and epochs. Slots are opportunities for blocks to be added to the beacon chain and they occur once every 12 seconds. Slots can go unfilled, but when the system is running optimally, blocks are added in every available slot. Epochs are units of 32 slots. Slots and epochs set the pace of the blockchain.

In each epoch, the block in the first slot is a checkpoint. These checkpoints are important because they are used to make sections of the blockchain permanent and irreversible. This is a two-stage process. First, if at least 2/3 of the total staked ether balance of all the active validators ("supermajority") attest to the most recent pair of checkpoints (the current “target” and previous “source” checkpoints) then that section of the chain is “justified”. Justification is the first step towards permanent inclusion on the canonical blockchain. Once a justified checkpoint has another checkpoint justified on top of it it is "finalized", making it permanent and irreversible.

This process of justification and finalization requires that validator attestations are actually a little more complex than previously suggested. There are two types of attestation. One is the LMD GHOST vote which attests to the head of the chain (LMD GHOST is the fork choice algorithm). The second is an FFG vote that attests to pairs of checkpoints (FFG is the name of the "finality gadget" that justifies and finalizes the chain). All validators make FFG votes for each checkpoint, a randomly chosen subset make LMD-GHOST votes in each slot.

Staking rewards, penalties and slashing

Rewards

Staked ether acts as collateral incentivizing honest behavior of validators. This staked ether grows over time as validators are rewarded for their participation in securing the network. Validators receive attestation rewards when they make LMD-GHOST and FFG votes consistent with the majority of other validators. When validators are selected to be block proposers they get rewarded if their proposed block gets finalized. Block proposers can also increase their reward by including evidence of misbehavior by other validators in their proposed block. These rewards are the "carrots" that encourage validator honesty.

Penalties

The "sticks" take the form of various mechanisms for burning a small portion of a validator's staked ether. Attestation penalties are applied when a validator fails to submit an FFG vote, submits it late, or submits an incorrect vote. There is no penalty for missing LMD-GHOST votes except for the opportunity cost of missing the head vote reward. The validator balance is reduced by the same amount as they would be rewarded for a correct attestation. This means an honest but "lazy" validator that is maximally penalized for missed attestations loses 3/4 of the amount they would gain if they attested perfectly. When validators are assigned to sync committees they receive rewards for each slot they sign off. When validators in a sync committee fail to sign blocks they are penalized exactly the value of ether they would have received for signing successfully.

Overall these penalties are mild and amount to a very slow bleed of staked ether for continued inactivity.

Slashing

Slashing is a more severe action that results in the forceful removal of a validator from the network and an associated loss of their staked ether. There are three ways a validator can be slashed, all of which amount to the dishonest proposal or attestation of blocks:

By proposing and signing two different blocks for the same slot
By attesting to a block that "surrounds" another one (effectively changing history)
By "double voting" by attesting to two candidates for the same block

If these actions are detected, the validator is slashed. This means that 1/64th of their staked ether (up to a maximum of 0.5 ether) is immediately burned, then a 36 day removal period begins. During this removal period the validators stake gradually bleeds away. At the mid-point (Day 18) an additional penalty is applied whose magnitude scales with the total staked ether of all slashed validators in the 36 days prior to slashing event. This means that when more validators are slashed, the magnitude of the slash increases. The maximum slash is the full effective balance of all slashed validators (i.e. if there are lots of validators being slashed they could lose their entire stake). On the other hand, a single, isolated slashing event only burns a small portion of the validator's stake. This midpoint penalty that scales with the number of slashed validators is called the "correlation penalty".

Inactivity Leak

If the Beacon Chain has gone more than four epochs without finalizing, an emergency protocol called the "inactivity leak" is activated. The ultimate aim of the inactivity leak is to create the conditions required for the chain to recover finality. As explained above, finality requires a 2/3 majority of the total staked ether to agree on source and target checkpoints. If validators representing more than 1/3 of the total validators go offline or fail to submit correct attestations then it is not possible for a 2/3 supermajority to finalize checkpoints. The inactivity leak lets the stake belonging to the inactive validators gradually bleed away until they control less than 1/3 of the total stake, allowing the remaining active validators finalize the chain. However large the pool of inactive validators, the remaining active validators will eventually control >2/3 of the stake. The loss of stake is a strong incentive for inactive validators to reactivate as soon as possible!

The reward, penalty and slashing design of the Beacon Chain encourages individual validators to behave correctly. However, from these design choices emerges a system that strongly incentivizes equal distribution of validators across multiple clients, and should strongly disincentivize single-client dominance. This arises because the supermajority is so fundamental to Beacon Chain consensus. A single bad validator is fairly benign, but a large group of bad validators can wreak havoc. Let’s examine some potential scenarios…

Client Diversity Risk Scenarios

The asset incentivizing consensus client diversity is risk. With even distribution of validators across multiple clients the consequences of attacks or bugs that exploit specific clients is drastically reduced, whereas single-client dominance acts as a risk multiplier. This risk multiplication effect scales with the degree of network-share of the dominant client. We can get more intuition for this by examining some hypothetical (but realistic) scenarios. Let's assume a bug is accidentally introduced into a consensus client. This bug can either directly lead to incorrect attestations, or expose a vulnerability that allows a malicious attacker to force a client to make incorrect attestations. How does client diversity influence the consequences of such a bug?

Scenario 1: corrupted client has less than 1/3 total staked ether

This scenario confers maximum resilience to the Beacon Chain because 2/3 of the total staked ether is still making correct attestations allowing the Beacon Chain to finalize as normal. Therefore, from the network perspective the consequences are negligible. The affected validators suffer inactivity penalties because they submit incorrect attestations. These penalties are relatively minor and the affected validators can simply wait for the client to be fixed or switch to an alternative client. Either way, the validator can resume making correct attestations with minimal financial consequences and no disruption to the Beacon Chain.

Scenario 2: corrupted client has > 1/3 total staked ether

This scenario is far more problematic because less than 2/3 of the total staked ether is making correct attestations - there can be no supermajority. This means the Beacon Chain cannot finalize and the inactivity leak will be activated. The bug now has consequences for the network as a whole. Finality is critical for exchanges and apps built on top of Ethereum - without it there is no guarantee that transactions are permanent and irreversible. For individual validators using the affected client, the penalties are much more severe because of the inactivity leak - their stake is burned until the affected client controls < 1/3 of the total staked ether. Only then can the Beacon Chain start finalizing again. The ether burn can actually continue for some time after the Beacon Chain recovers, providing a buffer against small changes in validator numbers flipping the state of the Beacon Chain from "able to finalize" to "unable to finalize". Only when an affected client has more than 1/3 share of the total staked ether is Beacon Chain finalization in jeopardy.

Validators running correctly-functioning alternative clients receive no rewards during the inactivity leak. This is a security mechanism to prevent attackers from deliberately initiating the inactivity leak in order to raise the total rewards available to their correctly-operating validators. These are small penalties, but the point is that no-one escapes negative consequences from a consensus bug in a client with more than 1/3 of the total staked ether.

Scenario 3: corrupted client has 1/2 total staked ether

This scenario potentially leads to an unrecoverable fork in the Beacon Chain. If the client with a consensus bug forks off onto its own chain, neither the original nor the new fork would be able to finalize because both would be missing about half of their validators and would both activate the inactivity leak. The staked ether of the missing validators on each chain would burn until it amounted to < 1/3 of the total staked ether, at which point some validators on each chain could start finalizing again. This would take about the same time on both forks because the amount of ether burning required to restore finalization would be about equal. Both forks would finalize independently with a different set of finalized checkpoints. The two forks could never be merged together into a single canonical chain. To remedy this would require social consensus from the Ethereum community about which is the canonical chain - a process sure to be politically awkward and divisive and leading to financial losses for about half the community as they switch chains (not including the likely devaluing of ether that could result from the market pricing in the drama). Perhaps worse, the community could simply stay divided (with similarities to the DAO fork that produced Ethereum Classic).

To avoid a permanent split in the Beacon Chain, validators using the corrupted client would have to race the inactivity leak to switch or fix their client before the chain starts finalizing. There would probably be 3-4 weeks available, during which time developers would be rushing to save Ethereum. There is no escaping significant financial consequences for a large set of validators in this scenario.

Scenario 4: corrupted client has > 2/3 total staked ether

This is the nightmare scenario for the Beacon Chain because the corrupted client has a supermajority and is able to finalize its own chain. Incorrect information would then likely be cemented into Ethereum's history forever. There would be only about 13 minutes for client teams to identify the bug, fix it and broadcast updates to the affected validators bug before the chain begins to finalize corrupted blocks.

The only viable mitigation to this situation is for the affected validators to withdraw their stakes and exit the chain. If the affected validators try to rejoin the correct chain after applying a fix they would be slashed with the maximum correlation penalty because they would now be attesting to checkpoints that contradict their previous attestations, and doing so en masse. The inactivity leak would be activated by the large validator exodus, meaning the affected validators would be continuously losing their staked ether while they are waiting in the exit queue. The large number of validators would make the queue long, slow and expensive.

The only other option is for the remaining non-affected clients to accept the bug, join the new chain and agree that the bug becomes the expected behavior of Ethereum's consensus layer from then on. This would run contrary to core principles of the staking community and would be extremely divisive. Those minority clients would then be subject to inactivity penalties on the new chain even though they acted properly. Neither of these are good options. The former option is extremely expensive for the affected validators and logistically awkward to correct. The latter option would deeply undermine trust in Ethereum and cause us to accept a permanently tarnished chain.

Other risks

Reverting Finality

Control of >2/3 of the total staked ether gives power to the developers of a single client to choose which version of history is the right one. For example, if the developers turned malevolent they could spend some ether (cash it out via an exchange or bridge to another chain, for example) then collectively vote to replace the existing finalized chain with an alternative version that does not include their spend transaction. This is a "double-spend" made possible by the client's supermajority that allows it to revert finality and overwrite history. Meanwhile the honest minority would be punished for their inconsistent attestations. A malevolent supermajority could also just threaten such action and hold the network to ransom. Even with just 1/3 of the stake a malevolent team could threaten to halt finality and activate the inactivity leak.

Shared responsibility

The previous point took a somewhat pessimistic view of the client development teams, not because it was justified but because nefarious behavior is possible and therefore needs defending against. However, those same developers are most likely to always be good actors and they themselves require protection against single-client dominance, not only because they are likely to be Ethereum users (and ether holders/stakers) but because responsibility for the security of the network should not be concentrated on the shoulders of one small team. There is a real cost in the form of stress and mental health for developers whose actions bear outsized consequences on the health of Ethereum as a whole. Client diversity protects against this by sharing the responsibility across multiple independent teams.

Centralization

Even when the development team comprise entirely well-intentioned actors, they still retain excessive power of the functioning of Ethereum when they control the majority of the staked ether. Decentralization is a core principle for Ethereum, and this must include developers as well as users and custodians. Decentralization of development teams across multiple clients that share equal proportions of the staked ether limits the power of a single team to make key decisions about, for example, the content and timing of forks, limiting their influence on the philosophical direction of Ethereum. Client diversity protects decentralized decision making at the developer level.

Politics

Social recovery of an honest chain is an issue fraught with politics. Ethereum's consensus mechanism should finalize based on the rules coded into its clients - that is it's primary aim. Intervening int hat process is likely to lead to schisms in the Etheruem community where various forks benefit or punish different pockets of the community, and various users likely have various points of view about the most philosophically, ethically and technically acceptable mitigation of a consensus bug/attack in a majority client. Governance decisions would be awkward, disruptive and likely too slow to be maximally effective.

Real world examples

The scenarios outlined above have relatively low probability of occurrence. Developers are meticulous in researching and testing each and every update to their software and there is no reason to doubt the integrity of any client teams - far from it. However, neither are the scenarios purely hypothetical. There have already been examples of client diversity rescuing Ethereum's mainnet from permanent damage, and examples of consensus bugs disrupting Ethereum testnets. Some of these examples are described below.

Shanghai

In September 2016, during the Shanghai DevCon conference, hackers were able to attack Ethereum, exploiting several vulnerabilities in the client software causing the network to slow down dramatically. The attacker was persistent, rapidly deploying new similar attacks as client developers raced to reverse-engineer and patch them. Eventually the attacker found a vulnerability in Geth that could not be patched, necessitating a hard fork. Even after the hard fork, the attacker still found a denial-of-service vulnerability that used the bloated state hanging over from their previous attacks to force clients to make tens of thousands of slow disk i/o operations in each block. Client diversity won the day because, while developers fought to fix the vulnerabilities in Geth, Ethereum was able to continue using the alternative Parity clients, which did not suffer the same vulnerability.

The Shanghai attack was recoverable because there were multiple clients, but the situation could have been very different had a similar bug affected a majority consensus client. If a consensus client had the same dominance that Geth had at the time of the attack, the chain would not have been able to finalize as the vast majority of the validators would not have been attesting to blocks. The inactivity leak would have been activated because < 1/3 of the total staked ether was available for attesting.

Insecura

The viability of a "long range attack" was recently demonstrated on the Pyrmont testnet. The idea was to establish a set of validators attesting to an alternate blockchain history. These validators were then used to trick new validators into joining the dishonest “Insecura” chain, gradually growing the population of compromised validators, eventually to the point of interrupting finality, activating the inactivity leak and draining the stake of the honest majority. Ultimately this could lead to the corrupted clients finalizing their version of the chain. As explained in this thread, the investment of time and money required makes this an unlikely attack vector, but the similar dynamics could lead a bug in a majority consensus client to infect a large proportion of the network and become finalized.

Medalla

The Medalla testnet suffered a sudden drop in active validators due to an issue relating to the Prysm client's clocks. The chain was not able to finalize because so many validators dropped off the network that a 2/3 majority of staked ether was no longer available for attesting. Recovery was gradual, as it relied on validators switching clients from Prysm to minority clients. Later, real time caught up with the erroneous Prysm clock time and the previously invalid attestations suddenly became valid. This caused Prysm to stall while Teku and Lighthouse clients suffered massive state bloat as they processed the sudden glut of attestations. Had Prysm been the only client, the entire network would have stalled. Had Prysm had <1/3 of the total staked ether, a lot of the chaos could have been averted.

Prysm deposit root bug

Early in 2021 Prysm suffered a bug related to its validation of Eth1 deposit roots. Prysm clients were able to generate an invalid deposit root and pass it to other Prysm nodes. Because Prysm had such a large validator share, this invalid root spread quickly through the network, accelerated by the way Prysm clients followed the majority vote rather than explicitly validating the deposit root in each block. The consequences of this bug were very minor - there was no interruption to the Beacon Chain finality nor any significant financial penalties to validators. However, the incident demonstrates the importance of client diversity in two ways. First, a smaller validator share would have limited the spread of the bug through the network, reducing its impact. Second, the post mortem describes how alternative client implementations were used as benchmarks, helping the developers identify and fix the bug quickly. Of course, this would not be possible without multiple, actively maintained clients.

Client diversity today

The two pie charts above show snapshots of the current client diversity for the execution and consensus layers (at time of writing in January 2022). The execution layer is overwhelmingly dominated by Geth, with Open Ethereum a distant second, Erigon third and Nethermind fourth, with other clients comprising less than 1 % of the network. The most commonly used client on the consensus layer - Prysm - is not as dominant as Geth but still represents over 60% of the network. Lighthouse and Teku make up about 20% and 14% respectively, and other clients are rarely used.

The execution layer data were obtained from Ethernodes on 23/01/22. Data for consensus clients was obtained from Michael Sproul. Consensus client data is more difficult to obtain because the Beacon Chain clients do not always have unambiguous traces that can be used to identify them. The data was generated using a classification algorithm that sometimes confuses some of the minority clients (see here for more details). In the diagram above, these ambiguous classifications are treated with an either/or label (e.g. Nimbus/Teku). Nevertheless, it is clear that the majority of the network is running Prysm. The data is a snapshot over a fixed set of blocks (in this case Beacon blocks in slots 2048001 to 2164916) and Prysm's dominance has sometimes been higher, exceeding 68%. Despite only being snapshots, the values in the diagram provide a good general sense of the current state of client diversity.

Execution layer diversity is included here because a bug affecting the execution clients can also propagate through to the consensus layer since, after the merge, the two will be coupled together with the execution payload generated by the execution clients being a core component of Beacon Blocks.

Up to date client diversity data for the consensus layer is now available at https://clientdiversity.org/.

Individual stakers and staking pools

Tackling the imbalance in client distribution requires action from the major exchanges and staking pools. However, individual stakers can still do their part by choosing to run non Geth/Prysm client combinations. Instructions for setting up minority clients can be found at clientdiversity.org.

For stakers who have less than 32 ether or who do not wish to take on responsibility for running a validator, there are staking services available. Several of the major centralized exchanges offer ether staking, but the client distribution in their staking pools is often hidden, and there are limits of the tradeability of the staked ether tokens those exchanges offer. For these (and other) reasons, these centralized services are not recommended. The better option is to use a more decentralized liquid staking service such as Lido or Rocketpool. These services stake ether and provide a token in return whose value increases over time as the pool’s validators accrue rewards. Those tokens can be traded or used to earn DeFi yields. These liquid staking platforms are more transparent about their client distribution too, with Lido publishing quarterly updates and Rocketpool now reporting theirs too. For users unable or unwilling to run their own validator, these services are a route to contributing to better client diversity.

Summary

Client diversity is directly incentivized by the Beacon Chain's reward and penalty protocols. Single-client dominance is a hidden threat to Ethereum, invisible while the dominant client behaves faultlessly but potentially catastrophic when a consensus bug rears its head. Having multiple clients is a unique strength of Ethereum and a testament to the diligence of the developer community. However, that good work is undermined when one client controls a majority of the stake. The ideal scenario is equal distribution of staked ether across at least 4 clients, giving a maximum of 1/4 of the staked ether to each client. This is easily possible with the production-ready clients available today.