The opinions in this are mine, I’m not writing this on behalf of anyone, and ahm.. any other disclaimers you think should be here probably are… So anyway…
There was a network issue on Mainnet, and the Prysm team wrote up a great post-mortem.
From a Teku perspective, we got a release out relatively quickly largely due to the initial event the day before, and we’d been working on it ‘in case it happened again’, and I’m glad we took that approach.
It took time to understand the issue, and circling back on it recently and reviewing the post-mortem written by the Prysm team, I still had niggles with the fix we had produced. It did fix our issue, but ultimately we shouldn’t have needed nearly as much
state in memory as we have in the first place.
When digging, I noticed something subtle, but interesting, and I added some metrics, then based on that was later able to produce a fix to the actual problem that teku was experiencing during that time - more on that after some context.
To understand the issue, and how subtle it is, I’ll try to give a little background.
Attestations are a crucial part of our network. They get produced every epoch, by every active validator, and they get produced and propagated at basically the same time around the network - 4 seconds into the slot.
Every node sees these Attestations depending on what they’re listening to, and each node reads them and decides whether to pass them on.
These Attestations have a relatively simple structure (spec). Your vote includes:
source: validators opinion of the best currently justified checkpoint
target: block at the start of the current epoch
beacon_block_root: your current head on your chain.
slot: the slot you produced the attestation at.
This is all wrapped up and signed.
Another better description can be found in eth2book.
A number of attestations were arriving that had some interesting properties. The slot was current, but the target was an epoch old, and the head was also old. Typically they were on the canonical chain.
When validating attestation data, a certain amount of information is needed from a relatively recent state, and a logic process is followed to determine the ideal state to use.
If the head matches, this is ideal, and we can just use our head state - that covers over 90% of the cases that are seen for validation. After this we start a process of deciding what’s going to be the best state.
Given that we have the
target, we can attempt to use that to find a state, and we can also attempt to use things such as the
finalized state as a common ancestor.
Our fix straight after the incident was basically to ignore attestations to a
target prior to the current justified epoch if the
target of the attestation is on the canonical chain, and we can’t find the state in memory. This fix was in the
23.5.0 teku release, and was very effective at avoiding the problems seen during non finality.
One cool thing about this community is the sense of ‘team’ and ‘us’ - it’s actually very supportive and inclusive.
This is a super power, because it lets us do things that would never be possible if it was always a huge competitive field.
So we spun up a testing network, a big one… 600k validators. The devops people we have in the group are amazing, and made it super easy for us. We ran with similar setup to mainnet in terms of ratios of clients, and proved we could break it in the way that was understood to be the issue. It was almost the issue, so we upped the anti, and things went boom. This is huge. Replicating the circumstances shows you understand what actually happened.
Afterwards we spun up the new versions of clients put out to rectify the issue (23.5.0 for teku), and saw the problem was no longer present. Then we pushed the understood circumstances, and gained confidence that even given a significantly worse set of circumstances, the network would be healthy.
This was a huge community win, I can’t speak highly enough of this group of people and their willingness to extrapolate and test more.
23.5.0, it was a lingering thought that we still needed to lookup states without a great description of why we needed them, and I did end up revisiting this area.
The metrics I added during investigation were very effective at showing percentages of our state selection choices, and I noticed a peculiarity, because basically the target that was in these problematic attestations were still ultimately canonical, and should have actually been able to be verified very cheaply (ancestor of head).
Prior to our initial fix, after we discounted the chain head as a candidate state, we would attempt to find the state where we processed the block that’s referenced as target of the attestation. Unfortunately we made the assumption that the
slot of the attestation would match the
target slot. While this does occur, it’s actually relatively common for the target to not be the
slot. If you think about it, only 1 slot in 32 is the start of any given epoch. This was the core of the issue that caused the chaos for us in May, as we weren’t able to determine a good match here and went further down into trying to match scenarios, ran out of things to try, and regenerated states.
23.6.0,the root cause of the bug that was present during non finality has been removed, by looking up the slot of the target rather than using the slot of the attestation.
We also extended the theory of using a chain head for validation, since we generally have head states present at all times, they’re usually a cheap way to validate, if we can locate them. Our selector now also checks for non canonical chains and if the attestation is an ancestor of any of those chains, and uses that chain head to verify against in that scenario.
While we’ve come a long way in reducing the number of states accessed during gossip validation, there’s other cases where we still need to investigate.
One desire I’ve had for a while is to bring memory under control, and we’ve reduced state access a lot, but there’s still regenerations happening when we drop the heap size down, so no changes to our recommended minimum at this stage, as it may impact validating nodes. If we can rectify another couple of accesses, we should be able to actually reduce our heap requirements, and that would be a huge improvement.
The other investigation that probably is notable is why state regeneration is taking a long period of time, because that impacts
tick processing, which can have flow on effects.