Synchronised functions vs Locks
December 21st, 2023

Cover photo by Life Of Pix: https://www.pexels.com/photo/brass-colored-metal-padlock-with-chain-4291/

The year is winding down, so why not take a moment to revisit a bugbear.

We have this really simple concept in Teku around slashing protection, where we’ve ensured someone can’t be slashed with quite a simple set of functionality and data. It all comes down to 1 validator, 1 file, 3 numbers.

This works really really well. It’s backed by code that uses a synchronized function - the implication here is we can only run one instance of this function at any one time. It’s a good decision in a lot of ways and scales fairly nicely.

32 validators, minimal network  - profiler. This looks decent pretty good at this point
32 validators, minimal network - profiler. This looks decent pretty good at this point

I noticed an issue though, back when I was doing performance testing. If we scale up enough this synchronised function does actually start to hurt our timings. Visually, the profiler shows it taking up a large amount of a code path, and practically some of our metrics show increased delays in attestation production times.

4096 validators, minimal network  - profiler. Time is ramping up a lot!
4096 validators, minimal network - profiler. Time is ramping up a lot!
16k validators, minimal network  - profiler
16k validators, minimal network - profiler

On holesky during some testing, I was running with 20k validators, admittedly there’s more slots so the signing load does get spread, but this becomes an actual issue as we continue to scale up, as attestation production delays in some cases reduce our overall effectiveness.

To look at why, you could look at our slashing protection files as a database. It’s a stretch, but in real terms we have a ‘table’ of records of slashing protection data, and effectively there’d be 1 row per validator key.

The synchronized functions mean any 1 of those validator records being updated is effectively like locking the entire table. This is a fairly common kind of concept (table level locking), so I won’t go deeply into it, but the synchronised function is fairly analogous. With 20k validators, on average we’re looking at about 625 attestations per slot, and all of these checks need to happen one at a time.

To look at a solution, we can look to some common principles of databases. As Table level locking is effectively the problem, how do relational databases often solve this issue? Row level locking…

That would be a fairly fundamental change, and with fundamental change is a level of risk.

This would mean maintaining locks for each validator, so that we’re only restricting access at a validator level, not at the entire system level, you can see how that may improve things… The advantage would be that these 625 attestations could be verified ok over several threads concurrently, as long as they’re not for the same validator (which they shouldn’t be!). It’s unlikely that we’d want that many concurrent threads but we’d be able to make better use of whatever we had available.

I did a fairly tiny PR to attempt to implement that concept. We often use feature flags for things like this, it’s a way of ensuring we don’t do something bad and push that to master, so that’s basically the pattern I followed, but the critical changes were around using ConcurrentHashMap, and a new validatorLocks member that allows a lock per validator (row level locking effectively).

Lets see what happens with that same example above with 16k validators…

16k validators, minimal network, row locks - profiler
16k validators, minimal network, row locks - profiler

The time has gone up a fair amount, so thats interesting - both sample times were roughly 6 minutes, and this one is more than double the previous time as an overall measurement. Interestingly (not shown) the call stack changes drastically, but theres no real easy way to demonstrate that that’s readable. This is likely explained by the signing not being a choke point, it’s now likely that a lot more functions find their way into this kind of pattern.

Lets look at Grafana output (Teku Detailed) on a node on holesky running 20k validators:

Feature switched on in the middle roughly
Feature switched on in the middle roughly

When the feature flag was toggled, all of a sudden the vast majority of attestation delays drop to the 0-1ms (delay) bucket… That’s interesting! Also in the above graph you can roughly see our delays are significant sometimes before we enable this feature, which is why this area initially drew my eye. This is likely more of a ‘worst case’ scenario than likely in the wild, as 20k validators on 1 node is a lot.

Lets look at some other panels of the Grafana board:

Same timescale, feature came on roughly in the middle
Same timescale, feature came on roughly in the middle

Inclusion distance (max) improved a lot, as did the Attestation inclusion detail. This is significant!

I’m pretty happy with this improvement. There’s likely more to be done, and possibly better solutions to this problem, but its surprising how often we find a bottleneck because of changes of how we’re using something, and first principles lets us implement a relatively simple solution.

I still have questions regarding some specifics, but the idea definitely had merit, and now it’s just about time to soak (testing) and making sure theres no weird corner cases. I’m not sure we’ll have this on by default in the next release, but its certainly tempting.

The thing with bottlenecks, is now we get to find the next one :)

Subscribe to Rolfy.eth
Receive the latest updates directly to your inbox.
Nft graphic
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.
More from Rolfy.eth

Skeleton

Skeleton

Skeleton