Teku has always made heavy use of caching data, which is one mechanism for making sure we have the data available that we need, without having to load it from disk.
The upside of caching is that we get fast data access. Within reasonable bounds, it’s a great trade off, but there comes a time when things seem to grow unbound, and we need to re-assess what we’re storing.
Since the May non-finality issue, we’ve been acutely aware of the heavy reliance we have on our state cache.
This cache gave us fast access to all of the hot states. In fact, one of the answers we give most often on socials when people have issues with memory, is to look at the Objects in memory graph on the teku detailed grafana board, and it should look like a saw-tooth pattern.
The number of objects being graphed in this are Beacon States available to load from our cache.
Our primary fix for the May non-finality was to not access this cache each time we need to validate attestation gossip from the network. Afterwards, we still had 1 access periodically to this large cache, and so it was an unknown, and we left it alone.
With the validator set size increasing at break-neck speed, we were going to have to revisit this decision, and in the last couple of weeks, we’ve done just that. It turns out that we access 3 states consistently, and somewhat unsurprisingly they’re the epoch transition states, and the head state.
A new cache got added for epoch transition states. This can be a fairly small cache with a little room to grow, in case of forks. The intent of this cache is to ensure we can access finalized and justified states, and if there’s any epoch transitions of forks ideally, without having to regenerate a state. Regenerations do tend to be expensive, and it’d be nice if they only occur very rarely…
Tracking epoch states is mostly complicated to know which state we’ll access for justified and finalized states, but it’s a problem we’ve solved elsewhere so the challenge was basically to ensure we add that same management to this new scenario.
The size of the cache will grow and shrink - periods of non-finality, forks etc may influence what gets tracked in this cache. The minimum number of items required will be 2 (finalized and justified), but the size of the cache as a limit can comfortably be more like 6 or so, giving room for when different network events happen.
The other state that we do track in this cache is the current epoch transition, which typically will become the justified epoch, so it just saves computing it for the majority case when everything happens as expected.
Looking at the graph (Mainnet), we see that 3 is indeed the typical number, and over a few hours sometimes ours does grow to 4. On Goerli sometimes it goes higher, particularly in some instability over the last little while, so it’s worth having spare capacity in case it’s required in this cache.
The states cache doesn’t really need to be large. It is somewhat dependent on how many forks are in flight as to the tuning of this cache, so that may need to be tweaked potentially if things are going poorly on a given network.
The main task for this cache is to ensure we can get the head state of any chain, to avoid having to replay to track the head of that chain.
The other thing the states cache is handy for, is if people are referencing a lot of non finalized state data as part of their operations. That really means this option may want to be a visible command line option, so that people can freely adjust it depending on their requirements.
This cache will basically always be full at lower numbers, but that’s ok. The cache self-manages what’s in it by getting rid of the oldest items, and if it’s very small that becomes a self managing limit. The reason for the old ‘saw tooth’ pattern, was because this cache got cleaned up by transaction updates, where we explicitly removed items to reduce the size of the cache before it hit its capacity limit, where as if the cache is very small, we may never end up needing to remove stale items from this cache manually.
All of these cache changes resulted in a reduction of the amount of memory required, but it actually wasn’t huge. In situations like this we can sometimes look for an ‘easy win’.
In Java, it’s fairly easy to take a dump of the heap (that
Xmx setting people are so familiar with is the heap space), and look over it with a memory analysis tool.
Interestingly, we had a lot of duplicated objects. Looking at the data, basically we had a lot of duplication of some numbers -
32_000_000_000 (32 ETH),
-1 (which is max uint64…). In all there was over 200MB of cache on Goerli referencing these numbers, and it’s actually a fairly easy task to cleanup.
After making that little change, I pushed to some of our non-validating nodes, and the heap space came down quite a lot! Below is the graph of heap space usage of one of our nodes I’ve been testing on (that
Xmx setting, we were on 5g max initially).
The first ‘peaky’ bit of the graph is our old states based cache with no epoch states, basically the heap space when we’re employing that saw-tooth kind of pattern we saw above.
The change to using less states and the addition of the epoch-states (6 in states cache, and 6 in epoch states cache ) gave us the middle section, which already is a big improvement on part one.
The final section from the 23rd is after fixing the numeric duplication we had, and though it’s not a huge change, it’s definitely a worth while ‘easy win’.
The result was being able to comfortably run on a 3g heap (-Xmx3g) without the need for frequent garbage collection, down from 5g previously.
The next challenge will be rolling out these changes without causing too much chaos, and finding the next conversation to have when people are struggling with their memory, as this has fairly substantially changed that landscape. These changes aren’t small, but it’s something I’d like to get in general use relatively soon due to the impact.
The numeric ‘easy win’ change we’ve put in and will definitely be in the next release.
Regarding the cache sizings, I’m still toying with how to best roll this out. The one complication is for those that access non-finalized states a lot, as they may find they need to tweak the settings, and documenting that scenario.
The next release contains Java 17, so that’s a pretty huge change for some people that may be on older images.
This change would mean people are able to reduce some memory pain, so it may be worth including also, but the discussion is ongoing.
The command line arguments are present, so it’s possible that we just document how to make the changes, but it will definitely become default settings very soon.
Image by rawpixel.com on Freepik