Deep Dive into MakerDAO Spells Test Suite Part 1: Execution Cost…

Deep Dive into MakerDAO Spells Test Suite Part 1: Execution Costs

August 2nd, 2023

Introduction

As of last May, with the official winding-down of the Protocol Engineering Core Unit, Dewiz has taken upon the Herculean task of maintenance and emergency updates of the Maker Protocol alongside other few Ecosystem Actors.

In Maker dialect, we refer to those protocol changes as “spells”. A spell is simply a Solidity smart contract containing risk parameter changes, new vault types onboarding, delegate payments, teams compensation vesting, etc. The process of creating such contracts is called “spell crafting”.

Make no mistake from this simple description, though. Due to the recurrent nature and tight schedule of spells, they do not fit well with the auditing paradigm. This cannot be overstated: we are talking about multiple unaudited contracts over time, which will have root access to the entire Maker Protocol. A mistake has potential to simply implode a protocol whose TVL is 4.6 billion USD (at the time of this writing).

The current iteration of the process is resource-heavy, both in terms of people — 1 spell “crafter” and 2 reviewers are mandatory — and computation — through a thorough and comprehensive test suite that ensures the protocol state is correct after the spell is cast.

Such test suite is executed as a fork of the current state of the relevant chain. In such an environment, we can simulate the spell casting (execution) and run assertions against the “would be” state of the chain. Not only that, we can also simulate hypothetical future conditions to ensure the system will behave as expected, even under extreme, but unlikely, scenarios.

Current State of Affairs

Before diving into what this test suite does exactly and breaking it down — trust me, we have done that, but it is harder to write about it than it sounds — we have a more pressing pain point.

Apparently, most Protocol Engineering Core Unit members used to run their own nodes locally. With a local node, the test suite executes in less than 1 second. Not bad at all!

For that reason, I used to run both Goerli and Mainnet local nodes in my work laptop. Until I ran out of space on my 2 TB SSD drive. Then I dropped the Mainnet node and kept on running only the Goerli one. Everything was fine and dandy until a couple of weeks ago. My computer simply would not boot anymore. Turns out that running a local node can fry a SSD drive in less than 2 years.

It might have been silly of me to try to run Ethereum nodes on SSDs made for personal use, however, this serves only as an argument that running your own node is becoming increasingly harder. Not only you have to spend time maintaining them, you also need to buy server-grade SSDs that can cope with the required throughput.

With the further decentralization of engineering resources at MakerDAO, it is even harder to ask that everyone involved in the process should run their own nodes. People have things to do with their lives other than figuring out why the latest release of Erigon broke their node entirely, or why Geth is making your computer sound like it is going to take off at any moment. Also, not everyone can or will buy a NUC and carry it with them everywhere they go just for the sake of having a functioning node.

Long story short: we need to rely on cloud Ethereum JSON RPC providers such as Alchemy, Infura or QuickNode. However, this comes with a pitfall: the spell test suite now takes a lot of time to run. The actual execution time depends on where in the world you are. In our CI pipeline, tests usually run is about 10 minutes, while if you are away from the US, it can take up to 2000 seconds. That is a 2000x worse performance!

An example of execution which took almost 1,000 seconds to run

An attentive reader might be thinking:

Why do you care if it takes 16 minutes to run? Seems like a very small amount of time to help securing a protocol worth billions!

The real issue is that those tests need to be run several times during the spell crafting process. Also, the actual set of tests being executed changes every time. There are no tests for the tests themselves, so the only way to validate a test we just wrote is by running it.

Notice that this is not a unit test suite, it is something in between an integration and an end-to-end testing suite, which means test isolation is something that cannot be achieved and sometimes is not even desired. Therefore, more often than not, when you need to re-run a test due to a failure, you also need to run others that might be intrinsically related to it to see if you will not have a regression issue.

In that sense, 30 minutes (or even 15 minutes) is a long feedback loop. It is barely bearable in a regular spell crafting and review cycle, but it is even worse if you are unlucky enough to be required to debug an unexpected failing spell during the process.

I can hear some people from the audience interjecting:

Just get a paid account with the provider, your cheap bas&@#%$!

We hear you! In fact, we are already doing that. The problem seems to be that the JSON RPC providers are not used to deal with workloads like ours. You can subscribe to a plan with a more or less constant high rate of usage, as most front-end apps have. MakerDAO test suite on the other hand does not consume any data from the node most of the time, except when it is run and makes a ton of requests in a short amount of time.

Being realistic, from the provider’s perspective, it is way easier to deal with a more or less constant usage with some eventual spikes, than with sudden surges in requests. Add that to the fact that most of their clients fit in the former profile, they have no reason to build a super-efficient highly elastic infrastructure that could accommodate spell crafting needs.

We did some consultation with the most important node providers, but right now our only alternative seems to be throwing an enormous amount of money in the problem to have some outstanding capacity ready for us at any time, even if it remained completely idle 90% of the time.

Profiling the Requests

The first step to understand the workload is to know which kind of requests it makes.

My initial idea was the nuclear option: install Wireshark and record the requests to the JSON RPC endpoint that I chose for this test. Looked great in theory, but in practice it does not work, since we connect to providers through HTTPS. When Wireshark gets a hang of the request, it is already SSL encrypted. I needed to intercept them at an early stage!

So I decided to perform a man-in-the-middle attack on myself, using this little tool called Man-in-the-Middle Proxy (or mitmproxy to those in the know). For brevity’s sake, I will skip the nitty-gritty details on how to configure it, but the documentation is quite comprehensive in case the readers want to try by themselves. Just keep in mind that you will be required to install mitmproxy custom certificates and brush out your Python skills to transform the data to a more friendly format.

I ran the test suite from the spells-mainnet repository for the purpose of the analysis. I an Alchemy endpoint, using a “Growth Plan” API key owned by Dewiz.

After some back and forth with the Python script, I finally managed to extract the relevant info into a CSV file after running the full test suite. Keep in mind that Ethereum JSON RPC services have one single endpoint; the actual query is included in the payload of the request, so you need to parse it in order to understand what data is being requested.

Analyzing the Data

Keep in mind that the spell test suite is not exactly fixed every time, as some tests are enabled/disabled or added/removed depending on the specific spell contents.

The first thing I learned was that the full test suite execution performs around 10k requests (!) to the JSON RPC endpoint.

Wow! No wonder why it can take such a long time to complete.

- Me, after the first finding.

However, another interesting finding was that all query responses combined amount to no more than 7 MB of data. This is peanuts for modern internet connection standards, so this cannot be the reason for such a large execution time.

If connection bandwidth is not the problem, then the cause must be that the providers are throttling the requests at some point. But what is the actual impact of this throttling policy? Let’s dig deeper!

Summoning the data-analyst-that-never-was in me, I built a pivot table in Google Spreadsheets with the request data. I split the analysis by JSON RPC method to understand who the biggest offender is:

Different JSON RPC methods vs the amount of calls made with them

And the winner is eth_getStorageAt, with 86% of all requests. In hindsight, it makes sense, since we need to load the state of almost the entire MCD protocol in order to run tests for spells. Foundry even tries to help here, by caching the results to prevent duplicate requests.

On a distant second place come eth_getTransactionCount, eth_getCode and eth_getBalance with 4.6% each. While they are still relevant, they do not come even close to the main offender.

The rest of the calls are irrelevant to this analysis, since they seem to be called only during the initialization of the test suite.

Next, I wanted to get a deeper insight into the request durations. How fast is the fastest request? How slow is the most laggard one? How many of them are slow? How are the percentiles distributed? Let’s go to more data crunching.

I’ll have some more data for breakfast, please! Some columns of the “Grand Total” line are off because Google Spreadsheet doesn’t let me customize them!

First, I set an arbitrary threshold of 500 ms for “slow” requests. They correspond to 1% of all eth_getStorageAt requests and 4.5~6.9% for the remaining relevant ones. Overall 1.8% can be deemed slow.

For minimum and maximum duration, there is a whopping discrepancy. While the minimums are always close to 150 ms, we have some requests taking almost 40 secs to return a result! I guess that is the power of exponential back-off.

Looking at the percentiles, the median (P50) and P90 — meaning 90% of the requests finish before the time in the column — are really close for all relevant methods. Specifically for eth_getStorageAt, even P95 is pretty close. P99 is sitting around the 500~600 ms for all of them as well, so the conclusion is that very few requests (<1%) are responsible for the huge execution time.

Inspecting the data, I noticed that as soon as the requests take more than 1000 ms, the throttling effect becomes more visible. If I change the definition of “slow” request to “requests taking more than 1000 ms”, then we have the following scenario:

Redefining “slow” requests… Please don’t mind the glitch in the “Grand Total” line… It’s Google Spreadsheets fault!

Notice how the number relative to eth_getStorageAt drops to half of what it was before, when using 500 ms as a reference. Also, the other relevant methods practically do not have requests that slow.

Not All Tests Are Born Equal

The analysis above gave me a rough idea of what is happening. The sheer number of requests for the blockchain state is slowing down the entire execution. More than that, it is only a handful of requests that are taking too long because of providers throttling requests with exponential back-off.

The next step was to understand which tests are the ones causing more trouble. Unfortunately, there is no easy way to tell which test case is executing each request. Given our experience with spell testing, I had a few suspects in mind. To confirm that, I used the --match-test argument in forge test, which allows us to execute only tests whose name matches a specific regex. I could simply repeat the same steps for the full suite, but modifying the command to match a single test.

`testGeneral`

There is a test case called testGeneral. No, this is not yet another way to solve the Byzantine generals coordination problem, but a comprehensive tests of the general state of the system after the execution of the spell. It covers most, but not all of the changes that are being made.

Notice how this is roughly 20% of the amount of requests for the full suite. Truth be said, there is some overhead in the setup function, which needs to run regardless there is 1 or 30 tests being run, so the actual number is a bit lower when there are other tests to split this cost with.

In general this does not look too bad. The slowest request took 10.5 secs to complete, while there are only 45 requests (~1.6%) taking more than 500 ms. Percentile data is in line with the full suite as well.

This was a bit surprising to me, as I believed this would be the “heavier” test in the system. That took me to the second subject in my list.

`testAuth`

Security is priority #1 in MakerDAO. We are talking about the OG DeFi protocol, which has been up and kicking since 2017, without any major hacking incidents. 6 years (and counting) is a very long time in crypto! You gotta respect that!

Those who came before us defined a couple of rules for smart contracts in MCD that are still followed to the letter:

Every “ownable” smart contract ownership is controlled by a mapping(address => uint) wards.
wards permissions are given through rely(address who) and revoked through deny(address who) present in those contracts. Only a ward can call such functions.
The deployer of a contract is relyed automatically in the deployment transaction, granting ward access to themselves.
Before integrating the contract into the Maker Protocol, the deployer MUST rely on the main governance execution contract — called MCD_PAUSE_PROXY — and deny themselves. Failing to do so will block such contract from being part of the system. No negotiations!

This prevents dangling permissions in sensitive contracts that could lead to attacks in case any of the deployer keys are compromised. Only on rekt.news, a simple Google search shows more than 90 results at the time of this writing for the terms “compromised keys”.

To enforce this rule, the testAuth test case goes nuclear: it scans every contract in the protocol chainlog and checks that none of the wallets of all deployers in the history of MCD is still a ward in any of them. The list of deployers needs to be maintained up to date within the spells repository at all times. This is enforced by the current spell crafting and reviewing process, but it is not bullet-proof.

The attentitve reader might have noticed where all this is going:

Well, looks like we’ve found our culprit!

Wow! Even taking the setup overhead in consideration, that is a lot. Let’s break it down.

The slowest request took 10.8 secs to complete, while there are only 127 requests (~1.6%) taking more than 500 ms. Percentile data is a bit different compared to the other results, with eth_getStorageAt performing a little bit better and eth_getBalance a little bit worse.

In hindsight, this is a nested loop test — list all contracts in the system and check them agains all deployers that ever existed. As the time of this writing, chainlog v1.15.0 had 411 contracts (!) and 22 deployer wallets, which would require 9042 individual checks to complete the test. Luckily some of those contracts are permissionless, so there are no wards to check, which reduces the number to what you see above.

We will discuss in a future post suggestions on how to improve this check.

`testAuthInSources`

testAuth has a little brother called testAuthInSources, which have the same intent, but are used in the context of Oracle Security Modules (OSMs), which have a src method that points to the actual Oracle contract, that must be checked for wards as well.

This also surprised me a little bit. There are far less OSMs in the system, however this test is heavier even than testGeneral.

The slowest request took 16 secs to complete, while there are only 41 requests (~1.4%) taking more than 500 ms. Percentile data is in line with the full suite execution.

Excluding the Heavy Tests

I could have continued examining tests one by one, but at this point I felt that most of the problem was caused by the 3 test cases above. I decided to run the entire tests suite, only excluding those ones.

Oh! It feels so good to drop all that weight!

The slowest request took less than 2.9 secs to complete, while there are only 33 requests (~1.9%) taking more than 500 ms. Percentile data is a bit different compared to the other results, with eth_getStorageAt performing a little bit better, while everything else is in line with the full suite.

Not surprisingly, the tests completed their execution in about 1/3 of the time when compared to the full suite.

There is Hope

As this post was being written, LlamaCorp launched Llama Nodes, which is a new competitor in the space alongside Alchemy, Infura, QuikNode and others. Among the new features they bring, there are 3 that fits well with our use cases:

They have a pay-as-you-go model instead of a subscription, meaning you only pay for requests that are actually being used. This is great for running spell tests because we only need to do it a dozen times a month.
They support crypto payments (yay!). You can top up your account with USDC and use it until it wears out. Great! No credit cards needed.
They allegedly cache frequent requests for better response times. This can help speeding up subsequent runs for the same spell tests.

We noticed a 3x faster execution time when using a Llama Nodes (330 secs) endpoint versus Alchemy (1000 secs). However, since this is a new product, it still has some outstanding issues: sometimes we notice a 502 Gateway Timeout error:

Even so, we still perceive a reasonable improvement in the developer experience.

If you want to give it a try, you can use Dewiz’s referral code: 01H6P3V582694HM7J732C3JQM0 for Llama Nodes.

Full disclosure: according to LlamaCorp, if you use our referral code, Dewiz’s account would make 10% of the amount you spend in LlamaNodes at the expense of LlamaCorp itself, meaning you would not lose anything. Other than that, Dewiz has no affiliation with LlamaCorp or any of their products. LlamaCorp has not paid and will not pay Dewiz because of this blog post.

Conclusion

With this analysis we finally could better understand why the spells test suite takes such a long time to run on comercial-grade remote JSON RPC nodes. The common practice by the providers to throttle requests makes a very small percentage of the requests to take a really long time. In case the kind reader is interested in the actual data, you can find them in this spreadsheet.

While the time spent running tests to help secure a protocol with billions USD in TVL is not a big problem per se, the spell crafting and reviewing cycle is hindered by the long feedback loop. This would not be a problem if most of the people involved in the process would run nodes locally, but nowadays that is a heavy requirement to put on a growingly decentralized workforce.

In the next post, we will dive deep into underderstanding what each test case does and what is the value they bring individually and as a collective, and maybe suggest some changes that would improve its performance, without sacrificing even a bit of security.

See you soon!

Subscribe to Dewiz

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

ntleCgx9gXO1QYO…uoJKJ9Ho0QhFH0M

Author Address

0x8a4E736D31C03De…Ba570B60734B9A3

Content Digest

q4C9GwxB6Arc-kw…-jde36h8Vyhe12s