Learning to Work With Ethereum Storage and State Data

If you’re completely new to SQL and Ethereum I recommend starting with my full beginners’ overview or breaking down an Ethereum transaction first. Feel free to support my work by buying a Data Landscape edition at the end of the article, or sending a tip to ilemi.eth!

Most explanations of storage and state on Ethereum are diagram heavy and don’t get into the meat and bones of how to access or work with that data. I’m going to avoid the diagrams and theoretical explanations, but you should still have some understanding of data tries so skim this and this if you know absolutely nothing about how data gets into blocks.

Essentially, what you need to know is that every time you make a transaction with a contract involved, there will be state and storage changes saved to the block. State changes involve things like an addresses’ Ether balance or Nonce, storage changes involve storage slots on contracts such as your balance of ERC20 tokens in a contract.

To access these changes from an Ethereum node, you’ll need to use a replay VM - meaning you have to replay a transaction or block to get the raw diffs in state and storage. This is easily done using Infura or Alchemy, without running your own node. You can go to Alchemy’s API composer to play with the trace_replayTransaction endpoint (here). If you want to read more about how traces works, read the standards here - but I’ll run through one example.

Let’s use this ERC20 transfer, whose contract code includes the following:

function transfer(address to, uint256 value) external returns (bool) {
        _transfer(msg.sender, to, value);
        return true;
    }

function _transfer(
        address from,
        address to,
        uint256 value
    ) private {
        balanceOf[from] = balanceOf[from] - value;
        balanceOf[to] = balanceOf[to] + value;
        emit Transfer(from, to, value);
    }

You can see that there are two storage changes, the balance of the from address and the balance of the to address. Here’s your JSON rpc request:

{
	"jsonrpc":"2.0"
	"id":0
	"method":"trace_replayTransaction"
	"params":[
		0:"0x8e5cd4d97881b558b7c4a120c41193b944db0612bc45807ae056d5488b82700e"
		1:[
			0:"trace"
			1:"stateDiff"
		]
	]
}

And here’s the response you’d get from Alchemy, I’ll explain it briefly here:

  • state diffs for 0x320d83769eb64096ea74b686eb586e197997f930 are balance changes in the mapping of balanceOf
  • state diffs for 0x52bc44d5378309ee2abf1539bf71de1b7d7be3b5 are addition to balance of gas fees being paid to miner
  • state diffs for 0xdad3fd6c9fb0c2b56228e58ae191b62bfb1bec83 are subtraction from balance of gas fees being paid by the sender for this transfer
  • the first trace action is the actual transfer() call
  • the second trace action is a delegatecall because a proxy pattern is used for this contract, and the logic is stored somewhere else (like forwarding an email).
{
  "jsonrpc": "2.0",
  "id": 0,
  "result": {
    "output": "0x0000000000000000000000000000000000000000000000000000000000000001",
    "stateDiff": {
      "0x320d83769eb64096ea74b686eb586e197997f930": {
        "balance": "=",
        "code": "=",
        "nonce": "=",
        "storage": {
          "0x108f03599a018b41696aad07e2100369d10122c70bf27d8e9fe3e425885a246b": {
            "*": {
              "from": "0x0000000000000000000000000000000000000000000000357bdcd5b903c454ce",
              "to": "0x0000000000000000000000000000000000000000000000000000000000000000"
            }
          },
          "0xdc7fafdc41998a74ecacb8f8bd877011aba1f1d03a3a0d37a2e7879a393b1d6a": {
            "*": {
              "from": "0x000000000000000000000000000000000000000000000215d691e79c0df07a6f",
              "to": "0x00000000000000000000000000000000000000000000024b526ebd5511b4cf3d"
            }
          }
        }
      },
      "0x52bc44d5378309ee2abf1539bf71de1b7d7be3b5": {
        "balance": {
          "*": {
            "from": "0xfdc72883cded567d93",
            "to": "0xfdc728a1a08db20db3"
          }
        },
        "code": "=",
        "nonce": "=",
        "storage": {}
      },
      "0xdad3fd6c9fb0c2b56228e58ae191b62bfb1bec83": {
        "balance": {
          "*": {
            "from": "0x59e417e4aa86a74",
            "to": "0x599f683b630a274"
          }
        },
        "code": "=",
        "nonce": {
          "*": {
            "from": "0x159",
            "to": "0x15a"
          }
        },
        "storage": {}
      }
    },
    "trace": [
      {
        "action": {
          "from": "0xdad3fd6c9fb0c2b56228e58ae191b62bfb1bec83",
          "callType": "call",
          "gas": "0x938c",
          "input": "0xa9059cbb000000000000000000000000000000000000000000000000000000000000dead0000000000000000000000000000000000000000000000357bdcd5b903c454ce",
          "to": "0x320d83769eb64096ea74b686eb586e197997f930",
          "value": "0x0"
        },
        "result": {
          "gasUsed": "0x45d0",
          "output": "0x0000000000000000000000000000000000000000000000000000000000000001"
        },
        "subtraces": 1,
        "traceAddress": [],
        "type": "call"
      },
      {
        "action": {
          "from": "0x320d83769eb64096ea74b686eb586e197997f930",
          "callType": "delegatecall",
          "gas": "0x7e24",
          "input": "0xa9059cbb000000000000000000000000000000000000000000000000000000000000dead0000000000000000000000000000000000000000000000357bdcd5b903c454ce",
          "to": "0x5b5746f6f5e2db8bf5e260829ca7a004c876b167",
          "value": "0x0"
        },
        "result": {
          "gasUsed": "0x323b",
          "output": "0x0000000000000000000000000000000000000000000000000000000000000001"
        },
        "subtraces": 0,
        "traceAddress": [
          0
        ],
        "type": "call"
      }
    ],
    "vmTrace": null
  }
}

vmTrace is a deeper trace, which tells you the order in which opcodes were executed by the EVM. I won’t cover that in this article (mainly since I am not familiar enough with those yet lol)

You can check the balanceOf changes using hex to num here, remember to div by 1e18. If you’re curious about how subtraces work, they are nested function calls:

  • subtraces = number of sub-calls by function

  • type = call (normal call), staticcall (doesn’t mod state), delegatecall (proxies calling logic), create (contract deploys)

    traceAddress = order of call execution, see example below from open Ethereum:

    A
      CALLs B
        CALLs G
      CALLs C
        CALLs G
    

    then it should look something like:

    [ {A: []}, {B: [0]}, {G: [0, 0]}, {C: [1]}, {G: [1, 0]} ]

A good tool to use is the Versatile block explorer, where you can check the “Invocation Flow” for a more detailed view of traces than Etherscan provides.

Don’t worry too much if that doesn’t all make sense, we’ll dive into the data here using a data warehouse instead of RPC calls from here on out.

Accessing State and Storage Data using Ethereum Data Warehouse (EDW)

EDW is events, calls, transactions, state, and storage data in a Snowflake shared database. You can get setup here with a custom app or just start querying on Flipside by changing the schema to tokenflow_eth. There are some example queries here. I’m assuming you’ve already used something like Dune Analytics or Flipside, so I’ll only cover the new tables.

  • state_diffs: use this for tracking ETH balances. here are the main state fields and reasons they change:
    • I like to use this table for wallet balances analysis - much faster than trying to aggregate on ETH calls from traces.
  • storage_diffs: tells you current and last value for different storage slots, with mappings nicely nested already in dict formats.
    • Contract storage layout comes from solc compiler. you can use a hardhat plugin to do this to get the following on each compile (or the raw JSON output):
  • This table is useful for tracking variables on contracts, most commonly would be stuff like historical balances.
  • storage_reads: this tells you when a certain slot of storage was read during a transaction.
    • useful for assessing contract interoperability/dependencies, when joined on call_id and tx_hash with traces (calls)
    • exploring what variables are read when a certain function is called, when joined on transactions with a selected funcsig in call_data.
    • sometimes it's easier to use calls table (traces) when the value you want is in a function output, instead of the storage variables.

Example Queries: Looking at On-chain Price Oracles

I used the data outputted from these queries in further analysis, you can read and play with the code in the repo here.

Let’s run through a harder example than just balances and transfers: looking at price oracles on-chain. If you have no familiarity with Oracles, check out this explainer and this repo of older oracles.

tldr; Chainlink pushes prices from the real world on-chain so smart contracts can access it, Uniswap V2 and V3 have methods for calculating the token price purely on-chain. There are likely differences in prices at any given moment for a variety of reasons.

All the queries I’m going through below are public here, you can run them once you have EDW set up - just remember to change EDW_SNOWFLAKE_SECURE_SHARE_1645732350028 to whatever your secure share instance name is! I also set up udfs on a new database in Snowflake, since you can’t create udfs in a shared db.

Let’s start with Chainlink data feeds! The contract is set up such that the proxy itself doesn’t store historical prices - it calls aggregator contracts instead. There have been five such aggregators used for the ETH/USDC price feed, which I queried for here:

SELECT * --use curr_value for aggregator addresses to use in the next query
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_diffs 
WHERE contract = lower('0x5f4eC3Df9cbd43714FE2740f5E3616155c5b8419') --ETH/USDC proxy aggregator, ref https://data.chain.link/popular 
AND location LIKE '4[%].0'
ORDER BY timestamp ASC;

I queried for any address in slot 4 mapping. Mappings in EDW are stored as slot[value].keys where keys apply to structs in mappings. Here there is no struct, so we take .0 instead. The variable this tracks is:

"4": [
    {
      "from": 0,
      "name": "phaseAggregators",
      "to": 31,
      "type": "mapping(uint16, address)"
    }
  ],

Then, I plugged the five addresses manually into a query over the calls table, so I could get the values from each time the proxy contract called an aggregator.

SELECT 
    distinct 
    timestamp, 
    block,
    udf.public.hextoint(return_value)/1e8 as eth_price
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.calls 
WHERE from_address= lower('0x5f4eC3Df9cbd43714FE2740f5E3616155c5b8419') --ETH/USDC proxy aggregator
AND to_address IN ('0x37bc7498f4ff12c19678ee8fe19d713b87f6a9e6',    '0xd3fcd40153e56110e6eeae13e12530e26c9cb4fd', '0xb103ede8acd6f0c106b7a5772e9d24e34f5ebc2c', '0xf79d6afbb6da890132f9d7c355e3015f15f3406f', '0xc7a37b03690fb9f41b5c5af8131735c7275446') --historical phaseaggregators for ETH/USDC as of 3/1/2022
AND substring(call_data,1,10) = '0x50d25bcd' --funcsig for getting latestRoundData
AND block >= 12381343
AND status = 1
ORDER BY 1 ASC;

Chainlink oracle usage queries are just an aggregation change, so I’m not including it here.

Uniswap V2 Query

Uniswap V2 pairs track token prices separately, in variable price1CumulativeLast and price0CumulativeLast for token1 and token0 respectively (and in storage slots 10 and 9 respectively)

For this pair, I queried storage at slot 10 and parsed curr_value into an integer:

SELECT 
    block, 
    timestamp, 
    -- location, 
    -- tx_hash, 
    udf.public.hextoint(curr_value) as cumulative_last
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_diffs 
WHERE contract = lower('0xB4e16d0168e52d35CaCD2c6185b44281Ec28C9Dc')
AND location = '10' -- token1 price slot, which is WETH
AND block >= 12381343
AND status = 1
ORDER BY block, timestamp, order_index;

The rest of the data processing to get price was done in Python (see repo from top of section). V2 Oracle usage queries are basically the same as in v3, so I won’t go over them here.

Uniswap V3 Query

Uniswap V3 is much more complex, because of two reasons:

  • prices are tracked in ticks now
  • historical prices are stored in the contract now to make accessing them easier, but this means dealing with a struct array. This is the observations[] array, created using the Oracle.sol library and implemented on the pair contract.

Because the array is the last slot initialized in the contract (the 9th slot), we know the index of the array is slot - 8. Structs are stored in reverse and tightly packed within each slot, such that for the struct:

struct Observation {
        uint32 blockTimestamp;
        int56 tickCumulative;
        uint160 secondsPerLiquidityCumulativeX128;
        bool initialized;
    }

The slot can be decoded into: first two bytes have initialized then 40 for secondsPerLiquidityCumulativeX128, 14 for tickCumulative and 8 for blockTimestamp. The query I used was:

SELECT timestamp, 
       location::integer - 8 as index, 
       -- substr(curr_value, 3, 2) = '0x01' as initialized,
       -- hextoint(substr(curr_value, 5, 40)) as secondsPerLiquidityCumulativeX128, 
       hextoint(substr(curr_value, 45, 14)) as tickCumulative
       -- hextoint(substr(curr_value, 59, 8)) as blockTimestamp
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_diffs 
WHERE contract = lower('0x8ad599c3A0ff1De082011EFDDc58f1908eb6e6D8')
      and location not like '_[%' and location::integer > 8
      AND block >= 12381343 
      AND status = 1
ORDER BY block asc, order_index asc;

The rest of the data price processing was done in Python.

I also made a query to get daily reads/unique contracts reliant on the Uniswap V3 oracle, found here. First, I had to filter out transactions where there was a call to mint, burn, swap, or flash in the pool. Then, I joined the storage_reads table with the calls table on call_id and tx_hash to figure out which contracts were reading the tickCumulative data values, presuming they did so to get ETH prices.

WITH 
    --ignore calls that affect observation state as that doesn't count as oracle usage
    ignore_interactions as (
        SELECT 
            *
            --distinct tx_hash
        FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.calls
        WHERE to_address = lower('0xB4e16d0168e52d35CaCD2c6185b44281Ec28C9Dc')
        AND substring(call_data,1,10) IN ('0x6a627842','0x89afcb44','0x022c0d9f') --mint, burn, swap
    )

-- query top contracts using calls instead of txs, we count distinct tx_hash since otherwise reads might be cheating with multiple reads per oracle use. 
SELECT 
    call.from_address , 
    COUNT(distinct call.tx_hash) as total_times_queried
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_reads read 
LEFT JOIN EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.calls call 
    ON call.tx_hash = read.tx_hash AND call.call_id = read.call_id
WHERE contract = lower('0xB4e16d0168e52d35CaCD2c6185b44281Ec28C9Dc')
    AND location = '10'
    AND read.status = 1
    AND read.block >= 12381343
    AND NOT EXISTS (SELECT 1 FROM ignore_interactions ii --ignore non-oracle interactions
               WHERE ii.tx_hash = read.tx_hash)
GROUP BY 1
ORDER BY total_times_queried DESC
LIMIT 100;

*I highly encourage you to check out the longer deep dive into oracle prices here to get a sense of how this data is then applied in real analysis.*

Concluding Thoughts

I believe working with storage and state data is harder than events/function call data at the start because you need more familiarity with Solidity. Many variables are encoding in weird ways to make manipulation easy between smart contracts - so they really aren’t human interpretable at all. There are also a bunch of different storage patterns that people use, which take time to get used to.

However, this data does provide two primary benefits:

  • allow you to avoid a ton of expensive event/call aggregations, which will become increasingly important for wallet analysis in the future.
  • it captures cross-contract interoperability and dependencies. I think this enables us to do much deeper ecosystem analysis and risk analysis, as well as more accurate simulations.

I can’t wait for the production release of this product, huge kudos to the TokenFlow insights team (and thanks a bunch for answering all my questions! 🙂 )

(here’s the edition I mentioned at the start!)

Subscribe to Andrew Hong
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.