Historical blockchain data poses challenges for analysis. Despite its general accessibility, obtaining and analyzing such data has been historically hindered by paywalls and restrictions imposed by node service providers. Setting up a personal archive node is also a non-trivial task, introducing extra steps before data analysis becomes feasible.
By leveraging Cryo to extract historical data and utilizing the free Merkle RPC with archive node support, researchers can now easily access Ethereum historical data at no cost. This breakthrough provides quantitative researchers with production-grade access, allowing more time for data exploration and less time spent on building data pipelines.
Cryo, a recent addition to data extraction tools, was announced in July 2023. It employs ethers.rs for JSON-RPC requests, making it compatible with various chains, including Ethereum, Optimism, Arbitrum, Polygon, BNB, and Avalanche. Since Cryo is built in rust, querying data is embarassingly parallel. This actually makes Cryo so fast that by default, it will be too fast to use with most node providers.
When extracting data from a historical node, a common challenge involves preprocessing raw blockchain data to make it human-usable. Cryo takes care of this and standardizes the dataset across a wide variety of datasets. y default, data is saved into Apache's free, universal, and open-source column-oriented storage format—parquet files. These files use the lz4
compression method by default (modifiable with the --compression
syntax).
Some example datasets that are available in Cryo already:
balance_diffs
balances
blocks
erc20_balances
erc20_supplies
erc20_transfers
erc721_transfers
eth_calls
geth_calls
geth_code_diffs
geth_balance_diffs
geth_opcodes
logs (alias = events)
native_transfers
slots (alias = storages)
storage_diffs (alias = slot_diffs)
storage_reads (alias = slot_reads)
traces
trace_calls
transactions (alias = txs)
Cryo is user-friendly, accessible through the CLI or Python bindings, significantly simplifying the process of extracting and curating historical data for research. The Cryo GitHub readme provides a set of starter commands. Additionally, Cryo is idempotent, enabling researchers to resume interrupted pipelines without duplicating queried data.
Cryo's efficiency is limited by rate limits and throttling from various node endpoints or personal node hardware. Merkle addresses this bottleneck by offering a free RPC with no throttling and unlimited requests. The endpoint can be found here. How is this possible?
Merkle is a private mempool provider and operates a group of RETH nodes, allowing them to save ~$250,000 annually on expenses (source) and improve performance compared to other mempool services like Kolibrio and Bloxroute. Of equal importance, their cloud provider, OVH, grants them unlimited outgoing/incoming bandwidth so don’t feel guilty using the node!
While Cryo currently supports BSC and Polygon, Reth does not. However, Merkle plans to offer similar public endpoints for BSC and Polygon once they support them.
Here is an example of how I am using Cryo CLI to build a dataset with blocks, transactions, and intra-block balance changes. Two lines of code downloads ~50gb of historical data for the month of September. The limiting factor was my internet speed! The pipeline is largely self managed because of the idempotent nature of Cryo so the only thing I needed to do was create a data folder, Cryo takes care of the subfolder management with --subdirs datatype
`cryo blocks_and_transactions -b 18039828:18251969 -o /home/evan/Documents/blockspace/data/cryo_september/ --rpc "https://eth.merkle.io" --subdirs datatype --hex`
`cryo balance_diffs -b 18039828:18251969 -o /home/evan/Documents/blockspace/data/cryo_september/ --rpc "https://eth.merkle.io" --subdirs datatype --hex`