Free Historical Blockchain Extraction with Cryo + Merkle Reth Nodes

Introduction

Historical blockchain data poses challenges for analysis. Despite its general accessibility, obtaining and analyzing such data has been historically hindered by paywalls and restrictions imposed by node service providers. Setting up a personal archive node is also a non-trivial task, introducing extra steps before data analysis becomes feasible.

By leveraging Cryo to extract historical data and utilizing the free Merkle RPC with archive node support, researchers can now easily access Ethereum historical data at no cost. This breakthrough provides quantitative researchers with production-grade access, allowing more time for data exploration and less time spent on building data pipelines.

Cryo

Cryo, a recent addition to data extraction tools, was announced in July 2023. It employs ethers.rs for JSON-RPC requests, making it compatible with various chains, including Ethereum, Optimism, Arbitrum, Polygon, BNB, and Avalanche. Since Cryo is built in rust, querying data is embarassingly parallel. This actually makes Cryo so fast that by default, it will be too fast to use with most node providers.

When extracting data from a historical node, a common challenge involves preprocessing raw blockchain data to make it human-usable. Cryo takes care of this and standardizes the dataset across a wide variety of datasets. y default, data is saved into Apache's free, universal, and open-source column-oriented storage format—parquet files. These files use the lz4 compression method by default (modifiable with the --compression syntax).

Some example datasets that are available in Cryo already:

  • balance_diffs

  • balances

  • blocks

  • erc20_balances

  • erc20_supplies

  • erc20_transfers

  • erc721_transfers

  • eth_calls

  • geth_calls

  • geth_code_diffs

  • geth_balance_diffs

  • geth_opcodes

  • logs (alias = events)

  • native_transfers

  • slots (alias = storages)

  • storage_diffs (alias = slot_diffs)

  • storage_reads (alias = slot_reads)

  • traces

  • trace_calls

  • transactions (alias = txs)

Cryo is user-friendly, accessible through the CLI or Python bindings, significantly simplifying the process of extracting and curating historical data for research. The Cryo GitHub readme provides a set of starter commands. Additionally, Cryo is idempotent, enabling researchers to resume interrupted pipelines without duplicating queried data.

Merkle RPC

Cryo's efficiency is limited by rate limits and throttling from various node endpoints or personal node hardware. Merkle addresses this bottleneck by offering a free RPC with no throttling and unlimited requests. The endpoint can be found here. How is this possible?

Merkle is a private mempool provider and operates a group of RETH nodes, allowing them to save ~$250,000 annually on expenses (source) and improve performance compared to other mempool services like Kolibrio and Bloxroute. Of equal importance, their cloud provider, OVH, grants them unlimited outgoing/incoming bandwidth so don’t feel guilty using the node!

While Cryo currently supports BSC and Polygon, Reth does not. However, Merkle plans to offer similar public endpoints for BSC and Polygon once they support them.

Short Example

Here is an example of how I am using Cryo CLI to build a dataset with blocks, transactions, and intra-block balance changes. Two lines of code downloads ~50gb of historical data for the month of September. The limiting factor was my internet speed! The pipeline is largely self managed because of the idempotent nature of Cryo so the only thing I needed to do was create a data folder, Cryo takes care of the subfolder management with --subdirs datatype

`cryo blocks_and_transactions -b 18039828:18251969 -o /home/evan/Documents/blockspace/data/cryo_september/ --rpc "https://eth.merkle.io" --subdirs datatype --hex`

`cryo balance_diffs -b 18039828:18251969 -o /home/evan/Documents/blockspace/data/cryo_september/ --rpc "https://eth.merkle.io" --subdirs datatype --hex`
Subscribe to 0xEvan
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.