While Dune Analytics has rightly gotten a lot of positive attention recently for providing an easy-to-use platform for doing analytics over Ethereum datasets, you may not know that Google also provides a comprehensive dataset of Ethereum data on its large-scale analytics querying platform, BigQuery.
Not only are the datasets comprehensive - even including a table of call execution traces for every Ethereum transaction - they also benefit from BigQuery’s distributed nature, meaning that even large and complex queries can be executed quickly. I used BigQuery for both efficiently calculating the ENS airdrop amounts and for our ENS dashboard, and I thought people might find a few tips on how to put the data to work useful.
The Ethereum data can be found in two public projects in the BigQuery console. The first is the
crypto_ethereum dataset under the
bigquery-public-data project, which contains general Ethereum-wide datasets: Tables of blocks, logs, transactions, balances, etc.
The second project is
blockchain-etl, which contains numerous public datasets for Ethereum projects. For example,
blockchain-etl.ethereum_ens contains tables of events for most major contracts in ENS. If you’re wanting to produce analytics or do data mining for a specific project, look here first.
If you’ve used another analytics product like Dune Analytics, this will mostly be pretty familiar; BigQuery uses an SQL dialect just like Dune. For example, if you wanted to get a list of every address used as an ENS resolver, you could do it like this:
SELECT resolver, COUNT(*) AS count FROM `blockchain-etl.ethereum_ens.ENSRegistryWithFallback_event_NewResolver` GROUP BY resolver ORDER BY count DESC;
Likewise for the
crypto_ethereum dataset. Want a leaderboard of accounts that have spent the most on transaction fees? Here you go:
SELECT from_address, SUM(CAST(receipt_gas_used AS numeric) * CAST(COALESCE(gas_price, receipt_effective_gas_price) AS numeric)) / 1e18 AS total_fees FROM `bigquery-public-data.crypto_ethereum.transactions` GROUP BY from_address ORDER BY total_fees DESC LIMIT 100;
A lot of Ethereum event data has a common format: each event describes a change to some underlying state. Often what you want, however, is the state itself, rather than the delta. For example, ERC20 transfers are logged as a
Transfer event, which includes the from address, to address, and amount transferred, but often what you want is a balance sheet of all account balances. For this sort of query, simple aggregates generally suffice. This query, for example, returns a balance sheet for $ENS:
SELECT address, SUM(value) AS balance FROM ( SELECT token_address, from_address AS address, -CAST(value AS bigdecimal) / 1e18 AS value FROM `bigquery-public-data.crypto_ethereum.token_transfers` UNION ALL SELECT token_address, to_address AS address, CAST(value AS bigdecimal) / 1e18 AS value FROM `bigquery-public-data.crypto_ethereum.token_transfers` ) WHERE token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72' GROUP BY address ORDER BY balance DESC;
Note we’re including the
token_transfers table here twice, once with negative values for the sender, and once with positive values for the recipient.
What if you want a balance history - the balance of an address after each transaction? This is where window functions (also called analytic functions) come in incredibly useful:
SELECT block_timestamp AS timestamp, address, SUM(value) OVER (PARTITION BY address ORDER BY block_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS balance FROM ( SELECT block_timestamp, token_address, from_address AS address, -CAST(value AS bigdecimal) / 1e18 AS value FROM `bigquery-public-data.crypto_ethereum.token_transfers` UNION ALL SELECT block_timestamp, token_address, to_address AS address, CAST(value AS bigdecimal) / 1e18 AS value FROM `bigquery-public-data.crypto_ethereum.token_transfers` ) WHERE token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72' ORDER BY balance DESC;
Note the similarities to the previous query - but now we’re no longer grouping the results. Here, the SUM function isn’t aggregating records, it’s summing over other rows in the input set, totalling up the values of all previous transactions for that account. Note how the OVER clause specifies how to partition the rows (by address) and how to order them (by timestamp) and which rows to sum (from the start to the current row) - all resulting in a correct total for our use-case. The result will be one row for every token transaction, but with the balance after the transaction rather than the amount transacted.
What if you want to do the reverse - you have the new value, but instead you want the delta? We see this with ENS registration and renewal events; the contract outputs the expiry time, and we may want to know how long the name was registered or renewed for. Analytics queries come in useful here too, this time with the LAG function, which references a previous row in the input set:
SELECT block_timestamp, name, CAST(expires AS int64) - COALESCE(LAG(CAST(expires AS int64), 1) OVER(PARTITION BY name ORDER BY block_timestamp), UNIX_SECONDS(block_timestamp)) AS duration FROM ( SELECT block_timestamp, name, expires FROM `blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRegistered` UNION ALL SELECT block_timestamp, name, expires FROM `blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRenewed`)
LAG takes the expression to retrieve and how many rows back to retrieve it - and the OVER expression specifies how the rows should be partitioned and ordered. We also use COALESCE to provide a dynamic value for when a row is the first in the result set (eg, it’s being registered for the first time).
It’s worth noting that this won’t give totally accurate results; if a name expires and is reregistered, it will show the previous registration taking over from when the name expired, not from when it was reregistered. Fixing this is left as an exercise for the reader.
CREATE OR REPLACE FUNCTION `project.dataset.int_str_to_hash`(data STRING) RETURNS STRING LANGUAGE js OPTIONS (library=["gs://blockchain-etl-bigquery/ethers.js"]) AS R""" return '0x' + ethers.utils.bigNumberify(data).toHexString().slice(2).padStart(64, '0'); """;
If you want your own hosted version of the Ethereum ETL data, you’re in luck: The team has posted the source publicly on GitHub, so you can reproduce the datasets yourself: