While Dune Analytics has rightly gotten a lot of positive attention recently for providing an easy-to-use platform for doing analytics over Ethereum datasets, you may not know that Google also provides a comprehensive dataset of Ethereum data on its large-scale analytics querying platform, BigQuery.
Not only are the datasets comprehensive - even including a table of call execution traces for every Ethereum transaction - they also benefit from BigQuery’s distributed nature, meaning that even large and complex queries can be executed quickly. I used BigQuery for both efficiently calculating the ENS airdrop amounts and for our ENS dashboard, and I thought people might find a few tips on how to put the data to work useful.
The Ethereum data can be found in two public projects in the BigQuery console. The first is the crypto_ethereum
dataset under the bigquery-public-data
project, which contains general Ethereum-wide datasets: Tables of blocks, logs, transactions, balances, etc.
The second project is blockchain-etl
, which contains numerous public datasets for Ethereum projects. For example, blockchain-etl.ethereum_ens
contains tables of events for most major contracts in ENS. If you’re wanting to produce analytics or do data mining for a specific project, look here first.
If you’ve used another analytics product like Dune Analytics, this will mostly be pretty familiar; BigQuery uses an SQL dialect just like Dune. For example, if you wanted to get a list of every address used as an ENS resolver, you could do it like this:
SELECT
resolver,
COUNT(*) AS count
FROM
`blockchain-etl.ethereum_ens.ENSRegistryWithFallback_event_NewResolver`
GROUP BY
resolver
ORDER BY
count DESC;
Likewise for the crypto_ethereum
dataset. Want a leaderboard of accounts that have spent the most on transaction fees? Here you go:
SELECT
from_address,
SUM(CAST(receipt_gas_used AS numeric) * CAST(COALESCE(gas_price, receipt_effective_gas_price) AS numeric)) / 1e18 AS total_fees
FROM
`bigquery-public-data.crypto_ethereum.transactions`
GROUP BY
from_address
ORDER BY
total_fees DESC
LIMIT
100;
A lot of Ethereum event data has a common format: each event describes a change to some underlying state. Often what you want, however, is the state itself, rather than the delta. For example, ERC20 transfers are logged as a Transfer
event, which includes the from address, to address, and amount transferred, but often what you want is a balance sheet of all account balances. For this sort of query, simple aggregates generally suffice. This query, for example, returns a balance sheet for $ENS:
SELECT
address,
SUM(value) AS balance
FROM (
SELECT
token_address,
from_address AS address,
-CAST(value AS bigdecimal) / 1e18 AS value
FROM
`bigquery-public-data.crypto_ethereum.token_transfers`
UNION ALL
SELECT
token_address,
to_address AS address,
CAST(value AS bigdecimal) / 1e18 AS value
FROM
`bigquery-public-data.crypto_ethereum.token_transfers` )
WHERE
token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72'
GROUP BY
address
ORDER BY
balance DESC;
Note we’re including the token_transfers
table here twice, once with negative values for the sender, and once with positive values for the recipient.
What if you want a balance history - the balance of an address after each transaction? This is where window functions (also called analytic functions) come in incredibly useful:
SELECT
block_timestamp AS timestamp,
address,
SUM(value) OVER (PARTITION BY address ORDER BY block_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS balance
FROM (
SELECT
block_timestamp,
token_address,
from_address AS address,
-CAST(value AS bigdecimal) / 1e18 AS value
FROM
`bigquery-public-data.crypto_ethereum.token_transfers`
UNION ALL
SELECT
block_timestamp,
token_address,
to_address AS address,
CAST(value AS bigdecimal) / 1e18 AS value
FROM
`bigquery-public-data.crypto_ethereum.token_transfers` )
WHERE
token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72'
ORDER BY
balance DESC;
Note the similarities to the previous query - but now we’re no longer grouping the results. Here, the SUM function isn’t aggregating records, it’s summing over other rows in the input set, totalling up the values of all previous transactions for that account. Note how the OVER clause specifies how to partition the rows (by address) and how to order them (by timestamp) and which rows to sum (from the start to the current row) - all resulting in a correct total for our use-case. The result will be one row for every token transaction, but with the balance after the transaction rather than the amount transacted.
What if you want to do the reverse - you have the new value, but instead you want the delta? We see this with ENS registration and renewal events; the contract outputs the expiry time, and we may want to know how long the name was registered or renewed for. Analytics queries come in useful here too, this time with the LAG function, which references a previous row in the input set:
SELECT
block_timestamp,
name,
CAST(expires AS int64) - COALESCE(LAG(CAST(expires AS int64), 1) OVER(PARTITION BY name ORDER BY block_timestamp), UNIX_SECONDS(block_timestamp)) AS duration
FROM (
SELECT
block_timestamp,
name,
expires
FROM
`blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRegistered`
UNION ALL
SELECT
block_timestamp,
name,
expires
FROM
`blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRenewed`)
LAG takes the expression to retrieve and how many rows back to retrieve it - and the OVER expression specifies how the rows should be partitioned and ordered. We also use COALESCE to provide a dynamic value for when a row is the first in the result set (eg, it’s being registered for the first time).
It’s worth noting that this won’t give totally accurate results; if a name expires and is reregistered, it will show the previous registration taking over from when the name expired, not from when it was reregistered. Fixing this is left as an exercise for the reader.
Finally, what if you want to do something there isn’t a built-in function for - such as converting between a 256 bit int and its hex representation? Fortunately, BigQuery has you covered, with support for JavaScript user-defined functions, which can even import libraries:
CREATE OR REPLACE FUNCTION `project.dataset.int_str_to_hash`(data STRING) RETURNS STRING LANGUAGE js
OPTIONS (library=["gs://blockchain-etl-bigquery/ethers.js"]) AS R"""
return '0x' + ethers.utils.bigNumberify(data).toHexString().slice(2).padStart(64, '0');
""";
All of this is great, but what if you want to do analytics on events that haven’t been conveniently extracted to a table and decoded by the Blockchain ETL team? If your dataset is widely useful, you may want to follow the instructions in this post, to get your dataset added to the blockchain_etl project. If you don’t have time to wait, or you’re working on something more bespoke, however, it’s possible to extract this data yourself using JavaScript UDFs - something I’ll go into in detail in my next post.
If you want your own hosted version of the Ethereum ETL data, you’re in luck: The team has posted the source publicly on GitHub, so you can reproduce the datasets yourself: