Ethereum analytics with BigQuery
0xb8c2
November 27th, 2021

While Dune Analytics has rightly gotten a lot of positive attention recently for providing an easy-to-use platform for doing analytics over Ethereum datasets, you may not know that Google also provides a comprehensive dataset of Ethereum data on its large-scale analytics querying platform, BigQuery.

Not only are the datasets comprehensive - even including a table of call execution traces for every Ethereum transaction - they also benefit from BigQuery’s distributed nature, meaning that even large and complex queries can be executed quickly. I used BigQuery for both efficiently calculating the ENS airdrop amounts and for our ENS dashboard, and I thought people might find a few tips on how to put the data to work useful.

Where to find it

The Ethereum data can be found in two public projects in the BigQuery console. The first is the crypto_ethereum dataset under the bigquery-public-data project, which contains general Ethereum-wide datasets: Tables of blocks, logs, transactions, balances, etc.

The second project is blockchain-etl, which contains numerous public datasets for Ethereum projects. For example, blockchain-etl.ethereum_ens contains tables of events for most major contracts in ENS. If you’re wanting to produce analytics or do data mining for a specific project, look here first.

How to query it

If you’ve used another analytics product like Dune Analytics, this will mostly be pretty familiar; BigQuery uses an SQL dialect just like Dune. For example, if you wanted to get a list of every address used as an ENS resolver, you could do it like this:

SELECT
  resolver,
  COUNT(*) AS count
FROM
  `blockchain-etl.ethereum_ens.ENSRegistryWithFallback_event_NewResolver`
GROUP BY
  resolver
ORDER BY
  count DESC;

Likewise for the crypto_ethereum dataset. Want a leaderboard of accounts that have spent the most on transaction fees? Here you go:

SELECT
  from_address,
  SUM(CAST(receipt_gas_used AS numeric) * CAST(COALESCE(gas_price, receipt_effective_gas_price) AS numeric)) / 1e18 AS total_fees
FROM
  `bigquery-public-data.crypto_ethereum.transactions`
GROUP BY
  from_address
ORDER BY
  total_fees DESC
LIMIT
  100;

Tips

Aggregating Deltas

A lot of Ethereum event data has a common format: each event describes a change to some underlying state. Often what you want, however, is the state itself, rather than the delta. For example, ERC20 transfers are logged as a Transfer event, which includes the from address, to address, and amount transferred, but often what you want is a balance sheet of all account balances. For this sort of query, simple aggregates generally suffice. This query, for example, returns a balance sheet for $ENS:

SELECT
  address,
  SUM(value) AS balance
FROM (
  SELECT
    token_address,
    from_address AS address,
    -CAST(value AS bigdecimal) / 1e18 AS value
  FROM
    `bigquery-public-data.crypto_ethereum.token_transfers`
  UNION ALL
  SELECT
    token_address,
    to_address AS address,
    CAST(value AS bigdecimal) / 1e18 AS value
  FROM
    `bigquery-public-data.crypto_ethereum.token_transfers` )
WHERE
  token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72'
GROUP BY
  address
ORDER BY
  balance DESC;

Note we’re including the token_transfers table here twice, once with negative values for the sender, and once with positive values for the recipient.

Creating a time-series from deltas

What if you want a balance history - the balance of an address after each transaction? This is where window functions (also called analytic functions) come in incredibly useful:

SELECT
  block_timestamp AS timestamp,
  address,
  SUM(value) OVER (PARTITION BY address ORDER BY block_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS balance
FROM (
  SELECT
    block_timestamp,
    token_address,
    from_address AS address,
    -CAST(value AS bigdecimal) / 1e18 AS value
  FROM
    `bigquery-public-data.crypto_ethereum.token_transfers`
  UNION ALL
  SELECT
    block_timestamp,
    token_address,
    to_address AS address,
    CAST(value AS bigdecimal) / 1e18 AS value
  FROM
    `bigquery-public-data.crypto_ethereum.token_transfers` )
WHERE
  token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72'
ORDER BY
  balance DESC;

Note the similarities to the previous query - but now we’re no longer grouping the results. Here, the SUM function isn’t aggregating records, it’s summing over other rows in the input set, totalling up the values of all previous transactions for that account. Note how the OVER clause specifies how to partition the rows (by address) and how to order them (by timestamp) and which rows to sum (from the start to the current row) - all resulting in a correct total for our use-case. The result will be one row for every token transaction, but with the balance after the transaction rather than the amount transacted.

Creating deltas from a time series

What if you want to do the reverse - you have the new value, but instead you want the delta? We see this with ENS registration and renewal events; the contract outputs the expiry time, and we may want to know how long the name was registered or renewed for. Analytics queries come in useful here too, this time with the LAG function, which references a previous row in the input set:

SELECT
  block_timestamp,
  name,
  CAST(expires AS int64) - COALESCE(LAG(CAST(expires AS int64), 1) OVER(PARTITION BY name ORDER BY block_timestamp), UNIX_SECONDS(block_timestamp)) AS duration
FROM (
  SELECT
    block_timestamp,
    name,
    expires
  FROM
    `blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRegistered`
  UNION ALL
  SELECT
    block_timestamp,
    name,
    expires
  FROM
    `blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRenewed`)

LAG takes the expression to retrieve and how many rows back to retrieve it - and the OVER expression specifies how the rows should be partitioned and ordered. We also use COALESCE to provide a dynamic value for when a row is the first in the result set (eg, it’s being registered for the first time).

It’s worth noting that this won’t give totally accurate results; if a name expires and is reregistered, it will show the previous registration taking over from when the name expired, not from when it was reregistered. Fixing this is left as an exercise for the reader.

JavaScript user-defined functions

Finally, what if you want to do something there isn’t a built-in function for - such as converting between a 256 bit int and its hex representation? Fortunately, BigQuery has you covered, with support for JavaScript user-defined functions, which can even import libraries:

CREATE OR REPLACE FUNCTION `project.dataset.int_str_to_hash`(data STRING) RETURNS STRING LANGUAGE js
OPTIONS (library=["gs://blockchain-etl-bigquery/ethers.js"]) AS R"""
return '0x' + ethers.utils.bigNumberify(data).toHexString().slice(2).padStart(64, '0');
""";

Extracting your own event data

All of this is great, but what if you want to do analytics on events that haven’t been conveniently extracted to a table and decoded by the Blockchain ETL team? If your dataset is widely useful, you may want to follow the instructions in this post, to get your dataset added to the blockchain_etl project. If you don’t have time to wait, or you’re working on something more bespoke, however, it’s possible to extract this data yourself using JavaScript UDFs - something I’ll go into in detail in my next post.

DIY

If you want your own hosted version of the Ethereum ETL data, you’re in luck: The team has posted the source publicly on GitHub, so you can reproduce the datasets yourself:

Arweave TX
xPxBkQKDOSKZVIWqtipOWGzhNXIyllQIsCGQVZk8nG0
Ethereum Address
0xb8c2C29ee19D8307cb7255e1Cd9CbDE883A267d5
Content Digest
INhEmxgxoyoa8kPZ3rjYNZXoyfGsReLgx42MdDvn4SM