Getting Into Web3 as a Data Scientist & Machine Learning Engineer
Miguel Padriñán (Pexels)
Miguel Padriñán (Pexels)

After many weeks of hearing about Web3 i.e randomly scroll through my Twitter feed and find the term everywhere. I hate buzzwords and trending terms but this is one that felt like it’s here to stay. So, I reluctantly did a deep dive on it over the weekend. What caught my eye is the idea of decentralizing the control and ownership that currently only a few companies have.

I wouldn’t bore you with a long definition of Web3 if there’s anything I’ve learned from my research it’s that for a definition it’s ever-evolving. However, it’s important to know what Web3 is, so you know how you fit.

Web 3.0 technology is simply a fair and transparent network where people will interact without the fear of loss of security or privacy.

So imagine a network where users control their own information and provide tokens of access to the businesses who want to use that data. This is set to be powered by AI and peer-to-peer applications like blockchain. Blockchain’s cryptographic and distributed technology ensures both security and data privacy. But in order to be Web 3.0, it must be present for all user data.

Blockchain reinvents the way data is stored and managed. It provides a unique set of data (a universal state layer) that is collectively managed. This unique state layer for the first time enables a value settlement layer for the Internet. It allows us to send files in a copy-protected way, enabling true P2P transactions without intermediaries.

What Does This Mean For Data Scientists & Machine Learning Engineers?

Since Web 3 is all centered on user autonomy and this is achieved by user data being distributed across blockchain-enabled storage technologies. Web applications are also distributed across these same blockchain platforms and so users can opt to allow those apps (or dApps as they’re called) access to their data, creating richer, more relevant experiences. In contrast to traditional data sources (e.g., centrally controlled corporate databases), users no longer need to request data from businesses because it is already controlled by them and stored on the blockchain.

Since data are now stored in a distributed fashion, across the entirety of the internet, AI can be deployed to understand user needs more fully by developing language models that bring semantic understanding because queries are tied to user interactions. By design blockchains provide several benefits that are important for data science applications:

  • Traceability: The consensus protocol is designed in a way that the network can collectively remember preceding events or user interactions. Bitcoin, therefore, resolved the problem of double-spending by providing a single source of reference for who received what and when. Moreover, most of the public blockchains have “explorers” — websites where anyone can examine any record that has ever been generated on the respective blockchain (see, for example, the Bitcoin, Ethereum, and Ripple explorers).
  • Built-in anonymity: Blockchains do not require their users to provide any personal information, which is important in a world where keeping one’s privacy has become a real issue. From a Data Scientist’s perspective, this helps to overcome the headaches associated with some of the regulations (e.g., GDPR in Europe) that require personal data to be anonymized before processing.
  • High data quality: Data on a blockchain is typically well structured and their schemas are well documented. All new records also go through a rigorous, blockchain-specific validation process powered by one of the many “consensus protocols”. Once validated and approved, these records become unchangeable — no one can modify them for any purposes, good or malicious ones. This makes the life of a Data Scientist who works with such data much easier and more predictable.
  • Large data volumes: Many Machine Learning algorithms require large amounts of data to train models. This is not a problem in mature blockchains, which offer tons of data.

How Do We Then Collect Data From Blockchains For Web3?

Tom Fisk (Pexels)
Tom Fisk (Pexels)

Data collection is the first hurdle most data scientists like me are likely to encounter in their blockchain or Web3 related projects. Even though it is easy to examine individual blockchain records using the aforementioned explorer websites. However, automating the collection of larger datasets suitable for Data Science purposes can be an unsettling task that may require specialized skills, software, and financial resources. Nevertheless, here are four main options one could consider.

  1. Web3 Data Marketplaces

With the advent of Web3 there are businesses providing marketplaces for data aggregators like businesses and data scientists to come together to buy and sell data assets in a decentralized framework. One company I am following that is making some significant strides towards this is the Ocean Protocol.

Ocean Protocol is enabling private businesses to sell their data assets on the marketplace without having to share the data outside of its firewalls. The Ocean Protocol employs a “Compute-to-Data” orchestration that allows AI models to train on private data.

Do you know how exciting that is? Imagine being able to train a disease model using data from multiple major hospital networks without ever having to access the data itself, just the metadata.

Another exciting opportunity for data scientists and ML engineers with this new data protocol is the opportunity to buy data, blend it with other data, enhance it with machine learning models and sell it back in its enhanced form.

2. BigQuery public datasets & Other

As part of their BigQuery Public Datasets program, Google Cloud provides full transaction histories for Bitcoin, Dash, Dogecoin, Ethereum, Ethereum Classic, Litecoin, Zcash, etc. These datasets can be easily queried using SQL and the results can be exported for further analyses and modeling. Conveniently, most of these datasets are using the same schema, making it easier to reuse SQL queries.

I found an excellent account Evgeny Medvedev where you can get tutorials on how to use and format these data. There also exist static blockchain datasets that one could use for research and development purposes. Here are just a few examples, you can search for more:

3. Use blockchain-specific API or ETL tool

It’s understandable that BigQuery Public Datasets contain major blockchain projects, but what if the blockchain of interest is not among them? One good way to go about collecting data is to use an API or ETL tool. Essentially most blockchains have a way to automate interactions with their networks via the respective REST and/or Websocket APIs. See, for example, the APIs to query Bitcoin, Ethereum, EOS, NEM, NEO, Nxt, Ripple, Stellar, Tezos.

For data scientists, you even find existing and convenient client libraries that take away the intricacies of different languages of specific APIs and allow Data Scientists to work with their preferred languages — Python or R. Examples of such libraries for Python include bitcoin (Bitcoin), trinity and (Ethereum), blockcypher (Bitcoin, Litecoin, Dogecoin, Dash), tronpy (TRON), litecoin-utils (Litecoin), etc. Examples of R packages are fewer but they exist: Rbitcoin (Bitcoin), ether (Ethereum), tronr (TRON).

In addition to APIs, one could also consider using dedicated ETL tools to gather data from blockchains. One prominent open-source project in this space is “Blockchain ETL”, a collection of Python scripts developed by In fact, these are the very scripts that feed data into the aforementioned BigQuery public datasets.

Although native blockchain APIs and open-source ETL applications give Data Scientists a lot of flexibility, using them in practice may require additional efforts and data engineering skills: setting up and maintaining a local or cloud-based blockchain node, a runtime environment to execute scripts, a database to store the retrieved data, etc. The associated infrastructural requirements may also incur substantial costs.

4. Commercial Solutions

To save time, efforts, and infrastructure-related costs, one can also opt for commercial solutions for blockchain data collection. Such tools typically provide data via an API or a SQL-enabled interface using a schema that is unified across several blockchains (see, for example, Anyblock Analytics, Bitquery, BlockCypher, Coin Metrics, Crypto APIs, Dune Analytics, Flipside Crypto). This facilitates various comparative analyses and, at least in theory, makes it possible to develop Data Science applications that are interoperable across blockchains.

Conclusion: if you want to become a Web3 Data Scientist, now is the perfect time

I am still learning and I am certain I have missed something or may even have misinterpreted something along the line. Web 3.0 is still relatively new and lots of changes are certain to come. I will continue to follow and participate in this new framework. It has the potential to transform many industries and business processes (I probably would write about the use cases & real-world implementations next).

But it’s important to note that developments in Web3 will require an army of experts who are capable of “making data useful”, that is Data Scientists. The range of interesting and unsolved blockchain Data Science problems is enormous. On top of that, many of these problems are yet to be even formulated. Thus, if you are thinking about entering the exciting world of Web3 as a Data Scientist, the timing could not be better. Many of the companies copied & mentioned in this article already have open positions for Data Scientists — do check out the “Careers” sections on their websites or check this!

Subscribe to Goke Adekunle
Receive the latest updates directly to your inbox.
This entry has been permanently stored onchain and signed by its creator.