After many weeks of hearing about Web3 i.e randomly scroll through my Twitter feed and find the term everywhere. I hate buzzwords and trending terms but this is one that felt like it’s here to stay. So, I reluctantly did a deep dive on it over the weekend. What caught my eye is the idea of decentralizing the control and ownership that currently only a few companies have.
I wouldn’t bore you with a long definition of Web3 if there’s anything I’ve learned from my research it’s that for a definition it’s ever-evolving. However, it’s important to know what Web3 is, so you know how you fit.
Web 3.0 technology is simply a fair and transparent network where people will interact without the fear of loss of security or privacy.
So imagine a network where users control their own information and provide tokens of access to the businesses who want to use that data. This is set to be powered by AI and peer-to-peer applications like blockchain. Blockchain’s cryptographic and distributed technology ensures both security and data privacy. But in order to be Web 3.0, it must be present for all user data.
Blockchain reinvents the way data is stored and managed. It provides a unique set of data (a universal state layer) that is collectively managed. This unique state layer for the first time enables a value settlement layer for the Internet. It allows us to send files in a copy-protected way, enabling true P2P transactions without intermediaries.
Since Web 3 is all centered on user autonomy and this is achieved by user data being distributed across blockchain-enabled storage technologies. Web applications are also distributed across these same blockchain platforms and so users can opt to allow those apps (or dApps as they’re called) access to their data, creating richer, more relevant experiences. In contrast to traditional data sources (e.g., centrally controlled corporate databases), users no longer need to request data from businesses because it is already controlled by them and stored on the blockchain.
Since data are now stored in a distributed fashion, across the entirety of the internet, AI can be deployed to understand user needs more fully by developing language models that bring semantic understanding because queries are tied to user interactions. By design blockchains provide several benefits that are important for data science applications:
Data collection is the first hurdle most data scientists like me are likely to encounter in their blockchain or Web3 related projects. Even though it is easy to examine individual blockchain records using the aforementioned explorer websites. However, automating the collection of larger datasets suitable for Data Science purposes can be an unsettling task that may require specialized skills, software, and financial resources. Nevertheless, here are four main options one could consider.
With the advent of Web3 there are businesses providing marketplaces for data aggregators like businesses and data scientists to come together to buy and sell data assets in a decentralized framework. One company I am following that is making some significant strides towards this is the Ocean Protocol.
Ocean Protocol is enabling private businesses to sell their data assets on the marketplace without having to share the data outside of its firewalls. The Ocean Protocol employs a “Compute-to-Data” orchestration that allows AI models to train on private data.
Do you know how exciting that is? Imagine being able to train a disease model using data from multiple major hospital networks without ever having to access the data itself, just the metadata.
Another exciting opportunity for data scientists and ML engineers with this new data protocol is the opportunity to buy data, blend it with other data, enhance it with machine learning models and sell it back in its enhanced form.
2. BigQuery public datasets & Other
As part of their BigQuery Public Datasets program, Google Cloud provides full transaction histories for Bitcoin, Dash, Dogecoin, Ethereum, Ethereum Classic, Litecoin, Zcash, etc. These datasets can be easily queried using SQL and the results can be exported for further analyses and modeling. Conveniently, most of these datasets are using the same schema, making it easier to reuse SQL queries.
I found an excellent account Evgeny Medvedev where you can get tutorials on how to use and format these data. There also exist static blockchain datasets that one could use for research and development purposes. Here are just a few examples, you can search for more:
3. Use blockchain-specific API or ETL tool
It’s understandable that BigQuery Public Datasets contain major blockchain projects, but what if the blockchain of interest is not among them? One good way to go about collecting data is to use an API or ETL tool. Essentially most blockchains have a way to automate interactions with their networks via the respective REST and/or Websocket APIs. See, for example, the APIs to query Bitcoin, Ethereum, EOS, NEM, NEO, Nxt, Ripple, Stellar, Tezos.
For data scientists, you even find existing and convenient client libraries that take away the intricacies of different languages of specific APIs and allow Data Scientists to work with their preferred languages — Python or R. Examples of such libraries for Python include bitcoin (Bitcoin), trinity and web3.py (Ethereum), blockcypher (Bitcoin, Litecoin, Dogecoin, Dash), tronpy (TRON), litecoin-utils (Litecoin), etc. Examples of R packages are fewer but they exist: Rbitcoin (Bitcoin), ether (Ethereum), tronr (TRON).
In addition to APIs, one could also consider using dedicated ETL tools to gather data from blockchains. One prominent open-source project in this space is “Blockchain ETL”, a collection of Python scripts developed by Nansen.ai. In fact, these are the very scripts that feed data into the aforementioned BigQuery public datasets.
Although native blockchain APIs and open-source ETL applications give Data Scientists a lot of flexibility, using them in practice may require additional efforts and data engineering skills: setting up and maintaining a local or cloud-based blockchain node, a runtime environment to execute scripts, a database to store the retrieved data, etc. The associated infrastructural requirements may also incur substantial costs.
4. Commercial Solutions
To save time, efforts, and infrastructure-related costs, one can also opt for commercial solutions for blockchain data collection. Such tools typically provide data via an API or a SQL-enabled interface using a schema that is unified across several blockchains (see, for example, Anyblock Analytics, Bitquery, BlockCypher, Coin Metrics, Crypto APIs, Dune Analytics, Flipside Crypto). This facilitates various comparative analyses and, at least in theory, makes it possible to develop Data Science applications that are interoperable across blockchains.
I am still learning and I am certain I have missed something or may even have misinterpreted something along the line. Web 3.0 is still relatively new and lots of changes are certain to come. I will continue to follow and participate in this new framework. It has the potential to transform many industries and business processes (I probably would write about the use cases & real-world implementations next).
But it’s important to note that developments in Web3 will require an army of experts who are capable of “making data useful”, that is Data Scientists. The range of interesting and unsolved blockchain Data Science problems is enormous. On top of that, many of these problems are yet to be even formulated. Thus, if you are thinking about entering the exciting world of Web3 as a Data Scientist, the timing could not be better. Many of the companies copied & mentioned in this article already have open positions for Data Scientists — do check out the “Careers” sections on their websites or check this!