Synthetic foundations: the future of data in web3

In 2012, Chris Dixon wrote that just as oil powered the Industrial Age, data is the fuel of the Information Age. I believe that by the end of 2024, we will all realize that synthetic data is the clean energy.

Understanding synthetic data

Synthetic data is created by first training a model on real data to learn its distribution. Then, the model's generative capabilities are used to produce new, synthetic data. This synthetic data can then train other models to learn new functions or abilities.

source: https://tinyurl.com/yymtp4hh
source: https://tinyurl.com/yymtp4hh

In shadowed veil 'neath star-crossed sky, our tale begins with whisper'd sigh.
GPT-4, 2024

We've all seen the impressive capabilities of Large Language Models (LLMs). For instance, GPT-4 can effortlessly produce new lines of Shakespearean dialogue, despite being trained on less than 1,000,000 original words from Shakespeare’s works. Synthetic data utilizes these capabilities to generate new data that imitates the characteristics of real data. Synthetic data will unlock a scale and breadth of data that may be impossible to collect in the real-world.

Synthetic data is more than just 'fake' data; it is crafted to mimic the complexity of real-world data. However, since the data is synthetic, it overcomes the ethical and logistical limitations associated with traditional data collection and sharing methods. Moreover, it can be produced faster, more complete (pre-labeled, without sample biases, etc), and cheaper than real data. Synthetic data promises a future of accelerated innovation, allowing us to train enhanced and entirely new models with data that would otherwise be unattainable.

Synthetic data in the real-world

Generally, the arc of technology has been about reducing randomness and increasing our control over the world. At some point in the next century, we are going to have the most randomness ever injected into the system.
Sam Altman, 2015

In 2024, synthetic data is set to become more accessible and broadly applicable across many fields. The use-cases for synthetic data are starting to proliferate, from text-to-image pipelines for creating domain-specific image libraries, to generating datasets for hate speech detection, and formulating coding challenges for model training.

Let’s look at some examples of how synthetic data is being used in the real-world.

  • Medical researchers are using synthetic data, like images made by GANs (Generative Adversarial Networks), to help detect diseases without using real patient data. This approach lets them train AI models to spot conditions such as eye diseases from fake but realistic images, keeping patient information private. This technology speeds up research while sticking to privacy laws, pushing medical advancements forward. See here for discussion of implications and concerns https://www.nature.com/articles/s41551-021-00751-8.
  • The generation of over 100,000 synthetic faces to enhance facial recognition technologies. The project then uses the synthetic faces to generate synthetic eye motion, aiming to train even better eye-tracking models, as demonstrated in their presentation. In short, create fake faces, create fake eye movement, create all the labels for both datasets automatically. Then train models to do facial labeling and eye tracking on real world faces. Important to note is the scale and quality of automated data labeling generated for both the faces and the eye-tracking examples. Watch full video here.
  • NVIDIA is building the Omniverse, a platform that facilitates the creation of extensive synthetic datasets based on a simulated real-world environment with real-world physics. The vast amounts of synthetic data being produced, and the nuance in potential parameters generated is wild. Video link below:

It's clear the impact of synthetic data extends far beyond just the creation of random data. Here are some key advantages of synthetic data in today's data-driven world:

  • Enhanced data: Synthetic data enables the creation of expansive, training-ready datasets, significantly alleviating the time and financial burdens associated with traditional data collection methods.

  • Scalable data: By removing the need for real-world training data and accessing these pre-labeled data, we substantially reduce the need for labor-intensive human annotation.

  • Impossible data: The simulation of otherwise inaccessible data, providing solutions to challenges posed by biases, incompleteness, scale, accessibility, or security constraints inherent in real-world data collection.

  • Sensitive data: Synthetic data facilitates the sharing or publishing of data for training purposes, circumventing the privacy and sensitivity concerns tied to actual data.

The web3 opportunity

We are approaching a significant change in how data is created. Synthetic data introduces both opportunities and challenges for all of us. On the one hand, synthetic data is still tricky to get right (and when dealing with privacy, comes with serious pitfalls). On the other hand, synthetic data will rapidly grow more valuable and easy to create. Web3 needs to look at the concrete opportunities that come in the synthetic data future. Let's explore some of the problem areas where web3 and synthetic data can grow together.

Bootstrapping data economies

The path to establishing data economies within web3 ecosystems (or any ecosystem) comes with challenges, particularly in bootstrapping real-world data to fuel these new economies. Here, synthetic data can serve as a bridge to overcoming the initial scarcity of real-world data. Consider these examples for how to use synthetic data to help bootstrap web3 data economies:

  • Rapidly scale DePIN data: Synthetic data can be used to improve analysis and prediction models for decentralized sensor networks. Synthetic data can be used by a project like WeatherXM to improve the detection of bad data faster than it appears in the wild, enhance the reward models, and build classification models for high-value events (e.g., floods, heat-waves, etc.) not yet seen at high-levels in real data.

  • Privacy-conscious data sharing: Projects like Dimo could generate synthetic trip datasets, allowing developers to enhance applications without compromising user privacy. While real trip data on the network is sensitive and private to only the drivers and the apps they share it with, synthetic data can be shared (with attention to risks) for all developers to train new models and build novel applications to run in their ecosystem.

  • Reputation and future threat detection: Projects like Gitcoin Passport could utilize synthetic data to refine models identifying fraudulent behavior, bolstering security across Web3 platforms. Help expand beyond heuristic approaches to threat detection into advanced models that may perform better at detecting attacks versus real people.

  • Generating more map data: Synthetic data derived from dash cams is helping to create an open dataset of accurate, high-detail map data, crucial for navigation and geographic information systems. On a network like Hivemapper, synthetic data would help developers create innovative models (e.g., new categories of object detection) to extract greater amounts of data from images. See here for examples on how to use synthetic data for real-world object detection in imagery and what a single developer can achieve.

  • Decentralized ride-share & delivery predictions: Bootstrapping the sharing economy has well documented challenges, among them is the gap between data that incumbents have versus what any upstart can access. Synthetic data facilitates the creation of a comprehensive training dataset to optimize driver dispatching and forecast demand in various regions, enhancing service efficiency and user satisfaction. These capabilities could allow these networks to compete more quickly with the incumbents.

  • DeSci for healthcare research and drug discovery: DeSci projects often involve disease research, where the sharing of original data on open networks is restricted or prohibited. This limitation can be overcome through synthetic patient data, conforming to privacy regulations, to accelerate clinical research and personalized medicine, driving forward healthcare innovations.

While generating synthetic data isn’t trivial, and any example above would take investment, the potential of synthetic data in web3 can’t be understated. I believe it could become a cornerstone of the web3 data landscape.

Maximize access & democratize innovation

The issue of figuring out which people are best to work on a problem is totally different from the issue of figuring out which problems to solve.
Munn, 2024

By publishing data (e.g., benchmark data for competition or standardized data in place of private data for composability) on web3 platforms and leveraging crypto-based incentives, we can power more open innovation. Open & permissible data encourages collaboration and iteration and also (through incentives) can create new models of rewarding innovation. All together, the combination can propel the development of novel solutions to big problems.

Here are two examples for how synthetic data might be used to expand the innovation on open networks in web3.

  • Tailored forecasting solutions: The weather prediction and event classification needs of a farmer in Brazil are different from a city dweller in India. By offering customizable, region-specific synthetic datasets, WeatherXM and other networks could empower local developers and researchers to rapidly innovate and deploy weather prediction and classification tools that are tuned to the unique environmental challenges and requirements of their communities. This innovation, and showcasing the value of the results, can potentially happen faster than the bootstrapping of hardware installation to completely cover those regions.

  • Collaborative research breakthroughs: A similar promise exists within DeSci, where interested parties can pool resources to solve specific challenges or unlock new resources not always addressed by institutions or larger industry players. Synthetic patient data bypasses the privacy and sharing constraints of traditional research data. Allowing for innovation on data to begin prior to solving regulatory challenges, or even knowing who you are collaborating with.

Provenance and verifiability

… AI needs data and blockchains are good for storing and tracking data.
Vitalik, 2024

Synthetic data is going to exacerbate all problems around trusting and using data on the Internet. It will grow increasingly difficult for (both humans and machines) to discern real-data from synthetic data. I have no doubt that costly mistakes will be made from processing synthetic data in the place of real data (or vice versa). This is an area where blockchain can help.

Certifying data origins is crucial as the distinction between synthetic and real data becomes blurred. With every piece of data, be it a dataset for training machine learning models or records for transactional purposes, blockchain can provide a verifiable trail that confirms its source, modifications, and ownership over time. This is true for both public and private data.

Comprehensive traceability will be vital for compliance with regulatory standards and privacy laws (e.g., synthetic PII). By creating an ecosystem where data's origins are transparent and verifiable, we can mitigate risks associated with data misuse and enhance the reliability of data.

Advanced data structures, proof carrying formats, inclusion proofs, and other cryptographic tools can further strengthen the integrity and verifiability of data. These mechanisms could verify any data's accuracy and completeness (often without exposing the underlying data).

The future of data usage and sharing, particularly in sensitive and regulated domains, will increasingly rely on such cryptographic verification methods to maintain trust, security, and privacy in our increasingly data-driven world.

Conclusions

source: http://tinyurl.com/5ajfppbc
source: http://tinyurl.com/5ajfppbc

The gains made in a future GPT-5 wont come from newly crawled Internet content as much as they will from innovations in synthetic data. Likewise, our entire field should be looking at how to capture and incorporate synthetic data into our open networks.

Web3’s opportunities with synthetic data are big. From certifying data origins to nurturing burgeoning data economies that thrive on open innovation and collaboration. But synthetic data isn’t magical, creating it isn’t easy and it will take a lot of effort to get it right. Large, open questions remain: how can we capture the full potential and value? How can we ensure responsible development and equitable access? Our future is not just for AI, but for collaboration, discovery, and innovation itself.

I’m excited for where we can go and to see how we use the current wave of technology to build bigger, more valuable, and more decentralized networks of innovative data products. I hope and believe we can play an important role in catalyzing and supporting this path. For our part, we aim to build a joint, shared protocol for connecting and sharing high-value synthetic datasets for all. Our latest work brings together storage networks, attribution & provenance, and data delivery to solve many challenges that will become more pressing in the synthetic data future. Connect with me online or at EthDenver next week if you want to talk more about synthetic data or our solutions.

Subscribe to Tableland
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.