Data and The Train-to-Earn Economy

February 25th, 2024

A speculative and theoretical thesis on the future decentralized data economy.

A personal AI assistant [1], [analyzes] the feeds from [a user’s] wearable devices and logs [their] quality of sleep to monitor and predict [the user’s] conditions. A … System AI managing urban infrastructures and services, processes real-time information from a commercial and [decentralized IoT] cameras installed on the road networks to support the decisions and actions of local service providers. A [user] envisions a kind of AI agent designed to act for communicating, engaging, negotiating, coordinating its actions with other humans and artificial agents. Antonini, Lupi

The nexus of artificial intelligence and blockchain technology continues to evolve with the increased interest and development of AI-blockchain applications. Decentralized compute, open source Large Language Models (LLMs), model marketplaces, and early examples of AI assistants and autonomous agents are emerging throughout Web3. Currently, builders are focused on decentralizing compute to horizontally scale and democratize model infrastructure. This will increase consumer access to LLMs and contribute to a multimodal world with thousands of open-source general and fine-tuned LLMs.

Data Scarcity

Although horizontally scaling LLM infrastructure through decentralized compute is extremely important for the democratization of AI, the pressing issue for both centralized and decentralized AI applications is the enormous amount of data needed to pre-train, fine-tune, and enhance LLMs. In fact, AI firms may run out of "high-quality, natural data sources" as early as 2026, and they may run out of lower-quality text and image data "between 2030 and 2060” Pardo.

One solution to data scarcity is decentralized data collection through train-to-earn incentive models; train-to-earn models incentivize users to collect, contribute, and create data to train AI models in exchange for rewards. Train-to-earn incentives may help to create large-scale decentralized datasets. These datasets will support model pre-training, fine-tuning via Reinforcement Learning with Human Feedback (RHLF) for aligning model outputs with human preferences, and model optimization through real-world predictions enabled during inference by Retrieval-Augmented Generation (RAG) or (in the future) long context models.

Retrieval Augmented Generation is a method that combines searching for information with a response creation by enhancing responses by fetching relevant content from datasets during response creation.
Reinforcement Learning from Human Feedback (RLHF) is a method that improves AI responses by using human feedback. This involves human feedback on AI model answers to guide models to produce better responses. This process helps AI models understand and align with what is considered a good or helpful response.
Long Context Models: retain large amounts of contextual information from user inputs. They are inherently multimodal, handling diverse data formats like code, text, audio, and video efficiently within the context window. Google’s Gemini 1.5 can analyze information from any data format, code, text, audio and video to provide contextually relevant outcomes.

Data Collection, Contribution, and Creation in the Train-to-Earn Economy

Here, I will outline my thesis on the decentralized data economy by examining train-to-earn economic incentive models. I will theorize on how these models will incentivize users to provide data as well as how this data will be used for Large Language Model (LLM) pre-training, fine-tuning, inference, and optimization.

Data Collection, Contribution and Creation Pyramid

1. Internet Of Things (IoT) Device Data Collection

Decentralized physical infrastructure networks (DePIN) incentivize users to install, monitor, and operate Internet of Things (IoT) devices to collect data from their physical environment. Data collected from IoT devices allows for precise monitoring, analysis, and decision-making based on granular and real-time information. This real-time data, when applied to LLMs, either through RHLF fine-tuning, RAG optimization, or long context models, will allow for autonomous AI responses to real-time events—creating a new autonomous service layer.

Decentralized large-scale data collection will outperform centralized methods of granular data collection both in capital efficiency and scale. For example, Google Street View imagery is currently “outdated by ten years,” whereas Hivemapper data contributors (decentralized mapping through dashboard cameras) “map places that have not been mapped or [areas] where Google Street View needs to be updated.” Here, we see one of the first examples of decentralized IoT networks being superior in liveness of data to centralized authorities. Although just one example, Hivemapper provides the base-case for real-time, decentralized IoT data collection.

A Personal AI is the [realization] of the "invisible computer" [2], seamless technology embedded in the user's material and digital sphere (e.g. through a wearable, home assistant, smartphone). Antonini, Lupi

Social-AI incentivizes users to train hyper-personalized AI models that assist with the complexity of digital ecosystems. These models will generate living intelligence from user participation and personal IoT device data. This differs from passive IoT data collection, as Social-AI involves direct user engagement and collects personal rather than just environmental data.

Social-AI data contributions will be further incentivized through gamified social quests interwoven with the real world; this may take the form of IoT device data contributions from daily labor tasks. Moreover, coordinated and collective forms of IoT data contributions will serve as the basis for generating group human feedback data (Metaverse, AR, Pokemon Go). From this data, Social-AI models may generate synthetic datasets that function as anonymous representations of collective human feedback data. This data may be valuable to researchers, businesses, and institutions.

The future of data creation via natural real-time contributions and synthetic generation is nascent and novel. We’ve yet to conceive of how these data-creation structures will alter society and culture.

3. Decentralized Marketplaces for RHLF Labor

Another category of data contribution is data labor through Reinforcement Learning from Human Feedback (RHLF); RHLF contributes to pre-training and fine-tuning LLMs through human feedback to curate model responses. RHLF is considered a "significant advancement in the field of Natural Language Processing (NLP)” Abideen. However, scaling RHLF is costly as it involves large amounts of human labor. Decentralized train-to-earn networks may provide a means to economically scale RHLF and decentralize routine aspects of model pre-training and fine-tuning.

Decentralized marketplaces for RHLF labor will match model builders, through open-source model training networks (TAO, NetMind.AI, Arbius), to data contributors who are willing to participate in manual data aggregation and collection labor: data annotation, surveys, event details, user feedback, or other routine data-labor tasks.

Marketplace for Model Pre-Training and Fine-Tuning Data Flow

4. Data Contribution DAOs

Train-to-earn incentives will be used to collect proprietary data from specific expert groups known as Data Contribution DAOs. The private and expert data from these DAOs will contribute to open-source inference and model training networks, either for RAG model optimization or as context for long context models.

RAG combines generalized models with an “authoritative knowledge base” (Contribution DAO) to optimize model results, whereas long context models, such as Google’s Gemini 1.5, accept any data format, code, text, audio and video. In both cases, exclusive, proprietary data from Data Contribution DAOs will optimize and create highly-specific and aggregated-expertise responses from models trained for legal research, healthcare, financial analysis, cross-lingual applications, data extraction, historical data analysis, and much more.

For Data Contribution DAOs to succeed, we will need a provably fair system for distributing token incentives based on the authenticity and quality of data from various data contributors. Moreover, privacy and reputational concerns among professionals will need to be addressed in order to encourage widespread participation among professionals.

Example: A system comprising data-contribution delegates responsible for maintaining data authenticity and distributing rewards weighted on the quality of data contribution by DAO members may be a possible solution to privacy and reputational concerns among participants (Modulus, EZKL, GIZA, DataOS).

I envision Data Contribution DAOs as a combination of both public and private institutions. Tokenized, programmable data may act as equity.

Decentralized Data Contributions and LLMs

Conclusion

I theorize that the future of data collection, contribution, and creation will be decentralized and global. I believe that to incentivize high levels of user participation in this new data economy, innovative token incentive models, as well as privacy and data authenticity, will need to be developed at scale.

Contributors who provide data used by LLMs, AI agents, and eventually robots to solve world problems or generate significant social and economic value will be directly attributed, and hopefully, fairly compensated for their data contributions. This will enable fair compensation for intellectual micro-contributions to services and achievements enabled through AI technology.

The impending problem facing Artificial Intelligence development is the scarcity of natural and synthetic data for the development and training of useful AI models. Train to earn as an economic incentive model provides the basis for a new decentralized data economy where data collection, contribution, and creation can be accelerated through global participation.

Thanks for reading!* *

I’m a writer researching the nexus of Artificial Intelligence and Blockchain. Please follow me @NFTSWIMM3R for similar content.

References and Inspiration: