Exploring the Synergy Between Data Labeling and Blockchain
June 26th, 2024

Bruce | June-27-2024

Introduction

Since the emergence of ChatGPT in November 2022, a multitude of tech magnates have entered the AI competition, and numerous models have been introduced ever since. ChatGPT achieved 100 million users within two months, making it the fastest-growing product in the market. While ChatGPT exemplifies a highly functional and profitable business model for the AI industry, several product lines from tech giants and ambitious start-ups have been gaining prominence.

This rapid industry advancement has led to increased demand for AI training data. Acquiring reliable data can be challenging due to the need for a standardized format and professional collection methods. In contrast, data labeling is a more accessible task that can be performed from home with just basic computer knowledge, a computer, and an internet connection. This is why numerous companies with AI product lines opt to outsource their data labeling to online platforms or professional annotation companies.

This study aims to determine the suitability of data labeling for the crypto industry, the business model, and other factors to consider when building a data labeling project.

Data Labeling Market

The AI market is expected to grow from $241.8 billion in 2023 to nearly $740 billion in 2030, with a compound annual growth rate (CAGR) of 17.3%[1]. The increasing adoption of AI across various industries drives the widespread use of data collection and labeling. The demand for highly accurate and well-labeled datasets for training AI and ML models is fueling market growth. For instance, the healthcare sector relies heavily on image and voice data to develop and train machine-learning models for diagnostic automation, gene sequencing, and treatment prediction. Companies like DefinedCrowd provide data labeling services that create highly accurate training data for voice recognition models.

Additionally, the rise of e-commerce and automated driving has led to increased data collection for annotation, further driving market growth. This global market for data labeling/annotation solutions and services is projected to reach $46.9 billion in 2030, with a CAGR of 19.5%[2]. As the AI market continues its rapid evolution, there is a growing emphasis on data labeling, anticipated to bolster operations across automotive, retail, and healthcare sectors, thus stimulating demand for data labeling services. Nevertheless, the significant expenses linked to the manual annotation of intricate images and videos are expected to limit market expansion. Conversely, the advent of cloud-based platforms for data collection and labeling is emerging as a notable trend, facilitating remote data collection and labeling processes. This development is poised to offer enhanced flexibility and scalability for businesses.

Crowdsource Platforms

In practice, engines or companies may create a qualified dataset for AI model training by setting up a project on crowdsourcing platforms, and thousands of people can complete their tasks online.  Currently, there is a range of platforms that allow individuals to provide data labeling services and generate income.

  • Amazon Mechanical Turk

The leading outsourcing platform, Amazon Mechanical Turk(“MTurk”), offers a diverse range of tasks, encompassing software development and succinct surveys. Data labeling assignments on the platform typically command prices ranging from $0.03 to $0.5 per assignment, depending on the difficulty of the task, also known as Human Intelligence Tasks (HITs) on the platform. These tasks may vary in the number of HITs they encompass. For example, the first job listed below comprises 888 HITs, each valued at $0.03, while the second task, offering the same reward price, consists of 699 HITs. Both assignments necessitate completion within a 60-minute timeframe. Successful task completion will yield rewards of $26.64 and $20.97.

  • Upwork

Upwork is another one of the most renowned platforms for this purpose. On this platform, data annotators typically charge $5 to $15 per hour, with the platform retaining a 10% service fee from their earnings. Some annotators have successfully earned thousands of dollars through data labeling using this platform[3].

  • Appen

Another significant player in the market is Appen. In addition to providing data labeling jobs, the platform offers resources to help beginners learn about AI and improve their data annotation skills. Companies post long-term data labeling jobs on the platform, with earning potential similar to that of MTurk and Upwork, where labelers can expect to earn $6 to $12 per hour.

This earning may not be substantial for US or European citizens, given that the minimum wage for US 2024 is $7.25[4] to $15.5, differing among states, while the minimum wage for major European countries ranges from $7.1 to

$14.6[5]. However, when we consider the large labor resources in developing countries in Southeast Asia, South America, and Africa, this income can be significant.

Adaption Thesis

When looking closely at the data labeling jobs on those crowdsourcing platforms, one may find some obvious obstacles to this business model.

  1. The crowdsourcing platforms previously mentioned usually ask data labelers to provide their tax information and billing address, possibly leading to fewer labelers being available in the market and blocking a wide range of labelers outside the US.

  2. AI companies and engineers usually need to verify the labeled data because the platform does not have a built-in validation system. This often requires them to assign the same task to multiple labelers, which increases costs and decreases the rewards for labelers.

  3. The compensation for data labelers in the US is not significant enough to attract participation. As a result, only a few full-time data labelers actively operate on the platform.

Those obstacles may seem difficult to resolve in traditional methods, but if you’re familiar with tokenization and the Proof of Work mechanism, you may find it can potentially be suitable business for a crypto project.

  1. The majority of the crypto projects don’t require customers’ personal information; since all payments are settled through blockchain by token, there is no need for TAX and billing address; this approach provides an instant reach to data labeling customers and service providers from all over the world.

  2. There are several ways a crypto protocol can enhance data accuracy. Firstly, the protocol can integrate a validator layer specifically tasked with checking the accuracy of labelers. These validators would be compensated from the protocol's earnings, not directly by the customer. Secondly, it can employ sample inspection, where a piece of data is reviewed by several labelers to ensure correctness. The third method involves a classic staking and slashing mechanism. This holds the labelers' rewards for a specified duration, and upon the customer's approval of the data, the reward is then distributed to the labeler. If the labeler has been found to be malicious or inaccurate, their reward is deducted, with a penalty applied to their staked assets. By implementing the PoW mechanism, we can decrease the customer's expenses while simultaneously boosting the rewards for individuals labeling the data.

  3. Increasing the rewards for labelers may not sufficiently compensate for the lack of scale in the platform. Consider a scenario where a labeler is only able to secure 2-3 tasks per week, with each task offering a compensation of $20-$40. The inadequate salary is likely to lead to labelers leaving the platform. Additionally, if the platform fails to retain a sufficient number of qualified labelers, it risks losing its customer base due to a decrease in the quantity and availability of services, thereby exacerbating the reduction in demand. However, in the context of PoW protocols, particularly during their initial stages, an alternative approach is employed to remunerate workers. These protocols offer compensation in the form of the platform's native tokens. This strategy not only incentivizes participation from workers but also significantly lowers the cost burden on customers. A pertinent example of this model can be observed in Filecoin, a protocol designed to enable customers to pay some fee to store data on a decentralized network. Data storage providers, or ‘workers,' are rewarded with protocol tokens that customers pay. Filecoin additionally rewards workers when they’re providing services. In the early stage, these additional rewards are so significant that the workers are satisfied with just the additional reward, resulting in free service for the customers. This strategy provides a strong initial boost to the platform by attracting early customers and workers and establishing a foundation upon which to build a more sustainable economic model.

Other than the borderless workforce, better pay, and better accuracy, crypto adaption’s native share economy model has some additional perks. The crypto integration makes it possible for labelers to receive micropayment for each task they complete. Decentralization can result in better governance of voting on the fee standard. It can also help with data security by storing datasets on a decentralized network.

Relevant Protocols

  • Sapien

Sapien[6] is a data labeling protocol that raised $4 million in April 2024, backed by Primitive Ventures, Animoca, Artichoke Capital, and Yield Guild Games[7]. On its website, Sapien lists some famous Chinese tech companies, such as Alibaba and Baidu, as customers. The details of the collaboration are unclear at this moment.

Sapien is still at an early stage. The network went online in May 2024. The product functions as a basic data labeling tool; labelers only need to choose from the categories to best describe the images. The workload is flexible on the platform. Each image labeled will be rewarded with points immediately,

which allows labelers to take their fragment time to earn and rest as desired. Right now, the data labeling jobs can only reward points on the platform, which can only be converted to the platform’s native token after its launch. Sapien leverages a multiplier system designed to motivate labelers to use the same wallet address and cultivate a reputation based on it. The timeframe allocated for data labeling is adequately sufficient; all data on the platform are from similar categories, likely produced by the protocol team for test purposes.

  • Alaya AI

Another GameFi data labeling project worth mentioning is Alaya AI[8]. The project launched on the website in July 2023 and on Google Play in February 2024, which allows labelers to earn from anywhere, anytime conveniently. The product adds a GameFi layer to the labeling jobs, which labelers can purchase and upgrade an NFT when they’re in the market. This NFT heavily impacts the labeler’s earning efficiency and capacity, just like Snackers on Stepn.

According to Alaya's data, it has over 400,000 registered users, 20,000 daily active users, and more than 2,500 daily on-chain transactions, with over 50,000 downloads on Google Play.

The tokenomics of Alaya AI is straightforward, and there are six utilities of the platform's native token — $AGT:

  1. $AGT is rewarded for completing training tasks.

  2. $AGT is needed to upgrade NFTs at specific levels.

  3. $AGT staking is required to access advanced tasks.

  4. Customers can request data labeling services using $AGT.

  5. Revenue is proportionally distributed to $AGT stakers on the platform.

  6. $AGT is required for DAO governance.

  • AIT Protocol

AIT Protocol[10] is currently the largest crypto project focused on data labeling. Founded in 2023, its token was launched in December 2023. As of now, AIT Protocol boasts a $136.36 million FDV and an $11 million market cap.

The test version of its data labeling platform is scheduled to launch in July 2024. Right now, there is no demo or preview of the product, so it’s difficult to tell the full range of data labeling tasks it will support. The project's integrated system includes an AI auto-checking feature and employs data scientists to create standardized input forms for labelers. This system aims to enhance data annotation and validate the accuracy of labeled data after annotation.

Conclusion

Data labeling is a novel concept in the crypto space, with most relevant projects still in development or has just launched their open beta versions. None of these projects have undergone extensive market testing, making it difficult to estimate labelers’ income at this stage. However, as mentioned previously, the protocol’s rewards for labelers are intended to attract more participants to the network at the early stage and should not be considered the primary source of labelers’ income in the long run.

The supported data type for labeling is currently limited to a very basic form; only pictures can be labeled, and labelers can choose only from standard categories. This is lagging behind market demand, as platforms like MTurk support labeling for various data types, including pictures, voice recordings, and videos, as well as more complex annotations such as written descriptions and polygon drawings.

Other than that, crypto adoption offers a concrete solution to the limitations related to labelers’ nationalities and reduces data verification costs by leveraging blockchain’s anonymous nature and distributed network sharing economy.

Building actual demand in the crypto space can be challenging. Early-stage incentives from the protocol should be substantial enough to attract labelers and onboard data label customers through near free-of-charge services. However, these incentives should not be excessively generous, as this could result in a network flooded with fake orders. Additionally, labelers’ income could drop significantly once the protocol incentives are depleted.

The labeling tool should be the primary focus of the data labeling project, as it directly impacts the quality and quantity of labeled data, which is essential for attracting data label customers. There is no concrete evidence that incorporating gamification elements, such as NFT upgrading, can effectively balance increasing labelers’ income with establishing thresholds for labelers. Ultimately, the market will determine the viability of this approach.

In conclusion, the data labeling market in the crypto space is not a get-rich-overnight scheme but a business driven by genuine market demands that can be scaled globally. It efficiently redeploys the workforce from developing countries to perform tasks that are under-compensated in developed countries.

If, as anticipated, the AI field continues to remain one of the hottest and fastest-growing industries in the coming years or even decades, the demand for AI training data labeling will also skyrocket. Traditional data labeling services have already shown significant deficiencies, and tokenizing this service with blockchain technology can address some of the current bottlenecks of conventional data labeling methods. Consequently, the annotation of AI training data in the cryptocurrency space could potentially become a highly promising market direction with immense potential for widespread adoption.

[1]Market size and revenue comparison for artificial intelligence worldwide from 2018 to 2030

[2]Global Data Labeling Solution And Services Market Size

[3]Upwork Talent Page for Data Annotation

[4]U.S. Department of Labor: Minimum Wage

[5]Minimum wages in Europe: How do salaries compare across the continent?

[6]Sapien Official Website

[7]Sapien raises $5 million for data labeling with AI applications

[8]Alaya AI Official Website

[9]Alaya AI Token System

[10]AIT Protocol Official Website

Subscribe to Bruce
Receive the latest updates directly to your inbox.
Mint this entry as an NFT to add it to your collection.
Verification
This entry has been permanently stored onchain and signed by its creator.
More from Bruce

Skeleton

Skeleton

Skeleton