Author: Lesley、Shelly, Footprint Analytics Researchers
Key Takeaways:
Figure 1: History of AI
On November 30, 2022, the debut of ChatGPT showcased for the first time the potential for AI to interact with humans in a user-friendly and efficient way. ChatGPT sparked a broader conversation about artificial intelligence, reshaping how we interact with AI to make it more efficient, intuitive, and human-centric. This shift also increased interest in various generative AI models, including those from Anthropic (Amazon), DeepMind (Google), Llama, and others that have since gained prominence. At the same time, professionals in various industries have begun actively exploring how AI can drive advances in their respective fields. Some are using the combination of AI technologies to differentiate themselves in their industries, further accelerating the convergence of AI across domains.
Web3’s vision begins with transforming the financial system to empower users and even drive the change of modern politics and culture. Blockchain technology serves as a robust technology to achieve this goal, not only by reshaping the transmission of value and incentives but also by facilitating resource allocation and decentralization.
Figure 2: History of Web3
As early as 2020, blockchain investment firm Fourth Revolution Capital (4RC) foresaw integrating blockchain technology with AI, envisioning a decentralized transformation of global sectors such as finance, healthcare, e-commerce, and entertainment.
The convergence of AI and Web3 revolves around two key aspects:
Some of the following directions for exploring the combination of AI and Web3 are currently available on the market:
Figure 3 The Convergence of AI with Web3 Overview
In this article, we focus on using AI technology to improve the processing productivity and user experience of Web3 data.
As the fundamental element of AI, Web3 and Web2 data have significant differences. These differences are mainly due to the application architectures of Web2 and Web3, resulting in different features of the data they generate.
Figure 4:Comparison of Web2 & Web3 Application Architectures
In the Web2 architecture, web pages or application control is typically centralized within a single entity, often a company. This entity has absolute authority over the content it develops, determines access privileges to server content and logic, defines user rights, and dictates the lifespan of online content. There are many instances where Internet companies have the power to change platform rules or terminate services without allowing users to retain the value they’ve contributed.
In contrast, the Web3 architecture leverages the concept of the Universal State Layer, which places some or all of the content and logic on a public blockchain, allowing users to control that content and logic directly. Unlike Web2, where users don’t necessarily need accounts or privileged API keys to interact with what’s on the blockchain, Web3 users do not need authorized accounts or API keys to interact with blockchain content, except for certain administrative operations.
Figure 5:Comparison of Web2 and Web3 Data Features
Web2 data is typically characterized by its closed and highly restricted nature, with complex permission controls, a high level of sophistication, multiple data formats, strict adherence to industry standards, and intricate business logic abstractions. Although this data is vast, its interoperability is relatively limited. It is typically stored on centralized servers, lacks privacy awareness, and often involves non-anonymous interactions.
In contrast, Web3 data is more open and widely accessible, albeit at a lower level of maturity. It consists primarily of unstructured data, with little standardization and relatively simplified business logic abstractions. While Web3 data is smaller in scale than Web2 data, it offers high interoperability, such as EVM compatibility. Data can be stored decentralized or centralized, with a strong emphasis on user privacy. Users often interact anonymously on blockchains.
In the Web2 era, data was as precious as “oil reserves,” and accessing and acquiring large amounts of data has always been a significant challenge. In Web3, the openness and sharing of data have made it seem like “oil is everywhere,” allowing easier access to more training data for AI models, which is critical to improving model performance and intelligence. However, challenges remain in processing this Web3 “new oil,” primarily as follows:
Processing on-chain data involves time-consuming indexing processes that require significant effort from developers and analysts to adapt to the data differences across blockchains and projects. The web3 data industry lacks harmonized production and processing standards, except for blockchain ledger entries. Individual projects primarily define and produce data like events, logs, and traces. This complexity makes it difficult for non-professional traders to identify accurate and trustworthy data, complicating on-chain trading and investment decisions. For example, decentralized exchanges such as Uniswap and Pancakeswap may have different data processing methods and calibers, and the process of verifying and standardizing data calibers adds to the complexity of data processing.
The dynamic nature of blockchain, with updates occurring in seconds or even milliseconds, underscores the importance of automated processing. However, the Web3 data industry is still in its infancy. The proliferation of new contracts and iterative updates, coupled with a need for more standards and diverse data formats, adds to the complexity of data processing.
On-chain data often lacks sufficient information to uniquely identify each address, making it difficult to correlate on-chain data with off-chain economic, social, or legal movements. Nevertheless, understanding how on-chain activity correlates with specific individuals or entities in the real world remains critical for specific scenarios.
With the discussion of productivity changes brought about by LLMs, the ability to leverage AI to address these challenges has become one of the central focuses in the Web3 industry.
When it comes to model training, traditional AI models are typically modest in size. The number of parameters ranges from tens of thousands to millions. However, ensuring the accuracy of output results requires a significant amount of manually labeled data. In part, LLM’s formidable strength lies in its use of massive corpora to calibrate parameters numbering in the tens or hundreds of billions. This greatly enhances its understanding of natural language, but at the same time requires a significant increase in the amount of data for training, resulting in particularly high training costs.
In terms of capabilities and modes of operation, traditional AI excels at tasks within specific domains, providing relatively precise and specialized answers. LLMs, on the other hand, are better suited to general tasks but are prone to hallucinations. This means that in certain scenarios, its answers may lack the required precision or specialization, or even be completely wrong. Consequently, for results that require objectivity, reliability, and traceability, multiple checks, repeated training, or the introduction of additional error correction mechanisms and frameworks may be necessary.
Figure 6: Differences between Traditional AI and LLM
Traditional AI has proven its importance in the blockchain data industry, bringing greater innovation and efficiency to the field. For example, the 0xScope team uses AI techniques to develop a graph-based cluster analysis algorithm. This algorithm accurately identifies related addresses among users by assigning weights to various rules. The application of deep learning algorithms improves the accuracy of address clustering, providing a more precise tool for data analysis. Also, Nansen uses AI for NFT price prediction, providing insights into NFT market trends through data analysis and natural language processing. Trusta Labs utilizes a method machine learning approach based on asset graph exploration and user behavior sequence analysis. to strengthen its Sybil detection solution’s reliability and stability, contributing to the blockchain network’s overall security. Goplus strategically integrates traditional AI into its operations to improve the security and efficiency of decentralized applications (dApps). Their approach involves collecting and analyzing security information from dApps to provide rapid risk alerts, thereby mitigating risk exposure for these platforms. This includes identifying risks in dApp host contracts by assessing factors such as open source status and potential malicious behavior. In addition, Goplus compiles detailed audit information, including audit firm credentials, audit times, and links to audit reports. Footprint Analytics uses AI to generate code that produces structured data, facilitating the analysis of NFT transactions, wash trading activity, and bot account screening.
However, traditional AI is constrained by its limited information and focuses on performing predefined tasks using pre-defined algorithms and rules. In contrast, Large Language Models (LLM) capture and generate natural language by learning from rich natural language data, making them better suited for processing complex and large textual data. With the remarkable progress of LLMs, new considerations and explorations have emerged regarding integrating AI with Web3 data.
LLMs boast several advantages compared to traditional AI:
LLMs demonstrate remarkable scalability, efficiently managing large volumes of data and user interactions. This capability makes it exceptionally well-suited for tasks that require extensive information processing, such as text analysis or large-scale data cleansing. Its robust data processing capabilities offer the blockchain data industry tremendous analytical and practical potential.
With outstanding adaptability, an LLM can be fine-tuned for specific tasks or integrated into industry-specific or private databases. This feature allows it to quickly learn and adapt to the subtle differences between different domains. An LLM is an ideal choice for addressing diverse, multi-purpose challenges and providing comprehensive support for blockchain applications.
LLM’s high efficiency significantly streamlines operations within the blockchain data industry. It automates tasks that traditionally require significant manual effort and resources, increasing productivity and reducing costs. LLM can generate large amounts of text, analyze massive datasets, or perform various repetitive tasks within seconds, minimizing wait times and processing times to improve overall efficiency of blockchain data processing.
LLM agents can generate detailed plans for specific tasks, breaking down complex tasks into manageable steps. This feature proves to be highly beneficial when working with extensive blockchain data and performing complex data analysis tasks. By breaking down large jobs into smaller tasks, LLM skillfully manages data processing flows and ensures the delivery of high-quality analytics.
LLM’s accessibility simplifies interactions between users and data, promoting a more user-friendly experience. By leveraging natural language, LLM facilitates easier access and interaction with data and systems, eliminating the need for users to understand complex technical terms or specific commands such as SQL, R, Python, etc. for data acquisition and analysis. This feature broadens the user base of blockchain applications, allowing more people, regardless of technical sophistication, to access and use Web3 applications and services. As a result, it promotes the development and widespread adoption of the blockchain data industry.
Figure 7: Convergence of LLM with Web3 Data
Training LLMs relies on large amounts of data, with patterns within the data serving as the model’s foundation. The interaction and behavioral patterns embedded in blockchain data serve as the driving force for LLM learning. The quantity and quality of data also directly impact the effectiveness of the LLM.
Data isn’t just a resource that an LLM consumes. LLMs also contribute to data production and can even provide feedback. For example, LLMs can assist data analysts by contributing to data preprocessing, such as data cleansing and labelling, or by generating structured data that removes noise and highlights valuable information.
ChatGPT not only demonstrates the solid problem-solving capabilities of LLM but also sparks a global exploration of integrating external capabilities into its general capabilities. This includes enhancing generic capabilities (such as context length, complex reasoning, mathematics, code, multimodality, etc.) and extending external capabilities (handling unstructured data, using more advanced tools, interacting with the physical world, etc.). Integrating domain-specific knowledge from the crypto field and personalized private data into the general capabilities of large models is a key technical challenge for the commercial application of LLM in the crypto field.
Currently, most applications focus on Retrieval-Augmented Generation (RAG), using prompt engineering and embedding techniques. Existing agent tools primarily aim to improve the efficiency and accuracy of RAG. The main architectures on the market for application stacks based on LLM technologies include the following:
Figure 8: Prompt Engineering
Most practitioners use basic, prompt engineering solutions when building applications. In this method, specific prompts are designed to quickly modify the model’s inputs to meet the needs of a particular application. However, basic prompt engineering has limitations, such as delayed database updates, content redundancy, in-context length (ICL) constraints and limits on multiple rounds of conversations.
As a result, the industry is exploring more advanced solutions, including embedding and fine-tuning.
Embedding is a widely used mathematical representation of a set of data points in a lower-dimensional space that efficiently captures their underlying relationships and patterns. By mapping object attributes to vectors, embedding can quickly identify the most likely correct answer by analyzing the relationships between vectors. Embeddings can be built on top of LLM to exploit the rich linguistic knowledge gained from large corpora. Embedding techniques introduce task, or domain-specific information, into a large pre-trained model, making it more specialized and adaptable to a particular task, while retaining the generality of the underlying model.
In simple terms, embedding is like giving a fully trained student a reference book of knowledge relevant to a specific task and allowing them to consult the book as needed to solve specific problems.
Figure 9: Fine-Tuning
Unlike embedding, fine-tuning adapts a pre-trained language model to a specific task by adjusting its parameters and internal representations. This approach allows the model to exhibit improved performance in a particular task while maintaining its generality. The core idea of fine-tuning is to adjust model parameters to capture specific patterns and relationships relevant to the target task. However, the upper limit of model generalizability through fine-tuning is still constrained by the base model itself.
In simpler terms, fine-tuning is like providing a broadly educated college student with specialized knowledge courses that allow them to acquire specialized knowledge in addition to broad skills and solve problems in specialized domains independently.
Although current LLMs are powerful, they do not meet all the requirements. Retraining the LLM is a highly customized solution that introduces new datasets and adjusts model weights to improve adaptability to specific tasks, needs, or domains. However, this approach requires significant computational and data resources, and managing and maintaining the retrained model is also challenging.
Figure 10: Agent Model
The Agent Model is an approach to building intelligent agents that uses LLM as the core controller. This system includes several key components to provide more comprehensive intelligence.
The AI agent model has robust language comprehension and generation capabilities, enabling it to address generic problems, perform task decomposition, and engage in self-reflection. This gives it broad potential in various applications. However, agent models also face limitations such as context length constraints, challenges in long-term planning and task decomposition, and unstable reliability of output content. Addressing these limitations requires continuous research and innovation further to expand the application of agent models in various domains.
The various techniques mentioned above are not mutually exclusive. They can be used together in the process of training and refining the same model. Developers can fully exploit the potential of existing LLMs and experiment with different approaches to meet the needs of increasingly complex applications. This integrated approach not only improves model performance but also drives rapid innovation and advances in Web3.
However, we believe that while existing LLMs have played a critical role in the rapid development of Web3, before fully exploring these existing models (e.g., OpenAI, Llama 2, and other open-source LLMs), it is prudent to consider fine-tuning and retraining the base model, starting with a shallow base using RAG strategies such as prompt engineering and embedding.
In today’s blockchain landscape, developers are increasingly realizing the value of data. This value spans multiple domains, such as operational monitoring, predictive modeling, recommender systems, and data-driven applications. Despite this growing awareness, the critical role of data processing — as the indispensable bridge from data collection to application — is often underestimated.
Figure 12: Blockchain Data Processing Flow
Each transaction or event on the blockchain generates events or logs, typically in an unstructured format. While this step serves as the initial gateway to the data, further processing is required to extract valuable insights and form structured raw data. This involves organizing the data, handling exceptions, and transforming it into a standardized format.
Once structured raw data is produced, additional steps are required for business abstraction. This involves mapping the data to business entities and metrics, such as transaction volume and number of users, ultimately transforming raw data into information relevant to business operations and decision-making.
With abstracted business data, subsequent calculations can provide important derived metrics. Metrics like the monthly growth rate of total transactions and user retention rate. These metrics, implemented using tools such as SQL and Python, are critical in monitoring business health, understanding user behavior, and identifying trends-supporting decision-making and strategic planning.
LLM addresses multiple blockchain data processing challenges, including but not limited to:
Process unstructured data:
Business Abstraction:
Interpret Data by Natural Language:
Leveraging LLM’s technological and product experience advantages, it finds versatile applications in various on-chain data scenarios, which can be broadly classified into four categories based on technical complexity:
Figure 11: LLM Application Scenarios
Despite significant progress in the area of Web3 data, several challenges remain.
Established applications:
Ongoing research and challenges:
As a language model, LLM is best suited for scenarios that require high fluency, but achieving precision may require further model adjustments. The following framework provides valuable insights when applying LLM to the blockchain data industry.
Figure 13: Fluency, Accuracy, and Risks in the Blockchain Data Industry
When evaluating the applicability of LLMs in various scenarios, the focus on fluency and accuracy becomes paramount. Fluency measures the naturalness and coherence of the model’s output, while accuracy determines the precision of its responses. These dimensions have different requirements in different application contexts.
For tasks that emphasize fluency, such as natural language generation and creative writing, LLM typically excels due to its robust natural language processing performance, which enables the generation of fluent text.
Blockchain data presents multifaceted challenges involving data parsing, processing, and application. LLM’s exceptional language understanding and reasoning capabilities make it an ideal tool for interacting with, organizing, and summarizing blockchain data. However, LLM does not comprehensively solve all blockchain data problems.
Regarding data processing, LLM is better suited for fast, iterative, and exploratory processing of on-chain data, where continuous experimentation with new processing methods is essential. However, LLM is limited to tasks such as detailed matching in a production environment. Also, unstable answers to prompts affect downstream tasks, leading to unstable success rates and the inefficiency of performing high-volume tasks.
The processing of content by an LLM may cause hallucination problems. The estimated probability of hallucination in ChatGPT is approximately 15% to 20%, and the opaque nature of its processing makes many errors challenging to detect. Therefore, establishing a robust framework coupled with expert knowledge is crucial. In addition, combining LLM with on-chain data presents numerous challenges:
LLMs are typically pre-trained on large amounts of textual data, making them naturally adept at processing diverse unstructured textual information. However, various industries already possess substantial amounts of structured data, particularly in the Web3 field where data has been parsed. Effectively leveraging this data to improve LLM has become a hot industry research topic.
For LLMs, structured data continues to offer several advantages:
There are some imaginative views in the current market that suggest that LLMs have exceptional capabilities in handling both textual and unstructured information. According to this perspective, achieving the desired result is as simple as importing raw and unstructured data into an LLM.
This notion is similar to expecting a general-purpose LLM to solve mathematical problems: without constructing a specialized model of mathematical skills, most LLMs might make mistakes when tackling basic elementary school addition and subtraction problems. On the contrary, constructing a vertical cryptographic LLM model, similar to models for mathematical abilities and image generation, proves to be a more practical approach to addressing the application of LLMs in the crypto world.
While an LLM can extract information from textual sources such as news and social media, insights derived directly from on-chain data remain critical for the following reasons:
The analysis of on-chain data remains essential, and LLM is a complementary tool for extracting information from text. However, it cannot replace the direct analysis of on-chain data. Optimal results are achieved by leveraging the strengths of both approaches.
Tools such as LangChain and LlamaIndex provide a convenient way to build custom and simple LLM applications, enabling rapid development. However, successfully deploying these tools in the real world presents additional challenges. Building an LLM application with sustained high quality and efficiency is a complex task that requires a deep understanding of both blockchain technology and how AI tools work, as well as the effective integration of the two. This is proving to be a significant yet challenging undertaking for the blockchain data industry.
Throughout this process, recognizing the unique characteristics of blockchain data is critical. It requires a high level of accuracy and verifiability through repeatable checks. Once data is processed and analyzed through LLM, users have high expectations for its accuracy and reliability. However, there is a potential conflict between these expectations and the fuzzy fault tolerance of LLM. Therefore, when constructing blockchain data solutions, it is necessary to carefully balance these two aspects in order to meet users’ expectations.
In the current market, despite the availability of some basic tools, the field continues to evolve rapidly. The landscape is constantly changing as the Web2 world has evolved from the early days of PHP to more mature and scalable solutions such as Java, Ruby, Python, JavaScript, Node.js, and emerging technologies such as Go and Rust. Similarly, AI tools are changing dynamically, with emerging GPT frameworks such as AutoGPT, Microsoft AutoGen, and OpenAI’s recently announced ChatGPT 4.0 Turbo representing only a fraction of the possibilities. This underscores the fact that there is ample room for growth in both the blockchain data industry and AI technology, which requires continuous efforts and innovation.
When applying LLM, there are two pitfalls that require special attention:
While LLM holds immense potential in various domains, developers and researchers must exercise caution and maintain an open-minded exploration approach when applying an LLM. This approach ensures the discovery of more suitable application scenarios and maximizes the advantages of LLMs.
This article is jointly published by Footprint Analytics, Future3 Campus, and HashKey Capital.
Footprint Analytics is a blockchain data solutions provider. We leverage cutting-edge AI technology to help analysts, builders, and investors turn blockchain data and combine web2 data into insights with accessible visualization tools and a powerful multi-chain API across 30+ chains for NFTs, GameFi, wallet profiles, and money flow data.
Footprint Website: https://www.footprint.network
Discord: https://discord.gg/3HYaR6USM7
Twitter: https://twitter.com/Footprint_Data
Telegram: https://t.me/Footprint_Analytics
Future3 Campus is a Web3.0 innovation incubation platform jointly initiated by Wanxiang Blockchain Labs and HashKey Capital. It focuses on three major tracks: Web3.0 Massive Adoption, DePIN, and AI. The main incubation bases are in Shanghai, the Guangdong-Hong Kong-Macao Greater Bay Area, and Singapore, radiating across the global Web3.0 ecosystem. Additionally, Future3 Campus will launch its first $50 million seed fund for incubating Web3.0 projects, truly serving the innovation and entrepreneurship in the Web3.0 domain.
**HashKey Capital **is an asset management institution focusing on investments in blockchain technology and digital assets, currently managing over $1 billion in assets. As one of Asia’s largest and most influential blockchain investment institutions and also an early institutional investor in Ethereum, HashKey Capital plays a leading role, bridging Web2 and Web3. Collaborating with entrepreneurs, investors, communities, and regulatory bodies, HashKey Capital is committed to building a sustainable blockchain ecosystem. The company is based in Hong Kong, Singapore, Japan, the United States, and other locations. It has taken the lead in deploying investments across more than 500 global enterprises in tracks spanning Layer 1, protocols, Crypto Finance, Web3 infrastructure, applications, NFTs, the Metaverse, and more. Representative investment projects include Cosmos, Coinlist, Aztec, Blockdaemon, dYdX, imToken, Animoca Brands, Falcon X, Space and Time, Mask Network, Polkadot, Moonbeam, and Galxe (formerly Project Galaxy).