Data is the lifeblood of artificial intelligence. As LLM models grow exponentially in size, with newer models like GPT-4 having been trained on approximately 13T tokens, the need for training data is continuously growing. We understand that the “world model” we all possess so clearly as humans may still be out of touch for current AI models, whether that be due to a lack of high quality data, architectures, training strategies, or some combination of these. Even so, researchers remain fixated on the end form of AI: artificial general intelligence.
Said pursuit of data has catalyzed advancements across AI, notably in natural language processing and image recognition. LLMs stand out in this field due to their ability to decode and replicate seemingly natural human communication. These models leverage transformer architectures which enable them to encode the context and meaning behind words into multi-dimensional vector spaces, transform these vectors in vector space, and decode them back into human readable language. Models like these perform as a function of the quantity and quality of training data provided. It’s no wonder that GPT-4's staggering consumption of over 10 trillion tokens has seemingly exhausted the internet's textual resources. These data appetites underscore the models' dependency on expansive corpora to refine their linguistic mimicry.