The cost of inference is trending towards zero
Token throughput is trending towards infinity
Context windows sizes are getting larger
Companies are spending more on training despite improvements in compute and cost efficiency
Models are quickly becoming commoditized
Compute is quickly becoming commoditized
We’re sharing our notes on trends that we wrote about back in December of last year (and updated in February of 2024). This document has been sitting in our team Notion workspace for almost half a year now. So we figured we may as well put it out there rather than letting it collect dust. While some of the observations are dated, others are holding up pretty well. And that’s pretty exciting because at the time we were just having fun speculating about the near future. Note that there is no particular structure to this document since it was just something we threw together. We hope you find it entertaining!
Note that these trends are focused on transformer models.
Moore's Law
The doubling of transistor density every two years will lead to faster and more cost-effective computing performance, enhancing the efficiency of model training and inference over time. This is improving at a rate of 1.35x per year.
Jevon’s Paradox
When the cost of using a resource decreases due to increased efficiency, it becomes more attractive for consumers and industries to utilize it. Hence why when the internal combustion engine became more efficient, fuel consumption, and as a consequence, green house gas emissions, increased. In software development the same phenomenon is described by Wirth’s Law: devs always figure out how to bloat software faster than hardware can keep up. Or said simply, we have more resources so we do more things.
Price Competition
In addition to Moore's Law, competitive pricing among compute providers is further driving down the cost of processing and generating tokens. Cheaper inference increases accessibility, governed by Jevons' Paradox, where increased efficiency leads to higher overall consumption. This results in unlocks such as increasing context window sizes, more sophisticated planning (agent) workflows, and (arguably) excessive inferencing for things like generative web components (see Wirth’s Law). Maybe ‘generative everything’ is what leads us to Dead Internet e.g. AITube. To really drive it home, you would expect demand for token consumption to increase proportionally as the cost of inference decreases. But here's a surprising fact: while inference costs are dropping by a factor of 15x each year, the demand for processing and generating more tokens is increasing significantly faster. And we can use context window size as a proxy for estimating just how much. Especially since it is the most significant driver of token processing consumption. The answer? Context windows have grown 1,250x each year since 2022.
More Inference
We’re finding more places to run models too. For example, Georgi Gerganov’s llama.cpp offloads token processing and generation to the CPU. So now any server or consumer device can serve a model using CPU clock cycles as opposed to GPU only. And there seems to be a lot of work being done getting around memory constraints so that even memory-bound devices can run inference on larger models. Quantization being the obvious one here, but also techniques like offloading and distributed inferencing (see Petals) just to run the gamut. WebAssembly might also play a role because it enables inferencing from the browser. Meaning that smaller models (which are also cheaper to inference) can be used as a sort of ‘workers’ for low-IQ tasks (e.g. reasoning assists) without running up the cloud bill.
Wirth’s Law for Training
Algorithmic optimizations result in 3x per year decline in the physical compute requirements to run a training cycle. Yet, these efficiencies are contrasted by a 3.1x increase in USD of the cost of the most expensive training run for every year since 2009—another example of Jevon’s Paradox (a.k.a. Wirth’s Law).
GPU > CPU
The general trend is that hyperscalers are running the Apple playbook and vertically integrating, from bare metal to the web interface, going from compute aggregators to end-to-end clock cycle providers. Let’s assume for a moment, given all of the trends, that every clock cycle in the near future will go towards some form of token generation: site rendering and site copy, porn, video games, ads etc.
By that measure, the future of the compute market will be defined by the metric of serving floating point operations per second (FLOP/s). The demand for cost-effective, high-performance compute will skyrocket (commoditizing hardware) and naturally, everyone is going to want to go after NVIDIA’s market share.
Groq's Tensor Streaming Processor and Lightspeed Processing Unit (LPU)
Bitmain's custom Tensor Processing Units (TPUs)
Google's TPUs
AWS Trainium and Inferentia silicon
Apple’s M-Series chips (pray they make enterprise versions)
Some believe this will ultimately lead to a decline in the enterprise values of chip designers and manufacturers, similar to what Cisco experienced in the early 2000s.
Kurzweil’s Law
Evolution applies positive feedback in that the more capable methods resulting from one stage of evolutionary progress are used to create the next stage. Each epoch of evolution has progressed more rapidly by building on the products of the previous stage.
It’s likely that once last-gen models get good enough they will be able to aid in the development, one way or another, of the next gen model. A straightforward example is how data labeling becomes more efficient as processing and token generation costs go down. And as last-gen models get better. This cost reduction also makes it ever more viable to continue integrating modalities into tokens as a unified representation of information. Which expands data labeling from just language, to image, to the next modality, and so on. This makes sense since token representations all share the same form as language tokens anyways. See Meta’s ImageBind.
It’s also likely that multi-modal models will outperform specialist models because they just have more knowledge to work with. And they can think and ‘reason’ across a broader spectrum. Something like what Feynman said about John Tukey, who could keep time by picturing a clock whereas Feynman had to ‘hear’ himself count in his head.
Open Source
Open models are lagging behind proprietary models but are improving at a faster rate. This is likely due to the sheer frequency of iteration available to open research and development. All of this is explained much better in the (Google) memo titled ‘We have no moat, and neither does OpenAI’.
Research institutions all over the world are building on each other’s work, exploring the solution space in a breadth-first way that far outstrips [Google’s] own capacity. We can try to hold tightly to our secrets while outside innovation dilutes their value, or we can try to learn from each other.
Energy
It’s obvious that this will just boil down to an energy game (always has been, but now more than ever). That leaves us with a few questions.
Where do solar, coal, gas, nuclear, lithium, and fusion stand? For example, gas plants can be ramped up and down almost on demand. Whereas coal plants can’t because of thermal inertia. What other factors need to be taken into consideration?
With that said, what are the geopolitical implications? There’s a paper titled Effects of Energy Consumption on GDP: New Evidence of 24 Countries on Their Natural Resources and Production of Electricity that supports the idea that energy consumption drives GDP. But it also suggests a ‘complex relationship.’ Doesn’t the relationship become more straightforward? More energy→ more compute → more intelligence → more innovation. And it’s no longer about reproduction.
Does the energy demand for AI training and inference undermine that of crypto?
How fast are we making improvements in performance (FLOP/s) per watt? What is the physical limit?
Data Centers and Supply Chain
We’ll assume that the current trends hold for the next decade or so. Not that this ends up being like the dot com bubble.
What is being overlooked? Who makes the uninterruptible power supply systems? Flywheel backups? Battery backups (like saltwater batts)? The transfer switches?
What companies maintain the HVAC systems to cool down these centers? What is the ideal climate to build a data center in? As centers upgrade to liquid cooled systems, who supplies/manufactures/maintains those components? Do cities progressively reorganize around data centers instead of ports and waterways?
What does the power profile of a data center look like? Who is contracted to build out the utility substations? What company names (suppliers) pop up as you move your finger along the electrical schematic(s) of a data center?
Across the entire datacenter supply chain, which components are hardest to scale up?
Some data centers are located in remote locations. Who services the employees that work there? What about security detail? The White House AI Executive Order requires that training over 1e26 FLOPs of compute report to the U.S. government. Who handles the reporting? The order also emphasizes the importance of both the AI systems (including models) and the infrastructure supporting them (such as data centers) in terms of national security, economic security, and public health and safety. Do these get nationalized? Private Public Partnership’ed?
What happens to these companies? ↓
Education
What degrees or fields of study are susceptible to becoming inference tokens?
When can we expect models to work alongside (and eventually replace) humans doing research?
Is there a rapidly closing window of opportunity for certain STEM degrees, where the skills and knowledge taught today will no longer be economically viable for humans by the time X cohort of freshmen graduate? And if so, what fields of study are most likely to fall outside the ‘Overton window’ of viable career paths first?
This all feels like what happened to the mechanical watch industry when Seiko introduced the quartz watch. A lot of Swiss brands died, but few, namely, Rolex, Omega (and others) pivoted to luxury. People buy mechanical watches because they are beautiful. What skills or professions become Rolex?
Does the government prop up ‘bullshit jobs’ like it subsidizes corn, soy, and wheat?
Real Estate
Let’s assume models keep getting better and better. To the point where they become economically viable as substitutes for humans that take up keyboard and mouse jobs. This means that knowledge capital can be deployed and scaled anywhere in the world.
Because information technology transcends the tyranny of place, it will automatically expose jurisdictions everywhere to de facto global competition on the basis of quality and price…Leading nation-states with their predatory, redistributive tax regimes and heavy-handed regulations, will no longer be jurisdictions of choice. Seen dispassionately, they offer poor-quality protection and diminished economic opportunity at monopoly prices….The leading welfare states will lose their most talented citizens through desertion.
Ethics
At what point do these models become sentient? It likely doesn’t even matter whether they are conscious or sentient as long the average person thinks they are or feels a certain way about them. For example, environmentalists care about the earth even though it is not sentient. So when does that happen?
People don’t even have to care. Maybe it becomes a form of virtue signaling?