Thoughts, Trends, and Questions

May 31st, 2024

The cost of inference is trending towards zero
Token throughput is trending towards infinity
Context windows sizes are getting larger
Companies are spending more on training despite improvements in compute and cost efficiency
Models are quickly becoming commoditized
Compute is quickly becoming commoditized

We’re sharing our notes on trends that we wrote about back in December of last year (and updated in February of 2024). This document has been sitting in our team Notion workspace for almost half a year now. So we figured we may as well put it out there rather than letting it collect dust. While some of the observations are dated, others are holding up pretty well. And that’s pretty exciting because at the time we were just having fun speculating about the near future. Note that there is no particular structure to this document since it was just something we threw together. We hope you find it entertaining!

Thoughts on Current Trends

Note that these trends are focused on transformer models.

Moore's Law

The doubling of transistor density every two years will lead to faster and more cost-effective computing performance, enhancing the efficiency of model training and inference over time. This is improving at a rate of 1.35x per year.

Jevon’s Paradox

When the cost of using a resource decreases due to increased efficiency, it becomes more attractive for consumers and industries to utilize it. Hence why when the internal combustion engine became more efficient, fuel consumption, and as a consequence, green house gas emissions, increased. In software development the same phenomenon is described by Wirth’s Law: devs always figure out how to bloat software faster than hardware can keep up. Or said simply, we have more resources so we do more things.

Price Competition

In addition to Moore's Law, competitive pricing among compute providers is further driving down the cost of processing and generating tokens. Cheaper inference increases accessibility, governed by Jevons' Paradox, where increased efficiency leads to higher overall consumption. This results in unlocks such as increasing context window sizes, more sophisticated planning (agent) workflows, and (arguably) excessive inferencing for things like generative web components (see Wirth’s Law). Maybe ‘generative everything’ is what leads us to Dead Internet e.g. AITube. To really drive it home, you would expect demand for token consumption to increase proportionally as the cost of inference decreases. But here's a surprising fact: while inference costs are dropping by a factor of 15x each year, the demand for processing and generating more tokens is increasing significantly faster. And we can use context window size as a proxy for estimating just how much. Especially since it is the most significant driver of token processing consumption. The answer? Context windows have grown 1,250x each year since 2022.

1. We’ll continue to see costs fall as more specialized ASICs and maybe even models implemented in hardware (physically burned to a chip) offer better inference economics. Source: https://artificialanalysis.ai/models/mixtral-8x7b-instruct

2. GPT-3 Curie is a discontinued OpenAI model that has 6.7B parameters. It scored something like 25 on the MMLU. Similar 7B parameter models today, like Llama-2 7B score 45. But that's another, separate trend: smarter models, same parameter count. For clarity, inferencing Curie and Llama-7B (or any 7B model) generally costs the same without going in to transformer inference math.

3. The general trend in state of the art (SOTA) model context window sizes illustrating two growth patterns: between 2020 and 2022, context windows doubled in length, whereas between 2022 and 2024, they’ve increased 2,500x. Updated in February 2024 with launch estimates for Gemini models.

More Inference

We’re finding more places to run models too. For example, Georgi Gerganov’s llama.cpp offloads token processing and generation to the CPU. So now any server or consumer device can serve a model using CPU clock cycles as opposed to GPU only. And there seems to be a lot of work being done getting around memory constraints so that even memory-bound devices can run inference on larger models. Quantization being the obvious one here, but also techniques like offloading and distributed inferencing (see Petals) just to run the gamut. WebAssembly might also play a role because it enables inferencing from the browser. Meaning that smaller models (which are also cheaper to inference) can be used as a sort of ‘workers’ for low-IQ tasks (e.g. reasoning assists) without running up the cloud bill.

Wirth’s Law for Training

Algorithmic optimizations result in 3x per year decline in the physical compute requirements to run a training cycle. Yet, these efficiencies are contrasted by a 3.1x increase in USD of the cost of the most expensive training run for every year since 2009—another example of Jevon’s Paradox (a.k.a. Wirth’s Law).

4. Despite algorithmic optimizations that result in a decline in the physical compute requirements to run a training cycle, despite Moore’s Law, and despite price competition between compute providers, co's are spending more and more every year on training runs. Note: training makes up just 10% of the lifetime costs of a model. It would be interest to see how much more co's are spending on inference every year (models get bigger faster than Moore's Law can keep up). That's probably going to trend up as more compute is thrown at inference. See Monte Carlo Tree Search, Q

GPU > CPU

The general trend is that hyperscalers are running the Apple playbook and vertically integrating, from bare metal to the web interface, going from compute aggregators to end-to-end clock cycle providers. Let’s assume for a moment, given all of the trends, that every clock cycle in the near future will go towards some form of token generation: site rendering and site copy, porn, video games, ads etc.

By that measure, the future of the compute market will be defined by the metric of serving floating point operations per second (FLOP/s). The demand for cost-effective, high-performance compute will skyrocket (commoditizing hardware) and naturally, everyone is going to want to go after NVIDIA’s market share.

Groq's Tensor Streaming Processor and Lightspeed Processing Unit (LPU)
Bitmain's custom Tensor Processing Units (TPUs)
Google's TPUs
AWS Trainium and Inferentia silicon
Apple’s M-Series chips (pray they make enterprise versions)

Some believe this will ultimately lead to a decline in the enterprise values of chip designers and manufacturers, similar to what Cisco experienced in the early 2000s.

Kurzweil’s Law

Evolution applies positive feedback in that the more capable methods resulting from one stage of evolutionary progress are used to create the next stage. Each epoch of evolution has progressed more rapidly by building on the products of the previous stage.

It’s likely that once last-gen models get good enough they will be able to aid in the development, one way or another, of the next gen model. A straightforward example is how data labeling becomes more efficient as processing and token generation costs go down. And as last-gen models get better. This cost reduction also makes it ever more viable to continue integrating modalities into tokens as a unified representation of information. Which expands data labeling from just language, to image, to the next modality, and so on. This makes sense since token representations all share the same form as language tokens anyways. See Meta’s ImageBind.

It’s also likely that multi-modal models will outperform specialist models because they just have more knowledge to work with. And they can think and ‘reason’ across a broader spectrum. Something like what Feynman said about John Tukey, who could keep time by picturing a clock whereas Feynman had to ‘hear’ himself count in his head.

Open Source

Open models are lagging behind proprietary models but are improving at a faster rate. This is likely due to the sheer frequency of iteration available to open research and development. All of this is explained much better in the (Google) memo titled ‘We have no moat, and neither does OpenAI’.

5. Well funded co's releasing open models seem to be catching up to well funded co's releasing closed models. Sadly, we haven't seen any underground or grassroots labs release a SOTA model contender yet. Note: the MMLU is just one of many benchmarks for measuring how 'smart' a model is.

Research institutions all over the world are building on each other’s work, exploring the solution space in a breadth-first way that far outstrips [Google’s] own capacity. We can try to hold tightly to our secrets while outside innovation dilutes their value, or we can try to learn from each other.

6. Anyone can contribute to open research. This is classic Cathedral v. the Bazaar. The only difference this time is that the open source community is lacking one key resource: compute.

Questions About the Next Decade (or Two)

Energy

It’s obvious that this will just boil down to an energy game (always has been, but now more than ever). That leaves us with a few questions.

Where do solar, coal, gas, nuclear, lithium, and fusion stand? For example, gas plants can be ramped up and down almost on demand. Whereas coal plants can’t because of thermal inertia. What other factors need to be taken into consideration?
With that said, what are the geopolitical implications? There’s a paper titled Effects of Energy Consumption on GDP: New Evidence of 24 Countries on Their Natural Resources and Production of Electricity that supports the idea that energy consumption drives GDP. But it also suggests a ‘complex relationship.’ Doesn’t the relationship become more straightforward? More energy→ more compute → more intelligence → more innovation. And it’s no longer about reproduction.
Does the energy demand for AI training and inference undermine that of crypto?
How fast are we making improvements in performance (FLOP/s) per watt? What is the physical limit?

7. Based on the Green500. This is also known as Koomey's Law.

How does this trend compare to the growing energy demands for training and inferencing bigger (and better) models? Does it outpace it? By how many orders of magnitude per year?

8. Companies are throwing 3-4x more compute at training models every year. At what point does the energy demand of a data center reach the energy caps set by public utility companies? One solution could be to network data centers across states as 'superclusters.' That way you can overcome local energy caps by arbitraging power consumption across states. Source: https://epochai.org/blog/compute-trends

Data Centers and Supply Chain

We’ll assume that the current trends hold for the next decade or so. Not that this ends up being like the dot com bubble.

What is being overlooked? Who makes the uninterruptible power supply systems? Flywheel backups? Battery backups (like saltwater batts)? The transfer switches?
What companies maintain the HVAC systems to cool down these centers? What is the ideal climate to build a data center in? As centers upgrade to liquid cooled systems, who supplies/manufactures/maintains those components? Do cities progressively reorganize around data centers instead of ports and waterways?
What does the power profile of a data center look like? Who is contracted to build out the utility substations? What company names (suppliers) pop up as you move your finger along the electrical schematic(s) of a data center?
Across the entire datacenter supply chain, which components are hardest to scale up?
Some data centers are located in remote locations. Who services the employees that work there? What about security detail? The White House AI Executive Order requires that training over 1e26 FLOPs of compute report to the U.S. government. Who handles the reporting? The order also emphasizes the importance of both the AI systems (including models) and the infrastructure supporting them (such as data centers) in terms of national security, economic security, and public health and safety. Do these get nationalized? Private Public Partnership’ed?
What happens to these companies? ↓

9. Considering historical precedents where the US has intervened to protect 'national and economic interests,' such as the intervention in Kuwait in the 90s and the involvement in Chile in the 70s, it's not crazy to imagine that the entire semiconductor supply chain, from raw materials to data centers, becomes of national interest (and a potential future cause of conflict).

Education

What degrees or fields of study are susceptible to becoming inference tokens?
When can we expect models to work alongside (and eventually replace) humans doing research?
Is there a rapidly closing window of opportunity for certain STEM degrees, where the skills and knowledge taught today will no longer be economically viable for humans by the time X cohort of freshmen graduate? And if so, what fields of study are most likely to fall outside the ‘Overton window’ of viable career paths first?
This all feels like what happened to the mechanical watch industry when Seiko introduced the quartz watch. A lot of Swiss brands died, but few, namely, Rolex, Omega (and others) pivoted to luxury. People buy mechanical watches because they are beautiful. What skills or professions become Rolex?
Does the government prop up ‘bullshit jobs’ like it subsidizes corn, soy, and wheat?

10. The Philippines call center and business process outsourcing (BPO) market is something like $100B. Yet it's not hard to imagine that it will get automated away in the next decade. See switchboard operators.

Real Estate

Let’s assume models keep getting better and better. To the point where they become economically viable as substitutes for humans that take up keyboard and mouse jobs. This means that knowledge capital can be deployed and scaled anywhere in the world.

Why would companies base their headquarters in places that anchor them to taxes and jobs locally when they are free to chase the lowest costs (taxes, climate, real estate, etc.). Will co’s overcome the tyranny of place? Or will there be some sort of exit tax on knowledge capital?

Because information technology transcends the tyranny of place, it will automatically expose jurisdictions everywhere to de facto global competition on the basis of quality and price…Leading nation-states with their predatory, redistributive tax regimes and heavy-handed regulations, will no longer be jurisdictions of choice. Seen dispassionately, they offer poor-quality protection and diminished economic opportunity at monopoly prices….The leading welfare states will lose their most talented citizens through desertion.

Let’s continue rolling with these assumptions. Will we see a mass exodus from major cities? Will the value of prime real estate in tech hubs like SF and NYC plummet?

11. Human mouse clicks and keystrokes will be replaced by GPUs and ASICs streaming output tokens.

Ethics

At what point do these models become sentient? It likely doesn’t even matter whether they are conscious or sentient as long the average person thinks they are or feels a certain way about them. For example, environmentalists care about the earth even though it is not sentient. So when does that happen?
People don’t even have to care. Maybe it becomes a form of virtue signaling?

12. Long Term Bet: High-speed, large-scale matrix multiplication will simulate sentient behavior so convincingly that it becomes indistinguishable from actual sentience.

Subscribe to Palet

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

wUOhP0je35sqCu6…haPe5KWNAKeT5Oc

Author Address

0x8160b16dD62587F…038E8Da605DCf4B

Content Digest

6MwaAn7SfPUOFkr…4O9CDnBrxIQXXEo