Deep Learning’s New Infrastructure

0x9753
March 7th, 2022

Deep learning is changing the human experience but there is a battle raging for control of the underlying infrastructure.

Meta, the company behind Facebook, recently revealed its own AI supercomputer, a cluster of approximately 6,000 GPUs. This makes it the ~5th fastest supercomputer in the world and underscores their ambition to own the infrastructure on which the world’s artificial intelligence is developed.

As we’ve seen in the past decade, what’s good for Meta and other oligopolists generally hasn’t been good for us; the ease of use of their products typically trades off against price gouging and censorship/anti-competitive behaviour.

Thus far, most of the fight against Big Tech centralisation has been on age-old issues, principally, freedom of speech. In the coming decades, as AI becomes intractably integrated into society, a more fundamental concern will arise: freedom of compute.

As we discuss below, limitations and control on compute introduce  existential implications, so let's rewind a bit, unpack what's really at stake here, and examine a radically new way to think about the problem.

Supercomputer: Facebook announces a cluster of 16k GPUs exclusively for model training.
Supercomputer: Facebook announces a cluster of 16k GPUs exclusively for model training.

Deep learning is everywhere in your life

Every face you see on a video call and all the audio you hear is manipulated. To improve call quality, neural networks selectively adjust the resolution in Zoom and suppress background noise in Microsoft Teams. More recent advances even see lower resolution video ‘dreamed’ into a higher resolution.

Your video calls aren’t what they seem: Nvidia’s AI video compression technology can convincingly recreate your face from low resolution video.
Your video calls aren’t what they seem: Nvidia’s AI video compression technology can convincingly recreate your face from low resolution video.

Neural networks are the models used in the deep learning branch of artificial intelligence. They are loosely based on the structure of the human brain and have myriad applications, perhaps ultimately creating human level artificial intelligence. Bigger models generally yield better results, and the hardware required for state-of-the-art development is doubling every three months.

This explosion in development has made deep learning a fundamental part of the modern human experience. In 2020, a neural network operated the radar on a US spy plane, language models now write better scam emails than humans, and self-driving car algorithms outperform humans in many environments.

Hands free: A Tesla autonomously drives through California.
Hands free: A Tesla autonomously drives through California.

In the next decade, deep learning will become even more fundamental as jumps in model development become more frequent and also as hardware advances create greater sensory immersion. For example, Brain Machine Interfaces (BMI) use artificial intelligence to decode brain signals. BMIs, like Neuralink, hold the near-term promise of allowing someone to access and browse the internet with their thoughts.

Brain Pong: Neuralink shows a real monkey playing Pong. (L) shows the monkey moving the joystick in its left hand. (R) shows the joystick unplugged - with the monkey still moving the joystick - but with control of the game happening entirely through the monkey’s thoughts via BMI.
Brain Pong: Neuralink shows a real monkey playing Pong. (L) shows the monkey moving the joystick in its left hand. (R) shows the joystick unplugged - with the monkey still moving the joystick - but with control of the game happening entirely through the monkey’s thoughts via BMI.

Current deep learning progress is astounding but only scratches the surface

If you’re sceptical of grand visions for deep learning then you might have well-founded concerns. The space has been anthropomorphised and overhyped from the moment it entered the public eye.

Bold claims: In a debut July 1958 NYT article we were told neural networks would become self aware.
Bold claims: In a debut July 1958 NYT article we were told neural networks would become self aware.

Yet in the past decade, some of the human-like promise of AI has indeed come to fruition. In 2016, AlphaGo, a neural network developed by DeepMind (an Alphabet subsidiary), toppled Lee Sedol, the world’s Go Champion. It accomplished this by analysing games played by Go masters and then playing itself to improve. Go is a game that is 10^100 times more complicated than chess.

Less well-known is the fact that the following year, a new version of AlphaGo learned to play Go without ever being shown games from the Go masters. Having never seen a real game of Go, it ended up beating the previous version that had beaten Sedol.

Lonely at the top: Lee Sedol battles AlphaGo, and loses (L). A year later, a new version which learns by playing itself (0 human guidance)
Lonely at the top: Lee Sedol battles AlphaGo, and loses (L). A year later, a new version which learns by playing itself (0 human guidance)

However, despite enormous gains in narrow areas like Go, the most basic and innate concepts of the human experience are the hardest to replicate: self-awareness, morality, and ‘gut instinct’. This shortcoming was famously captured in Bladerunner’s Voight-Kampff test–loosely based on the Turing Test–in which AI replicants were asked a series of morally ambiguous questions to determine their humanity.

Do you like our owl? Rick Deckard tests the humanity of a suspected replicant in Bladerunner (1982).
Do you like our owl? Rick Deckard tests the humanity of a suspected replicant in Bladerunner (1982).

For a system to convincingly answer these questions, it would probably need to satisfy a version of Artificial General Intelligence (AGI), aka ‘Strong AI’. An AGI is a system that matches average human intelligence and has a sense of consciousness. Achieving this state is an area of passionate research and fundamental uncertainty (there is no universally accepted definition of consciousness, for example). Estimates for achieving AGI vary wildly; Bladerunner might have been set in 2019, but rough estimates for AGI creation span from 2029 to 2220!

The three factors that get us to AGI are being fiercely centralised

If you’ve noticed that most of the above examples are produced by the same set of companies, that’s because the deep learning industry currently looks like a game of monopoly between Big Tech companies. At the state level, too, it often looks like a trade and talent war between China and the United States. These forces are resulting in huge centralisation of the key resources that get us to AGI: compute power, knowledge, and data.

Compute power: access to superior processors enables increasingly large/complex models to be trained. In the past decade, transistor density gains and advances in memory access speed/parallelisation have dramatically reduced training times for large models. Virtual access to this hardware, via cloud giants like AWS and Alibaba, has simultaneously widened adoption.

Look closely: Intel increased the density of transistors in 2011 with their 22nm Tri-Gate design. State-of-the-art transistor density is now ~5nm, reaching the physical limits of what can be built. For context, the width of a human hair is 100,000 nm.
Look closely: Intel increased the density of transistors in 2011 with their 22nm Tri-Gate design. State-of-the-art transistor density is now ~5nm, reaching the physical limits of what can be built. For context, the width of a human hair is 100,000 nm.

Accordingly, there is strong state interest in acquiring the means to produce state-of-the-art processors. Mainland China does not yet have the end-to-end capability to produce state-of-the-art semiconductors (namely, silicon wafers), an essential component in processors. They need to import these, particularly from TSMC (Taiwan Semiconductor Manufacturing Company). Chip vendors also attempt to block out other customers from accessing chip manufacturers by buying up supply. At the state level, the US has been aggressively blocking any move by Chinese companies to acquire this technology.

Chip war: US-aligned countries dominate the sub-22nm chip production process.
Chip war: US-aligned countries dominate the sub-22nm chip production process.

Further up the tech stack, some companies have gone as far as creating their own deep learning specific hardware, like Google’s TPU clusters. These outperform standard GPUs at deep learning and aren’t available for sale, only for rent.

Pod race: A single Google TPU v3 pod has 100 petaFLOP performance. One billion times more than the best supercomputer of the 1960s.
Pod race: A single Google TPU v3 pod has 100 petaFLOP performance. One billion times more than the best supercomputer of the 1960s.

Knowledge: many of the most public breakthroughs have stemmed from new model architectures developed by researchers, but there is a battle over the underlying IP and talent. The US has historically captured over 50% (!) of the talent emerging from China, and the companies that develop models with this talent are increasingly making the technology less accessible. GPT-3 by OpenAI was (as the name suggests) meant to be openly available. But, as of today, it controversially sits behind an API with only Microsoft having access to the source code.

Brain drain: historically, China has lost over 50% of its best AI researchers to the US.
Brain drain: historically, China has lost over 50% of its best AI researchers to the US.

Data: deep learning models require huge volumes of data–both labelled and unlabeled–and generally improve as data quantity increases. GPT-3 was trained on 300 billion words. Labelled data is particularly important, and the industry has been steadily accruing it for years. A clandestine example: every time you solve a reCaptcha to access a website you are labelling training data to improve Google Maps.

Big data: (L) you have likely labelled data for Google, (R) GPT-3 models make less mistakes as they are exposed to more words.
Big data: (L) you have likely labelled data for Google, (R) GPT-3 models make less mistakes as they are exposed to more words.

Decentralising compute improces scale and access

The internet might have been born of the US Government in the 1960s, but by the 1990s it was an anarchic web of creativity, individualism, and opportunity. Well before Google was stockpiling TPUs, projects like SETI@home attempted to discover alien life by crowdsourcing decentralised compute power. By the year 2000, SETI@home had a processing rate of 17 teraflops, which is over double the performance of the best supercomputer at the time, the IBM ASCI White. This period of time is generally named ‘web1’, a moment before the hegemony of large platforms like Google or Amazon (web2).

Lonely planet: A SETI@home user 22 hours into an analysis of data from Puerto Rico’s old Arecibo Radio Observatory.
Lonely planet: A SETI@home user 22 hours into an analysis of data from Puerto Rico’s old Arecibo Radio Observatory.

The shift from web1 and the relative obscurity of SETI@home might have been due to some of the issues with decentralised infrastructure. For one, all of the data must be distributed, increasing bandwidth requirements. Equally important is how fast you can complete tasks that must be performed in sequence. For example, if you want to analyse the background radiation of the universe (and maybe find alien life) you can divide the sky into small parts and distribute it to everyone (the radiation in one part of the sky can be analysed independently of the other parts). This allows for perfect parallelisation of work; it also means that it’s relatively trivial to check if the work has been done correctly. After asking a third party to perform a computation, you could randomly select units of submitted work and check if they are correct.

In contrast, if you want to calculate the next best move in a game of chess (or Go), you must have access to the positions of all the pieces on the board. This makes it very difficult to efficiently decentralise the computation in chess and Go engines.

“Easy” vs “Hard” problems: You can take any slice of the night sky and analyse it to determine results: this enables perfect parallelisation (L). However, you cannot reliably determine the next best move on a chessboard without seeing the whole board (R)
“Easy” vs “Hard” problems: You can take any slice of the night sky and analyse it to determine results: this enables perfect parallelisation (L). However, you cannot reliably determine the next best move on a chessboard without seeing the whole board (R)

However, the current centralisation of web infrastructure into huge web2 platforms creates its own arguably larger issues:

  1. Costs increase: AWS’ gross margin is an estimated 61% - effectively a money printing licence.
  2. Control: AWS turned off the infrastructure of popular right-wing social media platform Parler with one day’s notice following the Jan 6th 2021 Capitol Riot. Many agreed with this decision, but the precedent is dangerous when AWS hosts 42% of the top 10,000 sites on the internet. The propensity for government tyranny also increases, for instance, the CCP block Google (although it’s still –illegally–accessible via a bridge on the decentralised TOR network).

There is a third way, however. Web3 can be thought of as a combination of the decentralised components of web1 and the capitalist components of web2. For example, decentralising compute with a blockchain and buying/selling processor cycles with tokens would circumvent the above issues with web2:

  1. Low cost scale: Unlike web2 organisations, which extract value through margins, there are low (or no) margins in web3. Instead, as demand for the token increases (or supply deflates) the token price increases and investors realise a return. As long as the cost per processor cycle is dynamically priced, the value of the token increases without affecting the cost of the purchasing compute. This instantly removes the AWS-style margin applied on compute. Just as importantly, blockchains are a more scalable option as long as the blockchain has, at worst, a linear relationship between work verification time and work.
  2. Free access: For blockchains that are trustless, there is generally no oversight as to who can and cannot access the network. This circumvents state and private sector restrictions and enables users to build where they might not have otherwise been able to.

Training deep learning models across decentralised hardware is difficult due to the verification problem

It’s clear that decentralising compute creates a cheaper and freer base from which to research and develop artificial intelligence. But the fundamental blocker to the decentralisation of deep learning training has been verification of work. Essentially, how do you know that another party has completed the computation that you requested? The two factors driving this blocker are:

State dependency: neural networks are more like the chess board than the night sky. That’s because, generally, each layer in the network is connected to all the nodes in the layer before it. This means it requires the state of the previous layer (literally, ‘state dependent’). Worse still, all the weights in every layer are determined by the previous time step. So if you want to verify that someone has trained a model–say, by picking a random point in the network and seeing if you get the same state--you need to train the model all the way up to that point, which is very computationally expensive.

Learning to read: a simple neural network uses the MNIST handwriting dataset to learn the relationship between handwritten numbers and their true values. Each layer has nodes and each node is connected to all the nodes in the previous layer, creating state dependence.
Learning to read: a simple neural network uses the MNIST handwriting dataset to learn the relationship between handwritten numbers and their true values. Each layer has nodes and each node is connected to all the nodes in the previous layer, creating state dependence.

High computational expense: It cost ~$12m in 2020 for a single training run of GPT-3, >270x more than the estimated ~$43k for the 2019 training of GPT-2. In general, model complexity (size) of the best neural networks is currently doubling every three months. If neural networks were less expensive, and/or if the training represented less of the model development process, then perhaps the verification overhead stemming from state dependency would be acceptable.

Hold on tight: state-of-the-art neural networks are growing in size at an exponential rate.
Hold on tight: state-of-the-art neural networks are growing in size at an exponential rate.

If we want to lower the price, and decentralise the control, of deep learning training, we need a system that trustlessly manages state dependent verification whilst also being inexpensive in terms of overhead and rewarding to those who contribute compute.

Gensyn is the future of deep learning training

The Gensyn protocol trustlessly trains neural networks at hyperscale and low cost. The protocol achieves the lower prices and higher scale by combining two things:

  1. Novel verification system: a verification system which efficiently solves the state dependency problem in neural network training at any scale. The system combines model training checkpoints with probabilistic checks that terminate on-chain. It does all of this trustlessly and the overhead scales linearly with model size (keeping verification costs constant). If you’d like to go deeper into how this is technically possible, read our Litepaper.
  2. New supply: leveraging underutilised and underused/unoptimised compute sources. These range from presently unused gaming GPUs to sophisticated Eth1 mining pools about to detach from the Ethereum network. Better still, the protocol’s decentralised nature means it will ultimately be majority community governed and cannot be ‘turned off’ without community consent; this makes it censorship resistant, unlike its web2 counterparts.
High scale + low cost: the Gensyn protocol provides a cost similar to an owned GPU in a data centre at a scale which can surpass AWS. (Prices as at Nov 2021).
High scale + low cost: the Gensyn protocol provides a cost similar to an owned GPU in a data centre at a scale which can surpass AWS. (Prices as at Nov 2021).

Deep learning’s Cambrian Explosion

Vastly increasing the scale of accessible compute, whilst simultaneously reducing its unit cost, opens the door to a completely new paradigm for deep learning for both research and industrial communities.

Improvements in scale and cost allow the protocol to build up a set of already-proven, pre-trained, base models–also known as Foundation Models–in a similar way to the model zoos of popular frameworks. This allows researchers and engineers to openly research and train superior models over huge open datasets, in a similar fashion to the Eleuther project. These models will solve some of humanity’s fundamental problems without centralised ownership or censorship.

Cryptography, particularly Functional Encryption, will allow the protocol to be leveraged over private data on-demand. Huge foundation models can then be fine-tuned by anyone using a proprietary dataset, maintaining the value/privacy in that data but still sharing collective knowledge in model design and research.

Join us at the frontiers of artificial intelligence

With the Gensyn protocol, we finally have hyperscale, cost-efficient, deep learning training that isn’t marshalled by institutions who play god with who gets access. The first version of the protocol, our testnet, will be deployed later this year.

If you’d like to use Gensyn to train models, get a yield on your GPU, or solve some hard engineering problems: join our Discord, join our team, or follow us on Twitter!

Finally, thank you to everyone in the Gensyn community who helped with this piece!

Arweave TX
b3jF1jctxVZUNFod8dadY6iivNWfasRdSziA9fGbCXM
Ethereum Address
0x97530840F6f68156A0c4fA1B6591743bE4e3327f
Content Digest
0SHaOYVPhATdTfw8Ypixoln_G5HF_NcVz9gEI7AXLTw