Part 3 of Planetary-Scale Computation: An industry primer on the hyperscale CSP oligopoly (AWS/Azure/GCP):
Table of Contents for Primer on the Economics of Cloud Computing:
Techne vs Metis.
The disciplines of economics and finance are typically taught by oscillating between:
This process is typical for the teaching of any discipline but [I believe that] economics and finance are a special case in that the delta between the disciplines’ Platonic ideals and reality of the disciplines’ practices (aka techne vs metis, respectively) are the largest of any discipline and remain large even as one learns more about them. Whereas students of fields like theoretical physics, mathematics, philosophy, etc. tend to gravitate towards techne as they further the development of their theories, and students of agriculture, engineering, medicine, marketing, accounting, etc. tend to gravitate towards metis as they begin practicing their disciplines, neither techne nor metis seem to be an exclusive, stable attractor for students of economics and finance.
In finance, metis without techne results in the “I use technical analysis exclusively. What do you mean automation? Python?” Davey daytrader archetype who underfits reality, while techne without metis can contribute to global financial collapse through model overfit. The ability to properly synthesize practice and theory can lead to profitable opportunities, as was the case for oil traders in 2020 who were quick to switch from Black/Black-Scholes-based options pricing models to the Bachelier model when oil futures went negative for the first time in history.
That this gap between the ideal and the real exists makes sense given that the practice business and commerce exists to solve real problems that the disciplines of business/economics/accounting/finance only attempt to systematize the analysis and practice of after the fact. The job to be done (JBTD) for the healthcare industry is to improve peoples’ health; the JBTD for the airline industry to get people where they want to go; the JBTD for a bakery is bake bread. While principal-agent problems and regulatory capture can eventually lead to market distortions that pervert the industry’s JBTD, these distortions are usually born after the fact — the metis of the business of breadmaking eventually leads to the techne of managing the business and finances of the bakery (metis → techne). Not so with the cloud computing industry (techne → metis) .
The modern cloud computing industry (as well as its computer timesharing predecessor) was initially born of an attempt to capitalize on what was an internal financial problem rather than an already existing, external demand.
In a sentence, the basic idea behind cloud computing is to aggregate demand for computing resources at scale in order to diversify the timing of resource use, thereby maximizing asset (i.e., computer/server) utilization and exploiting various economies of scale. Cloud computing was initially a financial innovation that only subsequently enabled and benefited from (and still enables and benefits from, present tense) technological innovation. That Jeff Bezos was a financial analyst (no less at a quant firm that heavily utilized computational resources) prior to founding Amazon should come as no surprise to anyone who learns this fact.
The grounding of modern cloud computing industry in economic and financial theory means that the underlying economic and financial concepts governing the industry’s economics are as salient today as they were over a decade ago, with no rationale for a fundamental paradigm shift in cloud economics until we’re able to achieve feats like the quantum entanglement of qubits or actual clouds of solar-powered smart dust. The sale of computing resources refers to demand vs reserved vs spot prices (analogous to spot vs future prices in hard/soft commodities markets) and there are even “Cloud Economist” roles inside and outside of hyperscaler cloud companies. The business model’s “immaculate conception” (with respect to economic/financial grounding) is what makes it “perfect” — the sale of cloud-based infrastructure services is as perfect as it gets for real-world business models (maybe we’ll find a more perfect business model in the Metaverse, but even that will be running on the Cloud) in terms of minimizing the delta between economic theory and business practice. It’s for this reason why the primary documents which I’ll be referencing to outline the economics of cloud computing (everything after this subsection is largely just me refactoring concepts from these older documents) can be close to a decade old without diminishment in relevance.
Prior to the advent of cloud-based infrastructure businesses would usually buy and manage their own servers for internal (employee-facing) and external (customer/client-facing) use. This necessitated an upfront expenditure of capital (i.e., CapEx) both in order to procure the space (rackspace, rooms, buildings) and equipment (servers, cooling, racks, networking, etc.) as well as continuous operating expenditure (i.e., OpEx) in order to run the whole thing (sysadmins, networking engineers, and other flavors of “IT guys”). The main problem with this approach was that upfront prediction of computing needs necessarily resulted in either:
Overprovisioning meant that your servers were underutilized and you allocated company capital that could have been put to better use elsewhere (e.g., You spent $10 million on servers that could have been spent sales and marketing). Underprovisioning meant that your servers awere overutilized and you didn’t buy enough IT equipment, potentially causing a loss in sales and/or reputation (e.g., You spent $10 million on a marketing campaign but your campaign was too successful and now servers can’t handle all the traffic requests). The chunkiness of traditional IT meant that a single firm either had too little capacity to meet sudden bursts in demand for storage and compute or had too much server capacity relative to what could effectively be utilized. It was this dual problem that the first wave of Cloud customers sought to avoid.
The lumpy, discrete nature of IT CapEx [as opposed to fluid/continuous; companies can’t, like, buy 2 extra servers on Thursday because they’re expecting 1% more traffic next Tuesday] meant that companies were always either under or overutilizing their servers’ memory and compute resources. Furthermore, the increasingly “viral” nature of the Internet meant that requests for any particular website’s services might spike out of nowhere but businesses without enough capacity would have to reject potential new customers (i.e., “You just lost customers” and “Unfulfilled Demand”).
This was the position that Amazon initially found itself in due to the cyclical nature of their e-commerce business nearly two decades ago and is the same position that Alibaba found itself about a decade ago — either these companies had to find a way to outsource (the “buy” in buy vs build) extra IT capacity during high demand days/months/seasons or they could integrate backwards, internalize the cost and sell the excess, thereby transforming a cost line into a revenue line. While this primer focuses on hyperscalers ex-China, the reasons why both the Big 3 hyperscalers and Chinese hyperscalers invested as heavily as they did into cloud CapEx was because they had a strong incentive to — they were receiving demand at a sufficient scale to justify investments in CapEx and the idea to sell the excess came naturally. Attempted entrants like Oracle, IBM, and HP were unsuccessful because, among other reasons, they didn’t already have an existing consumer-facing business that necessitated traffic at scale and so they never had organic internal mandates to begin a “incremental cloud CapEx → internally developed pools of engineering and sales expertise → more cloud CapEx → more expertise” flywheel.
Cloud service providers (CSPs) like AWS essentially take on the role of capitalizing servers in the aggregate and recouping the investment through selling the use of their equipment. Users pay for “compute time” (i.e., CSP X charges you for Y seconds of instance Z, where “instance Z” is the particular processor being utilized) and storage (i.e., CSP X charges you for Y minutes/days/months of memory use of Z-th level accessibility, where accessibility determines the retrieval time of your data). CSPs have abstracted the market **for computers/servers into a marketplace for compute time and memory. For businesses, a server is useful only insofar as it serves as provider of compute-time and storage capacity — CSPs subsume the need to acquire and manage IT equipment by directly offering the compute and storage that businesses actually care about.
The firms that ended up becoming natural suppliers of cloud infrastructure services (i.e., Amazon, Microsoft, and Google) benefited from multiple economies of scale and expertise flywheels that continue spinning to this day.
Put simply, the hyperscalers first began selling cloud services because they were the natural sellers of underutilized resources (compute, storage) that they already had reason to procure en masse and continued to sell cloud services because the business had exhibited multiple economies of scale. On the cost side, having scale meant lower unit costs through better negotiation leverage when buying hardware and electricity as well as having a larger base over which to amortize semi-fixed costs like labor, land, and facilities (the DC’s “shell”) over. Furthermore, aggregating compute demand lets scaled players diversify away variability in order to maximize asset utilization:
And for a more recent articulation of this idea from a 2021 AWS re:Invent keynote by Peter DeSantis [11:55 to 12:20 — these 25 seconds are worth watching for the intuitive visualization of workload demand aggregation that Peter shows]:
Scale economies and the large amounts of capital expenditure and expertise required to manage cloud infrastructure meant that scaled players quickly grew moats, with in-house expertise and steady process improvements continually raising barriers for would-be entrants. Furthermore, in the process of scaling up their cloud offerings the hyperscalers were able to build comprehensive profiles of their customers’ demand curves and discover the price elasticities of their suite of cloud services. The nature of selling IaaS means that all the consumption information is logged without uncertainty (relative to, say, General Mills selling cereal wholesale and relying on various distributors for sales vs price info within various, disparate geographies) and feedback delay between price-setting of on-demand/spot instances and customer demand is non-existent.
Discretizing the cost of a cloud.
Cloud economics in practice requires that we oscillate away from an implicitly virtual, dematerialized view of the cloud and rematerialize these “instances”, “workloads”, and “endless long tail of AWS products” into the assemblages of atoms that actually comprise these abstractions. “Cloud economics” [that is, from the POV of the hyperscale CSPs; the term has an entirely different meaning if considered from the POV of cloud customers] are really just “networked data center economics” and “networked data center economics” are the interdependent economics of concerns including but not limited to ...
... etc, etc, etc.
In other words, we don’t really have full access to the metis of cloud economics because the economics and return profiles of these projects (and their interdependencies [i.e., product cannibalization, revenue/cost synergies, strategic tradeoffs]) is internal information that no one expects to be made transparent for either investors or the general public. That being said, it’s obviously still valuable for us to map out the contours of whatever is made available to us from this complex, planetary-scale system. Let’s Get Physical, (Cyber)Physical!: Flows of Atoms, Flows of Electrons tries to reconcile the cloud’s virtuality and materiality (i.e., it’s cyberphysicality) through an extensive exploration of the Cloud through the perspective of the electrons and atoms that flow through it. However, here, we’ll be limiting our analysis of the Cloud at the level of the datacenter, a level of analysis that is complex enough in and of itself.
A data center is a factory that transforms and stores bits. The Cloud is the name we give to the network(s) of these bit-transforming-and-storage factories. The economics of a data center are wholly concerned with the physics of this bit transformation and storage process (whereas the economics of the Cloud [defined here as a network of datacenters] concerns itself with bit transformation, storage, AND distribution across networks) — all of a datacenter’s costs and revenues have to do with how efficiently and effectively it’s able to securely process and store information. If fifth-dimensional beings gifted Earthlings a shiny, four-dimensional hypercube that violated the laws of physics and exhibited the capability for infinite compute/storage capacity, instant data transmission via distance-independent quantum entanglement, and all at zero energy cost, then there wouldn’t be a need for any data centers and the cloud infrastructure industry wouldn’t need to exist.
Since that day has yet to come, the laws of physics and material realities remain the primary governors of data center economics. The speed of light is what makes multiple cloud “Regions” spread across the globe a competitive necessity (versus, say, a single GIANT data center in Antarctica). The potential for earthquakes, tornados, floods, bombings, fires, and other un/natural incidents are why datacenters exhibit diseconomies of scale after a certain size (tradeoff between intra-DC latency vs disaster risk provides the imperative for geographic distribution) and why hyperscalers introduce redundancy to their fiber optic routes. The first of law of thermodynamics is what necessitates cooling and heat exchange equipment within datacenters and what makes cooler climates near water relatively attractive sites for placing datacenters. Energy is used in bit transformation and storage, with excess heat energy itself requiring energy to remove from the bit factory to limit accelerated equipment depreciation and prevent cooking alive the meat-based employees on-site.
The best place to start on how to think about data center economics on the cost side is James Hamilton’s canonical research blog posts on large-scale data center infrastructure. Although many of these posts (*Overall Data Center Costs, Cost of Power in Large-Scale Data Centers, Annual Fully Burdened Cost of Power*) are over a decade old at this point and Hamilton himself has attested to the obsolescence of many of the input assumptions [due to technological advancement and shifts in hyperscaler buy vs build decisions, among other things] in these posts, the utility of the underlying frameworks remains evergreen.
From a Microsoft Research paper co-authored by Hamilton titled The Cost of a Cloud: Research Problems in Data Center Networks (2009):
This line of analysis is elaborated upon and decomposed in both *Overall Data Center Costs *and Cost of Power in Large-Scale Data Centers where Hamilton provides us with Excel files of the dependent variables and his working assumptions. From Hamilton’s open-sourced model in Overall Data Center Costs [I’ve re-colored assumptions to be blue]:
For clarity, the assumptions in Hamilton’s model can be [imperfectly and provisionally] grouped into three categories — infrastructure (server and non-server) assumptions, power cost/efficiency assumptions, and amortization assumptions:
While the appropriate inputs for hyperscale DCs has changed with over a decade’s worth of technological innovation and embedded experience, Hamilton’s breakdown remains relevant — upfront infrastructure and equipment costs are amortized depending on their estimated useful life and variable energy costs are calculated after factoring in efficiency (PUE) and slack (avg critical load usage). As for the evolution of the input assumptions, Hamilton gives us some hints in a response to a question on his blog:
The most important area of change in DCs between 2009 and now [and what will continue to be the most important as DCs adapt for higher mixes of AI/ML-based workloads] is the shift from low cost, commodity servers, presumably designed and manufactured by 3P providers for compute workloads that were overindexed in 2009 relative to the more diverse set of workloads that exist now. Educated guesses about the evolution of DC cost breakdowns are possible for the motivated and diligent analyst but, for the time being, that analyst is not me.
In keeping with the theme of providing you a framework without specific numbers [anyone who truly needs access to quantified cost/revenue/margin breakdown estimations of AWS/Azure/GCP probably already has access through their own channels], we can square our skeleton datacenter cost breakdown with the simple observation that CSPs sell different cloud-based services and each of these services contribute some percentage to the their top and bottom lines. The rule of thumb was that margins get higher the higher up the stack the services you’re selling are: IaaS therefore provide the lowest margins and SaaS the highest — this logic changed into one of bifurcation in which PaaS has increasingly served a low margin, commoditized complement to IaaS and SaaS, of which the latter earns higher margins than the former. That being said, the “low” [gross] margins of cloud infrastructure are only low by high growth software standards [and are considerably higher than software companies when we consider operating margin, given the negative margins of many high growth software companies] — Bernstein estimates from 2013 [from the Bernstein Blackbook on AWS that has proven itself to be evergreen, if only not optimistic enough given AWS’s extraordinary growth] pegged EC2 and S3 gross margins at around 50%, margin profiles which analysts estimate to have stayed consistent, or even expanded, over nearly a decade of price cuts through offsetting cost savings from both industry-wide (e.g., Moore’s Law) and hyperscaler-specific (e.g., custom architectures and hardware) tech and process improvements.
While AMZN doesn’t give product/service-level breakdowns for AWS (GOOG only started breaking out GCP revenues in 2018 whereas MSFT gives you three numbers for Azure/Cloud to intentionally frustrate analysts [okay, probably not, but it feels that way sometimes; up until maybe 2021 you could only download .docx files from their IR portal so you had to convert to .pdf yourself, like ??? why ???]), various estimates about revenue and margin breakdowns exist in the public domain and every research house publishes their own estimates.
Timothy Prickett Morgan of TheNextPlatform published an illustration of his revenue breakdown estimate for AWS in Navigating the Revenue Streams and Profit Pools of AWS (2018) that pegged the revenue breakdown of Compute, Storage, Networking, and Software at 20-30% of AWS overall each for Q4’17:
While Morgan doesn’t specify his methodology in his breakdown, what’s clear is that it doesn’t jive well with other breakdown estimates. Bernstein estimates from 2013 pegged EC2 at 70% of revenues and Corey Quinn, per this 2021 NBC article, estimates that over 50% of AWS revenue comes from EC2. Shift in revenue mix [and certainly margin contribution] away from increasingly “commoditized” EC2 (and S3) over time makes intuitive sense. The dream for hyperscalers (certainly AWS and Azure, less so for GCP) is that they commoditize the complements (i.e., their basic compute and storage offerings) to higher-margin cloud offerings (analytics, AI/ML, software in general) and lock-in customers to service **their customers from both ends. A persistent, fundamental question of the cloud computing industry is whether or not non-hyperscale players (i.e., ISVs) can modularize the cloud infrastructure of hyperscalers and sell higher-margin software offerings on top of “commoditized” infrastructure — answering this question will one of the focuses of Three-Body: Competitive Dynamics in the Hyperscale Oligopoly.
While the evolution of the industry is an open question [that we explore thoroughly later], NBC’s How Amazon’s cloud business generates billions in profit provides us with some useful benchmarks to help fill out the revenue and margin profiles of our skeleton (emphasis mine):
The article provides more details and estimated margins but the general idea is this — basic cloud computing services can be sold at gross margins around 50%, services higher up the stack sell at gross margins above 50%, and blended IaaS gross margins sit at around 60%, with OpEx bringing that down to 25-30% operating margins. OpEx is split between SG&A and R&D with what has been historically estimated to be an equal split but has [I’m assuming] shifted towards R&D given the semi-fixed nature of SG&A and the higher intensity of R&D for proprietary hardware design. Beyond the energy and the D&A allocations previously discussed, other contributors to the bottom-line of cloud compute are server utilization rates (higher = better, up to the point where overutilization risks breaching SLAs), embedded costs of using x86-based chip architectures in servers (vs lower-cost ARM, or open source RISC-V; hyperscalers have been using ARM in their proprietary chip designs), and costs of licensing if your server instances’ OS aren’t open source [AWS has to pay MSFT when their VMs utilize Windows-based instances, which helps explain why 90+% of AWS EC2 instances run Linux; AWS has recently been promoting their new Apple M1-based EC2 instances which they ostensibly pay Apple licensing fees to use].