The acceleration and widespread adoption of AI has led to an unprecedented demand for computational resources, particularly GPUs. This demand has given rise to GPU marketplaces and aggregators - platforms that amass vast GPU resources and offer them as a service to power the compute-intensive requirements of AI model training and inference. GPU platforms such as Akash, io.net, primeintellect, vast.ai, and major cloud providers have democratized access to GPUs, striving to make AI more accessible to all.
However, the surge in GPU marketplaces does not magically resolve all challenges associated with AI workloads, particularly in terms of data transfer efficiency, memory usage, compute capacity, and networking.
For data scientists, machine learning engineers, and AI enthusiasts fine-tuning models with proprietary data, these GPU platforms don't automatically alleviate the challenges inherent in AI workloads. Despite improved accessibility, simply having more GPUs doesn't guarantee better AI performance. Significant obstacles persist in efficiently harnessing these resource-intensive AI tasks.
The performance of AI workloads can be significantly influenced by various factors, as illustrated in the table below:
The goal of this article is to understand the importance of GPU virtualization and the considerations necessary when building a high-performance computing network for AI.
We will explore the following:
AI training and inference systems: hardware and software requirements, parallelization techniques, and challenges with complex models and large datasets.
Bottlenecks in AI workloads: data transfer rates, memory limitations, networking overhead, and factors to consider when building distributed systems and networks.
GPU virtualization for optimizing AI workloads: NVIDIA's Multi-Instance GPU (MIG) and vGPU technologies for efficient resource utilization and sharing.
Constructing a GPU server for AI: selecting appropriate hardware (e.g., NVIDIA A100, H100), PCIe generation, and optimizing data transfer using NVLink or Infinity Fabric. (Covered in the next chapter of the article)
Let’s dive in!
AI training and inference are intricate processes with specific hardware and software needs. Training deep learning models, for instance, is a task characterized by its parallelizable nature, which benefits greatly from the architecture of GPUs. However, the notion that "more GPUs always mean faster training" is an oversimplification. In practice, the effective use of GPUs for AI requires understanding the subtleties of data transfers, memory usage, compute capacity, and networking.
For example, the performance gains from using multiple GPUs can diminish if the data isn't efficiently transferred between the GPUs or if the model doesn't scale well. This is evident in complex models like GPT-3, where the interconnect bandwidth and latency can become bottlenecks. Interconnect bandwidth refers to the amount of data that can be transferred between GPUs or devices per unit of time, while latency is the time delay in the data transfer. Optimizing these factors is crucial for the performance of distributed AI systems in decentralized applications.
The training process also becomes more complex when considering model parallelism and data parallelism to manage larger models or datasets that do not fit into a single GPU's memory. Model parallelism involves splitting a neural network model across multiple GPUs or devices to distribute the computational workload, with each device processing a portion of the model. Data parallelism, on the other hand, distributes the training data across multiple devices, with each device processing a subset of the data and the model being replicated on each device.
Inference, i.e., the deployment of trained models, brings another set of challenges. In decentralized GPU marketplaces, GPUs are often shared among multiple users, potentially leading to variability in performance due to resource contention. Moreover, not all AI models are inference-friendly on all GPU types; some may run more efficiently on specialized inference hardware like NVIDIA's Tensor Core GPUs, such as the A100 and H100 PCIe/SXM. Tensor Cores are specialized hardware units designed to accelerate AI workloads, particularly matrix multiplication operations, which are fundamental to deep learning inference. These GPUs offer high performance and efficiency for inference tasks compared to general-purpose GPUs.
The most amount of time spent in computing is moving data from one place to another. Sometimes services take longer to move data from memory than processing it. As high-performance AI computing resource requirements increase, interconnect bandwidth and latency quickly become the bottleneck.
GPU memory bandwidth is the speed of data transfer between a GPU and the system across a bus, such as PCI Express (PCIe) - more on this later.
Latency is the delay before a transfer of data begins following an instruction for its transfer.
Explore memory and computation hierarchy concepts in GPU Computing here and here.
Example AI Inference Using NVIDIA A100 and Llama 3 Model:
GPU Specs:
Model: NVIDIA A100 80GB PCIe
Memory: 80GB HBM2e
Memory Bandwidth: 1,935 GB/s
Compute Performance: 19.5 TFLOPS (FP32)
Llama 3 Model Specs:
Parameters: 8 billion
Model Size (FP16): 4.7 GB
When running the Llama 3 model, we utilize 4.7 GB out of the A100’s 80 GB of memory for the model’s parameters. The inference process requires additional memory for input and output data, but the exact amount depends on the sequence lengths and batch size. For this scenario, we'll allocate an extra 5.3 GB for a total of 10 GB per inference instance.
Performance Considerations:
Given the A100 can perform a significant number of mathematical operations per second (19.5 TFLOPS), it can theoretically support numerous inference passes per second.
One inference pass for Llama 3, assuming optimized use of the A100, might take approximately 2 ms. This translates to 500 inference passes per second.
At 10 GB needed per pass, we would require 5000 GB/s, which exceeds the A100's memory bandwidth of 1,935 GB/s.
Memory Bandwidth Analysis:
However, not all 10 GB needs to be transferred every pass since the model parameters remain constant and only the data being processed changes.
The memory bandwidth will mainly be utilized for reading input data and writing output data, alongside accessing intermediate data during computation.
Effective memory usage patterns and caching can significantly reduce the memory bandwidth requirements per inference pass.
PCIe Bandwidth:
The PCIe Gen4 x16 interface on the A100 provides 64 GB/s bandwidth.
To avoid bottlenecks, the entire Llama 3 model should be preloaded onto the GPU's memory before inference begins.
Data transfers over PCIe would primarily involve input data from the host and output data back to the host. These transfers must be managed to ensure they do not become the bottleneck.
Using the A100 for AI inference with the Llama 3 model is practically feasible, as the GPU has sufficient memory to hold the model and offers high memory bandwidth to accommodate the flow of data during inference. However, to maintain efficiency, it is crucial to optimize data transfer patterns and preload the model to minimize reliance on PCIe bandwidth during runtime.
Recognizing the significant bottlenecks in data transfer and compute capacity inherent in AI workloads, particularly when scaling across multiple GPUs, leads us to consider solutions that can mitigate these challenges. While GPUs have significantly accelerated the training process due to their parallel processing capabilities, they are not immune to challenges. In large-scale training tasks where multiple GPUs are networked together, bandwidth and latency in data transfer between GPUs can be a significant hurdle.
When bandwidth is insufficient, it prevents GPUs from receiving data as fast as they can process it. High latency can lead to delays in synchronization between GPUs, especially critical during tasks like model parallelism where different parts of a neural network model are processed on a set of clustered GPUs. Therefore these limitations can severely impede the scalability of AI systems and the efficiency of training complex models like GPT, where rapid, continuous data exchange is crucial. If multiple AI tasks running concurrently, they often contend for GPU resources also leading to increased latency and inefficiencies.
To bridge the gap and address these challenges, GPU resources need to be shared and utilized efficiently. GPU virtualization abstracts GPU resources from the physical hardware to a virtual environment, where resource allocation can be dynamically managed and multiple virtual machines or containers can share physical GPUs. This not only optimizes hardware utilization and the scheduling of computational tasks but also mitigates bandwidth and latency issues by managing data flows more efficiently in the virtualized environment.
The objective of this article is to dissect the intricacies of GPU virtualization, with a particular emphasis on the pivotal roles played by PCI Express (PCIe) interconnect and NVIDIA’s SXM (Server PCI Express Module) form factor. PCIe is a high-speed serial bus standard that has become the backbone of internal data transfers in modern computing. It connects GPUs to the central processing unit (CPU) and to other devices, enabling the rapid data transfer rates required for AI computations.
We’ll explore the technical nuances of PCIe in GPU virtualization, addressing how PCIe’s bandwidth and latency characteristics can affect AI workload performance.
3.1 PCIe and SXM: Key Interconnects for GPU Virtualization
PCI Express (PCIe) is a widely used interface for connecting GPUs to the rest of the computing system. It allows multiple GPUs to be connected to a single computer and enables the division of GPU resources into virtual segments that can be allocated to different VMs. This technology has been made possible by NVIDIA's vGPU and Multi-Instance GPU (MIG).
NVIDIA vGPU allows multiple virtual machines (VMs) to share a single physical GPU, while MIG enables the creation of multiple isolated GPU instances within a single GPU.
SXM, on the other hand, was introduced by NVIDIA as a part of their high-end computing platforms to overcome some of the PCIe bottlenecks. It offers a more direct path between the GPU and the CPU with higher bandwidth and lower latency than PCIe. NVIDIA NVLink, a high-speed, bidirectional interconnect, is designed to scale AI training workloads across multiple GPUs, offering 5X higher performance over PCIe.
3.2 How GPU virtualization works:
GPU virtualization is a technology that allows multiple virtual machines (VMs) or containers to share a single physical GPU. It abstracts the physical GPU resources and presents them as virtual GPUs (vGPUs) to each VM or container. This enables efficient utilization of GPU resources and facilitates the creation of isolated GPU instances within a single GPU.
In the context of AI workloads, GPU virtualization plays a crucial role as they often require significant computational power. However, dedicating a single physical GPU to each AI workload can lead to underutilization of resources and increased costs. GPU virtualization addresses this challenge by allowing multiple AI tasks to run concurrently on a single GPU, thereby maximizing resource utilization and cost efficiency.
Moreover, GPU virtualization enables secure and isolated AI training environments. In a decentralized network, multiple parties may contribute their GPU resources to participate in collaborative AI training. However, ensuring the security and privacy of sensitive training data and models is of utmost importance. GPU virtualization technologies, such as NVIDIA vGPU and Multi-Instance GPU (MIG), provide hardware-level isolation between different GPU instances. Each virtual GPU operates within its own isolated environment, with dedicated memory and compute resources. This isolation prevents unauthorized access or leakage of sensitive information between different AI tasks running on the same physical GPU.
The security benefits of GPU virtualization extend beyond data privacy. It also enhances the overall resilience and fault tolerance of the decentralized AI training network. If one virtual GPU instance encounters an issue or fails, it does not impact the other instances running on the same physical GPU. This compartmentalization ensures that the failure of a single task does not bring down the entire network, thereby improving the reliability and availability of the decentralized AI infrastructure.
For instance, a virtualized GPU could simultaneously support an engineer developing new AI models while also running an operational model that serves real-time inference requests.
Another example could be, that a compute provider might use GPU virtualization to service multiple customers on the same physical hardware, each with their own virtualized GPU instances tailored to their specific needs.
3.3 How does it work?
There are a few ways to virtualize a GPU:
NVIDIA's Multi-Instance GPU (MIG) technology allows a single GPU to be split into multiple instances, each with their own dedicated resources such as memory and compute cores. This is different from traditional GPU deployments where a single task could monopolize an entire GPU, potentially leaving a lot of unused resources. MIG enables more efficient utilization of the GPU by allowing multiple VMs and tasks to run in parallel on a single GPU, while preserving isolation guarantees that vGPU provides. technical brief. Great introductory explanation here.
With MIG, each instance's processors have separate and isolated paths through the entire memory system - the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address busses are all assigned uniquely to an individual instance. This ensures that an individual user's workload can run with predictable throughput and latency, with the same L2 cache allocation and DRAM bandwidth, even if other tasks are thrashing their own caches or saturating their DRAM interfaces. MIG can partition available GPU compute resources (including streaming multiprocessors or SMs, and GPU engines such as copy engines or decoders), to provide a defined quality of service (QoS) with fault isolation for different clients such as VMs, containers or processes. MIG enables multiple GPU Instances to run in parallel on a single, physical NVIDIA Ampere GPU.
The figure above shows MIG architecture that includes:
CPU and Memory at the left side of the diagram.
USER box is a virtual machine or container that requires GPU resources.
PCIe SW are interfaces on the motherboard. The PCIe Switch is used to connect multiple devices over the PCIe bus, which includes GPUs in this context.
Each GPU Instance box represents a partition of the physical GPU, allocated to a user. These instances allow multiple users to share the same physical GPU resources securely and efficiently.
SMs, Pipe, Control, L2, DRAM: These are components within each GPU instance.
SMs (Streaming Multiprocessors) are fundamental units within NVIDIA GPUs that carry out parallel processing tasks.
Pipes are data pipelines or the pathways through which data flows within the GPU.
Control logic manages the operations within each GPU instance.
L2 Cache memory that stores frequently accessed data to speed up retrieval.
DRAM (Dynamic Random-Access Memory) is the dedicated memory for each GPU instance.
NVIDIA vGPU, on the other hand, provides an instance to multiple users without physically partitioning the hardware. This means a time-sliced access to the GPU is shared among different virtual machines. This can be ideal for scenarios where users require GPU resources but not necessarily the full capacity of a GPU all the time.
GPU Passthrough technique gives a virtual machine full and exclusive access to a GPU's resources. This could be beneficial for critical AI tasks that require direct control over GPU hardware without any intervening layers that could add overhead. With GPU passthrough, an AI training task could fully utilize the underlying GPU hardware, leading to faster model training times and more efficient resource utilization. Direct device assignment is possible, which means you can assign a device to a guest VM. VFIO (Virtual Function I/O) kernel framework can be used to provide direct physical GPU device access.
The figure below shows a general architecture for a physical GPU that is split into three GPU instances of different sizes. Each VM is strictly isolated, it can be running different AI tasks with direct access to the physical GPU instance that have dedicated memory and compute instances. The KVM hypervisor manages the VMs and can run multiple containers on top of them in parallel.
3.4 GPU Passthrough for Dedicated Resources
GPU passthrough in a virtualization environment involves directly assigning a physical GPU to a virtual machine (VM), bypassing the hypervisor layer, thus allowing the VM to interact with the GPU as if it were a physical machine. The process typically involves these technical steps:
IOMMU Grouping: The Input-Output Memory Management Unit (IOMMU) groups devices that the CPU can directly access. An IOMMU allows safe DMA (Direct Memory Access) which is essential for passthrough.
VFIO (Virtual Function I/O) is a kernel framework used to assign devices to VMs securely. It employs the IOMMU groups to ensure the device can only be accessed by the designated VM.
Hypervisors (like KVM or Xen) must be configured to allow passthrough for the specific GPU, involving setting up the VM with direct access to the GPU's physical address space.
For instance, developers using KVM can configure passthrough by using virsh to edit the VM’s configuration file, ensuring the hostdev section points to the correct PCI address of the GPU.
VM-level Isolation and Security for AI workloads:
Passthrough provides strong isolation because the VM has exclusive access to the GPU, eliminating "noisy neighbor" issues in shared environments.
The direct assignment model reduces the attack surface since there is no sharing or over-provisioning. Moreover, IOMMU's role in DMA protection helps prevent VMs from accessing unauthorized memory regions.
For example, if a developer is working on an AI model that handles sensitive data, GPU passthrough can ensure that this data is processed in a secure, isolated environment within the VM.
Another one could be, a data science team that requires access to GPU resources for various projects with differing performance needs. They could configure a server with some GPUs dedicated to passthrough for high-performance needs, while others are shared for tasks where peak performance isn't as critical.
There are various other GPU concurrency and sharing mechanisms supported by NVIDIA for improved GPU utilization, along with the CUDA programming model. More information on their blog.
How can we leverage these virtualization and hardware utilization techniques?
Example #1:
Consider the development and deployment of a real-time, AI-powered Collaborative Video Editor Studio. This app leverages advanced AI models to perform AI tasks for automatic video enhancement, content-aware editing, background noise reduction, and even complex video effects that were previously only achievable by professional editors like enhancing low-light scenes or stabilizing shaky footage. It's designed for use across various platforms, including smartphones, tablets, and PCs, catering to content creators ranging from casual users to semi-professionals.
In this scenario, let's walk through what happens when complex AI features are needed in high demand.
The app's AI features require significant computing power, traditionally reserved for high-end PCs or cloud solutions. These processes include frame-by-frame video analysis, object detection, and applying neural network models for enhancing video quality in real-time. There is also diversity in devices this app is running on and consumers also expect results instantaneously.
The app is very popular and demand can spike suddenly during big events. They need to be able to upload, edit, enhance, and stitch multiple video clips together in real time. Can they rely solely on traditional GPU acceleration?
There is a need for the company to operationalize AI workloads and AI inference servers efficiently. Some of the challenges in scaling and performance could be:
The app running on high-end devices may handle tasks efficiently, while lower-end devices struggle, leading to inconsistent user experience. Without the ability to program GPUs to optimize for specific tasks, performance can vary across devices.
If performing concurrent tasks like rendering video while also applying AI generated enhancements, GPU utilization can be an issue as these tasks would then compete for GPU resources, potentially leading to degraded performance. Traditional GPU utilization can lead to inefficiencies without programmable access to allocating resources.
The app being very popular means the demand for more sophisticated AI capabilities needs to be applied. Without the ability to dynamically allocate and optimize GPU resources based on demand and workload type, scaling the app to accommodate more complex AI models and a larger user base becomes challenging.
If the processing times are longer for AI-driven enhancements due to inefficient resource utilization, the real-time editing process could become bottlenecked.
If users are on battery-powered devices like tablets and smartphones, inefficient GPU utilization can lead to excessive energy consumption, reducing battery life and negatively affecting the user experience.
Example #2:
AI-Powered Open Trading Network:
Imagine a trading platform that uses AI agents to analyze financial markets in real-time. This platform needs to process vast amounts of market data - prices, volumes, news updates, and economic indicators—to make accurate trading decisions.
Once the AI agents are developed, they are deployed across various computational environments that vary in capability - from robust data centers to edge devices used by individual traders. These AI agents need to analyze, perform complex calculations, and act on market data in milliseconds to capitalize on trading opportunities.
The network is now being used by a lot more customized AI agents built by sophisticated AI scientists. The challenges arise:
Even minor delays in data processing can result in significant financial losses and providing rapid responses becomes crucial during high market volatility.
Different environments may offer varying levels of computational power. AI agents might perform well in controlled test environments but underperform on live systems with actual workloads, leading to inconsistent trading strategies and potential financial risk.
Volatile market events can trigger massive spikes in data volume that need to be processed, leading to resource contention. Traditional systems might not scale quickly or efficiently enough, causing slowdowns and missed opportunities.
There is a need for programmable GPU hardware and efficient operationalization. Vistara addresses these challenges:
By enabling virtualization and programmable access to hardware resources, Vistara allows the app developer to optimize their AI workloads.
By providing dynamic hardware resource allocation and autoscaling, Vistara’s hypervisor can hotplug provision more computational resources to AI agents during high market activity, ensuring that the data is processed as quickly as possible.
By abstracting the hardware layer, Vistara allows AI agents to operate independently of the underlying physical resources. This means they can perform consistently, regardless of the device or system they are running on, ensuring uniform experience across all platforms.
Through load balancing, Vistara ensures that no single node or AI agent overwhelms the system’s resources. Hardware layer isolation prevents tasks from interfering with each other, maintaining system integrity and security during critical operations.
Vistara introduction👇
Throughout this article, we've explored the significant role that GPU virtualization plays in optimizing AI workloads. We've seen how bottlenecks such as bandwidth and latency can hinder AI systems, and how technologies such as PCIe, SXM, NVIDIA's MIG, and GPU Passthrough help mitigate these challenges by maximizing hardware efficiency. As we move towards a future dominated by AI agents and networks, the need for sophisticated virtualization solutions becomes even more crucial. Vistara redefines how we think about and utilize hardware resources. It's not just about virtualization or orchestration; it's about transforming static hardware into a dynamic, scalable, and secure infrastructure that responds in real time to the demands of advanced applications and networks. By providing dynamic resource allocation, Vistara enables more consistent performance across varied computational tasks and empowers developers to push the boundaries of what's possible in open AI networks.
For developers and entrepreneurs ready to push the boundaries of what’s possible in AI, Vistara provides the foundational layer and tools to bring your most ambitious projects to life. Join our community on discord.
For the next part of the post, we will explore how PCIe layers work, the virtualization lifecycle, and things to consider when constructing a GPU Server with PCIe and SXM.
References: