Intel graphics odyssey Pt. 2 – The quest for the chiplet way

In recent years, multi-chiplet GPU designs have become more and more attractive as monolithic GPUs are reaching the reticle limit and it's becoming ever clearer that Moore's Law has its days numbered. However, there are several problems in determining how to disaggregate a GPU into multiple processor blocks.

First, GPU data traffic has a very different pattern from the CPUs, having a many-to-few-to-many traffic pattern around the memory controllers which generates traffic bottlenecks. Second, if on the one hand, including a large number of blocks in a GPU can bring very significant benefits, on the other hand, it can increase the level of complexity in NoC routing with the addition of each block at an exponential level. Finally, one should also consider the power management issues of a multi-chiplet GPU. If powering all blocks during operation can result in excessive energy consumption, in some cases much greater than the equivalent monolithic design, on the other hand it is also highly complex to try to manage the energy consumption of each block separately.

Such design choices are of crucial importance because they directly affect the scaling performance of a multi-chiplet GPU. And all these choices are part of the great odyssey that Intel graphics has challenged itself to face. So, before we get into a deeper discussion of the technical details of what Intel plans for its next-gen GPUs, it would be convenient to know how Intel will address these challenges. And when we look closely at their latest patents, we can get an idea of how Intel will solve their quest.

How to make a mosaic

A recently published patent presents some of Intel's view on how it plans to solve the design problems of its future multi-chiplet GPUs, as well as clearly showing how their design proposal will evolve in the generations to come. The patent clarifies Intel's intention to use multiple chiplets in its future GPU designs, using serial or parallel links to interconnect them, as well as applying signal compression to reduce the overhead involved in communicating its multiple blocks.

Fig. 1 - Proposed signal compression via serial/parallel links.
Fig. 1 - Proposed signal compression via serial/parallel links.
Fig. 2 - Architectural sharing of pins/wires.
Fig. 2 - Architectural sharing of pins/wires.
Fig. 3 - Tile sub-cluster architecture.
Fig. 3 - Tile sub-cluster architecture.

In addition, the patent also suggests the use of architectural sharing of physical pins or wires between blocks to reduce the cost and complexity of fabrication, potentially reducing the overall physical size of the GPU design. To make it easier to control GPU scaling, the use of a subcluster architecture will be applied, where each subcluster can have a variable number of blocks and where each of the blocks in the subcluster can have different functions.

All these design approaches described above can be found at Ponte Vecchio and these will be the foundations of all Intel multi-chiplet GPU architecture in the coming years, both for gaming and HPC solutions. However, the patent goes even deeper, showing how Intel plans to apply a disruptive and heterogeneous approach to pushing GPU chiplet utilization to the limit.

Asymmetrical multi-chiplet GPU architecture

In a conventional multi-chiplet GPU architecture design, it is common for reasons of simplicity and fabrication cost to subdivide the GPU into homogeneous functional blocks, however the use of an asymmetric architecture can bring several benefits, both in manufacturing and in overall GPU performance. Therefore, Intel proposed in its patent the construction of an asymmetric GPU architecture, using chiplets of different sizes and specific functions in order to maximize its die recovery plan and opportunistically improve the chip utilization in its GPU architecture.

Fig. 4 - Asymmetric tile construction.
Fig. 4 - Asymmetric tile construction.

In terms of the manufacturing process, the patent proposes to use functional blocks of varying sizes, so that larger blocks have more resources than smaller blocks. Furthermore, such blocks may contain different hardware units and/or also have part of their operational functions fused off and not available, as in the case of die recovery where part of the block is defective and has been deactivated. The patent itself cites as an example tensor processing units being decoupled from vector processing units, or raster/texture processing units being decoupled from processors. It is important for the reader to understand that the use of fusing is not limited exclusively to defective dies, it can also be applied to other chiplets in such a way as to achieve certain desired characteristics for the chiplets on the GPU. In addition, the patent places great emphasis on the exemplification that chiplets do not need to be perfectly rectangular in shape, but can assume any shape without there being a difference in their final functionality. And this observation is very important to emphasize since it is possible that there are products of the same designation with chiplets of different formats and different number of chiplets, which can generate some confusion when other tech analysts disassemble their GPU samples.

Fig. 5 - Asymmetrical Multi-Tile Architecture.
Fig. 5 - Asymmetrical Multi-Tile Architecture.

In terms of architectural design, the patent proposes that different blocks can be built to support different ISAs. The initial example provided by the patent would be a composite-ISA architecture, formed by the combination of blocks with single and double precision format support, however, we can go further by exploring other patents, such as the inclusion of a micro-FPGA chiplet, formed by adaptive logic modules (ALMs) capable of being programmed to accelerate ray tracing and deep learning on its GPUs, thus creating a more definitive heterogeneous gaming solution. All these blocks are designed to provide different performance characteristics in order to allow the selection of a specific block for a specific workload by the hardware scheduler, thus allowing for a tradeoff of performance and power. It is important for the reader to note that many current GPU workloads are non-uniform, having multiple wavefronts with predicate-off threads. Such instructions still take up significant space in the pipeline, making even homogeneous multi-chiplet GPUs energy inefficient. Therefore, by providing a set of execution resources within each GPU chiplet tailored to a range of execution profiles, this proposed asymmetric GPU architecture can better handle irregular workloads in a more efficient way.

Fig 6A - The programmable hardware accelerator included in the EU pipeline.
Fig 6A - The programmable hardware accelerator included in the EU pipeline.
Fig. 6B - The micro FPGA Accelerator.
Fig. 6B - The micro FPGA Accelerator.
Fig. 7 - The Intel AI GPGPU architecture.
Fig. 7 - The Intel AI GPGPU architecture.

It is important to remember that the first part of this article presented a long breakdown of the Intel AI GPGPU architecture. Now, with the information presented with this new patent, it is possible to see in a more clear way how the implementation of the new AI GPGPU pipeline proposal will be integrated with the conventional pipeline by adding its multiple neural blocks to its composition.

Fig 8A - Dynamic exclusive assignment of fixed functions to cores.
Fig 8A - Dynamic exclusive assignment of fixed functions to cores.
Fig. 8B - Process of dynamic exclusive assignment of fixed functions.
Fig. 8B - Process of dynamic exclusive assignment of fixed functions.

Given the great differentiation proposed between the chiplets that make up the multiple functions of the GPU, Intel will implement a method of dynamic exclusive assignment of cores to fixed function units in such a way as to increase flexibility for the system, by mixing and matching resource tradeoffs to provide certain optimal power and performance characteristics in the GPU. Based on the performance scalability ratio which relates the workload power/performance scalability between the fixed functions, it is possible that a particular fixed function unit may provide fast performance but is expensive in terms of power usage, which another fixed function unit may be more functional but slower in operation, while consuming less power. Therefore, the implementation of a process for dynamic exclusive assignment of cores to fixed function units, enabling symmetric or asymmetric ISA support for processing when necessary to provide the power requirements for the GPU will lead, over the time, to substantial improvement in the GPU energy-efficiency.

The Intel chiplet quest

Adopting a chiplet-based architecture definitely is not an easy task. There are several choices and tradeoffs that Intel will have to make in its search for a development path for its new architectures. Throughout this silicon odyssey, there is already a predetermined fact: There will be no future for anyone waiting for a Deus Ex Machina to happen.

The widespread use of chiplets, already foreseen by Moore himself, is the last breath of his Law before the ultimate end. However, even in these final moments, a huge research effort will still be needed to develop solutions that allow a greater level of system disaggregation so that such multi-chiplet architectures can effectively bring an increase in computing power to the world.

Intel is putting a lot of effort into the development of its new GPUs and many of the design choices it is making have great potential to move Intel into a leadership position in GPU technology development. In the end, while Intel bravely goes through the long and arduous journey of its odyssey, Nvidia has been on its sisyphean task for years, trying again and again to solve its monolithic problems, which are getting ever giant by the way.

***

Some references and further reading:

  • US20220107914 - Multi-tile architecture for graphics operations - Koker et al. - Intel [Link]

  • US10929134 - Execution unit accelerator - Sripada et al. - Intel [Link]

  • US11151769 - Graphics architecture including a neural network pipeline – Labbe et al. – Intel [Link]

  • Underfox – “Intel Graphics Odyssey Pt. 1 – The AI GPGPU is a game changer” – Coreteks – 2020 [Link]

  • Underfox – “The Alder Lake hardware scheduler – A brief overview” – Coreteks – 2021 [Link]

Special thanks

I would like to thank you for the anonymous donation in the amount of 0.05 XMR. The amount may be small, but I am very grateful that someone believes in my work.

Changelog

  • v1.0 - initial release [Link];
  • v1.1 (current) - Added some captions to improve accessibility and fix special thanks;

Donations

  • Monero address: 83VkjZk6LEsTxMKGJyAPPaSLVEfHQcVuJQk65MSZ6WpJ5Adqc6zgDuBiHAnw4YdLaBHEX1P9Pn4SQ67bFhhrTykh1oFQwBQ
  • Ethereum address: 0x32ACeF70521C76f21A3A4dA3b396D34a232bC283
Subscribe to Underfox3
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.