Beyond Thread Director Pt. 1 - Streamed Threads

August 15th, 2023

With the introduction of Thread Director in Alder Lake, Intel has finally delivered a heterogeneous processor architecture capable of fine-grained, dynamic task switching, substantially improving energy efficiency to meet the current demands of the desktop market. However, this astonishing advancement has also brought forth a new tier of challenges that require even more engineering effort.

A clear example of these challenges is the problem of scheduling parallel threads in a heterogeneous architecture. In fact, on a homogeneous processor, it is relatively simple to schedule multiple threads in parallel in a way that efficiently manages processor resources. However, when a computing device has different types of cores forming a heterogeneous architecture, it is common that running parallel threads on these cores can result in an unbalanced workload execution, leading to idle cores and inefficient resource utilization. Although Intel's Thread Director is very efficient in arranging thread scheduling, such inefficiency is still clearly visible in many different types of workloads.

To try to solve this recurring problem of a lack of granularity in performance scaling, MediaTek has already attempted in the past to introduce new performance tiers, adding a cluster of mid-performance cores in its processors to achieve a better performance-to-power balance. The idea was that once a sufficiently heavy process could not efficiently scale to the small cluster due to performance constraints, but was not demanding enough to require the full performance of the big cluster, it would be possible to schedule this process to an intermediate cluster with much greater efficiency. However, the thread scheduling mechanism employed by Mediatek in conjunction with this architectural approach has proven to be very inefficient. In the end, the tri-cluster architecture was later abandoned.

Seeking for a definitive solution to this issue, Intel has recently filed a patent application for a new complementary method to its hardware scheduler, which may definitively solve this scheduling problem and reveal to the world the true potential of its new heterogeneous architecture.

The basic principle of the new method proposed by Intel is to accelerate the execution of instructions in parallel by dynamically breaking and regrouping these parallel threads into smaller partitions (referred to as sub-threads in the patent) and scheduling such partitions between the P-cores and E-cores present in the device. By breaking a thread into two or more partitions, each of the partitions can be scheduled between cores according to its complexity level and core settings to reduce the time required to complete threads and increase execution efficiency, thus ensuring that cores are not idle while other cores are working.

___________________________________________________________________________________________

For a better understanding of how this method works, it is essential to understand the operating principles of Thread Director. For this, I recommend reading my detailed article published in 2021. [Link]

___________________________________________________________________________________________

The Streamed Threading Circuitry

Once the hardware-guided scheduler (aka Thread Director) determines which segments of threads will be directed to P-cores and E-cores, thereby generating scheduling suggestions, the streamed threading circuitry comes into action. This circuitry implements the protocol that executes the partitioning method in a streamed manner, guiding the cores to store partially completed partition outputs in a streaming buffer in the cache at one or more points in time during the execution of the partitions. In this way, another core can access the partial information and initiate execution on a subsequent partition corresponding to the same thread before the completion of the initial partition, thus significantly increasing the speed and efficiency of execution.

The number of partitions might depend on the core configuration, thread characteristics, and user/manufacturer preferences. Furthermore, the thread processing circuitry can evaluate the complexity of these partitions, categorizing threads into complex and non-complex partitions. It determines the complexity of each partition, generating complexity levels corresponding to different types of cores. For instance, if there is only one level of performance cores and one level of efficient cores, the thread processing circuitry identifies and labels partitions as either performance or efficient.

However, it's important to note that the present method does not hinder Intel from incorporating even more distinct levels of performance cores or efficient cores in the future. In this manner, the thread processing circuitry might label partitions based on the corresponding performance and efficiency levels. This allows the scheduling circuitry to strive for alignment between the partition's complexity level and the cores' performance level. Consequently, this novel complementary scheduling method paves the way for the potential introduction of ultra-efficient or even ultra-high-performance cores.

The stream threading protocol implemented by the streamed threading circuitry is the heart of the entire method, enabling a pipelined partition execution. Each thread and partitions of a thread correspond to the dedicated spaces of the cache. The streamed partitions are chained by the stream queues as producer and consumer pairs. One streamed thread will push entries, including position information and length of data information, to the corresponding stream queue, the core scheduled to implement the proceeding partition will pop entries in first in first out order to get the next input stream address and range. Consequently, the stream threading protocol enhances execution speed, effectively causing efficient cores to behave similarly to performance cores. Using the proposed streamed threading protocol, the execution of a thread can be thread up by more than three times faster than using traditional techniques.

The real-world performance

It is natural that the reader is questioning to what extent this method is capable of providing some performance improvement in the current situation of the X86 instruction set. After all, what could be the expected performance baseline for the next generations of Intel processors after implementing this method? To make it easier to understand, the patent itself provides two excellent examples of how the proposed method can significantly speed up workloads on a heterogeneous processor in the real world. The figure below illustrates the first example, showing the amount of time needed to run three partitions of the two threads on the two cores (e.g., one P-core and one E-core) and how the method influences the execution.

In the first timing diagram (labeled 300), illustrates a traditional technique for scheduling parallel threads, which is applied in Alder Lake, Raptor Lake and Raptor Lake Refresh processors. As shown in the figure, the first thread is executed by the P-core and the second thread is executed by the E-core. Because the E core is more efficient, the duration of time needed to complete the second thread is longer than the duration of time to complete the first thread. Thus, there is wasted time where the P-core is not being utilized while the E core finishes execution, making explicit the corresponding inefficiency of the traditional technique.

In turn, the second timing diagram (labeled 302) illustrates the application of the stream threading protocol. In this case, the first thread and the second thread are split into three partitions. The first and second partitions of the first thread are executed on the P-core and the third partition of the first thread are executed on the E-core, in such a way that, the order of the first thread is respected even though the corresponding partitions are executed on different colors. Furthermore, the first partition of the second thread is executed on the E-core and the second and third partitions of the second thread are executed on the P-core. Following this order, the duration of time to complete execution of the two threads using the proposed method is less than the duration of time to complete execution of the two threads using traditional techniques.

In turn, the figure below illustrates the second example proposed in the patent, bringing a more quantitative approach. It shows the amount of time needed to run four threads on the four cores (e.g., two P-cores and two E-cores).

In the first timing diagram (labeled 310) is illustrated the first P-core (P0) running the first thread that includes an AVX portion and an SSE portion, and the second P-core (P1) executes a second thread that includes an AVX portion and an SSE portion, the first E-core (E0) executes a third thread that includes an AVX portion and an SSE portion, and the second E-core (E1) executes a fourth thread that includes an AVX portion and an SSE portion.

Since AVX instructions are more complex, requiring more time and resources to execute than SEE instructions, the performance core P0 is able to complete the first thread in 2ms, with 1ms for the AVX portion and 1ms for the SEE portion. In the same way, the performance core P1 is capable to complete the second thread in 2ms.

Due to the structure of the E-core and the complexify of the AVX and SSE portions, the efficient core E0 is only able to complete the third task in 4.6ms, with 2.6ms to complete the AVX portion and 1.6 second to complete the SSE portion of the thread. In the same way, the efficient core E1 is able to complete the fourth task in 4.6ms. Therefore, to complete the four threads, the traditional scheduling technique takes 4.6ms to complete with 2.6ms of idle time for core P0 and 2.6ms of idle time for core P1, which is considerably inefficient.

In turn, the second timing diagram (labeled 312) illustrates the application of the proposed method by breaking the four threads into two partitions: an AVX partition, and an SSE partition. Once instruction processing circuitry determines that the AVX task is more complex than SSE task, it then breaks the tasks according and tries to schedule the most or all of the AVX tasks on the P-cores and schedule SSE tasks on either P-cores or E-cores, depending on the order of the partitions and to improve efficiency and/or time.

It is important for the reader to realize that although the instruction processing circuitry will always try to schedule all complex partitions to performance cores this is not imperative, may have complex partitions opportunistically scheduled for efficiency cores depending on the number of efficient and performance cores and the number of higher complexity partitions vs lower complexity partitions.

In this way, as shown in the example, once the AVX tasks are all completed, the instruction processing circuit can schedule the remaining SSE tasks to the third and fourth partitions on the performance cores. Thus, the four threads complete in 3ms with only 1.4ms of idle time for each efficient core. Consequently, the technique proposed by Intel results in 40% of time reduction (1.2ms), and 2.4ms of reduction of idle time.

From what was presented, perhaps the reader will be able to visualize the level of challenges that Intel imposed when diving into the development of this type of processor architecture. In fact, it is possible to state that over the last few decades the industry as a whole has been researching hard to solve such obstacles in the development of high-performance heterogeneous CPU architectures. In addition to the implementation of such a technique, Intel is still committed to refining its set of instructions, modernizing them and turning them towards a better application in such architectures in order to further improve the performance of its new scheduling method. Beyond from what has been presented so far, there are other complementary methods to Thread Director that will also be part of the hard core of improvements that Intel will bring in the next generations of its processors, which will be discussed in depth in the next parts. Therefore, stay tuned!

See you soon! 🦊

Some references and further reading:

US20230168898 - Methods and apparatus to schedule parallel instructions using hybrid cores - Chen et al. - Intel [Link]

Donations to Underfox

Monero address: 83VkjZk6LEsTxMKGJyAPPaSLVEfHQcVuJQk65MSZ6WpJ5Adqc6zgDuBiHAnw4YdLaBHEX1P9Pn4SQ67bFhhrTykh1oFQwBQ
Ethereum address: 0x32ACeF70521C76f21A3A4dA3b396D34a232bC283

Subscribe to Underfox3

Receive the latest updates directly to your inbox.

Mint this entry as an NFT to add it to your collection.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

-JGV4WXD2sOyhty…XRmq8KQzvDP_J_8

Author Address

0x32ACeF70521C76f…396D34a232bC283

Content Digest

aKHdMVrI-gqbGaX…flNNFG2RENup1yY