Dark Silicon (Toward Dark Silicon in Servers)

15 min readJul 21, 2020

This report summarizes the technological trends that give rise to the phenomenon of dark silicon, its impact on the servers, and an effort to curb them based on the research paper [6] published in 2011 by Hardavellas et al. Server chips do not scale beyond a certain limit. As a result, an increasing portion of the chip remains powered off, known as dark silicon, that we can not afford to power. Specialized multi-core processors can make use of abundant, underutilized, and power-constrained die area by providing diverse application-specific heterogeneous cores to improve server performance and power efficiency.

1- DARK SILICON

Data is growing at an exponential rate. It requires computational energy to process and perform computations. It has been observed that data is growing faster than Moore’s Law [1]. Moore’s Law states that computer performance, CPU clock speed, and the number of transistors per chip will double every two years. An unprecedented amount of computational energy is required to cope up with this challenge. It suffices to get an idea of the energy demands by an example that 1000m2 datacenter is 1.5MW. Nowadays, multicore processors are used to process this data. It is believed that the performance of a system is directly proportional to the number of available cores. However, this belief is not true because performance does not follow Moore’s Law. In reality, the performance is much slower than the expected results due to some physical constraints such as bandwidth, power, and thermal limits, as shown in figure 1.

It is observed that off-chip bandwidth grows slowly. As a result, cores cannot be fed with data fast enough. An increase in the number of transistors does not decrease the voltage fast enough. A 10x increase in transistors resulted in only a 30% voltage drop in the last decade. Similarly, power is constrained by cooling limits, as cooling does not scale at all. In order to fuel the multicore revolution, the number of transistors on the chip are growing exponentially. However, operating all transistors simultaneously requires exponentially more power per chip, which is just not possible due to the physical constraints explained earlier. As a result, an exponentially large area of the chip is left unutilized, known as dark silicon.

The dark silicon area is growing exponentially, as shown by the trend line in figure 2. In this graph, the die size of the peak performance for the different workloads is plotted with time. In simple words, we can only use a fraction of the transistors available on a large chip, and the rest of the transistors remain powered off.

Now a question arises, should we waste this large unutilized dark area of the chip? Hardavellas et al. [6] repurposed dark silicon for chip multiprocessors (CMPs) by building a sea of specialized heterogeneous application-specific cores. These specialized cores dynamically power up only a few selected cores designed explicitly for the given workload. Most of these application cores remain to disable/dark when not in use.

Benefits of Specialized Cores: Specialized cores are better than the conventional cores because they eliminate overheads. For example, to access a piece of data from the local memory, L2 cache, and the main memory requires 50 pJ, 256–1000 pJ, and nearly 16000 pJ of energy, respectively. These overheads belong to general-purpose computing, while a carefully designed specialized core can eliminate most of these overheads. Specialized cores improve aggregate performance and energy efficiency of server workloads by mitigating the effect of physical constraints.

1.1 Methodology

To assess the extent of dark silicon, it is crucial to jointly optimize a large number of design parameters to compose CMPs that are capable of attaining peak performance while staying within the physical constraints. Therefore, we develop first-order analytical models by optimizing the principal components of the processor, such as supply & threshold voltage, clock frequency, cache size, memory hierarchy, and core count. The goal of the analytical models is to derive peak performance designs and describe the physical constraints of the processor. Detailed parameterized models are constructed according to ITRS* standards. These models help in exploring the design space of multicores. Note that these models do not propose the absolute number of cores or cache size required to achieve the peak performance in the processors. Instead, they are analytical models proposed to capture the first-order effects of technology scaling to uncover the trends leading to dark silicon. The performance of these models is measured in terms of aggregate server throughput, and the model is examined autonomously in heterogeneous computing.

In order to construct such models, we have made some design configuration choices for hardware, bandwidth, technology, power, and area models, as described in the next section in detail.

2- DESIGN CHOICES

2.1 Hardware Model

CMPs are built over three types of cores, i.e., general-purpose (GPP), embedded (EMB), and specialized (SP). GPPs are scalar in-order four-way multithreaded cores and provide high throughput in a server environment by achieving 1.7x more speedup over a single-threaded core [7]. EMB cores represent a power-conscious design paradigm, and they are similar to GPP cores in performance. Specialized cores are CMPs with specialized hardware, e.g., GPU, digital signal processors, and field-programmable gate arrays. Only those hardware components will powerup, which are best suitable for the given workload at any time instance. SP cores outperform GPP cores 20x with 10x less power.

2.2 Technology Model

CMPs are modeled across 65nm, 45nm, 32nm, and 20nm fabrication technologies following ITRS projections. Transistors having a high threshold voltage Vth are best to evaluate the lowering of leakage current. Therefore high Vth transistors are used to mitigate the effect of power wall [3]. CMPs with high-performance transistors for the entire chip, LOP (low operating power) for the cache, and LOP transistors for the entire chip are used to explore the characteristics and behavior of the model.

2.3 Area Model

The model restricts the die area to 310mm2. Interconnect and system-on-chip components occupy 28% of the area, and the rest of the 72% is for cores and cache. We can estimate core areas by scaling existing designs for each type of core according to ITRS standards. UltraSPARC T1 core is scaled for GPP Cores and ARM11 for EMB and SP cores.

2.4 Performance Model

Amdahl’s Law [9] is the basis of the performance model. It assumes 99% application parallelism. The performance of a single core is computed by aggregating UIPC (user instructions committed per cycle). UIPCis computed in terms of memory access time given by the following formula:

AveraдeMemoryAccessTime = HitTime + MissRate × MissPenalty

UIPC is proportional to the overall system throughput. Detailed formulas, derivations, and calculations of the performance model are available at [4][5].

2.5 L2 cache miss rate and data-set evolution models

Estimating the cache miss rate for the given workload is important as it plays a governing role in the performance. L2 cache of size between 256KB and 64MB is curve-fitted using empirical measurements to estimate the cache miss rate. X-shifted power law
y = α (x + β )^γ provides the best fit for our data with only 1.3% average error rate. Miss-rate scaling formulas are listed with details in this work [4].

2.6 Off-chip bandwidth Model

Chip bandwidth requirements are modeled by estimation of off-chip activity rate, i.e., clock frequency and core performance. Off-chip bandwidth is proportional to L2 miss rate, core count, and core activity. The maximum available bandwidth is given by the sum of the number of pads and maximum off-chip clocks. In our model, we treat 3D-Stacked memory as a large L3 cache due to its high capacity and high-bandwidth. Each layer of 3D stacked memory is 8 Gbits at 45nm technology. The energy consumption of each layer is 3.7 Watt in the worst case. We model 8 layers with a total capacity of 8 GBytes and one extra layer for control logic. The addition of 9 layers raises the chip temperature to 10°C. Nevertheless, we account for power dissipation to counter these effects. We estimate that 3D stacking will improve memory access time by 32.5% because it makes communication between the cores and 3D memory very efficient.

2.7 Power Model

Total chip power is calculated by adding the static and dynamic power of each component, such as core, cache, I/O, interconnect, etc. We use ITRS data to manage the maximum available power for air-cooled chips with heat sinks. Our model will take maximum power limits as input and will discard all the CMPs design exceeding the defined power limits. Liquid cooling technologies can increase the maximum power however, we are not yet succeeded in applying thermal cooling methods in cores. The dynamic power of N cores and L2 cache is computed using the formulas mentioned in the paper with details.

*https://en.wikipedia.org/wiki/International_Technology_Roadmap_for_Semiconductors

**Figure 3: Performance of general-purpose (GPP) chip multiprocessors**

3 ANALYSIS

After designing, we need to demonstrate the use of our analytical models. We will explore the peak performance designs of general-purpose and specialized multicore processors in the next two subsections. Furthermore, we will also evaluate the core counts for these designs and conclude by comparative analysis.

3.1 General purpose multicore processors

We begin by explaining the progression of our peak performance design-space exploration algorithm by the results shown in figure 3. Figure 3a represents the performance of a 20nm GPP CMPs running Apache using high performance (HP) transistors for both cores and cache. The graph represents the aggregate chip performance as a function of the L2 cache size. It means that a fraction of the die area is dedicated to the L2 cache (represented in MB on the x-axis).

Area curve shows the performance of the design with unlimited power and off-chip bandwidth but having constrained on-chip die area. Larger the cache fewer the cores. Even though a few numbers of cores fit on the remaining die area, each core performs the best due to the high hit rate of the bigger cache. The performance benefit is achieved by increasing the L2 cache until 64MB. After this, it is outweighed by the cost of further reducing the number of cores.

Power curve shows the performance of the design running at the maximum frequency with limited power due to air cooling constraint but having unlimited off-chip bandwidth and area. The power constraint restricts aggregate chip performance because running the cores at the maximum frequency requires an unprecedented amount of energy which limits the design to a very few cores only.

Bandwidth curve represents the performance of the design running at an unlimited power and die area having limited off-chip bandwidth. Such design reduces the off-chip bandwidth pressure due to the larger available cache size and improves the performance. Area+Power curve represents the performance of the design limited in power and area but unlimited off-chip bandwidth. Such design jointly optimizes the frequency and voltage of the cores by selecting the peak performance design for each L2 cache size.

Peak performance curve represents the multicore design that adapts to all the physical constraints. Performance is limited by off-chip bandwidth at the start but after 24 MB power becomes the main performance limiter. Peak performance design is achieved at the intersection of power and bandwidth curves. A large gap between the peak performance and area curve indicates that a vast area of the silicon in GPP cannot be used for more cores because of power constraints.

Figure 3b represents the performance of the designs that use high performance (HP) transistors for cores and low operational power (LOP) for the cache. Similarly, figure 3c represents the performance of the designs with low operating power for both cores and the cache. Designs using HP transistors can power up only 20% of the cores that fit in the die area of 20 nm. On the other hand, designs using LOP transistors for the cache (figure 3c) yield higher performance than designs using HP transistors because they enable larger caches which support approximately double the number of cores, i.e. 35–40% cores in our case. LOP devices yield higher power efficiency because they are suitable to implement both the cores and the cache.

Hence we can conclude that peak performance design offered by general purpose multicore processors results in a large area of dark silicon when cores and caches are built with HP transistors. However, making use of LOP transistors reduces the dark area up to some extent as explained earlier and shown in figure 3.

Core Counts Analysis: To analyze the utilized number of cores, figure 4a plots the theoretical number of cores that can fit on a specified die area of the corresponding technology along with core counts of the peak performance designs. Due to chip power limits, HP-based designs became impossible after 2013. Although LOP-based designs provided a way forward, the high gap shown between the die area limit and LOP designs indicates that an increasing fraction of the die area will remain dark because of underutilized cores.

3.2 Specialized multicore processors

Now we demonstrate the peak performance designs using GPP, embedded (EMB), and specialized (SP) cores using LOP transistors having die area of 20 nm.

An extreme application of SP cores is evaluated by considering a specialized computing environment where a multicore chip contains hundreds of diverse application-specific cores. Only those cores are activated which are most useful for the running application. The rest of the on-chip cores remain powered off. SP cores design delivers high performance with fewer but more powerful cores. It is observed that SP cores are highly power-efficient and they significantly outperform the GPP and EMB cores.

Core Counts Analysis: Figure 4b shows the comparative analysis of core counts for the peak performing designs across the mentioned core types. It shows that peak performance SP designs employ only 16–32 cores and cache occupies a large portion of the die chip area. Low-core-count SP designs outperform other designs with 99.9% parallelism. High-performance characteristics of SP cores boost the power envelope further than is possible with other core designs. SP multicores attain 2x to 12x speedup over EMB and GPP multicore designs and are ultimately constrained by the limited off-chip bandwidth. A 3D-stacked memory is used to mitigate the effect of bandwidth constraints beyond the power limits. The use of 3D-stacked memory pushes the bandwidth constraint and leads to a high-performance power-constrained design (figure 4c). Elimination of off-chip bandwidth bottleneck takes us back to the power-limited regime having an underutilized die area (figure 4b). Reduction of off-chip bandwidth by combining 3D memory with specialized cores improves the speedup by 3x for 20nm die size and reduces the pressure on the on-chip cache size. On the other hand, GPP and EMP chip multiprocessors can only attain less than 35 percent of performance improvement.

4 CURRENT STATE-OF-THE-ART

The phenomenon of dark silicon started in 2005. It was the time when processor designers started increasing the core count to exploit Moore’s Law scaling rather than improving a single-core performance. As a result, it was found out that Moore’s Law and Dennard scaling behave conversely in reality. Dennard scaling states that the density of transistors per unit area remains constant with a decrease in its size [2]. Initially, the tasks of the processors were divided into different areas to achieve efficient processing and minimize the impact of dark silicon. This division led to the concepts of floating-point units and later on it was realized that division and distribution of the processor’s tasks using specialized modules could also help to alleviate the problem of dark silicon. These specialized modules resulted in a smaller processor area with efficient task execution which enabled us to turn off a specific group of transistors before starting another group. The use of a few transistors in an efficient way in one task allows us to keep having working transistors in another part of the processor. These concepts advanced to System on Chip (SoC) and System in Chip (SiC) processors. Transistors in Intel processors also turns ON/OFF according to the workload. However, specialized multicore design discussed in this report requires further research to realize its impact on other SoC and SiC multicore processors having different requirements for bandwidth and temperature.

5 RELATED WORK

In this section, we will discuss other strategies, techniques, or trends proposed in the literature about the phenomenon of dark silicon.

Jorg Henkel et al. introduced new trends in dark silicon in 2015. The presented paper focuses on the thermal aspects of dark silicon. It is proven by extensive experiments that chip’s total power budget is not the only reason behind dark silicon, power density and related thermal effects are also playing a major role in this phenomenon. Therefore they propose a Thermal Safe Power (TSP) for a more efficient power budget. A new proposed trend states that consideration of peak temperature constraint provides a reduction in the dark area of the silicon. Moreover, it is also proposed that the use of Dynamic Voltage Frequency Scaling increases the overall system performance and decreases the dark silicon [8].

Anil et al. presented a run-time resource management system in 2018 known as adBoost. It employs dark silicon aware run-time application mapping strategy to achieve thermal-aware performance boosting in multicore processors. It benefits from patterning (PAT) of dark silicon. PAT is a mapping strategy that evenly distributes the temperature across the chip to enhance the utilizable power budget. It offers lower temperatures, higher power budget, and sustains the more extended periods of boosting. Experiments show that it yields 37 percent better throughput in comparison with other state-of-the-art performance boosters [11].

Lei Yang et al. proposed a thermal model in 2017 to solve the fundamental problem of determining the capability of the on-chip multiprocessor system to run the desired job by maintaining its reliability and keeping every core within a safe temperature range. The proposed thermal model is used for quick chip temperature prediction. It finds the optimal task-to-core assignment by predicting the minimum chip peak temperature. If the minimum chip peak temperature somehow exceeds the safe temperature limit, a newly proposed heuristic algorithm known as temperature constrained task selection (TCTS) reacts to optimize the system performance within a chip safe temperature limit. Optimality of TCTS algorithm is formally proved, and extensive performance evaluations show that this model reduces the chip peak temperature by 10°C as compared to other traditional techniques. Overall system performance is improved by 19.8% under safe temperature limitation. Finally, a real case study is conducted to prove the feasibility of this systematical technique [10].

6 CONCLUSION

Continuous scaling of multicore processors is constrained by power, temperature, and bandwidth constraints. These constraints limit the conventional multicore design to scale beyond a few tens to low hundreds of cores only. As a result, a large portion of a processor chip sacrifices to enable the rest of the chip to keep working. We have discussed a technique to repurpose the unused die area (dark silicon) by constructing specialized multicores. Specialized (SP) multicores implement a large number of workload-specific cores and power up only those specific cores having a close match with the requirements of the executing workload. A detailed first-order model is proposed to analyze the design of SP multicores by considering all the physical constraints. Extensive workload experiments in comparison with other general purpose multicores are performed to analyze the performance of the model. SP multicores outperform other designs by 2x to 12x. Although SP multicores are an appealing design, modern workloads must be characterized to identify the computational segments serving as candidates for off-loading to specialized cores. Moreover, software infrastructure and runtime environment are also required to facilitate the code migration at the appropriate granularity.

REFERENCES

[1] 1965. Moore’s Law. https://en.wikipedia.org/wiki/Moore%27s_law

[2] 1974. Dennard Scaling. https://en.wikipedia.org/wiki/Dennard_scaling

[3] Pradip Bose. 2011. Power Wall. Springer US, Boston, MA, 1593–1608. https://doi.org/10.1007/978-0-387-09766-4_499

[4] Nikolaos Hardavellas. 2009. Chip multiprocessors for server workloads. supervisors-Babak Falsafi and Anastasia Ailamaki (2009).

[5] Nikolaos Hardavellas, Michael Ferdman, Anastasia Ailamaki, and Babak Falsafi. 2010. Power scaling: the ultimate obstacle to 1k-core chips. (2010).

[6] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2011. Toward dark silicon in servers. IEEE Micro 31, 4 (2011), 6–15.

[7] Nikos Hardavellas, Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia Ailamaki, and Babak Falsafi. 2007. Database Servers on Chip Multiprocessors: Limitations and Opportunities.. In CIDR, Vol. 7. Citeseer, 79–87.

[8] Jörg Henkel, Heba Khdr, Santiago Pagani, and Muhammad Shafique. 2015. New trends in dark silicon. In 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1–6.

[9] Mark D Hill and Michael R Marty. 2008. Amdahl’s Law in the multicore era. Computer 41, 7 (2008), 33–38.

[10] Mengquan Li, Weichen Liu, Lei Yang, Peng Chen, and Chao Chen. 2018. Chip temperature optimization for dark silicon many-core systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 5 (2018), 941–953.

[11] Amir M Rahmani, Muhammad Shafique, Axel Jantsch, Pasi Liljeberg, et al. 2018. adBoost: Thermal Aware Performance Boosting through Dark Silicon Patterning. IEEE Trans. Comput. 67, 8 (2018), 1062–1077.