Computer System Evaluation
Sources:
- UWashington: CSE378, Lecture13
- John L. Hennessy & David A. Patterson. (2019). Chapter 1. Fundamentals of Quantitative Design and Analysis. Computer Architecture: A Quantitative Approach (6th ed.). Elsevier Inc.
Clock cycle time
- One "cycle" is the minimum time it takes the CPU to do any work1.
- The clock cycle time or clock period is just the length of a cycle.
- The clock rate, or frequency, is the reciprocal of the cycle time.
- Generally, a higher frequency is better. Some examples illustrate some typical frequencies.
- A 500MHz processor has a cycle time of 2ns.
- A 2GHz (2000MHz) CPU has a cycle time of just 0.5ns (500ps).
Moore’s Law
Moore’s Law: The number of transistor on a chip doubles every period of time (2, 1.5, or 2.5 years).
It is enabled by Dennard Scaling.
Dennard Scaling: As you reduce the size of the transistors, energy/power goes down proportionally.
- However, Dennard Scaling started breaking down around 2003.
Energy Consumption in CMOS Devices
The energy consumption in CMOS devices primarily occurs during the switching of transistors from one state to another (from ON to OFF or vice versa).
Total Power Comsumption = Dynamic Power Consumption + Static Power Consumption
Dynamic Power Consumption
Dynamic power is consumed when transistors switch states. It is given by the equation: \[ P_{\text {dynamic }}=\alpha \cdot C_L \cdot V_{D D}^2 \cdot f, \]
where: - \(P_{\text {dynamic }}\) is the dynamic power consumption,
\(\alpha\) is the activity factor (fraction of the time the circuit is switching, usually equals to 1),
\(C_L\) is the load capacitance,
\(V_{D D}\) is the supply voltage, and
\(f\) is the switching frequency.
The energy consumed per switching event is: \[ E_{\text {switch }}=C_L \cdot V_{D D}^2 \] ## Static Power Consumption:
Static power is consumed due to leakage currents when the transistors are in a non-switching (idle) state. It can be expressed as: \[ P_{\text {static }}=I_{\text {leak }} \cdot V_{D D} \] where \(I_{\text {leak }}\) is the leakage current.
Total power consumption
\[ \begin{aligned} P_\text{total} &= P_{\text {dynamic }} + P_{\text {static }} \\ &= \alpha \cdot C_L \cdot V_{D D}^2 \cdot f + I_{\text {leak }} \cdot V_{D D} \end{aligned} \]
How to reduce power consumption?
- Lower the voltage
- Lower the frequency
1 | def total_power_consumption(voltage, frequency, capacitive_load, leakage_current, alpha=1.0, name=None, description=None): |
Latency and throughput
- Latency: Time a single fixed task costs to finish.
- For example, the execution time of a benchmark is the latency.
- Throughput: Number of works/operations done in a fixed period of time.
CPI
The average of Cycles Per Instruction in a given process (CPI) is defined by \[ \mathrm{CPI}=\sum_{i=1}^n \mathrm{CPI}_i \times f_i \quad \text { where } f_i=\frac{\mathrm{I}_i}{\text { Instruction Count }} \]
\(f_i=\): Fraction of instructions of type \(i\).
CPI is a measure related to the latency.
For example, for the multi-cycle MIPS:
Load: 5 cycles Store: 4 cycles R-type: 4 cycles Branch: 3 cycles Jump: 3 cycles
If a program has
50% R-type instructions 10% load instructions 20% store instructions 8% branch instructions 2% jump instructions
then what is the CPI? \[ \mathrm{CPI}=(4 \times 50+5 \times 10+4 \times 20+3 \times 8+3 \times 2) / 100=3.6 \]
Iron Law
The performance of a processor is the time it takes to execute a program: \(\frac{\text { Time }}{\text { Program }}\). This can be further broken down into three factors: \({ }^{4]}\) \[ \frac{\text { Time }}{\text { Program }} = \frac{\text { Instructions }}{\text { Program }} \times \frac{\text { ClockCycles }}{\text { Instruction }} \times \frac{\text { Time }}{\text { ClockCycles }} \] We have following dependencies:
- the number of instructions depends on architecture.
- the number of cycles per instruction (CPI) depends on micro-architecture.
- time per cycle depends on technology.
Meanwhile, the frequency of the processor is \[ \text { frequency } = \frac{\text { ClockCycles }}{\text { Time }} \]
1 | # Given data |
Amdahl's Law
"the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used." \[ \text { Speedup }_{\text {overall }}= \frac {\text { Old Time }} {\text {New Time}} = \frac{1}{(1-P)+\frac{P}{S}} \] where: - \(P\) is the proportion of the execution time that the improvement affects, and - \(S\) is the speedup of the improved part (usually equals to the number of the CPU/PGPU cores). - Here we hyphothesize \(t_{\text {old }}\) and \(t_{\text {new }}\) have the same frequency.
The proof of this equation is as follows:
By the definition of CPI, we have \(t = I \times C P I \times \frac{1}{f}\), where \(t\) is the total time costing.
We also know that, due to the effect of parallelism, we obtain \(C P I_{\text {new }}=C P I_{\text {old }} \times\left(1-P+\frac{P}{S}\right)\).
Here we hyphothesize \(t_{\text {old }}\) and \(t_{\text {new }}\) have the same frequency, so that we have \(t_{\text {new }}=t_{\text {old }} \times\left(1-p+\frac{p}{c}\right)\), the derivation is: \[ \begin{aligned} & C P I_{\text {new }}=C P I_{\text {old }} \times\left(1-P+\frac{P}{S}\right) \\ & I \times C P I_{\text {new }} \times \frac{1}{f}=I \times C P I_{\text {old }} \times\left(1-P+\frac{P}{S}\right) \times \frac{1}{f} \\ & t_{\text {new }}=t_{\text {old }} \times\left(1-P+\frac{P}{S}\right) \\ & \text { Speedup }_{\text {overall }}= \frac {t_{\text {old }}} {t_{\text {new }}} = \frac{1}{(1-P)+\frac{P}{S}}. \end{aligned} \]
$$
Example 1
There is a program in which 89% can be parallelized.
When you run this program on 4 cores, what is the speedup compared to the serial version?
Solution:
1/ (0.11 + 0.89/4) = 3.007518796992481
Example 2
Assume that you have a program which has a kernel, or the "main" part of the program, which can be accelerated by a GPU. The kernel makes up 95% of the program's execution time on a CPU system. There are two different GPU systems you could run this code on. System A has 48 GPU cores and provides a speedup of 40x for the kernel compared to the CPU. System B has 96 GPU cores and provides a speedup of 50x for the kernel compared to the CPU. What is the overall speedup for the entire program on System B compared the System A?
Solution:
Given: - The kernel makes up \(95 \%\) of the program's execution time \((P=0.95)\). - System A provides a 40x speedup for the kernel. - System B provides a 50x speedup for the kernel.
First, let's calculate the execution time of the program on both systems in terms of the original CPU execution time ( \(T_{\mathrm{CPU}}\) ).
System A Speedup Speedup \(_{\mathrm{A}}=\frac{1}{(1-0.95)+\frac{0.95}{40}}\) System B Speedup Speedup \(_B=\frac{1}{(1-0.95)+\frac{0.95}{50}}\) Let's calculate these:
System A \[ \begin{aligned} & \operatorname{Speedup}_A=\frac{1}{0.05+\frac{0.95}{40}} \\ & \text { Speedup }_A=\frac{1}{0.05+0.02375} \\ & \text { Speedup }_A=\frac{1}{0.07375} \approx 13.56 \end{aligned} \]
System B \[ \begin{aligned} & \text { Speedup }_B=\frac{1}{0.05+\frac{0.95}{50}} \\ & \text { Speedup }_B=\frac{1}{0.05+0.019} \\ & \text { Speedup }_B=\frac{1}{0.069} \approx 14.49 \end{aligned} \]
Now, to find the overall speedup of System B compared to System A, we divide the speedup of System B by the speedup of System A: \[ \begin{aligned} & \text { Overall Speedup }=\frac{\text { Speedup }_B}{\text { Speedup }_A} \\ & \text { Overall Speedup }=\frac{14.49}{13.56} \approx 1.069 \end{aligned} \]
So, the overall speedup for the entire program on System B compared to System A is approximately 1.069.
Example 3
There is a program in which 55% can be parallelized.
What is the maximum possible speedup you can achieve through parallelization?
Solution:
Considering \[ \text { Speedup }_{\text {overall }}=\frac{1}{(1-P)+\frac{P}{S}} , \]
if \(S \rightarrow + \infty\), then \(\frac{P}{S} \rightarrow0\), so the maximum speedup is \(\frac{1}{(1-P)}\) = 1 / 0.45 = 2.222....
Example 4
Assume you have an application where 75.0% is parallelizable.
You have three systems, a one core system and a two core system.
Cores | Capacitance | Voltage | Frequency |
---|---|---|---|
1 | 1 * c | 1V | 2 GHz |
2 | 1.8 * c | 0.9V | 1.5GHz |
Given the Time, Power and Energy of the one core system:
- Time = 1 * t
- Power = 2 * c
- Energy = 2 ct
calculate the Time, Power and Energy of the two core system:
Solution:
Using \[ t_{\text {new }}=t_{\text {old }} \times\left(1-P+\frac{P}{S}\right) \times \frac{f_{\text {old }}}{f_{\text {new }}}, \] we can compute the Time to be \(t_2=t_1 \times\left(0.25+\frac{0.75}{2}\right) \times \frac{2 G H z}{1.5 G H z}\) = 0.83 * t.
Using \[ P_{\text {dynamic }}=\alpha \cdot C_L \cdot V_{D D}^2 \cdot f , \] we can compute the Power to be 1 * 1.8^2 * 0.9 = 2.19 * c.
So the Energy is: 0.83 * t * 2.19 * c = 1.82 * c * t.
Roofline model
- TPU v2
- Roofline
- 128x128 matrix = 16384 * 2 (cores/chip) * .7 (700 MHz) = 22937.6 (note: fp16)
- 700 GB/s
- 1 byte per op: 700 GFLOPS
- 2:1.4, 4:2.8, 8:6, 16:12, 32:24
- FLOPS/W = 23000/280=82.1429
- Roofline
- CPU (AMD 7950X)
- 16 cores * 4.2 GHz * 16 (AVX-512) * 2 (FMA) = 2.1 TFLOP/s
- Bandwidth: 80 GB/s
- If 1 byte per op, then 80 GFLOP/s
- 2:160, 4:320, 8:640, 16:1280, 32:2560(2.1)
- Power: ~200W
- FLOPS/W = 2100/200=10.5
- GPU H100
- 144 SMs * 128 cores * 1.8 * 2 (FMA) = 60 TFLOP/s
- Bandwidth 3 TB/s
- If 1 byte per op, then 3 TFLOP/s
- 2:6, 4:12, 8:24, 16:48, 32:96 (60)
- Power: 700W
- FLOPS/W = 60000/700=85.7143
In pipelined CPU design, because all stages proceed at the same time, the length of a processor cycle is determined by the time required for the slowest pipe stage.↩︎