Professional Documents
Culture Documents
01 Moores Law and Dennard Scaling
01 Moores Law and Dennard Scaling
2 / 98
Objective
●
Understand the intricacies of advanced IC
design and development
RTL code
module dff(input clk, input d, output q);
endmodule
...
3 / 98
Objective
●
Understand the intricacies of advanced IC
design and development
RTL code
module dff(input clk, input d, output q);
endmodule
...
IC layout
4 / 98
PD Contents
5 / 98
Evaluation
●
Presentation on related research work
– PechaKucha format: 20 slides, 20 seconds/slide
●
List of papers posted soon in PD web site
– http://jarnau.site.ac.upc.edu/PD/
●
20% of final mark
6 / 98
Practical approach
●
Theory sessions are useful for lab assigments
– How to write effective test benches
– How to verify your design
– How to synthesize, place and route your design
– How to estimate max freq and area of your design
●
EDA tools for digital synthesis
– Qflow: http://opencircuitdesign.com/qflow/
7 / 98
PD Contents
8 / 98
1. Moore’s Law and
Dennard Scaling
9 / 98
Moore’s Law
The number of transistors in a dense integrated circuit doubles about
every two years.
10 / 98
Moore’s Law
The number of transistors in a dense integrated circuit doubles about
every two years.
Dennard Scaling
As transistors shrink, they become faster, consume less power,
and are cheaper to manufacture.
11 / 98
MOS Transistors
●
Metal Oxide Semiconductor
12 / 98
MOS Transistors
drain
nMOS gate
source
drain
pMOS gate
source
13 / 98
MOS Transistors
drain drain
source source
drain
pMOS gate
source
14 / 98
MOS Transistors
drain
pMOS gate
source
15 / 98
MOS Transistors
drain drain
source source
16 / 98
MOS Transistors
17 / 98
CMOS Logic
●
Inverter
18 / 98
CMOS Logic
●
Inverter
19 / 98
CMOS Logic
●
Inverter
ON
20 / 98
CMOS Logic
●
Inverter
ON
0 OFF
21 / 98
CMOS Logic
●
Inverter
ON
1
0 OFF
22 / 98
CMOS Logic
●
Inverter
ON
1
0 OFF
23 / 98
CMOS Logic
●
Inverter
ON
1
0 OFF 1
24 / 98
CMOS Logic
●
Inverter
ON OFF
1
0 OFF 1
25 / 98
CMOS Logic
●
Inverter
ON OFF
1
0 OFF 1 ON
26 / 98
CMOS Logic
●
Inverter
ON OFF
1 0
0 OFF 1 ON
27 / 98
CMOS Logic
●
NAND gate
28 / 98
CMOS Logic
●
NAND gate
0
1
29 / 98
CMOS Logic
●
NAND gate
ON
0
1
30 / 98
CMOS Logic
●
NAND gate
ON
0 OFF
1
31 / 98
CMOS Logic
●
NAND gate
OFF ON
0 OFF
1
32 / 98
CMOS Logic
●
NAND gate
OFF ON
0 OFF
1 ON
33 / 98
CMOS Logic
●
NAND gate
OFF ON
1
0 OFF
1 ON
34 / 98
CMOS Logic
●
NAND gate
OFF ON
1
0 OFF
1 ON
35 / 98
CMOS Logic
●
NAND gate
OFF ON
1
0 OFF 1
1 ON 1
36 / 98
CMOS Logic
●
NAND gate
37 / 98
CMOS Logic
●
NAND gate
38 / 98
CMOS Logic
●
NAND gate
39 / 98
CMOS Logic
●
Complementary Metal-
Oxide Semicondutor
– pMOS pull-up network
– nMOS pull-down network
40 / 98
CMOS Logic
●
Multiplexers
41 / 98
CMOS Logic
●
Multiplexers
42 / 98
CMOS Logic
●
Multiplexers
43 / 98
CMOS Logic
●
Multiplexers
0
ON
ON
1
44 / 98
CMOS Logic
●
Multiplexers
0
ON
ON
1
OFF
OFF
0
45 / 98
CMOS Logic
●
Multiplexers
0
ON
ON D0
1
OFF
OFF
0
46 / 98
CMOS Logic
●
Latches and flip-flops
D latch D flip-flop
47 / 98
Moore’s Law
●
The number of transistors in a dense integrated
circuit doubles about every two years
48 / 98
Moore’s Law
●
The number of transistors in a dense integrated
circuit doubles about every two years
49 / 98
Moore’s Law
●
The number of transistors in a dense integrated
circuit doubles about every two years
~2 years
50 / 98
L O G2 O F T H E N U M B E R O F ●
C O M P O N E N TS P E R IN T E G R A T E D F U N C T IO N
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1959
1960
1961
1962
1963
196 4
Moore’s Law
1974
1975
The number of transistors in a dense integrated
51 / 98
Moore’s Law
52 / 98
Moore’s Law
Transistors per Square Millimeter by Year, 1971–2018. Logarithmic scale. Data from Wikipedia.
53 / 98
Technology
●
What do “7nm” or “10nm” technology mean?
54 / 98
Layout Design Rules
●
Define how small features can be and how closely
they can be reliably packed
●
Lambda based rules
– C. Mead and L. Conway, Introduction to VLSI Systems,
Reading, MA: Addison-Wesley, 1980.
55 / 98
Dennard Scaling
●
As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
56 / 98
Dennard Scaling
●
As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
S
Area = S2
57 / 98
Dennard Scaling
●
As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
0.7*S
S
0.7*S
Area = 0.5*S2
S
Area = S2
58 / 98
Dennard Scaling
●
As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
●
Transistor dimensions scaled by 30%
0.7*S
S
0.7*S
Area = 0.5*S2
S
Area = S2
59 / 98
Dennard Scaling
●
As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
●
Transistor dimensions scaled by 30%
0.7*S ●
Delay reduced by 30%
S
0.7*S
Area = 0.5*S2
S
Area = S2
60 / 98
Dennard Scaling
●
As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors
●
Transistor dimensions scaled by 30%
0.7*S ●
Delay reduced by 30%
S
●
Frequency increased by ~40%
0.7*S
Area = 0.5*S2
S
Area = S2
61 / 98
Performance Scaling
●
Smaller transistors are faster
– ~1.4x higher performance per generation
●
More transistors allow more functionality
– Better branch predictors
– Larger caches
– ILP: superscalar, out-of-order execution
– DLP: vector unit
– Better memory controller
62 / 98
CPU Performance
63 / 98
Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.
CPU Frequency
64 / 98
CPU Frequency
Power wall!
65 / 98
CMOS Power
●
Dynamic Power
Switching power 2
–
P dyn= α C V f
●
Static Power (Leakage) P sta =I leak V
– Power dissipated even
when not switching
P total =P dyn + P sta
– Around one third of the
total power in current
technology
66 / 98
CMOS Power
●
Static Power
– Gate leakage (Igate)
– Subthreshold leakage (Isub)
– Junction leakage (Ijunct)
– Depends on the temperature
67 / 98
CMOS Power
Im, Y., Cho, E. S., Choi, K., & Kang, S. (2007, March). Development of Junction
Temperature Decision (JTD) Map for Thermal Design of Nano-scale Devices
Considering Leakage Power. In Twenty-Third Annual IEEE Semiconductor Thermal
Measurement and Management Symposium (pp. 63-67). IEEE.
68 / 98
Frequency vs Power
69 / 98
Power Wall
70 / 98
Power Wall
71 / 98
Power Wall and ILP Limits
●
Cannot increase performance by raising the
frequency due to thermal constraints
●
Diminishing returns in ILP
– Computer architects optimized superscalar out-of-
order CPUs for decades
●
But Moore’s law still provides more and more
transistors
●
What can we do with the extra transistors?
72 / 98
Multi-core Processors
https://www.guru3d.com/articles-summary/amd-ryzen-5-2400g-review,2.html 73 / 98
Multi-core Processors
●
Use extra transistors to include multiple cores in
the same chip
●
Exploit TLP (Thread-Level Parallelism)
●
Programmer has to manually extract parallelism
– The Free Lunch Is Over. A Fundamental Turn Toward
Concurrency in Software. Herb Sutter.
http://www.gotw.ca/publications/concurrency-ddj.htm
74 / 98
CPU Trends
75 / 98
End of Dennard Scaling
●
According to Dennard scaling, power density remains
constant
– Smaller transistors consume less power
– If frequency is not increased, total power remains the same
●
Dennard scaling did not take subthreshold leakage
into account
●
In current technology, more transistors result in more
power
– Power density increases even at the same frequency!
2
P= α C V f + I leak V
76 / 98
Power Density
John Hennessy and David Patterson. “A New Golden Age for Computer Architecture: Domain-Specific
Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development
77 / 98
Dark Silicon
●
Dark Silicon and the End of Multicore Scaling.
Esmaeilzadeh, H., Blem, E., Amant, R. S.,
Sankaralingam, K., & Burger, D. ISCA 2011.
●
Part of the chip must be powered off due to
thermal constraints
– 22 nm: 21% of the chip powered off
– 8 nm: ~50% of the chip powered off
78 / 98
Multicore Limitations
●
Increasing the number of cores is no longer
an effective solution to improve performance
– Due to the dark silicon problem, not all the cores
can be powered at the same time
– Diminishing returns in TLP
●
But Moore’s law still provides more and more
transistors
●
What can we do with the extra transistors?
79 / 98
Hardware Accelerators
●
Welcome to the golden age of hardware
accelerators!
80 / 98
https://www.hotchips.org/hc30/1conf/1.05_AMD_APU_AMD_Raven_HotChips30_Final.pdf
Hardware Accelerators
Qualcomm Snapdragon 835 [1] NVIDIA “Parker” SoC [2]
1. https://www.notebookcheck.net/Qualcomm-Snapdragon-835-SoC-Benchmarks-and-Specs.207842.0.html 81 / 98
2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.30-Low-Power-Epub/
HC28.22.322-Tegra-Parker-AndiSkende-NVIDIA-v01.pdf
ASIC
●
Application Specific Integrated Circuit
https://ngcodec.com/markets-cloud-transcoding/ 82 / 98
Software Overheads
83 / 98
Software Overheads
la $t0, v
la $t1, v+16
li $t2, 0x40000000
mtc1 $t2, $f2
for: lwc1 $f0, 0($t0)
mul.s $f0, $f0, $f2
swc1 $f0, 0($t0)
addiu $t0, $t0, 4
bne $t0, $t1, for
84 / 98
Software Overheads
la $t0, v
la $t1, v+16
li $t2, 0x40000000
mtc1 $t2, $f2
for: lwc1 $f0, 0($t0)
mul.s $f0, $f0, $f2
swc1 $f0, 0($t0)
addiu $t0, $t0, 4
bne $t0, $t1, for
85 / 98
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
Software Overheads
la $t0, v
la $t1, v+16
li $t2, 0x40000000 Read R/W Read Write ADD/ MUL Bytes Bytes
mtc1 $t2, $f2 I$ D$ RF RF SUB R MM W MM
for: lwc1 $f0, 0($t0) 4 4 4 4 4 16
mul.s $f0, $f0, $f2 4 8 4 4
swc1 $f0, 0($t0) 4 4 8 4 16
addiu $t0, $t0, 4 4 4 4 4
bne $t0, $t1, for 4 8 4
TOTAL: 20 8 32 12 16 4 16 16
86 / 98
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
Software Overheads
X
Read X Write
MEM MEM
X
2.0f X
88 / 98
ASIC vs CPU
Main
Main Memory
Memory
X
Read X Write
MEM MEM
X
2.0f X
MUL Bytes Bytes Read R/W Read Write ADD/ MUL Bytes Bytes
R MM W MM I$ D$ RF RF SUB R MM W MM
4 16 16 20 8 32 12 16 4 16 16
89 / 98
ASIC vs CPU
Main
Main Memory
Memory
X
Read X Write
MEM MEM
X
2.0f X
MUL Bytes Bytes Read R/W Read Write ADD/ MUL Bytes Bytes
R MM W MM I$ D$ RF RF SUB R MM W MM
4 16 16 20 8 32 12 16 4 16 16
ASIC does much less work, so it is faster and consumes less energy
90 / 98
Hardware Accelerators
91 / 98
Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling."
IEEE Micro 35.3 (2015): 58-70.
Hardware Accelerators
92 / 98
Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling."
IEEE Micro 35.3 (2015): 58-70.
Sea of Accelerators
Y. S. Shao, B. Reagen, G. Wei and D. Brooks, "Aladdin: A pre-RTL, power-performance accelerator simulator enabling
large design space exploration of customized architectures," 2014 ACM/IEEE 41st International Symposium on Computer
Architecture (ISCA), 2014, pp. 97-108.
93 / 98
Hardware Accelerators
●
ASICs provide higher performance and
energy-efficiency
●
But they are less programmable and have
large development costs
●
Only accelerate in hardware applications that
are widely popular and compute intensive
●
Example: machine learning accelerators
94 / 98
Google TPU
Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit."
Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE,
2017.
●
92 TOPS
●
28 MB on-chip
●
15x – 30x speedup
●
30x – 80x higher Perf/W
95 / 98
Turing Lecture at ISCA 2018
https://www.acm.org/hennessy-patterson-turing-lecture 96 / 98
What about the future?
●
Hardware accelerators will start providing
diminishing returns in the future
●
Any solution?
– Quantum computing
– DNA computing
– Photonic computing
– Superconducting computing
97 / 98
Bibliography
●
Weste, N. H., & Harris, D. “CMOS VLSI
Design : A Circuits and Systems Perspective”.
4th Edition, 2010.
●
Kahng AB, Lienig J, Markov IL, Hu J. “VLSI
Physical Design: from Graph Partitioning to
Timing Closure”. Springer Science & Business
Media; 2011 Jan 27.
●
Mead, C., & Conway, L. (1980). Introduction
to VLSI systems (Vol. 1080). Reading, MA:
Addison-Wesley.
98 / 98