Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

Processor Design

José María Arnau


1 / 98
Objective

Understand the intricacies of advanced IC
design and development

2 / 98
Objective

Understand the intricacies of advanced IC
design and development

RTL code
module dff(input clk, input d, output q);

always @(posedge clk)


q <= d;

endmodule

...

3 / 98
Objective

Understand the intricacies of advanced IC
design and development

RTL code
module dff(input clk, input d, output q);

always @(posedge clk)


q <= d;

endmodule

...

IC layout

4 / 98
PD Contents

1. Moore’s Law and Dennard Scaling


2. Hardware Design Cycle
3. Functional Verification
4. Circuit Design
5. Physical Design

5 / 98
Evaluation


Presentation on related research work
– PechaKucha format: 20 slides, 20 seconds/slide


List of papers posted soon in PD web site
– http://jarnau.site.ac.upc.edu/PD/


20% of final mark

6 / 98
Practical approach


Theory sessions are useful for lab assigments
– How to write effective test benches
– How to verify your design
– How to synthesize, place and route your design
– How to estimate max freq and area of your design


EDA tools for digital synthesis
– Qflow: http://opencircuitdesign.com/qflow/

7 / 98
PD Contents

1. Moore’s Law and Dennard Scaling


(a) Transistor basic physics
(b) Power wall
(c) Dark silicon
2. Hardware Design Cycle
3. Functional Verification
4. Circuit Design
5. Physical Design

8 / 98
1. Moore’s Law and
Dennard Scaling

9 / 98
Moore’s Law
The number of transistors in a dense integrated circuit doubles about
every two years.

1. Moore’s Law and


Dennard Scaling

10 / 98
Moore’s Law
The number of transistors in a dense integrated circuit doubles about
every two years.

Dennard Scaling
As transistors shrink, they become faster, consume less power,
and are cheaper to manufacture.

1. Moore’s Law and


Dennard Scaling

11 / 98
MOS Transistors

Metal Oxide Semiconductor

nMOS Transistor pMOS Transistor

12 / 98
MOS Transistors

drain

nMOS gate

source

drain

pMOS gate

source

13 / 98
MOS Transistors

drain drain

nMOS gate gate = 0 OFF

source source

drain

pMOS gate

source

14 / 98
MOS Transistors

drain drain drain

nMOS gate gate = 0 OFF gate = 1 ON

source source source

drain

pMOS gate

source

15 / 98
MOS Transistors

drain drain drain

nMOS gate gate = 0 OFF gate = 1 ON

source source source

drain drain

pMOS gate gate = 0 ON

source source

16 / 98
MOS Transistors

drain drain drain

nMOS gate gate = 0 OFF gate = 1 ON

source source source

drain drain drain

pMOS gate gate = 0 ON gate = 1 OFF

source source source

17 / 98
CMOS Logic


Inverter

18 / 98
CMOS Logic


Inverter

19 / 98
CMOS Logic


Inverter

ON

20 / 98
CMOS Logic


Inverter

ON

0 OFF

21 / 98
CMOS Logic


Inverter

ON
1
0 OFF

22 / 98
CMOS Logic


Inverter

ON
1
0 OFF

23 / 98
CMOS Logic


Inverter

ON
1
0 OFF 1

24 / 98
CMOS Logic


Inverter

ON OFF
1
0 OFF 1

25 / 98
CMOS Logic


Inverter

ON OFF
1
0 OFF 1 ON

26 / 98
CMOS Logic


Inverter

ON OFF
1 0
0 OFF 1 ON

27 / 98
CMOS Logic


NAND gate

28 / 98
CMOS Logic


NAND gate

0
1

29 / 98
CMOS Logic


NAND gate

ON

0
1

30 / 98
CMOS Logic


NAND gate

ON

0 OFF
1

31 / 98
CMOS Logic


NAND gate

OFF ON

0 OFF
1

32 / 98
CMOS Logic


NAND gate

OFF ON

0 OFF
1 ON

33 / 98
CMOS Logic


NAND gate

OFF ON
1
0 OFF
1 ON

34 / 98
CMOS Logic


NAND gate

OFF ON
1
0 OFF
1 ON

35 / 98
CMOS Logic


NAND gate

OFF ON
1
0 OFF 1
1 ON 1

36 / 98
CMOS Logic


NAND gate

OFF ON OFF OFF


1
0 OFF 1
1 ON 1

37 / 98
CMOS Logic


NAND gate

OFF ON OFF OFF


1
0 OFF 1 ON
1 ON 1 ON

38 / 98
CMOS Logic


NAND gate

OFF ON OFF OFF


1 0
0 OFF 1 ON
1 ON 1 ON

39 / 98
CMOS Logic


Complementary Metal-
Oxide Semicondutor
– pMOS pull-up network
– nMOS pull-down network

40 / 98
CMOS Logic


Multiplexers

41 / 98
CMOS Logic


Multiplexers

42 / 98
CMOS Logic


Multiplexers

43 / 98
CMOS Logic


Multiplexers

0
ON
ON
1

44 / 98
CMOS Logic


Multiplexers

0
ON
ON
1
OFF
OFF
0

45 / 98
CMOS Logic


Multiplexers

0
ON
ON D0
1
OFF
OFF
0

46 / 98
CMOS Logic


Latches and flip-flops

D latch D flip-flop

47 / 98
Moore’s Law


The number of transistors in a dense integrated
circuit doubles about every two years

48 / 98
Moore’s Law


The number of transistors in a dense integrated
circuit doubles about every two years

49 / 98
Moore’s Law


The number of transistors in a dense integrated
circuit doubles about every two years

~2 years

50 / 98
L O G2 O F T H E N U M B E R O F ●
C O M P O N E N TS P E R IN T E G R A T E D F U N C T IO N

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1959
1960
1961
1962
1963
196 4
Moore’s Law

Electronics, April 19, 1965.


196 5
1966
1967
1968
1969
1970
1971
1972
1973
circuit doubles about every two years

1974
1975
The number of transistors in a dense integrated

51 / 98
Moore’s Law

52 / 98
Moore’s Law
Transistors per Square Millimeter by Year, 1971–2018. Logarithmic scale. Data from Wikipedia.

53 / 98
Technology


What do “7nm” or “10nm” technology mean?

54 / 98
Layout Design Rules

Define how small features can be and how closely
they can be reliably packed

Lambda based rules
– C. Mead and L. Conway, Introduction to VLSI Systems,
Reading, MA: Addison-Wesley, 1980.

55 / 98
Dennard Scaling


As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors

56 / 98
Dennard Scaling


As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors

S
Area = S2
57 / 98
Dennard Scaling


As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors

0.7*S
S

0.7*S
Area = 0.5*S2
S
Area = S2
58 / 98
Dennard Scaling


As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors


Transistor dimensions scaled by 30%

0.7*S
S

0.7*S
Area = 0.5*S2
S
Area = S2
59 / 98
Dennard Scaling


As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors


Transistor dimensions scaled by 30%

0.7*S ●
Delay reduced by 30%
S

0.7*S
Area = 0.5*S2
S
Area = S2
60 / 98
Dennard Scaling


As transistors shrink, they become faster, consume
less power, and are cheaper to manufacture.
– Miniaturization provides smaller and faster transistors


Transistor dimensions scaled by 30%

0.7*S ●
Delay reduced by 30%
S

Frequency increased by ~40%

0.7*S
Area = 0.5*S2
S
Area = S2
61 / 98
Performance Scaling


Smaller transistors are faster
– ~1.4x higher performance per generation

More transistors allow more functionality
– Better branch predictors
– Larger caches
– ILP: superscalar, out-of-order execution
– DLP: vector unit
– Better memory controller

62 / 98
CPU Performance

63 / 98
Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.
CPU Frequency

64 / 98
CPU Frequency
Power wall!

65 / 98
CMOS Power


Dynamic Power
Switching power 2

P dyn= α C V f

Static Power (Leakage) P sta =I leak V
– Power dissipated even
when not switching
P total =P dyn + P sta
– Around one third of the
total power in current
technology

66 / 98
CMOS Power


Static Power
– Gate leakage (Igate)
– Subthreshold leakage (Isub)
– Junction leakage (Ijunct)
– Depends on the temperature

67 / 98
CMOS Power

Im, Y., Cho, E. S., Choi, K., & Kang, S. (2007, March). Development of Junction
Temperature Decision (JTD) Map for Thermal Design of Nano-scale Devices
Considering Leakage Power. In Twenty-Third Annual IEEE Semiconductor Thermal
Measurement and Management Symposium (pp. 63-67). IEEE.
68 / 98
Frequency vs Power

69 / 98
Power Wall

70 / 98
Power Wall

71 / 98
Power Wall and ILP Limits


Cannot increase performance by raising the
frequency due to thermal constraints

Diminishing returns in ILP
– Computer architects optimized superscalar out-of-
order CPUs for decades

But Moore’s law still provides more and more
transistors

What can we do with the extra transistors?

72 / 98
Multi-core Processors

https://www.guru3d.com/articles-summary/amd-ryzen-5-2400g-review,2.html 73 / 98
Multi-core Processors


Use extra transistors to include multiple cores in
the same chip

Exploit TLP (Thread-Level Parallelism)

Programmer has to manually extract parallelism
– The Free Lunch Is Over. A Fundamental Turn Toward
Concurrency in Software. Herb Sutter.
http://www.gotw.ca/publications/concurrency-ddj.htm

74 / 98
CPU Trends

75 / 98
End of Dennard Scaling

According to Dennard scaling, power density remains
constant
– Smaller transistors consume less power
– If frequency is not increased, total power remains the same

Dennard scaling did not take subthreshold leakage
into account

In current technology, more transistors result in more
power
– Power density increases even at the same frequency!

2
P= α C V f + I leak V
76 / 98
Power Density

John Hennessy and David Patterson. “A New Golden Age for Computer Architecture: Domain-Specific
Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development

77 / 98
Dark Silicon


Dark Silicon and the End of Multicore Scaling.
Esmaeilzadeh, H., Blem, E., Amant, R. S.,
Sankaralingam, K., & Burger, D. ISCA 2011.

Part of the chip must be powered off due to
thermal constraints
– 22 nm: 21% of the chip powered off
– 8 nm: ~50% of the chip powered off

78 / 98
Multicore Limitations


Increasing the number of cores is no longer
an effective solution to improve performance
– Due to the dark silicon problem, not all the cores
can be powered at the same time
– Diminishing returns in TLP

But Moore’s law still provides more and more
transistors

What can we do with the extra transistors?

79 / 98
Hardware Accelerators

Welcome to the golden age of hardware
accelerators!

80 / 98
https://www.hotchips.org/hc30/1conf/1.05_AMD_APU_AMD_Raven_HotChips30_Final.pdf
Hardware Accelerators
Qualcomm Snapdragon 835 [1] NVIDIA “Parker” SoC [2]

1. https://www.notebookcheck.net/Qualcomm-Snapdragon-835-SoC-Benchmarks-and-Specs.207842.0.html 81 / 98
2. https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.30-Low-Power-Epub/
HC28.22.322-Tegra-Parker-AndiSkende-NVIDIA-v01.pdf
ASIC

Application Specific Integrated Circuit

https://ngcodec.com/markets-cloud-transcoding/ 82 / 98
Software Overheads

for (i = 0; i < 4; i++)


v[i] *= 2.0f;

83 / 98
Software Overheads

for (i = 0; i < 4; i++)


v[i] *= 2.0f;

la $t0, v
la $t1, v+16
li $t2, 0x40000000
mtc1 $t2, $f2
for: lwc1 $f0, 0($t0)
mul.s $f0, $f0, $f2
swc1 $f0, 0($t0)
addiu $t0, $t0, 4
bne $t0, $t1, for

84 / 98
Software Overheads

for (i = 0; i < 4; i++)


v[i] *= 2.0f;

la $t0, v
la $t1, v+16
li $t2, 0x40000000
mtc1 $t2, $f2
for: lwc1 $f0, 0($t0)
mul.s $f0, $f0, $f2
swc1 $f0, 0($t0)
addiu $t0, $t0, 4
bne $t0, $t1, for

85 / 98
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
Software Overheads

for (i = 0; i < 4; i++)


v[i] *= 2.0f;

la $t0, v
la $t1, v+16
li $t2, 0x40000000 Read R/W Read Write ADD/ MUL Bytes Bytes
mtc1 $t2, $f2 I$ D$ RF RF SUB R MM W MM
for: lwc1 $f0, 0($t0) 4 4 4 4 4 16
mul.s $f0, $f0, $f2 4 8 4 4
swc1 $f0, 0($t0) 4 4 8 4 16
addiu $t0, $t0, 4 4 4 4 4
bne $t0, $t1, for 4 8 4
TOTAL: 20 8 32 12 16 4 16 16
86 / 98
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
Software Overheads

for (i = 0; i < 4; i++)


v[i] *= 2.0f; Accesses to TLB
TLB misses
ROB accesses
la $t0, v Issue queue...
la $t1, v+16
li $t2, 0x40000000 Read R/W Read Write ADD/ MUL Bytes Bytes
mtc1 $t2, $f2 I$ D$ RF RF SUB R MM W MM
for: lwc1 $f0, 0($t0) 4 4 4 4 4 16
mul.s $f0, $f0, $f2 4 8 4 4
swc1 $f0, 0($t0) 4 4 8 4 16
addiu $t0, $t0, 4 4 4 4 4
bne $t0, $t1, for 4 8 4
TOTAL: 20 8 32 12 16 4 16 16
87 / 98
https://en.wikibooks.org/wiki/Microprocessor_Design/Pipelined_Processors
ASIC vs CPU
Main
Main Memory
Memory

X
Read X Write
MEM MEM
X

2.0f X

MUL Bytes Bytes


R MM W MM
4 16 16

88 / 98
ASIC vs CPU
Main
Main Memory
Memory

X
Read X Write
MEM MEM
X

2.0f X

MUL Bytes Bytes Read R/W Read Write ADD/ MUL Bytes Bytes
R MM W MM I$ D$ RF RF SUB R MM W MM
4 16 16 20 8 32 12 16 4 16 16

89 / 98
ASIC vs CPU
Main
Main Memory
Memory

X
Read X Write
MEM MEM
X

2.0f X

MUL Bytes Bytes Read R/W Read Write ADD/ MUL Bytes Bytes
R MM W MM I$ D$ RF RF SUB R MM W MM
4 16 16 20 8 32 12 16 4 16 16

ASIC does much less work, so it is faster and consumes less energy
90 / 98
Hardware Accelerators

91 / 98
Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling."
IEEE Micro 35.3 (2015): 58-70.
Hardware Accelerators

92 / 98
Shao, Yakun Sophia, et al. "The aladdin approach to accelerator design and modeling."
IEEE Micro 35.3 (2015): 58-70.
Sea of Accelerators

Y. S. Shao, B. Reagen, G. Wei and D. Brooks, "Aladdin: A pre-RTL, power-performance accelerator simulator enabling
large design space exploration of customized architectures," 2014 ACM/IEEE 41st International Symposium on Computer
Architecture (ISCA), 2014, pp. 97-108.

93 / 98
Hardware Accelerators


ASICs provide higher performance and
energy-efficiency

But they are less programmable and have
large development costs

Only accelerate in hardware applications that
are widely popular and compute intensive

Example: machine learning accelerators

94 / 98
Google TPU
Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit."
Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on. IEEE,
2017.


92 TOPS

28 MB on-chip

15x – 30x speedup

30x – 80x higher Perf/W

95 / 98
Turing Lecture at ISCA 2018

https://www.acm.org/hennessy-patterson-turing-lecture 96 / 98
What about the future?


Hardware accelerators will start providing
diminishing returns in the future

Any solution?
– Quantum computing
– DNA computing
– Photonic computing
– Superconducting computing

97 / 98
Bibliography


Weste, N. H., & Harris, D. “CMOS VLSI
Design : A Circuits and Systems Perspective”.
4th Edition, 2010.

Kahng AB, Lienig J, Markov IL, Hu J. “VLSI
Physical Design: from Graph Partitioning to
Timing Closure”. Springer Science & Business
Media; 2011 Jan 27.

Mead, C., & Conway, L. (1980). Introduction
to VLSI systems (Vol. 1080). Reading, MA:
Addison-Wesley.
98 / 98

You might also like