Professional Documents
Culture Documents
Processors
Processors
Processors
MICRO-ARCHITECTURES
Slides by: Pedro Tomás
Intel Silvermont
ARM big.LITTLE
Intel Sandy Bridge
Architecture Overview
3 Advanced Computer Architectures, 2014
Intel Haswell / SkyLake (?)
Architecture Overview
4 Advanced Computer Architectures, 2014
Instruction Fetch
µOp Scheduler
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
7 Advanced Computer Architectures, 2014
µOp Scheduler
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
8 Advanced Computer Architectures, 2014
6 µOps
EX EX EX EX EX EX EX EX
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
13 Advanced Computer Architectures, 2014
64 Entry µOp Queue
MicroOp
per thread The front-end can deliver to the
Fusion scheduler a flow of up to 6 µOps per
clock cycle from one of the two
6 µOps
threads
Support the simultaneous execution of up
Reorder Buffer (ROB) to two threads, with most resources being
(224 entries) shared between the two threads
6 µOps
6 µOps
Because of fused multiply-and-add
Register Alias Table
Mapping of Logical to
Register Renaming
Physical register file with:
(FMA) instructions, a higher number
Physical Registers 168-FP entries and 180-INT entries of dependencies had to be
In order
Out-of-order
6 µOps supported from Haswell
µOp Scheduler
(97-entry fused reservation stations)
PORT
#0
PORT
#1
PORT
#2
PORT
#3
PORT
#4
PORT
#5
PORT
#6
PORT
#7
Currently (SkyLake) more instructions
require support from three input
EX EX EX EX EX EX EX EX dependencies (e.g., conditional
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
moves).
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
17 Advanced Computer Architectures, 2014
6 µOps
There are several read/write
Register Alias Table
Mapping of Logical to
Register Renaming
Physical register file with:
buffers in order to avoid stalls due
Physical Registers 168-FP entries and 180-INT entries to structural hazards in load/store
In order 6 µOps
Out-of-order #Read buffers: 72
µOp Scheduler #Write buffers: 56
(97-entry fused reservation stations)
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7
EX EX EX EX EX EX EX EX
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
19 Mobile Architectures…
Intel Silvermont Micro-Architecture
Architecture Overview
20 Advanced Computer Architectures, 2014
ARM big.LITTLE Micro-Architecture
Architecture Overview
21 Advanced Computer Architectures, 2014
Heterogeneous architecture
Composed of:
Low-power in-order A7 cores
High-performance out-of-order A15 cores
Super-pipelined architecture
Non-symmetric dual issue
Pipeline length between 8 and 10 stages
ARM big.LITTLE Micro-Architecture
Architecture of the A15 cores
23 Advanced Computer Architectures, 2014
Out-of-order architecture
Sustained triple issue
Pipeline length between 15 and 24 stages
ARM big.LITTLE Micro-Architecture
Architecture Overview
24 Advanced Computer Architectures, 2014
Heterogeneous architecture
Composed of:
Low-power in-order A7 cores
High-performance out-of-order A15 cores
Heterogeneous architecture
Composed of:
Low-power in-order A7 cores
High-performance out-of-order A15 cores
All processors implement the full ARMv7A ISA
Challenges:
Should a task be scheduled to an A7 or an A15 core?
When should a thread migrate from one core to the other?
What is the cost of migrating a thread and the estimated
performance on the other core:
is it worth migrating?
How to maximize performance, or minimize energy while
sustaining a given Quality of Service (QoS) level?
26 Future architectures
What will happen in the future?
Let’s guess…
ISCA 2002 – Session I
We Had It All Figured Out
27 Advanced Computer Architectures, 2014
Power consumption:
𝑃 = 𝑃𝑠𝑡𝑎𝑡𝑖𝑐 + 𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐
Static power consumption:
𝑃𝑠𝑡𝑎𝑡𝑖𝑐 = 𝛼 ⋅ 𝑉𝑑𝑑 ⋅ 𝑒 𝛾⋅𝑉𝑑𝑑
Dynamic power consumption:
2
𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛽 ⋅ 𝐶 ⋅ 𝑉𝑑𝑑 ⋅𝑓
Ups… what about power!
29 Advanced Computer Architectures, 2014
Power consumption:
𝑃 = 𝑃𝑠𝑡𝑎𝑡𝑖𝑐 + 𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐
Static power consumption:
𝑃𝑠𝑡𝑎𝑡𝑖𝑐 = 𝛼 ⋅ 𝑉𝑑𝑑 ⋅ 𝑒 𝛾⋅𝑉𝑑𝑑
Dynamic power consumption:
2
𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛽 ⋅ 𝐶 ⋅ 𝑉𝑑𝑑 ⋅𝑓
Relation between voltage and frequency:
𝑓 ∝ 𝑉𝑑𝑑
Energy consumption:
𝐸 = 𝑇 ⋅ 𝑃 ∝ 𝑓2
2004: Santa Clara we have a problem
30 Advanced Computer Architectures, 2014
SkyLake
14 nm
Widespread Assumption:
Microarchitecture was the cause of the power problem
32 Advanced Computer Architectures, 2014
Moore’s Law
33 Advanced Computer Architectures, 2014
In 1965 Moore
observed:
the number
of transistors doubles
every one year
In 1975 he revised:
the number of
transistors doubles
every two years
Dennard Scaling
34 Advanced Computer Architectures, 2014
Performance per
watt grows at
roughly the same
rate as Moore’s Law
Koomey's law:
performance per
watt doubles every
1.57 years
The Scaling Promise of Multicore
35 Advanced Computer Architectures, 2014