Processors

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

OVERVIEW OF MODERN

MICRO-ARCHITECTURES
Slides by: Pedro Tomás

ADVANCED COMPUTER ARCHITECTURES


ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)
Outline
2 Advanced Computer Architectures, 2014

 Intel Sandy Bridge/Haswell/SkyLake Micro-Architectures

 Intel Silvermont

 ARM big.LITTLE
Intel Sandy Bridge
Architecture Overview
3 Advanced Computer Architectures, 2014
Intel Haswell / SkyLake (?)
Architecture Overview
4 Advanced Computer Architectures, 2014

 Instructions fetched from L1

 Instructions are decoded into simpler µOps:


 CISC to RISC translation

 µOps are combined in order to increase execution


efficiency

 A Continuous flow of up to 6 µOps are


simultaneously issued for execution
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
5 Advanced Computer Architectures, 2014

Instruction Fetch  Branch Prediction Unit:


Choose the next block of code to execute from
Instruction TLB 32KB L1 Instruction Cache the program; instructions can be fetched from:
(144-entry, 4-way cache) (8-way, 64-entry cache with 64B Lines) Branch
16 Bytes Predictors  Decoded µOp Cache
16B Pre-decode and  L1 Instruction Cache
Fetch Buffer
 L2/L3/Memory
6 Instructions

2x20 Entry x86


Instruction Queue
MacroOp
Fusion  Advanced Branch Prediction Units
 Loop Counter
 Add additional information to the BTB, stating
1.5K µOP Cache
whether the branch/jump resembles a loop and
µCode Complex Simple Simple Simple (8-way)
what is the loop count
Engine Decode Decode Decode Decode
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp  Indirect Branch Predictor
(32B)
64 Entry µOp Queue  Allow saving multiple targets and add a
MicroOp
per thread predictor for the target address
Fusion µOp Cache
Rebuild  Useful for case constructs and polymorphism in
object oriented programming
 Subroutine Return Predictor
µOp Scheduler  Add a local stack inside the processor to
correctly predict returns from routines
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
6 Advanced Computer Architectures, 2014

Instruction Fetch

Instruction TLB 32KB L1 Instruction Cache


(144-entry, 4-way cache) (8-way, 64-entry cache with 64B Lines) Branch  Loaded instructions are decoded
16 Bytes Predictors
and placed on a queue
16B Pre-decode and
Fetch Buffer
6 Instructions  MacroOp Fusion allows to group test
2x20 Entry x86 and state-bit conditional branches:
Instruction Queue
MacroOp
Fusion CMP R1,R2
BR.Z loop BEQ R1,R2,loop

1.5K µOP Cache Possible test instructions: CMP, TEST, INC,


µCode Complex Simple Simple Simple (8-way)
Engine Decode Decode Decode Decode DEC, ADD, SUB, AND
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp
(32B)
64 Entry µOp Queue
per thread
MicroOp
Fusion µOp Cache
Rebuild

µOp Scheduler
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
7 Advanced Computer Architectures, 2014

Instruction Fetch  Depending on the complexity,


instructions are sent to specialized
Instruction TLB
(144-entry, 4-way cache)
32KB L1 Instruction Cache
(8-way, 64-entry cache with 64B Lines) Branch
decoders and converted into µOps
16 Bytes Predictors

16B Pre-decode and


Fetch Buffer  MicroOp Fusion allows to group
6 Instructions multiple simple µOps into complex
2x20 Entry x86
Instruction Queue
µOps (RISC to CISC translation), e.g.,
MacroOp
Fusion
LD R5,(R4)
ADD R3,R5 ADD R3,(R4)

1.5K µOP Cache


µCode Complex Simple Simple Simple (8-way)  Fused µOps are issued as many
Engine Decode Decode Decode Decode
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp
times as if they were not fused
(32B)
 Saves bandwidth at µOp queue output,
64 Entry µOp Queue
MicroOp
per thread reduces ROB length, increases
Fusion µOp Cache
Rebuild
retirement throughput and reduces
power

µOp Scheduler
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
8 Advanced Computer Architectures, 2014

Instruction Fetch  The Branch Predict Unit (BPU) uses:


 A Branch Predict Table (BPT)
Instruction TLB 32KB L1 Instruction Cache Likely holds up to 8-16K targets, divided in 2
(144-entry, 4-way cache) (8-way, 64-entry cache with 64B Lines) Branch
levels (such as L1 and L2 caches)
16 Bytes Predictors
 Branch History (likely global+local)
16B Pre-decode and
Fetch Buffer
Takes into consideration the path through
which the execution reached the branch
6 Instructions
instruction)
2x20 Entry x86
Instruction Queue  Indirect Branch Target Array
MacroOp
Fusion Stores jump addresses for control instructions
like JMP R5
 Subroutine Return Stack Buffer
1.5K µOP Cache
µCode Complex Simple Simple Simple (8-way) Local mirror of the stack that holds the return
Engine Decode Decode Decode Decode address of the 16 most recent calls
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp
(32B)
64 Entry µOp Queue
per thread
 Whenever a miss-prediction is found,
MicroOp
Fusion µOp Cache instruction decode does need to wait
Rebuild
for the pipeline to flush. It starts
decoding the correct path
immediately
µOp Scheduler
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
9 Advanced Computer Architectures, 2014

Instruction Fetch  Whenever a loop is detected


instructions are fetched directly from
Instruction TLB
(144-entry, 4-way cache)
32KB L1 Instruction Cache
(8-way, 64-entry cache with 64B Lines) Branch
the µOp Cache
16 Bytes Predictors  The µOp Cache stores decoded, fixed-
16B Pre-decode and length operations
Fetch Buffer  Organized in 8 ways x 32 sets x 6µOps
6 Instructions
 It allows reducing the pipeline length
2x20 Entry x86
Instruction Queue during loops
MacroOp
Fusion  Miss-branch prediction during the loop (e.g.,
in an inner loop or in a if statement) have a
lower miss penalty
1.5K µOP Cache  Increases the µOp bandwidth to out-of-
µCode Complex Simple Simple Simple (8-way)
Engine Decode Decode Decode Decode
order engine
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp  Allows reducing power consumption in
(32B)
64 Entry µOp Queue the front-end
per thread
MicroOp
Fusion µOp Cache
Rebuild
 Intel announces an average hit rate
of over 80% for the µOp Cache
µOp Scheduler
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
10 Advanced Computer Architectures, 2014

Instruction Fetch  A Loop Stream Detector (LSD) is able


to identify small loops that fit into the
Instruction TLB 32KB L1 Instruction Cache µOp queue
(144-entry, 4-way cache) (8-way, 64-entry cache with 64B Lines) Branch
16 Bytes Predictors
 Whenever such a loop is found, the
16B Pre-decode and
Fetch Buffer
queue is locked and µOps are sent
6 Instructions
directly from the queue until a branch
ends it
2x20 Entry x86
Instruction Queue
MacroOp
Fusion
 This allows stop reading and decoding
of instructions from either the
1.5K µOP Cache
instruction cache or the µOp Cache.
µCode Complex Simple Simple Simple (8-way)
Engine Decode Decode Decode Decode
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp  In the Sandy Bridge architecture there
(32B) are 2x28 entry queues, one for each
64 Entry µOp Queue
per thread active thread
MicroOp
Fusion µOp Cache  Ivy Bridge and Haswell use a single
Rebuild
unified 56 entry queue to better use the
resources whenever a single thread is
being executed
µOp Scheduler  SkyLake partitions the queue into two sets
of 64 entries, one per active thread
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
11 Advanced Computer Architectures, 2014
64 Entry µOp Queue
MicroOp
per thread  ROB size increased from 120 entries
Fusion (Nehalem) to 192 entries (Haswell), to 224
(SkyLake)
 Constant increase the instruction window size in
6 µOps order to explore more ILP
 Dynamically divided between the two threads
Reorder Buffer (ROB)  Can commit (retire) up to 4 fused µOps per
(224 entries) clock cycle (Sandy Bridge and Haswell)
6 µOps  In SkyLake they have likely increased the
number commit bandwidth
Register Alias Table Register Renaming
Mapping of Logical to Physical register file with:  A Register Alias Table allows saving
Physical Registers 168-FP entries and 180-INT entries
bandwidth and power
In order 6 µOps  Avoids unnecessary data copies between logical
Out-of-order and physical register tables
µOp Scheduler
(97-entry fused reservation stations)
 The processor tracks the actual set of registers
PORT PORT PORT PORT PORT PORT PORT PORT that are being used to reduce time to context
#0 #1 #2 #3 #4 #5 #6 #7
switching
 The complete register file (including special
EX EX EX EX EX EX EX EX registers) in the Sandy Bridge uses over 700B of
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 memory
 A thread using only the 16 GPRs can eliminate
moving roughly 600B of data on a context
switch
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
12 Advanced Computer Architectures, 2014
64 Entry µOp Queue
per thread
MicroOp
Fusion

6 µOps

Reorder Buffer (ROB)  Register zeroing (e.g., through a


(224 entries)
XOR instruction) is performed
6 µOps
directly in the renaming stage
Register Alias Table Register Renaming  Uses register renaming of the target
Mapping of Logical to Physical register file with: register
Physical Registers 168-FP entries and 180-INT entries
In order 6 µOps
 Plus zeroing the renamed register
Out-of-order
µOp Scheduler
(97-entry fused reservation stations)
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7

EX EX EX EX EX EX EX EX
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
13 Advanced Computer Architectures, 2014
64 Entry µOp Queue
MicroOp
per thread  The front-end can deliver to the
Fusion scheduler a flow of up to 6 µOps per
clock cycle from one of the two
6 µOps
threads
 Support the simultaneous execution of up
Reorder Buffer (ROB) to two threads, with most resources being
(224 entries) shared between the two threads
6 µOps

Register Alias Table Register Renaming  Renamed µOps remain at the


Mapping of Logical to
Physical Registers
Physical register file with:
168-FP entries and 180-INT entries
scheduler until all operands are
In order
satisfied
6 µOps
Out-of-order  Unified, centralized set of reservation
µOp Scheduler stations for both threads
(97-entry fused reservation stations)
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7  The scheduler can dispatch out of
order up to 8 µOps for execution
EX EX EX EX EX EX EX EX (from any thread) per clock cycle
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
 Dispatch the oldest 8 µops that are ready
 Sandy Bridge had only 6 execution ports
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
14 Advanced Computer Architectures, 2014
64 Entry µOp Queue
per thread
MicroOp
Fusion
 There are three types of
6 µOps computational µOps each
corresponding to a different
Reorder Buffer (ROB)
(224 entries) execution stack:
6 µOps  Integer
Register Alias Table Register Renaming  SIMD integer
Mapping of Logical to Physical register file with:
Physical Registers 168-FP entries and 180-INT entries  FP (scalar or SIMD)
In order 6 µOps
Out-of-order
µOp Scheduler  Each type of µOp has its own CDB
(97-entry fused reservation stations)
in order to minimize conflicts and
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7 management logic
 If there are cross domain dependencies
EX EX EX EX EX EX EX EX
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 (e.g., integer to floating-point), a delay
of one clock cycle is typically imposed
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
15 Advanced Computer Architectures, 2014
64 Entry µOp Queue
per thread
MicroOp
Fusion
 In Sandy Bridge, all the functional
6 µOps units in a given execution port had
the same latency
Reorder Buffer (ROB)
(224 entries)  Exception is Division and Square Root
6 µOps
 This implies that, in same cases, some
Register Alias Table Register Renaming
functional units have their execution
Mapping of Logical to Physical register file with: delayed
Physical Registers 168-FP entries and 180-INT entries
In order
 This was likely done in order to aid in
6 µOps
Out-of-order controlling functional unit access to the
µOp Scheduler CDBs
(97-entry fused reservation stations)
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7
 In Skylake each execution unit is
EX EX EX EX EX EX EX EX composed of several functional units
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
(FUs), with different pipeline lengths
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
16 Advanced Computer Architectures, 2014
64 Entry µOp Queue
per thread
MicroOp
Fusion
 In Sandy Bridge (and also Ivy
6 µOps Bridge), each µOp could only
support up to two dependencies
Reorder Buffer (ROB)
(224 entries)

6 µOps
 Because of fused multiply-and-add
Register Alias Table
Mapping of Logical to
Register Renaming
Physical register file with:
(FMA) instructions, a higher number
Physical Registers 168-FP entries and 180-INT entries of dependencies had to be
In order
Out-of-order
6 µOps supported from Haswell
µOp Scheduler
(97-entry fused reservation stations)
PORT
#0
PORT
#1
PORT
#2
PORT
#3
PORT
#4
PORT
#5
PORT
#6
PORT
#7
 Currently (SkyLake) more instructions
require support from three input
EX EX EX EX EX EX EX EX dependencies (e.g., conditional
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
moves).
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
17 Advanced Computer Architectures, 2014

Port Operations Latency


Integer and vector arithmetic, logic and shift 1
Vector string operations 3
Floating point add, multiply, FMA 4
0 AES encryption 4
Integer vector multiplication 5
Integer and floating point division, square root variable
Branch 1-2
Integer and vector arithmetic, logic and shift 1
Integer multiplication, bit scan 3
1
Floating point add, multiply, FMA 4
Integer vector multiplication 5
2 Load, including address generation
3 Load, including address generation
4 Store, including address generation
Integer and vector arithmetic and logic 1
Vector permute 1/3
5
X87 floating point add, SADBW 3
PCLMUL 7
Integer arithmetic, logic, shift 1
6
Jump and branch 1-2
7 Load and store, including address generation
Intel SkyLake (?) Micro-Architecture
Execution Unit (out-of-order execution)
18 Advanced Computer Architectures, 2014
64 Entry µOp Queue
per thread
MicroOp
Fusion
 Each read/write port as a width of
6 µOps 256 bits (in order to support AVX2
vector instructions)
Reorder Buffer (ROB)
(224 entries)

6 µOps
 There are several read/write
Register Alias Table
Mapping of Logical to
Register Renaming
Physical register file with:
buffers in order to avoid stalls due
Physical Registers 168-FP entries and 180-INT entries to structural hazards in load/store
In order 6 µOps
Out-of-order  #Read buffers: 72
µOp Scheduler  #Write buffers: 56
(97-entry fused reservation stations)
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7

EX EX EX EX EX EX EX EX
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7

Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
19 Mobile Architectures…
Intel Silvermont Micro-Architecture
Architecture Overview
20 Advanced Computer Architectures, 2014
ARM big.LITTLE Micro-Architecture
Architecture Overview
21 Advanced Computer Architectures, 2014

 Heterogeneous architecture
 Composed of:
 Low-power in-order A7 cores
 High-performance out-of-order A15 cores

 All processors implement the full


ARMv7A ISA
ARM big.LITTLE Micro-Architecture
Architecture of the A7 cores
22 Advanced Computer Architectures, 2014

 Super-pipelined architecture
 Non-symmetric dual issue
 Pipeline length between 8 and 10 stages
ARM big.LITTLE Micro-Architecture
Architecture of the A15 cores
23 Advanced Computer Architectures, 2014

 Out-of-order architecture
 Sustained triple issue
 Pipeline length between 15 and 24 stages
ARM big.LITTLE Micro-Architecture
Architecture Overview
24 Advanced Computer Architectures, 2014

 Heterogeneous architecture
 Composed of:
 Low-power in-order A7 cores
 High-performance out-of-order A15 cores

 All processors implement the full ARMv7A ISA

Speed-up Energy efficiency


A15 vs A7 A7 vs A15
Dhrystone 1.9x 3.5x
FDCT 2.3x 3.8x
IMDCT 3.0x 3.0x
MemCopy L1 1.9x 2.3x
MemCopy L2 1.9x 3.4x
ARM big.LITTLE Micro-Architecture
Architecture Overview
25 Advanced Computer Architectures, 2014

 Heterogeneous architecture
 Composed of:
 Low-power in-order A7 cores
 High-performance out-of-order A15 cores
 All processors implement the full ARMv7A ISA

 Challenges:
 Should a task be scheduled to an A7 or an A15 core?
 When should a thread migrate from one core to the other?
 What is the cost of migrating a thread and the estimated
performance on the other core:
 is it worth migrating?
 How to maximize performance, or minimize energy while
sustaining a given Quality of Service (QoS) level?
26 Future architectures
What will happen in the future?

Let’s guess…
ISCA 2002 – Session I
We Had It All Figured Out
27 Advanced Computer Architectures, 2014

 The Optimum Pipeline Depth for a Microprocessor


 IBM  22-36 pipeline stages

 The Optimal Logic Depth Per Pipeline Stage is 6


to 8 FO4 Inverter Delays
 Dec/Compac/HP  ~40 pipeline stages

 Increasing Processor Performance by Implementing


Deeper Pipelines
 Intel  50-60 pipeline stages
Ups… what about power!
28 Advanced Computer Architectures, 2014

 Power consumption:
𝑃 = 𝑃𝑠𝑡𝑎𝑡𝑖𝑐 + 𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐
 Static power consumption:
𝑃𝑠𝑡𝑎𝑡𝑖𝑐 = 𝛼 ⋅ 𝑉𝑑𝑑 ⋅ 𝑒 𝛾⋅𝑉𝑑𝑑
 Dynamic power consumption:
2
𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛽 ⋅ 𝐶 ⋅ 𝑉𝑑𝑑 ⋅𝑓
Ups… what about power!
29 Advanced Computer Architectures, 2014

 Power consumption:
𝑃 = 𝑃𝑠𝑡𝑎𝑡𝑖𝑐 + 𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐
 Static power consumption:
𝑃𝑠𝑡𝑎𝑡𝑖𝑐 = 𝛼 ⋅ 𝑉𝑑𝑑 ⋅ 𝑒 𝛾⋅𝑉𝑑𝑑
 Dynamic power consumption:
2
𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛽 ⋅ 𝐶 ⋅ 𝑉𝑑𝑑 ⋅𝑓
 Relation between voltage and frequency:
𝑓 ∝ 𝑉𝑑𝑑
 Energy consumption:
𝐸 = 𝑇 ⋅ 𝑃 ∝ 𝑓2
2004: Santa Clara we have a problem
30 Advanced Computer Architectures, 2014

 More pipeline stages, less efficient, more power.

 Just can’t remove > 100 watts without great


expense on a desktop.

 All computing is now Low Power Computing!


2004: Santa Clara, we have a
31
problem!
Advanced Computer Architectures, 2014

SkyLake
14 nm
Widespread Assumption:
Microarchitecture was the cause of the power problem
32 Advanced Computer Architectures, 2014
Moore’s Law
33 Advanced Computer Architectures, 2014

 In 1965 Moore
observed:
 the number
of transistors doubles
every one year

 In 1975 he revised:
 the number of
transistors doubles
every two years
Dennard Scaling
34 Advanced Computer Architectures, 2014

 Performance per
watt grows at
roughly the same
rate as Moore’s Law

 Koomey's law:
performance per
watt doubles every
1.57 years
The Scaling Promise of Multicore
35 Advanced Computer Architectures, 2014

4 Cores 8 Cores 16 Cores


Frequency f Frequency f Frequency f

Technology Technology Technology


𝑋 2
𝑋/√2 𝑋/ 2

2x more cores per generation


Flat or slowly increasing operating frequency
The End of Dennard Scaling
36 Advanced Computer Architectures, 2014

 Dennard scaling ignored the


“leakage current” and
“threshold voltage”, which
establish a baseline of power
per transistor.

 As transistors get smaller, power


density increases because these
don’t scale with size

 These created a “Power Wall”


that has limited practical
processor frequency to around 4
GHz since 2006
Dark Silicon
37 Advanced Computer Architectures, 2014

4 Cores 8 Cores 16 Cores


Frequency f Frequency f Frequency f

Technology Technology Technology


𝑋 2
𝑋/√2 𝑋/ 2

We can still put more transistors in the chip…


However, what do we do with them?
38 How do we solve this issue?
Ideas?
The Four Horsemen
39 Advanced Computer Architectures, 2014

 The Shrinking Horseman


 Build smaller and smaller chips (useful for the IoT)

 The Dim Horseman


 Partial dimming of circuits
 Use of bigger caches
 Apply morphing/reconfiguration methodologies
 Use Course-grained reconfigurable arrays
 Apply computational sprinting

 The Specialized Horseman


 Integrate many “static” cores, each specialized at different operations
 Only activate some of these cores at the same time

 The Deus Ex Machina Horseman


 New silicon devices or new technologies

You might also like