Processors

OVERVIEW OF MODERN
MICRO-ARCHITECTURES
Slides by: Pedro Tomás
ADVANCED COMPUTER ARCHITECTURES

ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)
Outline
2 Advanced Computer Architectures, 2014
 Intel Sandy Bridge/Haswell/SkyLake Micro-Architectures
 Intel Silvermont
 ARM big.LITTLE
Intel Sandy Bridge
Architecture Overview
Intel Haswell / SkyLake (?)
 Instructions fetched from L1
 Instructions are decoded into simpler µOps:

 CISC to RISC translation
 µOps are combined in order to increase execution

efficiency
 A Continuous flow of up to 6 µOps are

simultaneously issued for execution
Intel SkyLake (?) Micro-Architecture
Front-End (in-order execution)
Instruction Fetch  Branch Prediction Unit:

Choose the next block of code to execute from
Instruction TLB 32KB L1 Instruction Cache the program; instructions can be fetched from:
(144-entry, 4-way cache) (8-way, 64-entry cache with 64B Lines) Branch
16 Bytes Predictors  Decoded µOp Cache
16B Pre-decode and  L1 Instruction Cache
Fetch Buffer
 L2/L3/Memory
6 Instructions
2x20 Entry x86

Instruction Queue
MacroOp
Fusion  Advanced Branch Prediction Units
 Loop Counter
 Add additional information to the BTB, stating
1.5K µOP Cache
whether the branch/jump resembles a loop and
µCode Complex Simple Simple Simple (8-way)
what is the loop count
Engine Decode Decode Decode Decode
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp  Indirect Branch Predictor
(32B)
64 Entry µOp Queue  Allow saving multiple targets and add a
MicroOp
per thread predictor for the target address
Fusion µOp Cache
Rebuild  Useful for case constructs and polymorphism in
object oriented programming
 Subroutine Return Predictor
µOp Scheduler  Add a local stack inside the processor to
correctly predict returns from routines
Instruction Fetch
Instruction TLB 32KB L1 Instruction Cache

(144-entry, 4-way cache) (8-way, 64-entry cache with 64B Lines) Branch  Loaded instructions are decoded
16 Bytes Predictors
and placed on a queue
16B Pre-decode and
Fetch Buffer
6 Instructions  MacroOp Fusion allows to group test
2x20 Entry x86 and state-bit conditional branches:
Instruction Queue
MacroOp
Fusion CMP R1,R2
BR.Z loop BEQ R1,R2,loop
1.5K µOP Cache Possible test instructions: CMP, TEST, INC,

Engine Decode Decode Decode Decode DEC, ADD, SUB, AND
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp
(32B)
64 Entry µOp Queue
per thread
MicroOp
Fusion µOp Cache
Rebuild
µOp Scheduler
Instruction Fetch  Depending on the complexity,

instructions are sent to specialized
Instruction TLB
(144-entry, 4-way cache)
32KB L1 Instruction Cache
(8-way, 64-entry cache with 64B Lines) Branch
decoders and converted into µOps
16 Bytes Predictors
16B Pre-decode and

Fetch Buffer  MicroOp Fusion allows to group
6 Instructions multiple simple µOps into complex
2x20 Entry x86
Instruction Queue
µOps (RISC to CISC translation), e.g.,
MacroOp
Fusion
LD R5,(R4)
ADD R3,R5 ADD R3,(R4)
1.5K µOP Cache

µCode Complex Simple Simple Simple (8-way)  Fused µOps are issued as many
times as if they were not fused
(32B)
 Saves bandwidth at µOp queue output,
64 Entry µOp Queue
MicroOp
per thread reduces ROB length, increases
Fusion µOp Cache
Rebuild
retirement throughput and reduces
power
µOp Scheduler
Instruction Fetch  The Branch Predict Unit (BPU) uses:

 A Branch Predict Table (BPT)
Instruction TLB 32KB L1 Instruction Cache Likely holds up to 8-16K targets, divided in 2
levels (such as L1 and L2 caches)
16 Bytes Predictors
 Branch History (likely global+local)
16B Pre-decode and
Fetch Buffer
Takes into consideration the path through
which the execution reached the branch
6 Instructions
instruction)
2x20 Entry x86
Instruction Queue  Indirect Branch Target Array
MacroOp
Fusion Stores jump addresses for control instructions
like JMP R5
 Subroutine Return Stack Buffer
1.5K µOP Cache
µCode Complex Simple Simple Simple (8-way) Local mirror of the stack that holds the return
Engine Decode Decode Decode Decode address of the 16 most recent calls
(32B)
64 Entry µOp Queue
per thread
 Whenever a miss-prediction is found,
MicroOp
Fusion µOp Cache instruction decode does need to wait
Rebuild
for the pipeline to flush. It starts
decoding the correct path
immediately
µOp Scheduler
Instruction Fetch  Whenever a loop is detected

instructions are fetched directly from
Instruction TLB
(144-entry, 4-way cache)
32KB L1 Instruction Cache
(8-way, 64-entry cache with 64B Lines) Branch
the µOp Cache
16 Bytes Predictors  The µOp Cache stores decoded, fixed-
16B Pre-decode and length operations
Fetch Buffer  Organized in 8 ways x 32 sets x 6µOps
6 Instructions
 It allows reducing the pipeline length
2x20 Entry x86
Instruction Queue during loops
MacroOp
Fusion  Miss-branch prediction during the loop (e.g.,
in an inner loop or in a if statement) have a
lower miss penalty
1.5K µOP Cache  Increases the µOp bandwidth to out-of-
order engine
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp  Allows reducing power consumption in
(32B)
64 Entry µOp Queue the front-end
per thread
MicroOp
Fusion µOp Cache
Rebuild
 Intel announces an average hit rate
of over 80% for the µOp Cache
µOp Scheduler
Instruction Fetch  A Loop Stream Detector (LSD) is able

to identify small loops that fit into the
Instruction TLB 32KB L1 Instruction Cache µOp queue
16 Bytes Predictors
 Whenever such a loop is found, the
16B Pre-decode and
Fetch Buffer
queue is locked and µOps are sent
6 Instructions
directly from the queue until a branch
ends it
2x20 Entry x86
Instruction Queue
MacroOp
Fusion
 This allows stop reading and decoding
of instructions from either the
1.5K µOP Cache
instruction cache or the µOp Cache.
4 µOps 4 µOps 1 µOp 1 µOp 1 µOp 4 µOp  In the Sandy Bridge architecture there
(32B) are 2x28 entry queues, one for each
64 Entry µOp Queue
per thread active thread
MicroOp
Fusion µOp Cache  Ivy Bridge and Haswell use a single
Rebuild
unified 56 entry queue to better use the
resources whenever a single thread is
being executed
µOp Scheduler  SkyLake partitions the queue into two sets
of 64 entries, one per active thread
Execution Unit (out-of-order execution)
64 Entry µOp Queue
MicroOp
per thread  ROB size increased from 120 entries
Fusion (Nehalem) to 192 entries (Haswell), to 224
(SkyLake)
 Constant increase the instruction window size in
6 µOps order to explore more ILP
 Dynamically divided between the two threads
Reorder Buffer (ROB)  Can commit (retire) up to 4 fused µOps per
(224 entries) clock cycle (Sandy Bridge and Haswell)
6 µOps  In SkyLake they have likely increased the
number commit bandwidth
Register Alias Table Register Renaming
Mapping of Logical to Physical register file with:  A Register Alias Table allows saving
Physical Registers 168-FP entries and 180-INT entries
bandwidth and power
In order 6 µOps  Avoids unnecessary data copies between logical
Out-of-order and physical register tables
µOp Scheduler
(97-entry fused reservation stations)
 The processor tracks the actual set of registers
PORT PORT PORT PORT PORT PORT PORT PORT that are being used to reduce time to context
#0 #1 #2 #3 #4 #5 #6 #7
switching
 The complete register file (including special
EX EX EX EX EX EX EX EX registers) in the Sandy Bridge uses over 700B of
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 memory
 A thread using only the 16 GPRs can eliminate
moving roughly 600B of data on a context
switch
64 Entry µOp Queue
per thread
MicroOp
Fusion
6 µOps
Reorder Buffer (ROB)  Register zeroing (e.g., through a

(224 entries)
XOR instruction) is performed
6 µOps
directly in the renaming stage
Register Alias Table Register Renaming  Uses register renaming of the target
Mapping of Logical to Physical register file with: register
In order 6 µOps
 Plus zeroing the renamed register
Out-of-order
µOp Scheduler
PORT PORT PORT PORT PORT PORT PORT PORT
#0 #1 #2 #3 #4 #5 #6 #7
EX EX EX EX EX EX EX EX
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7
64 Entry µOp Queue
MicroOp
per thread  The front-end can deliver to the
Fusion scheduler a flow of up to 6 µOps per
clock cycle from one of the two
6 µOps
threads
 Support the simultaneous execution of up
Reorder Buffer (ROB) to two threads, with most resources being
(224 entries) shared between the two threads
6 µOps
Register Alias Table Register Renaming  Renamed µOps remain at the

Mapping of Logical to
Physical Registers
Physical register file with:
168-FP entries and 180-INT entries
scheduler until all operands are
In order
satisfied
6 µOps
Out-of-order  Unified, centralized set of reservation
µOp Scheduler stations for both threads
#0 #1 #2 #3 #4 #5 #6 #7  The scheduler can dispatch out of
order up to 8 µOps for execution
EX EX EX EX EX EX EX EX (from any thread) per clock cycle
 Dispatch the oldest 8 µops that are ready
 Sandy Bridge had only 6 execution ports
64 Entry µOp Queue
per thread
MicroOp
Fusion
 There are three types of
6 µOps computational µOps each
corresponding to a different
Reorder Buffer (ROB)
(224 entries) execution stack:
6 µOps  Integer
Register Alias Table Register Renaming  SIMD integer
Mapping of Logical to Physical register file with:
Physical Registers 168-FP entries and 180-INT entries  FP (scalar or SIMD)
In order 6 µOps
Out-of-order
µOp Scheduler  Each type of µOp has its own CDB
in order to minimize conflicts and
#0 #1 #2 #3 #4 #5 #6 #7 management logic
 If there are cross domain dependencies
Unit 0 Unit 1 Unit 2 Unit 3 Unit 4 Unit 5 Unit 6 Unit 7 (e.g., integer to floating-point), a delay
of one clock cycle is typically imposed
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
64 Entry µOp Queue
per thread
MicroOp
Fusion
 In Sandy Bridge, all the functional
6 µOps units in a given execution port had
the same latency
(224 entries)  Exception is Division and Square Root
6 µOps
 This implies that, in same cases, some
Register Alias Table Register Renaming
functional units have their execution
Mapping of Logical to Physical register file with: delayed
In order
 This was likely done in order to aid in
6 µOps
Out-of-order controlling functional unit access to the
µOp Scheduler CDBs
#0 #1 #2 #3 #4 #5 #6 #7
 In Skylake each execution unit is
EX EX EX EX EX EX EX EX composed of several functional units
(FUs), with different pipeline lengths
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
64 Entry µOp Queue
per thread
MicroOp
Fusion
 In Sandy Bridge (and also Ivy
6 µOps Bridge), each µOp could only
support up to two dependencies
(224 entries)
6 µOps
 Because of fused multiply-and-add
Register Alias Table
Register Renaming
(FMA) instructions, a higher number
Physical Registers 168-FP entries and 180-INT entries of dependencies had to be
In order
Out-of-order
6 µOps supported from Haswell
µOp Scheduler
PORT
#0
PORT
#1
PORT
#2
PORT
#3
PORT
#4
PORT
#5
PORT
#6
PORT
#7
 Currently (SkyLake) more instructions
require support from three input
EX EX EX EX EX EX EX EX dependencies (e.g., conditional
moves).
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
Port Operations Latency

Integer and vector arithmetic, logic and shift 1
Vector string operations 3
Floating point add, multiply, FMA 4
0 AES encryption 4
Integer vector multiplication 5
Integer and floating point division, square root variable
Branch 1-2
Integer and vector arithmetic, logic and shift 1
Integer multiplication, bit scan 3
1
Floating point add, multiply, FMA 4
Integer vector multiplication 5
2 Load, including address generation
3 Load, including address generation
4 Store, including address generation
Integer and vector arithmetic and logic 1
Vector permute 1/3
5
X87 floating point add, SADBW 3
PCLMUL 7
Integer arithmetic, logic, shift 1
6
Jump and branch 1-2
7 Load and store, including address generation
64 Entry µOp Queue
per thread
MicroOp
Fusion
 Each read/write port as a width of
6 µOps 256 bits (in order to support AVX2
vector instructions)
(224 entries)
6 µOps
 There are several read/write
Register Alias Table
Register Renaming
buffers in order to avoid stalls due
Physical Registers 168-FP entries and 180-INT entries to structural hazards in load/store
In order 6 µOps
Out-of-order  #Read buffers: 72
µOp Scheduler  #Write buffers: 56
#0 #1 #2 #3 #4 #5 #6 #7
Integer
CDB
INT SIMD CDB
X87 / FP SIMD CDB
19 Mobile Architectures…
Intel Silvermont Micro-Architecture
ARM big.LITTLE Micro-Architecture
 Heterogeneous architecture
 Composed of:
 Low-power in-order A7 cores
 High-performance out-of-order A15 cores
 All processors implement the full

ARMv7A ISA
Architecture of the A7 cores
 Super-pipelined architecture
 Non-symmetric dual issue
 Pipeline length between 8 and 10 stages
Architecture of the A15 cores
 Out-of-order architecture
 Sustained triple issue
 Pipeline length between 15 and 24 stages
 Composed of:
 All processors implement the full ARMv7A ISA
Speed-up Energy efficiency

A15 vs A7 A7 vs A15
Dhrystone 1.9x 3.5x
FDCT 2.3x 3.8x
IMDCT 3.0x 3.0x
MemCopy L1 1.9x 2.3x
MemCopy L2 1.9x 3.4x
 Composed of:
 All processors implement the full ARMv7A ISA
 Challenges:
 Should a task be scheduled to an A7 or an A15 core?
 When should a thread migrate from one core to the other?
 What is the cost of migrating a thread and the estimated
performance on the other core:
 is it worth migrating?
 How to maximize performance, or minimize energy while
sustaining a given Quality of Service (QoS) level?
26 Future architectures
What will happen in the future?
Let’s guess…
ISCA 2002 – Session I
We Had It All Figured Out
 The Optimum Pipeline Depth for a Microprocessor

 IBM  22-36 pipeline stages
 The Optimal Logic Depth Per Pipeline Stage is 6

to 8 FO4 Inverter Delays
 Dec/Compac/HP  ~40 pipeline stages
 Increasing Processor Performance by Implementing

Deeper Pipelines
 Intel  50-60 pipeline stages
Ups… what about power!
 Power consumption:
𝑃 = 𝑃𝑠𝑡𝑎𝑡𝑖𝑐 + 𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐
 Static power consumption:
𝑃𝑠𝑡𝑎𝑡𝑖𝑐 = 𝛼 ⋅ 𝑉𝑑𝑑 ⋅ 𝑒 𝛾⋅𝑉𝑑𝑑
 Dynamic power consumption:
2
𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛽 ⋅ 𝐶 ⋅ 𝑉𝑑𝑑 ⋅𝑓
Ups… what about power!
 Power consumption:
𝑃 = 𝑃𝑠𝑡𝑎𝑡𝑖𝑐 + 𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐
 Static power consumption:
𝑃𝑠𝑡𝑎𝑡𝑖𝑐 = 𝛼 ⋅ 𝑉𝑑𝑑 ⋅ 𝑒 𝛾⋅𝑉𝑑𝑑
 Dynamic power consumption:
2
𝑃𝐷𝑦𝑛𝑎𝑚𝑖𝑐 = 𝛽 ⋅ 𝐶 ⋅ 𝑉𝑑𝑑 ⋅𝑓
 Relation between voltage and frequency:
𝑓 ∝ 𝑉𝑑𝑑
 Energy consumption:
𝐸 = 𝑇 ⋅ 𝑃 ∝ 𝑓2
2004: Santa Clara we have a problem
 More pipeline stages, less efficient, more power.
 Just can’t remove > 100 watts without great

expense on a desktop.
 All computing is now Low Power Computing!

2004: Santa Clara, we have a
31
problem!
Advanced Computer Architectures, 2014
SkyLake
14 nm
Widespread Assumption:
Microarchitecture was the cause of the power problem
Moore’s Law
 In 1965 Moore
observed:
 the number
of transistors doubles
every one year
 In 1975 he revised:
 the number of
transistors doubles
every two years
Dennard Scaling
 Performance per
watt grows at
roughly the same
rate as Moore’s Law
 Koomey's law:
performance per
watt doubles every
1.57 years
The Scaling Promise of Multicore
4 Cores 8 Cores 16 Cores

Frequency f Frequency f Frequency f
Technology Technology Technology

𝑋 2
𝑋/√2 𝑋/ 2
2x more cores per generation

Flat or slowly increasing operating frequency
The End of Dennard Scaling
 Dennard scaling ignored the

“leakage current” and
“threshold voltage”, which
establish a baseline of power
per transistor.
 As transistors get smaller, power

density increases because these
don’t scale with size
 These created a “Power Wall”

that has limited practical
processor frequency to around 4
GHz since 2006
Dark Silicon
4 Cores 8 Cores 16 Cores

Frequency f Frequency f Frequency f
Technology Technology Technology

𝑋 2
𝑋/√2 𝑋/ 2
We can still put more transistors in the chip…

However, what do we do with them?
38 How do we solve this issue?
Ideas?
The Four Horsemen
 The Shrinking Horseman

 Build smaller and smaller chips (useful for the IoT)
 The Dim Horseman

 Partial dimming of circuits
 Use of bigger caches
 Apply morphing/reconfiguration methodologies
 Use Course-grained reconfigurable arrays
 Apply computational sprinting
 The Specialized Horseman

 Integrate many “static” cores, each specialized at different operations
 Only activate some of these cores at the same time
 The Deus Ex Machina Horseman

 New silicon devices or new technologies

Processors

Uploaded by

Copyright:

Available Formats

You might also like

Processors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Processors

Uploaded by

Copyright:

Available Formats

OVERVIEW OF MODERN

ADVANCED COMPUTER ARCHITECTURES

 Intel Sandy Bridge/Haswell/SkyLake Micro-Architectures

 Instructions fetched from L1

 Instructions are decoded into simpler µOps:

 µOps are combined in order to increase execution

 A Continuous flow of up to 6 µOps are

Instruction Fetch  Branch Prediction Unit:

2x20 Entry x86

Instruction TLB 32KB L1 Instruction Cache

1.5K µOP Cache Possible test instructions: CMP, TEST, INC,

Instruction Fetch  Depending on the complexity,

16B Pre-decode and

1.5K µOP Cache

Instruction Fetch  The Branch Predict Unit (BPU) uses:

Instruction Fetch  Whenever a loop is detected

Instruction Fetch  A Loop Stream Detector (LSD) is able

Reorder Buffer (ROB)  Register zeroing (e.g., through a

Register Alias Table Register Renaming  Renamed µOps remain at the

Port Operations Latency

 All processors implement the full

 All processors implement the full ARMv7A ISA

Speed-up Energy efficiency

 The Optimum Pipeline Depth for a Microprocessor

 The Optimal Logic Depth Per Pipeline Stage is 6

 Increasing Processor Performance by Implementing

 More pipeline stages, less efficient, more power.

 Just can’t remove > 100 watts without great

 All computing is now Low Power Computing!

4 Cores 8 Cores 16 Cores

Technology Technology Technology

2x more cores per generation

 Dennard scaling ignored the

 As transistors get smaller, power

 These created a “Power Wall”

4 Cores 8 Cores 16 Cores

Technology Technology Technology

We can still put more transistors in the chip…

 The Shrinking Horseman

 The Dim Horseman

 The Specialized Horseman

 The Deus Ex Machina Horseman

You might also like