15 - 370w18 - Pipeline Last

15.
Basic Processor Design –

Pipelining With Control Hazards
EECS 370 – Introduction to Computer Organization – Winter 2018
Mark Brehob, Reetu Das and Harry Davis
EECS Department
University of Michigan in Ann Arbor, USA
© Davis, Das, and Brehob, 2018

The material in this presentation cannot be
copied in any form without our written permission
Announcements
q  Homework 4 due today 3/8

q  Midterm re-grade request due by 3/12
•  Need detailed description of re-grade request, sign form to acknowledge
everything will be re-graded
The University of Michigan 2

Review: Control hazards
beq 1 1 10
add 3 4 5
time
beq fetch decode execute memory writeback
add fetch decode execute
EECS 370: Introduction to The University of Michigan 3

Computer Organization
Review: Approaches to handling control hazards
q  Avoid
•  Make sure there are no hazards in the code.
q  Detect and stall

•  Delay fetch until branch resolved.
q  Speculate and Squash-if-Wrong aka Branch Prediction

•  Go ahead and fetch more instructions in case it is correct, but stop them if
they shouldn’t have been executed.

Review: Branch Prediction
•  Predict the next fetch address (to be used in the next cycle)
•  Requires three things to be predicted at fetch stage:

•  Whether the fetched instruction is a branch
•  Branch direction (if conditional)
•  Branch target address (if direction is taken)
•  Observation: Target address remains the same for a conditional

direct branch across dynamic instances
•  Store the target address from previous instance and access it with the PC
•  Called Branch Target Buffer (BTB) or Branch Target Address Cache

Review: Fetch Stage with Branch Prediction
Direction predictor (2-bit counters)
taken?
PC + inst size Next Fetch

Address
Program
hit?
Counter
Address of the
current branch
target address
Cache of Target Addresses (BTB: Branch Target Buffer)

EECS 370: Introduction to The University of Michigan
Review: State Machine for 2-bit Saturating Counter
actually actually
taken pred !taken pred “weakly
taken taken taken”
11 actually 10
“strongly taken
taken”
actually actually
taken !taken
“strongly
“weakly actually !taken”
pred pred
!taken” !taken !taken !taken
01 00 actually
actually !taken
taken
7
Branch prediction
q  Predict not taken: ~50% accurate.

q  Predict backward taken: ~65% accurate.
q  Predict same as last time: ~80% accurate.
q  Pentium: ~85% accurate.

q  Pentium Pro: ~92% accurate.
q  Best paper designs: ~96% accurate.

Can We Do Better?
Last-time and 2BC predictors exploit “last-time” predictability
Realization 1: A branch’s outcome can be correlated with

other branches’ outcomes
Global branch correlation
Realization 2: A branch’s outcome can be correlated with past

outcomes of the same branch (other than the outcome of
the branch “last-time” it was executed)
Local branch correlation

Computer Organization 9
5 Global Branch Prediction
In the local branch prediction scheme, the only patterns considered are t
Global
branch.Branch
AnotherCorrelation
scheme proposed by Yeh and Patt[YP92] is able to take
recent branches to make a prediction. One implementation of such an
Recently
in Figureexecuted branch
6. A single shift outcomes in records
register GR, the execution path is
the direction taken by
correlated branches.
conditional with the outcome
Since theof the next
branch branch
history is global to all branch
called global branch prediction in this paper.
Global branch prediction is able to take advantage of two types
the direction take by the current branch may depend strongly on othe
If Consider the example
first branch below:
not taken, second also not taken
if (x<1)
if (x>1)
If firstUsing
branch taken,
global second
history definitely
prediction, we not takento base the predictio
are able
on the direction taken by the first if. If (x<1), we know that the sec
6
Two Level Global Branch Prediction
First level: Global branch history register (N bits)
The direction of last N branches
Second level: Table of saturating counters for each history entry
The direction the branch took the last time the same history was seen
Pattern History Table (PHT)
00 …. 00
1 1 ….. 1 0 00 …. 01
2 3
previous one 00 …. 10
GHR
(global index
history 0 1
register) 11 …. 11

Can We Do Better?
Last-time and 2BC predictors exploit “last-time” predictability
Realization 1: A branch’s outcome can be correlated with

other branches’ outcomes
Global branch correlation
Realization 2: A branch’s outcome can be correlated with past

outcomes of the same branch (other than the outcome of
the branch “last-time” it was executed)
Local branch correlation

Local Branch Correlation
McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.

Two Level Local Branch Prediction
First level: A set of local history registers (N bits each)
Select the history register based on the PC of the branch
Second level: Table of saturating counters for each history entry
The direction the branch took the last time the same history was seen
Pattern History Table (PHT)
Local history registers
00 …. 00
1 1 ….. 1 0 00 …. 01
2 3
00 …. 10
index
0 1
11 …. 11

What Do Architects Do For Fun?
Alpha 21264 Tournament Predictor
•  Minimum branch penalty: 7 cycles

•  Typical branch penalty: 11+ cycles
•  48K bits of target addresses stored in I-cache
•  Predictor tables are reset on a context switch
What can go wrong?
q  Data hazards: since register reads occur in stage 2 and register writes
occur in stage 5 it is possible to read the wrong value if is about to be
written.
q  Control hazards: A branch instruction may change the PC, but not until
stage 4. What do we fetch before that?
q  Exceptions: How do you handle exceptions in a pipelined processor with 5
instructions in flight?

Exceptions
q  Exception: when something unexpected happens during
program execution.
•  Example: divide by zero.
q  More complex for pipelined implementations.
•  Multiple instructions executing at the same time.
•  Simple case: ALU overflow.
-  “flush the pipeline after the exception”.
-  “handle the exception”.
n  Identify address of instruction causing exception (PC+1 in ID/
EX pipeline register).
n  JALR to exception handler.

Early exceptions
q  What about an early (fetch) exception?

•  Maybe a mis-speculated fetch.
-  Branch is wrong, fetching “down the wrong path”.
•  Solution.
-  Delay the handling of an exception until it is known to
be a “real” problem.

Late exceptions
q  What about a late (WB) exception?

•  When does a sw modify state?
-  In the memory access stage which means it may have
completed the write before the exception (to the
instruction that logically precedes it) occurs.
•  Solution:
-  Delay the memory write until writeback.
-  What happens if the sw is followed by a lw?

Multiple exceptions
q  What about simultaneous exceptions?
•  div 10 0 5
•  badop 4 4 4
q  They will generate a divide by 0 and an invalid opcode at the
same cycle.
q  Solution: assign priority
•  Generally deepest pipeline exception is handled.
•  Later exceptions can usually be ignored.

Building with Pipelines
•  CPI for pipelining:
•  1 (ideal case -‐ no stalls)
•  > 1 (reality, depends on program)
•  What if we want to improve performance more?
•  Want CPI as low as possible – lower than 1
•  Use Parallelism
•  InstrucIon Level Parallelism (ILP) – Within task
•  Thread Level Parallelism (TLP) – Having many tasks
•  Data Level Parallelism (DLP) – Many tasks with
same instrucIons
Crea=ng more pipelines
q  InstrucIon Level Parallelism – Superscalar Pipeline

•  Have two or more pipelines in same processor
•  pipelines need to work in tandem to improve single program
performance
q  Thread Level Parallelism – MulI-‐core
•  Have two or more processors (Independent Pipelines)
•  Need more programs or a parallel program
•  does not improve single program performance
q  Data Level Parallelism – Single Inst. MulIple Data (SIMD)
•  Have two or more execuIon pipelines (ID-‐>WB)
•  Share the same fetch and control pipeline to save power (IF+cont.)
•  Similar to GPU’s

22

ILP Techniques: Superscalar
Write
Fetch Decode Execute Memory
back
A
L M
U U
X
PC
Instr. Register Data

Memory File Memory Write
Execute
back
A
L M
U U
X

23
Other Techniques for ILP: Out of Order Execu=on
q  EliminaIng stall condiIons decreases CPI
q  Reorder instrucIons to avoid stalls
q  Example (5-‐stage LC2K pipeline):

add 1 2 3 add 1 2 3
lw 3 2 16 lw 3 2 16
nand 2 6 7 add 4 5 1
add 4 5 1 nand 2 6 7

Why Use Out of Order Execu=on?
q  Some instrucIons take a long Ime to execute

•  FloaIng point operaIons
•  Some loads and stores (more when we talk about
memory hierarchy)
q  OpIons:
•  Increase cycle Ime
•  Increase number of pipeline stages
•  Execute other instrucIons while you wait

TLP Techniques: Mul=processors
Fetch Decode Execute Memory Write
back
A
PC Register L M
U
File U
X
Instr. Data
Memory Memory
Register A
L M
PC File U U
X

Other Techniques for TLP: Mul=-‐Threading
Fetch Decode Execute Memory Write
back
A
PC
PC
Instr. Register
Register L Data M
Memory File U Memory U
File X
q  Virtual MulIprocessor (MulI-‐Threading or HyperThreading)

•  Duplicate the state (PC, Registers) but Ime share hardware
•  User/OperaIng system see 2 cores, but only one execuIon
•  Used to hide long latencies (i.e. memory access to disk)

DLP Techniques: Single Instr. Mul=ple Data (SIMD)
Fetch Decode Execute Memory WB
Register
PC
Instr. File
Mem
Register
File

Data
Memory
Register
File

Register
File


Building a GPU
q  Add special funcIonal
RF
units in EX
PC

RF

RF q  Combine Techniques

•  SIMD + MP + MT= SIMT
Inst. Mem
RF
Data Mem

q  MT used to hide
RF memory latencies
PC

RF q  SIMD used to decrease

power of fetch/control
RF

q  MP used to improve
RF throughput

LC2k Pipeline Summary
Fetch Decode Execute Memory WB
PC read Need register values Branches resolved Register values produced
q  Data hazards
q  Hazard exists if producer-‐consumer of a register within a 2 instrucIon window
q Note for project, the window is 3 instrucIons
q  Detect and stall – insert enough noops to separate producer and consumer by >
2 instrucIons (> 3 instrucIons for project)
q  Detect and forward
q Handles all cases except LW-‐USE, need 1 noop here
q  Control hazards
q  Detect and stall – needs 3 noops inserted aler each branch
q  Predict and squash
q Zero noops if predict correctly
q 3 if predict incorrectly

Review: basic performance equa=on
q  ExecuIon Ime (Time/Program) =

•  # of instr (I/P) × CPI (C/I) × cycle Ime (T/C)

q  MulI-‐cycle decreases cycle Ime, but increases CPI.

q  Pipelining decreases CPI
•  Approaches 1.0 if no stalls (hazards that are fixed by
stalling).

Calcula=ng performance with no stalls

How many cycles does this
code take to execute?
add 1 2 3
nor 1 4 5
add 4 6 7
What value is wri]en to the
ALU result field of the Mem/WB
pipeline register at the end of cycle 5.

Calcula=ng performance with no stalls

How many cycles does this
code take to execute?
No stalls – Final WB @ cycle 7
add 1 2 3
nor 1 4 5
add 4 6 7
What value is wri]en to the
ALU result field of the Mem/WB
pipeline register at the end of cycle 5.
nor result

Calcula=ng performance with data hazards
(detect and stall)

How many data hazards are there
in this code?
add 1 2 3
nor 3 4 5
add 3 5 6 How many stall cycles if we use
detect and stall to handle the
hazards?
Time: 1 2 3 4 5 6 7 8 9 10 11
add 1 2 3 IF ID EX ME WB
nand 3 4 5
add 3 5 6

(detect and stall)

How many data hazards are there
in this code?
add 1 2 3
3 data hazards
nor 3 4 5
add 3 5 6 How many stall cycles if we use
detect and stall to handle the
hazards?
Time: 1 2 3 4 5 6 7 8 9 10 11
add 1 2 3 IF ID EX ME WB
Stall : 4 cycles nor 3 4 5 IF ID* ID* ID EX ME WB
Total : 11 cycles add 3 5 6 IF* IF* IF ID* ID* ID EX ME WB

(detect and forward)

add 1 2 3 Where do the values for the second add
nor 3 4 5 instruc=on come from?
add 3 5 6
lw 3 6 7
add 6 6 1
How many stall cycles on the
LC2K pipelined datapath with
data forwarding from lecture?

(detect and forward)

add 1 2 3 Where do the values for the second add
nor 3 4 5 instruc=on come from?
add 3 5 6 From Mem/WB and EX/Mem
lw 3 6 7
add 6 6 1
How many stall cycles on the
LC2K pipelined datapath with
data forwarding from lecture?
1 stall for lw ➜ add

Calcula=ng performance with control hazards
(speculate and squash)
q  How many cycles are saved if you perform speculate and squash for
the following code (assume that branches are predicted to be not
taken)? add 1 2 3
beq 1 5 1
nor 6 4 1
add 3 4 5
q  Assume the branch is taken: How many cycles to execute this code?

q  Assume the branch is not taken: How many cycles execute this code?

(speculate and squash)
q  How many cycles are saved if you perform speculate and squash for
the following code (assume that branches are predicted to be not
taken)? add 1 2 3
beq 1 5 1
nor 6 4 1
add 3 4 5
q  Assume the branch is taken: How many cycles to execute this code?
3 instr + 3 stalls + 4 to empty pipe = 10 cycles
q  Assume the branch is not taken: How many cycles execute this code?
4 instr + 4 to empty pipe = 8 cycles
Assume the first branch is taken 50% add 1 2 3
of the Ime and the loop iterates 100 beq 1 5 1
Imes, and forwarding for all data lw 6 4 1
hazards. add 3 4 5
1.  How many cycles does the code beq 5 7 -‐5
take assuming detect and stall for halt
control hazards?
Assume halt is resolved
in WB stage

hazards. add 3 4 5
take assuming detect and stall for halt
control hazards?
# InstrucIons = 100*(0.5*5 + 0.5*4) + 1 = 451

Time = 451 + load stalls + branch stalls + empty pipe
Time = 451 + 100*0.5*1 +
Time = 451 + 100*0.5*1 + (100*3 + 100*3) + 4
Time = 1105
Assume the first branch is taken add 1 2 3
50% of the Ime and the loop beq 1 5 1
iterates 100 Imes, and forwarding lw 6 4 1
for all data hazards. add 3 4 5
take assuming speculate and halt
squash where all branches are
predicted not taken?

hazards. add 3 4 5
squash where all branches are
predicted not taken?
# InstrucIons = 100*(0.5*5 + 0.5*4) + 1 = 451
Time = 451 + 100*0.5*1 + (100*0.5*3 + 99*3) + 4
Time = 952

Assume the first branch is taken add 1 2 3
50% of the Ime and the loop beq 1 5 1
iterates 100 Imes, and forwarding lw 6 4 1
for all data hazards. add 3 4 5
squash where backward
branches are predicted taken
and forward branches not
taken?

hazards. add 3 4 5
squash where backward
branches are predicted taken
and forward branches not taken,
andBTB has all entrys to start?
# InstrucIons = 100*(0.5*5 + 0.5*4) + 1 = 451
Time = 451 + 100*0.5*1 + (100*0.5*3 + 1*3) + 4
Time = 658
Assume the first branch has the pasern TTTN add 1 2 3

that repeats, and the loop is iterated 100 Imes beq 1 5 1
4.  How many cycles does the code take if a 2-‐ lw 6 4 1
bit counter BTB is used to predict each add 3 4 5
branch, how many cycles does the code beq 5 7 -‐5
take? Assume iniIal state of branch halt
predictor counter is “10” (WT)

Assume the first branch has the pasern TTTN add 1 2 3

that repeats, and the loop is iterated 100 Imes beq 1 5 1
4.  How many cycles does the code take if a 2-‐ lw 6 4 1
bit counter BTB is used to predict each add 3 4 5
branch, how many cycles does the code beq 5 7 -‐5
take? Assume iniIal state of branch halt
predictor counter is “10” (WT)
√ √ √ X √ √ √ X √ √ √ X beq 2 is correct 99 Imes,
beq 1 T T T N T T T N T T T N • • • then incorrect last iter
# InstrucIons = 100*(0.25*5 + 0.75*4) + 1 = 426
Time = 426 + 100*0.25*1 + 100*0.25*3 + 1*3 + 4
Time = 533
Classic performance problem
q  Program with following instrucIon breakdown:
lw 10%
sw 15%
beq 25%
R-‐type 50%
q  Speculate “always not-‐taken” and squash. 80% of branches not-‐taken
q  Full forwarding to execute stage. 20% of loads stall for 1 cycle
q  What is the CPI of the program?
q  What is the total execuIon Ime if cycle Ime is 100MHz?


Classic performance problem
q  Program with following instrucIon breakdown:
lw 10%
sw 15%
beq 25%
R-‐type 50%
q  Speculate “always not-‐taken” and squash. 80% of branches not-‐taken
q  Full forwarding to execute stage. 20% of loads stall for 1 cycle
q  What is the CPI of the program?
q  What is the total execuIon Ime if cycle Ime is 100MHz?

CPI = 1 + 0.10 (loads) * 0.20 (load use stall)*1
+ 0.25 (branch) * 0.20 (miss rate)*3
CPI = 1 + 0.02 + 0.15 = 1.17
Time = 1.17 * 10ns = 11.7ns

Classic performance problem (cont.)
q  Assume branches are resolved at Execute?

•  What is the CPI?
•  What happens to cycle Ime?
•  What is the total execuIon Ime?

q  What if branches are resolved at Decode?

Classic performance problem (cont.)
q  Assume branches are resolved at Execute?

•  What is the CPI?
•  What happens to cycle Ime?
•  What is the total execuIon Ime?

CPI = 1 + 0.10 (loads) *0.20 (load use stall)*1
+ 0.25 (branch) * 0.20 (miss rate)*2
CPI = 1 + 0.02 + 0.1 = 1.12
q  What if branches are resolved at Decode?

CPI = 1 + 0.10 (loads) *0.20 (load use stall)*1
+ 0.25 (branch) * 0.20 (miss rate)*1
CPI = 1 + 0.02 + 0.05 = 1.07

Performance with deeper pipelines
q  Assume the setup of the previous problem.

q  What if we have a 10 stage pipeline?
•  InstrucIons are fetched at stage 1.
•  Register file is read at stage 3.
•  ExecuIon begins at stage 5.
•  Branches are resolved at stage 7.
•  Memory access is complete in stage 9.
q  What’s the CPI of the program?
q  If the clock rate was doubled by doubling the pipeline
depth, is performance also doubled?

Performance with deeper pipelines
q  Assume the setup of the previous problem.

q  What if we have a 10 stage pipeline?
•  InstrucIons are fetched at stage 1.
•  Register file is read at stage 3.
•  ExecuIon begins at stage 5.
•  Branches are resolved at stage 7.
•  Memory access is complete in stage 9.
q  What’s the CPI of the program?
q  If the clock rate was doubled by doubling the pipeline
depth, is performance also doubled?
CPI = 1 + 0.10 (loads) *0.20 (load use stall)*4 + 0.25 (branch) * 0.20 (N stalls)*6
CPI = 1 + 0.08 + 0.30 = 1.38
Time = 1.38 * 5ns = 6.9 ns

15 - 370w18 - Pipeline Last

Uploaded by

Copyright:

Available Formats

You might also like

15 - 370w18 - Pipeline Last

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15 - 370w18 - Pipeline Last

Uploaded by

Copyright:

Available Formats

15.

Basic Processor Design –

EECS 370 – Introduction to Computer Organization – Winter 2018

Mark Brehob, Reetu Das and Harry Davis

© Davis, Das, and Brehob, 2018

q Homework 4 due today 3/8

The University of Michigan 2

beq fetch decode execute memory writeback

add fetch decode execute

EECS 370: Introduction to The University of Michigan 3

q Detect and stall

q Speculate and Squash-if-Wrong aka Branch Prediction

EECS 370: Introduction to The University of Michigan 4

• Requires three things to be predicted at fetch stage:

• Observation: Target address remains the same for a conditional

EECS 370: Introduction to The University of Michigan 5

PC + inst size Next Fetch

Cache of Target Addresses (BTB: Branch Target Buffer)

q Predict not taken: ~50% accurate.

q Pentium: ~85% accurate.

EECS 370: Introduction to The University of Michigan 8

Last-time and 2BC predictors exploit “last-time” predictability

Realization 1: A branch’s outcome can be correlated with

Realization 2: A branch’s outcome can be correlated with past

EECS 370: Introduction to The University of Michigan

EECS 370: Introduction to The University of Michigan

Last-time and 2BC predictors exploit “last-time” predictability

Realization 1: A branch’s outcome can be correlated with

Realization 2: A branch’s outcome can be correlated with past

EECS 370: Introduction to The University of Michigan

McFarling, “Combining Branch Predictors,” DEC WRL TR 1993.

EECS 370: Introduction to The University of Michigan

EECS 370: Introduction to The University of Michigan

• Minimum branch penalty: 7 cycles

EECS 370: Introduction to The University of Michigan 16

EECS 370: Introduction to The University of Michigan 17

q What about an early (fetch) exception?

EECS 370: Introduction to The University of Michigan 18

q What about a late (WB) exception?

EECS 370: Introduction to The University of Michigan 19

EECS 370: Introduction to The University of Michigan 20

q InstrucIon Level Parallelism – Superscalar Pipeline

EECS 370: Introduction to The University of Michigan

Instr. Register Data

EECS 370: Introduction to The University of Michigan

EECS 370: Introduction to The University of Michigan

q Some instrucIons take a long Ime to execute

EECS 370: Introduction to The University of Michigan

EECS 370: Introduction to The University of Michigan

q Virtual MulIprocessor (MulI-­‐Threading or HyperThreading)

EECS 370: Introduction to The University of Michigan

EECS 370: Introduction to The University of Michigan 28

EECS 370: Introduction to The University of Michigan 30

q ExecuIon Ime (Time/Program) =

EECS 370: Introduction to The University of Michigan 31

EECS 370: Introduction to The University of Michigan 32

EECS 370: Introduction to The University of Michigan 33

EECS 370: Introduction to The University of Michigan 34

Stall : 4 cycles nor 3 4 5 IF ID* ID* ID EX ME WB

q  Homework 4 due today 3/8

q  Detect and stall

q  Speculate and Squash-if-Wrong aka Branch Prediction

•  Requires three things to be predicted at fetch stage:

•  Observation: Target address remains the same for a conditional

q  Predict not taken: ~50% accurate.

q  Pentium: ~85% accurate.

•  Minimum branch penalty: 7 cycles

q  What about an early (fetch) exception?

q  What about a late (WB) exception?

q  InstrucIon Level Parallelism – Superscalar Pipeline

q  Some instrucIons take a long Ime to execute

q  Virtual MulIprocessor (MulI-‐Threading or HyperThreading)

q  ExecuIon Ime (Time/Program) =

# InstrucIons = 100(0.55 + 0.5*4) + 1 = 451

q  Assume branches are resolved at Execute?

q  What if branches are resolved at Decode?

q  Assume branches are resolved at Execute?

q  What if branches are resolved at Decode?

q  Assume the setup of the previous problem.

q  Assume the setup of the previous problem.