Professional Documents
Culture Documents
Lec 04 Pipelined Processor
Lec 04 Pipelined Processor
M S Bhat
Dept. of E&C,
NITK Suratkal
Presentation Outline
Pipelining versus Serial Execution
Pipeline Hazards
Control Hazards
2/22/2020 EC792 2
Pipelining Example
Laundry Example: Three Stages
2/22/2020 EC792 3
Sequential Laundry
6 PM 7 8 9 10 11 12 AM
Time 30 30 30 30 30 30 30 30 30 30 30 30
2/22/2020 EC792 4
Pipelined Laundry: Start Load ASAP
6 PM 7 8 9 PM
30 30 30
30 30 30 Time
30 30 30
30 30 30
2/22/2020 EC792 5
Serial Execution versus Pipelining
Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit
The total execution time of the task is k time units
Pipelining is to overlap the execution
The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2 … k 1 2 … k
1 2 … k 1 2 … k
1 2 … k 1 2 … k
2/22/2020 EC792 6
Synchronous Pipeline
Uses clocked registers between stages
Upon arrival of a clock edge …
All registers hold the results of previous stages simultaneously
Register
Register
Register
Input S1 S2 Sk Output
Clock
2/22/2020 EC792 7
Pipeline Performance
Let ti = time delay in stage Si
Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A k-stage pipeline can process n tasks in k + n – 1
cycles
k cycles are needed to complete the first task
n – 1 cycles are needed to complete the remaining n – 1 tasks
2/22/2020 EC792 8
MIPS Processor Pipeline
Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address
4. MEM: Memory access for load and store
5. WB: Write Back result to register
2/22/2020 EC792 9
Performance Example
Assume the following operation times for components:
Instruction and data memories: 200 ps
ALU and adders: 180 ps
Decode and Register file access (read or write): 150 ps
Ignore the delays in PC, mux, extender, and wires
2/22/2020 EC792 10
Single-Cycle vs Multicycle Implementation
Break instruction execution into five steps
Instruction fetch
Instruction decode and register read
Execution, memory address calculation, or branch completion
Memory access or ALU instruction completion
Load instruction completion
2/22/2020 EC792 11
Solution
Instruction Instruction Register ALU Data Register
Total
Class Memory Read Operation Memory Write
ALU 200 150 180 150 680 ps
Load 200 150 180 200 150 880 ps
Store 200 150 180 200 730 ps
Branch 200 150 180 530 ps
Jump 200 150 decode and update PC 350 ps
2/22/2020 EC792 12
Single-Cycle vs Pipelined Performance
Consider a 5-stage instruction execution in which …
Instruction fetch = Data memory access = 200 ps
ALU operation = 180 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?
What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?
Solution
Single-Cycle Clock = 200+150+180+200+150 = 880 ps
IF Reg ALU MEM Reg
880ps IF Reg ALU MEM Reg
880 ps
2/22/2020 EC792 13
Single-Cycle versus Pipelined – cont’d
Pipelined clock cycle = max(200, 180, 150) = 200 ps
2/22/2020 EC792 14
Pipeline Performance Summary
Pipelining doesn’t improve latency of a single instruction
However, it improves throughput of entire workload
Instructions are initiated and completed at a higher rate
Pipeline Hazards
Control Hazards
2/22/2020 EC792 16
Single-Cycle Datapath
Shown below is the single-cycle datapath
How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage
IF = Instruction Fetch ID = Decode & EX = Execute MEM = Memory WB =
Register Read Access Write
Jump or Branch Target Address
Back
J
Next Beq
PC
30 Bne
ALU result
Imm26
Imm16
PCSrc +1 zero
Instruction Rs 5 32
RA BusA Data
30 Memory Memory 0
32
Registers A 32
00
32
Instruction
0
Rt 5 L Address
32
RB
Address
BusB 0 U Data_out 1
PC
0 32
1 Rd RW Data_in
1 BusW E 1
32
RegDst Reg
clk Write
2/22/2020 EC792 17
Pipelined Datapath
Pipeline registers are shown in green, including the PC
Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)
IF = Instruction Fetch ID = Decode & EX = Execute MEM = Memory
WB = Write Back
Register Read Access
NPC2
Next
PC
NPC
+1
Imm
Imm26
ALU result 32
Imm16 zero
Instruction 32 Data
ALUout
Register File
BusA A A
Instruction
Memory Rs 5 Memory
WB Data
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC
0 0 32
1 RW Data_in
D
1 BusW 32
Rd
32
clk
2/22/2020 EC792 18
Problem with Register Destination
Is there a problem with the register destination address?
Instruction in the ID stage different from the one in the WB stage
Instruction in the WB stage is not writing to its destination register
but to the destination of a different instruction in the ID stage
ID = Decode &
WB = Write Back
IF = Instruction Fetch Register Read EX = Execute MEM =
Memory Access
NPC2
Next
PC
NPC
+1
Imm
Imm26
ALU result 32
Imm16 zero
Instruction 32 Data
ALUout
Register File
A
A
Instruction
Rs 5 BusA Memory
Memory
WB Data
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC
0 0 32
1 RW Data_in
D
1 BusW 32
Rd
32
clk
2/22/2020 EC792 19
Pipelining the Destination Register
Destination Register number should be pipelined
Destination register number is passed from ID to WB stage
The WB stage writes back data knowing the destination register
IF ID EX MEM WB
NPC2
Next
PC
NPC
+1
Imm
Imm26
ALU result 32
Imm16 zero
Instruction 32 Data
ALUout
Register File
A
A
Instruction
Rs 5 BusA Memory
WB Data
Memory
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC
Rd
B
0 32
1 RW Data_in
D
BusW 32
32
Rd4
Rd3
Rd2
0
1
clk
2/22/2020 EC792 20
Graphically Representing Pipelines
Multiple instruction execution over multiple clock cycles
Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Figure shows the use of resources at each stage and each cycle
Time (in cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Program Execution Order
2/22/2020 EC792 21
Instruction-Time Diagram
Instruction-Time Diagram shows:
Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Up to five instructions can be in the
pipeline during the same cycle. ALU instructions skip
Instruction Level Parallelism (ILP) the MEM stage.
Store instructions
skip the WB stage
lw R15, 8(R19)
Instruction Order
IF ID EX MEM WB
lw R14, 8(R21) IF ID EX MEM WB
ori R12, R19, 7 IF ID EX – WB
sub R13, R18, R11 IF ID EX – WB
sw R18, 10(R19) IF ID EX MEM –
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
2/22/2020 EC792 22
Control Signals
IF ID EX MEM WB
NPC2
J
Next Beq
PC
NPC
Bne
+1
Imm
Imm26
ALU result 32
Imm16 zero
PCSrc Data
Instruction 32
ALUout
Register File
A
A
Instruction
Rs 5 BusA Memory
WB Data
Memory
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC
Rd
B
0 32
1 RW Data_in
D
BusW 32
32
Rd4
Rd3
Rd2
0
1
clk
2/22/2020 EC792 23
Pipelined Control
NPC2
J
Next Beq
PC
NPC
Bne
+1
Imm
Imm26
ALU result 32
Imm16 zero
PCSrc Data
Instruction
Instruction
32
ALUout
Register File
A
Memory Rs 5 BusA A Memory
WB Data
RA L 0
Instruction Address
Rt 5 U
0 RB E 1 32
Data_out 1
Address BusB
PC
Rd
B
0 32
1 RW Data_in
D
Op
BusW 32
32
Rd4
Rd3
Rd2
0
1
clk
J
func
& ALU
MEM
like the data Control
WB
2/22/2020 EC792 24
Pipelined Control – Cont'd
ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
2/22/2020 EC792 25
Control Signals Summary
Decode Execute Stage Memory Stage Write
Stage Control Signals Control Signals Back
Op
RegDst ALUSrc ExtOp J Beq Bne ALUCtrl MemRd MemWr MemReg RegWrite
j x x x 1 0 0 x 0 0 x 0
2/22/2020 EC792 26
Next . . .
Pipelining versus Serial Execution
Pipeline Hazards
Control Hazards
2/22/2020 EC792 27
Pipeline Hazards
Hazards: situations that would cause incorrect execution
If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
2/22/2020 EC792 28
Structural Hazards
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Example
Two instructions are
Writing back ALU result in stage 4 attempting to write
Conflict with writing load data in stage 5 the register file
during same cycle
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time
2/22/2020 EC792 29
Resolving Structural Hazards
Serious Hazard:
Hazard cannot be ignored
2/22/2020 EC792 30
Next . . .
Pipelining versus Serial Execution
Pipeline Hazards
Control Hazards
2/22/2020 EC792 31
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
2/22/2020 EC792 32
Example of a RAW Data Hazard
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of R18 10 10 10 10 10 20 20 20
Program Execution Order
add R20, R18, R13 IF Reg Reg Reg Reg ALU DM Reg
2/22/2020 EC792 34
Solution 2: Forwarding ALU Result
The ALU result is forwarded (fed back) to the ALU input
No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded from ALU, MEM, and WB stages
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
value of R18 10 10 10 10 10 20 20 20
Program Execution Order
2/22/2020 EC792 35
Implementing Forwarding
Two multiplexers added at the inputs of A & B registers
Data from ALU stage, MEM stage, and WB stage is fed back
Two signals: ForwardA and ForwardB control forwarding
ForwardA
Im26
Imm26 Imm16
32 32 ALU result
0 A
Result
Register File
Rs BusA 1
L Address
A
RA 2
Instruction
Rt 3 E U Data 0
WData
1
RB BusB
0 Memory
0 32
1 32 32 Data_out 1
RW
D
B
BusW 2 Data_in
3
32
Rd2
Rd4
Rd3
0
1
Rd
clk
ForwardB
2/22/2020 EC792 36
Forwarding Control Signals
Signal Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
2/22/2020 EC792 37
Forwarding Example
Instruction sequence: When sub instruction is in decode stage
lw R12, 4(R8) ori will be in the ALU stage
ori R15, R9, 2
lw will be in the MEM stage
sub R11, R12, R15
ForwardA = 2 from MEM stage ForwardB = 1 from ALU stage
sub R11,R12,R15 ori R15, R9,2 lw R12,4(R8)
Imm26
Imm16 2
Imm
ext
32 32 ALU result
0 A
Result
Register File
Rs BusA 1 A
L Address
RA 2
Instruction
Rt 3 U Data 0
WData
1
RB BusB
0 Memory
0 32
1 32 32 Data_out 1
RW
D
B
BusW 2 Data_in
3
32
Rd2
Rd3
Rd4
0
1
Rd 1
clk
2/22/2020 EC792 38
RAW Hazard Detection
Current instruction being decoded is in Decode stage
Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction in the Write Back stage
2/22/2020 EC792 39
Hazard Detect and Forward Logic
Imm26
Im26
32 32 ALU result
E
0 A
Result
Register File
Rs BusA 1
L Address
A
RA 2
Instruction
Rt 3 U Data 0
WData
1
RB BusB
0 Memory
0 32
Rd 1 32 32 Data_out 1
RW
D
B
BusW 2 Data_in
3 ALUCtrl 32
Rd2
Rd3
Rd4
0
1
clk
RegDst
ForwardB ForwardA
Hazard Detect
and Forward
func
RegWrite RegWrite RegWrite
Op Main
EX
& ALU
MEM
Control
WB
2/22/2020 EC792 40
Next . . .
Pipelining versus Serial Execution
Pipeline Hazards
Control Hazards
2/22/2020 EC792 41
Load Delay
Unfortunately, not all data hazards can be forwarded
Load has a delay that cannot be eliminated by forwarding
In the example shown below …
The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 - NOT possible
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
lw R18, 20(R17)
However, load can
IF Reg ALU DM Reg
Program Order
forward data to
2nd next and later
add R20, R18, R13 IF Reg ALU DM Reg
instructions
2/22/2020 EC792 42
Detecting RAW Hazard after Load
Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage
2/22/2020 EC792 43
Stall the Pipeline for one Cycle
ADD instruction depends on LW stall at CC3
Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)
Introduce a bubble into the ALU stage (bubble is a NO-OP)
Load can forward data to next instruction after delaying it
Time (cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
2/22/2020 EC792 44
Showing Stall Cycles
Stall cycles can be shown on instruction-time diagram
Hazard is detected in the Decode stage
Stall indicates that instruction is delayed
Instruction fetching is also delayed after a stall
Example:
Data forwarding is shown using blue arrows
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
2/22/2020 EC792 45
Hazard Detect, Forward, and Stall
Imm26
Im26
32 32 ALU result
E
0 A
Result
Register File
Rs BusA 1
L Address
A
RA 2
Instruction
Rt 3 U Data 0
WData
1
RB BusB
PC
0 Memory
0 32
Rd 1 32 32 Data_out 1
RW
D
B
BusW 2 Data_in
3 32
Rd2
Rd3
Rd4
0
1
clk
RegDst
ForwardB ForwardA
Disable PC
func
Hazard Detect
Forward, & Stall
MemRead
Stall RegWrite RegWrite
Op Control Signals RegWrite
Main & ALU 0
EX
WB
=0
2/22/2020 EC792 46
Code Scheduling to Avoid Stalls
Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E – F; // A thru F are in Memory
2/22/2020 EC792 47
Name Dependence: Write After Read
Instruction J writes its result after it is read by I
Called anti-dependence by compiler writers
I: sub R12, R9, R11 # R9 is read
J: add R9, R10, R11 # R9 is written
Results from reuse of the name R9
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and
Instructions are processed in order
Anti-dependence can be eliminated by renaming
Use a different destination register for add (eg, R13)
2/22/2020 EC792 48
Name Dependence: Write After Write
Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub R9, R12, R11 # R9 is written
J: add R9, R10, R11 # R9 is written again
Not a data hazard in the 5-stage pipeline because:
All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines
If instructions are allowed to complete out of order, and
Instruction J completes and writes R9 before instruction I
Output dependence can be eliminated by renaming R9
Read After Read is NOT a name dependence
2/22/2020 EC792 49
Hazards due to Loads and Stores
Consider the following statements:
sw R10, 0(R1)
lw R11, 6(R5)
2/22/2020 EC792 50
Next . . .
Pipelining versus Serial Execution
Pipeline Hazards
Control Hazards
2/22/2020 EC792 51
Control Hazards
Jump and Branch can cause great performance loss
Jump instruction needs only the jump target address
Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
PC + 4 If Branch is NOT taken
PC + 4 × immediate If Branch is Taken
2/22/2020 EC792 52
2-Cycle Branch Delay
Control logic detects a Branch instruction in the 2nd Stage
ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway
Convert Next1 and Next2 into bubbles if branch is taken
cc1 cc2 cc3 cc4 cc5 cc6 cc7
Branch
L1: target instruction Target IF Reg ALU DM
Addr
2/22/2020 EC792 53
Implementing Jump and Branch
NPC2
Jump or Branch Target
J
Next
Beq
PC
NPC
Bne
Im26
+1 Imm26
Imm16 zero
PCSrc 0
Instruction
Instruction
32
ALUout
1
Register File
BusA A
A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC
Rd 0
1 RW 1
D
Op
BusW 2 32
3
32
Rd3
0
Rd2
1
clk
func
Reg
Dst
Branch Delay = 2 cycles J, Beq, Bne
EX
Control
MEM
Bubble = 0 1
2/22/2020 EC792 54
Predict Branch NOT Taken
Branches can be predicted to be NOT taken
If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles
Else, convert Next1 and Next2 into bubbles – 2 wasted cycles
2/22/2020 EC792 55
Reducing the Delay of Branches
Branch delay can be reduced from 2 cycles to just 1 cycle
Reset
Bne then compared
Im16
Imm16
PCSrc 0
Instruction
Instruction
32
ALUout
1
Register File
BusA A
A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC
Rd 0
1 RW 1
D
Op
BusW 2 32
3
32
Rd3
0
Rd2
1
clk
func
Reg
Reset signal converts Dst
J, Beq, Bne
next instruction after ALUCtrl
Control Signals
jump or taken branch Main & ALU 0
EX
Control
MEM
into a bubble Bubble = 0 1
2/22/2020 EC792 57
Branch Behavior in Programs
Based on SPEC benchmarks on DLX
Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.
2/22/2020 EC792 58
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
Execute successor instructions in sequence
“Flush” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
33% DLX branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
67% DLX branches taken on average
#4: Define branch to take place AFTER n following instruction
(Delayed branch)
2/22/2020 EC792 59
Next . . .
Pipelining versus Serial Execution
Pipeline Hazards
Control Hazards
2/22/2020 EC792 60
Branch Hazard Alternatives
Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken
Flush only instructions appearing after Jump or taken branch
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
2/22/2020 EC792 61
Delayed Branch
Define branch to take place after the next instruction
Instruction in branch delay slot is always executed
Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
label:
For a 1-cycle branch delay, we have one delay slot
branch instruction . . .
2/22/2020 EC792 62
Delayed Branch
From the Target: Helps when branch is taken. May duplicate instructions
Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)
2/22/2020 EC792 63
Delayed Branch
From Fall Through: Helps when branch is not taken
(default case)
Instructions at target (L1 and after) must not use R4 till set (written)
again.
2/22/2020 EC792 64
Figure C.14 Scheduling the branch delay slot. The top box in each pair shows the code before scheduling; the bottom box shows
the scheduled code. In (a), the delay slot is scheduled with an independent instruction from before the branch. This is the best
choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch
condition prevents the DADD instruction (whose destination is R1) from being moved after the branch. In (b), the branch delay slot
is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another
path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be
scheduled from the not-taken fall-through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the
moved instruction when the branch goes in the unexpected direction. By OK we mean that the work is wasted, but the program will
still execute correctly. This is the case, for example, in (c) if R7 were an unused temporary register when the branch goes in the
unexpected direction.
Copyright © 2011, Elsevier Inc. All rights Reserved.
Filling delay slots
Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots
are useful in computation
About 50% (i.e., 60% x 80%) of slots usefully filled
Canceling branches or nullifying branches
Include a prediction of the branch is taken or not taken
If the prediction is correct, the instruction in the delay slot is
executed
If the prediction is incorrect, the instruction in the delay slot
is quashed.
Allow more slots to be filled from the target address or fall
through
2/22/2020 EC792 66
Figure C.17 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is generally better for the floating-
point programs, which have an average misprediction rate of 9% with a standard deviation of 4%, than for the integer
programs, which have an average misprediction rate of 15% with a standard deviation of 5%. The actual performance depends
on both the prediction accuracy and the branch frequency, which vary from 3% to 24%.
2/22/2020 EC792 68
Compiler “Static” Prediction of
Taken/Untaken Branches
Two strategies examined
Backward branch predict taken, forward branch not taken
Profile-based prediction: record branch behavior, predict
branch based on prior run
100000
Instructions per mispredicted branch
10000
1000
100
10
1
gcc
doduc
espresso
ora
compress
tomcatv
alvinn
hydro2d
swm256
mdljsp2
2/22/2020 EC792 69
Zero-Delayed Branching
How to achieve zero delay for a jump or a taken branch?
Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
2/22/2020 EC792 70
Branch Target Buffer
The branch target buffer is implemented as a small cache
Stores the target address of recent branches and jumps
We must also have prediction bits
To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the hardware
Branch Target & Prediction Buffer
Addresses of Target Predict
Inc
Recent Branches Addresses Bits
mux low-order bits
used as index
PC
=
predict_taken
2/22/2020 EC792 71
Dynamic Branch Prediction
Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry
We don’t know if the prediction is correct or not
If correct prediction …
Continue normal execution – no wasted cycles
If incorrect prediction (misprediction) …
Flush the instructions that were incorrectly fetched – wasted cycles
Update prediction bits and target address for future use
2/22/2020 EC792 72
Dynamic Branch Prediction – Cont’d
Use PC to address Instruction
Memory and Branch Target Buffer
Increment PC PC = target address
IF
Found
No Yes
BTB entry with predict
taken?
branch? branch?
Normal Correct Prediction
Execution No stall cycles
2/22/2020 EC792 73
1-bit Prediction Scheme
Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed
1-bit prediction scheme is simplest to implement
1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
Taken
Taken
Not Predict Predict
Taken Not Taken Taken
Not Taken
2/22/2020 EC792 74
1-Bit Predictor: Shortcoming
Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner
loop next time around
outer: …
…
inner: …
…
bne …, …, inner
…
bne …, …, outer
2/22/2020 EC792 75
2-bit Prediction Scheme
1-bit prediction scheme has a performance shortcoming
2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken
2/22/2020 EC792 76
2-bit Prediction Scheme
Alternative state machine
2/22/2020 EC792 77
1-Bit Branch History Table (BHT)
Example main() ‘while’ branch : always TAKEN
{
int i, j; 11111111111111111….
while (1) ‘for’ branch : TAKEN x 3, NOT TAKEN x 1
for (i=1; i<=4; i++) {
…… 1110111011101110….
j = i++;
}
}
Branch outcome
.
.
.
Automaton
PC Prediction
BHT
2/22/2020 EC792 79
2-Bit BHT
2-bit scheme where change prediction only if get misprediction twice
MSB of the state symbol represents the prediction;
1: TAKEN, 0: NOT TAKEN
T
Predict NT Predict
11 10 TAKEN
TAKEN branch outcome 1 1 1 0 1 1 1 0 1 1 1 0 ...
T
counter value 11 11 11 11 10 11 11 11 10 11 11 11 . . .
NT
T prediction T T T T T T T T T T T T ...
Predict NT new counter value 11 11 11 10 11 11 11 10 11 11 11 10 . . .
Predict
NOT
01 00 NOT correctness O O O X O O O X O O O X ...
TAKEN
TAKEN
T
NT
assume initial counter value : 11 O : correct, X : mispredict
2/22/2020 EC792 80
2-Level Self Prediction
• First level
– Record previous few outcomes(history) of the branch itself
– Branch History Table(BHT) - A shift register
• Second Level
– Predictor Table(PT) indexed by history pattern captured by BHT
– Each entry is a 2-bit counter
2/22/2020 EC792 81
2-Level Self Prediction
Algorithm
Using PC, read BHT
Using self history from BHT, read the
counter value from PT
If (MSB==1), predict T
else predict NT
After resolving branch outcome, bookkeeping
If mispredicted, discard all speculative executions
Shift right the new branch outcome into the entry read from BHT
Update PT: If(branch outcome==T)
increase counter value by 1
else decrease by 1
Prediction Accuracy = 100%
Branch Outcome 1 1 1 0 1 1 1 0 1 1 1 0
Self History (BHT) 0000 1000 1100 1110 0111 1011 1101 1110 0111 1011 1101 1110
Counter Value (PT) 10 10 10 10 10 10 10 01 11 11 11 00
Prediction T T T T T T T NT T T T NT
New BHT entry 1000 1100 1110 0111 1011 1101 1110 0111 1011 1101 1110 0111
New PT entry 11 11 11 01 11 11 11 00 11 11 11 00
Correctness O O O X O O O O O O O O
00…..00
2N entries
00…..01
Branch History Register (BHR) 00…..10
(Shift right when update)
…….
Rc-k Rc-1
1 1 ..... 1 0
N Prediction
11…..10
11…..11 PHT update Current State
Branch History Pattern
FSM
Rc: Actual Branch Outcome Update
Logic
Generalized correlated branch predictor
1st level keeps branch history in Branch History Register (BHR)
2nd level segregates pattern history in Pattern History Table (PHT)
2/22/2020 EC792 83
Branch History Register
An N-bit Shift Register = 2N patterns in PHT
Shift-in branch outcomes
1 taken
0 not taken
First-in First-Out
BHR can be
Global
Per-set
Local (Per-address)
2/22/2020 EC792 84
Pattern History Table
2N entries addressed by N-bit BHR
Each entry keeps a counter (2-bit or more) for prediction
Counter update: the same as 2-bit counter
Can be initialized in alternate patterns (01, 10, 01, 10, ..)
Alias (or interference) problem
2/22/2020 EC792 85
Two-Level Branch Prediction
Set PHT
00000000
00000001
00000010
PC = 0x4001000C
…
00110110
00110110 10
00110111
0110
…
BHR
11111101
11111110
11111111 MSB = 1
Predict Taken
2/22/2020 EC792
86
Predictor Update (Actually, Not Taken)
PHT
00000000
00000001
00000010
PC = 0x4001000C
00110011
…
00110011
00110110
00110110 01
10 decremented
00110111
0110
0011
…
BHR
11111101
11111110
11111111
2/22/2020 EC792 92
Correlated Branch Prediction
Idea: record m most recently executed branches as
taken or not taken, and use that pattern to select the
proper n-bit branch history table
In general, (m,n) predictor means record last m branches
to select between 2m history tables, each with n-bit
counters
Thus, old 2-bit BHT is a (0,2) predictor
Global Branch History: m-bit shift register keeping T/NT
status of last m branches.
Each entry in table has m n-bit predictors.
2/22/2020 EC792 93
Correlated Branch Predictor
2-bit shift register
Subsequent (global branch history)
branch
direction
select
Branch PC Branch PC
X X X X
Prediction w Prediction
hash . hash . . . .
. . . . .
. . . . .
. . . . .
2w
2-bit 2-bit 2-bit 2-bit
2-bit counter counter counter counter
counter
(2,2) Correlation Scheme
2-bit Sat. Counter Scheme
(M,N) correlation scheme
M: shift register size (# bits)
N: N-bit counter
2/22/2020 EC792 94
Correlating Branches
2/22/2020 EC792 95
Tournament Predictors
Local
Predictor M
U
Global X
Predictor
Branch PC Tournament
Predictor
Table of 2-bit
saturating counters
EC792 96
2/22/2020
Tournament Predictors
Multilevel branch predictor
Selector for the Global and Local predictors of correlating branch
prediction
Use n-bit saturating counter to choose between predictors
Usual choice between global and local predictors
2/22/2020
EC792 97
Tournament Predictors
2/22/2020 EC792 98
Tournament Predictors
2/22/2020 EC792 99
Tournament Predictors
• Advantage of tournament predictor is the ability to select the right
predictor for a particular branch
9%
8%
5%
4%
Correlating - (2,2) scheme
3%
2% Tournament
1%
0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
16%
8%
6% 6% 6%
6%
5% 5%
4%
4%
2%
1% 1%
0%
0%
spice
gcc
fpppp
nasa7
doducd
matrix300
li
tomcatv
expresso
eqntott
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)