Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Pipeline Hazards

Next . . .
❖ Control Hazards

❖ Dynamic Branch Prediction


 Correlated branches

EC792 2
Correlating Branches
❖ Basic two-bit predictor schemes
addi R2, R0, 2
 use recent behavior of a branch bne R10, R2, B1
to predict its future behavior xor R10, R10, R10
j B3
❖ Improve the prediction accuracy
B1: bne R11, R2, B2
 look also at recent behavior
xor R11, R11, R11
of other branches
j B3

if (aa==2) // B1 B2: beq R10, R11, B3


aa = 0; …
if (bb==2) // B2 B3:

bb = 0; B3 is correlated with B1 and B2;


if (aa!=bb) { // B3
If B1 and B2 are both untaken in the
…….
above code, then B3 will be taken
}
=> Use correlating predictors or
two-level predictors.
EC792 3
Branch Correlation
B1
Code Snippet 0 (NT) 1 (T)

if (aa==2) // B1 B2 B2
aa = 0; 0 1 0 1
if (bb==2) // B2
bb = 0;
B3 B3 B3 B3
if (aa!=bb) { // B3
……. Path: A:0-0 B:0-1 C:1-0 D:1-1
} aa=0 aa=0 aa2 aa2
bb=0 bb2 bb=0 bb2
❖ Branch direction
 Not independent
 Correlated to the path taken
❖ Example: Decision of B3 can be surely known beforehand if
the path to B3 is 0-0
❖ Track path using a 2-bit register
4
Correlating Branches
Example: BNE R1, R0 B1 ; (B1)(d!=0)
if (d==0) ADDI R1,R1,#1; since d==0, make d=1
d=1; B1 : SUBI R3,R1,#1
if (d==1) BNE R3, R0 B2; (B2)(d!=1)
.....
B2 :

Initial value Value of d


of d d==0? B1 before B2 d==1? B2
0 Y NT 1 Y NT
1 N T 1 Y NT
2 N T 2 N T
If B1 is NT, then B2 is NT
1-bit predictor
initialized to NT B1 B1 New B1 B2 B2 New B2
d=? prediction action prediction prediction action prediction
1-bit self history 2 NT T T NT T T
predictor 0 T NT NT T NT NT
2 NT T T NT T T
Sequence of 0 T NT NT T NT NT
2,0,2,0,2,0,...
All branches are mispredicted
EC792 5
Correlating Branches
Example: BNE R1, R0 B1 ; (B1)(d!=0)
if (d==0) ADDI R1,R1,#1; since d==0, make d=1
d=1; B1 : SUBI R3,R1,#1
if (d==1) BNE R3, R0 B2; (B2)(d!=1)
.....
B2 :
Self Gloabal
Prediction Prediction, if last Prediction, if last
bits(XX) branch action was NT branch action was T
NT/NT NT NT
NT/T NT T
T/NT T NT
T/T T T

d=? B1 prediction B1 action new B1 prediction B2 prediction B2 action new B2 prediction


2 NT/NT T T/NT NT/NT T NT/T
0 T/NT NT T/NT NT/T NT NT/T
2 T/NT T T/NT NT/T T NT/T
0 T/NT NT T/NT NT/T NT NT/T
3 T/NT T T/NT NT/T T NT/T

Initial self prediction bits NT/NT and


Initial last branch was NT. Prediction used is shown in Red
Misprediction only in the first prediction !!
EC792 6
Local/Global Predictors
❖ Instead of maintaining a counter for each branch to
capture the common case,
 Maintain a counter for each branch and surrounding pattern
 If the surrounding pattern belongs to the branch being predicted,
the predictor is referred to as a local predictor
 If the surrounding pattern includes neighboring branches, the
predictor is referred to as a global predictor

EC792 7
Correlated Branch Prediction
❖ Idea: record m most recently executed branches as
taken or not taken, and use that pattern to select the
proper n-bit branch history table
❖ In general, (m,n) predictor means record last m branches
to select between 2m history tables, each with n-bit
counters
 Thus, old 2-bit BHT is a (0,2) predictor
❖ Global Branch History: m-bit shift register keeping T/NT
status of last m branches.
❖ Each entry in table has m n-bit predictors.

EC792 8
Correlated Branch Predictor
2-bit shift register
Subsequent (global branch history)
branch
direction
select
Branch PC Branch PC

X X X X
Prediction w Prediction
hash . hash . . . .
. . . . .
. . . . .
. . . . .
2w
2-bit 2-bit 2-bit 2-bit
2-bit counter counter counter counter
counter
(2,2) Correlation Scheme
2-bit Sat. Counter Scheme
❖ (M,N) correlation scheme
 M: shift register size (# bits)
 N: N-bit counter
EC792 9
Correlating Branches

(2,2) predictor Branch address


4
Behavior of recent 2-bits per branch predictor
branches selects
between four
predictions of next Prediction
branch, updating just
that prediction

2-bit global branch history

EC792 10
Tournament Predictors

• A local predictor might work well for some branches or


programs, while a global predictor might work well for others

• Provide one of each and maintain another predictor to


identify which predictor is best for each branch

Local
Predictor M
U
Global X
Predictor

Branch PC Tournament
Predictor
Table of 2-bit
saturating counters
EC792 11
Tournament Predictors
❖ Multilevel branch predictor
 Selector for the Global and Local predictors of correlating branch
prediction
❖ Use n-bit saturating counter to choose between predictors
❖ Usual choice between global and local predictors

EC792 12
Tournament Predictors

EC792 13
Tournament Predictors

EC792 14
Tournament Predictors
• Advantage of tournament predictor is the ability to select the right
predictor for a particular branch

• A typical tournament predictor selects global predictor 40% of the time


for SPEC integer benchmarks

• AMD Opteron and Phenom use tournament predictors

EC792 15
Accuracy v. Size (SPEC89)
10%
Conditional branch misprediction rate

9%

8%

7% Local - 2 bit counters


6%

5%
4%
Correlating - (2,2) scheme
3%

2% Tournament
1%

0%
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)


EC792 16
Tournament Predictors (Intel Core i7)
• Based on predictors used in Core 2 Duo chip
• Combines three different predictors
• Two-bit
• Global history
• Loop exit predictor
• Uses a counter to predict the exact number of taken
branches (number of loop iterations) for a branch
that is detected as a loop branch
• Tournament: Tracks accuracy of each predictor
• Main problem of speculation:
• A mispredicted branch may lead to another branch
being mispredicted !

EC792 17
Accuracy of Different Schemes
20%

18% 4096 Entries 2-bit BHT


Unlimited Entries 2-bit BHT
Frequency of Mispredictions

16%

14% 1024 Entries (2,2) BHT


12%
11%
10%

8%
6% 6% 6%
6%
5% 5%
4%
4%

2%
1% 1%
0%
0%
spice

gcc
fpppp
nasa7

doducd

li
matrix300

expresso

eqntott
tomcatv

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

EC792 18
Branch Prediction is More Important Today
Conditional branches still comprise about 20% of instructions
Correct predictions are more important today - why?
• Pipelines deeper
Branch not resolved until more cycles from fetching - therefore the
misprediction penalty is greater
▪ Cycle times smaller - more emphasis on throughput
(performance)
▪ More functionality between fetch & execute
• Multiple instruction issue (superscalars & VLIW)
Branch occurs almost every cycle
▪ Flushing & refetching more instructions
• Object-oriented programming
More indirect branches - which are harder to predict
• Dual of Amdahl’s Law
Other forms of pipeline stalling are being addressed - so the portion
of CPI due to branch delays is relatively larger
All this means that the potential stalling due to branches is greater
EC792 19
Branch Prediction is More Important Today
On the other hand,
• Chips are denser so we can consider sophisticated
HW solutions
• Hardware cost is small compared to the performance
gain

EC792 20
Directions in Branch Prediction
1: Improve the prediction
▪ Correlated (2-level) predictor (Pentium III - 512 entries, 2-bit,
Pentium Pro - 4 history bits)
• Hybrid local/global predictor (Alpha 21264)
• Confidence predictors
2: Determine the target earlier
• Branch target buffer (Pentium Pro, IA-64 Itanium)
• Next address in I-cache (Alpha 21264, UltraSPARC)
• Return address stack (Alpha 21264, IA-64 Itanium, MIPS R10000,
Pentium Pro, UltraSPARC-3)
3: Reduce misprediction penalty
• Fetch both instruction streams (IBM mainframes, SuperSPARC)
4: Eliminate the branch
• Predicated execution (IA-64 Itanium, Alpha 21264)

EC792 21
Predicated Execution
❖ Predicated instructions execute conditionally
 Some other (previous) instruction sets a condition
 Predicated instruction tests the condition & executes if the condition is
true
 If the condition is false, predicated instruction isn’t executed
▪ i.e., instruction execution is predicated on the condition
 Eliminates conditional branch (expensive if mispredicted)
▪ changes a control hazard to a data hazard
 Fetch both true & false paths

EC792 22
Predicated Execution
❖ Example:
 Without predicated execution
SUB R10, R4, R5 ❖ Advantages
➢ No branch hazard,
BEQZ R10, Label
➢ Creates straight line code
SUB R2, R1, R6
➢ More independent instructions
Label: OR R3, R2, R7
❖ Disadvantages
 With predicated execution ➢ Instructions on both paths are
executed
SUB R10, R4, R5
➢ Hard to add predicated instructions
SUB R2, R1, R6, R10 to an existing instruction set;
OR R3, R2, R7 ➢ Additional register pressure
➢ Complex conditions if nested loops

EC792 23
Conditional Execution in ARM ISA
ARM ISA Example:
This is conditionally executed
; flags set by a previous based on the zero flag as,
instruction
; flags set by a previous
BNE Next instruction
LSL R0, R1, #24
ADD R0, R0, #2 LSLEQ R0, R1, #24
Next ADDEQ R0, R0, #2
;... ;...

By default, data processing operations do not affect the conditional flags


(apart from the comparisons where this is the only effect). To cause the
condition flags to be updated, the S bit of the instruction needs to be set by
postfixing the instruction by S

ADDS R0, R1, R2 ; R0 = R1 + R2


; ... and set flags

EC792 24
Pipelining Complications
❖ Exceptions: Events other than branches or jumps that change the
normal flow of instruction execution. Some types of exceptions:

 I/O Device request


 Invoking an OS service from user program
 Tracing Instruction execution
 Breakpoint (programmer requested interrupt)
 Integer arithmetic overflow
 FP arithmetic anomaly
 Page fault (page not in main memory)
 Misaligned memory accesses
 Memory protection violation
 Use of undefined instruction
 Hardware malfunction
 Power failure
EC792 25
Pipelining Complications
❖ Exceptions: Events other than branches or jumps
that change the normal flow of instruction execution.
❖ 5 instructions executing in 5 stage pipeline
❖ How to stop the pipeline?
 Who caused the interrupt?
 How to restart the pipeline?

Stage Problems causing the interrupts

IF Page fault on instruction fetch; misaligned


memory access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic interrupt
MEM Page fault on data fetch; misaligned memory
access; memory-protection violation
EC792 26
Pipelining Complications
❖ Simultaneous exceptions in more than one pipeline stage,
e.g.,
 LOAD with data page fault in MEM stage
 ADD with instruction page fault in IF stage
❖ Solution #1
 Interrupt status vector per instruction
 Defer check until last stage, kill state update if exception
❖ Solution #2
 Interrupt ASAP
 Restart everything that is incomplete

EC792 27
Pipelining Complications
❖ Our MIPS pipeline only writes results at the end of the
instruction’s execution. Not all processors do this.
❖ Address modes: Auto-increment causes register change
during instruction execution
 Interrupts Need to restore register state
 Adds WAR and WAW hazards since writes happen not only in
last stage
❖ Memory-Memory Move Instructions
 Must be able to handle multiple page faults
 VAX and x86 store values temporarily in registers
❖ Condition Codes
 Need to detect the last instruction to change condition codes

EC792 28
Fallacies and Pitfalls
❖Pipelining is easy!
 The basic idea is easy
 The devil is in the details
▪ Detecting data hazards and stalling pipeline

❖Poor ISA design can make pipelining harder


 Complex instruction sets (Intel IA-32)
▪ Significant overhead to make pipelining work
▪ IA-32 micro-op approach
 Complex addressing modes
▪ Register update side effects, memory indirection

EC792 29
Pipeline Hazards Summary
❖ Three types of pipeline hazards
 Structural hazards: conflicts using a resource during same cycle
 Data hazards: due to data dependencies between instructions
 Control hazards: due to branch and jump instructions

❖ Hazards limit the performance and complicate the design


 Structural hazards: eliminated by careful design or more hardware
 Data hazards are eliminated by data forwarding
 However, load delay cannot be completely eliminated
 Delayed branching can be a solution for control hazards
 BTB with branch prediction can reduce branch delay to zero
 Branch misprediction should flush the wrongly fetched instructions
EC792 30

You might also like