Professional Documents
Culture Documents
L9 PipelineHazards 2
L9 PipelineHazards 2
Next . . .
❖ Control Hazards
❖ Delayed Branch
EC792 2
Control Hazards
❖ Jump and Branch can cause great performance loss
❖ Jump instruction needs only the jump target address
❖ Branch instruction needs two things:
Branch Result Taken or Not Taken
Branch Target Address
▪ PC + 4 If Branch is NOT taken
▪ PC + 4 + (4 × immediate) If Branch is Taken
Branch
L1: target instruction Target IF Reg ALU DM
Addr
EC792 4
Implementing Jump and Branch
NPC2
Jump or Branch Target
J
Next
Beq
PC
NPC
Bne
Im26
+1 Imm26
Imm16 zero
PCSrc 0
Instruction
Instruction
32
ALUout
1
Register File
BusA A
A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC
Rd 0
1 RW 1
D
Op
BusW 2 32
3
32
Rd3
0
Rd2
1
clk
func
Reg
Dst
Branch Delay = 2 cycles J, Beq, Bne
EX
Control
MEM
Bubble = 0 1
EC792 5
Predict Branch NOT Taken
❖ Branches can be predicted to be NOT taken
❖ If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles
❖ Else, convert Next1 and Next2 into bubbles – 2 wasted cycles
EC792 6
Performance Impact
❖ Correct guess no penalty
❖ Incorrect guess 2 bubbles
❖ Assume
no data hazard stalls
20% control flow instructions
70% of control flow instructions are taken
New CPI is,
CPI = 1 + (0.20*0.7) * 2 = 1.28
IPC = 1 / [ 1 + 0.14 * 2 ] = 1 / 1.28 = 0.78
misprediction penalty
misprediction How to reduce the two penalty terms?
rate
EC792 7
Reducing the Delay of Branches
❖ Branch delay can be reduced from 2 cycles to just 1 cycle
Reset
Bne then compared
Im16
Imm16
PCSrc 0
Instruction
Instruction
32
ALUout
1
Register File
BusA A
A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC
Rd 0
1 RW 1
D
Op
BusW 2 32
3
32
Rd3
0
Rd2
1
clk
func
Reg
Reset signal converts Dst
J, Beq, Bne
next instruction after ALUCtrl
Control Signals
jump or taken branch Main & ALU 0
EX
Control
MEM
into a bubble Bubble = 0 1
EC792 9
Branch Behavior in Programs
❖ Based on SPEC benchmarks on MIPS
Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.
EC792 10
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
Execute successor instructions in sequence
“Flush” instructions in pipeline if branch actually taken
Advantage of late pipeline state update
33% MIPS branches not taken on average
PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
67% MIPS branches taken on average
#4: Define branch to take place AFTER n following instructions
(Delayed branch)
EC792 11
Next . . .
❖ Pipelining versus Serial Execution
❖ Pipeline Hazards
❖ Control Hazards
EC792 12
Branch Hazard Alternatives
❖ Predict Branch Not Taken
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken
Flush only instructions appearing after Jump or taken branch
❖ Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
EC792 13
Delayed Branch
❖ Define branch to take place after the next instruction
❖ Instruction in branch delay slot is always executed
❖ Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
label:
❖ For a 1-cycle branch delay, we have one delay slot
branch instruction . . .
EC792 14
Delayed Branch
From the Target: Helps when branch is taken. May duplicate instructions
Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)
EC792 15
Delayed Branch
From Fall Through: Helps when branch is not taken
(default case)
Instructions at target (L1 and after) must not use R4 till set (written)
again.
EC792 16
Figure C.14 Scheduling the branch delay slot. The top box in each pair shows the code before scheduling; the bottom box shows
the scheduled code. In (a), the delay slot is scheduled with an independent instruction from before the branch. This is the best
choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch
condition prevents the DADD instruction (whose destination is R1) from being moved after the branch. In (b), the branch delay slot
is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another
path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be
scheduled from the not-taken fall-through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the
moved instruction when the branch goes in the unexpected direction. By OK we mean that the work is wasted, but the program will
still execute correctly. This is the case, for example, in (c) if R7 were an unused temporary register when the branch goes in the
unexpected direction.
EC792
17
Filling delay slots
❖ Compiler effectiveness for single branch delay slot:
Fills about 60% of branch delay slots
About 80% of instructions executed in branch delay slots
are useful in computation
About 50% (i.e., 60% x 80%) of slots usefully filled
❖ Canceling branches or nullifying branches
Include a prediction of the branch is taken or not taken
If the prediction is correct, the instruction in the delay slot is
executed
If the prediction is incorrect, the instruction in the delay slot
is quashed.
Allow more slots to be filled from the target address or fall
through
EC792 18
Branch Behavior
Figure C.17 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is
generally better for the floating-point programs, which have an average misprediction rate of
9% with a standard deviation of 4%, than for the integer programs, which have an average
misprediction rate of 15% with a standard deviation of 5%. The actual performance depends on
both the prediction accuracy and the branch frequency, which vary from 3% to 24%.
EC792 19
Drawback of Delayed Branching
❖ New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)
EC792 20
In case you needed motivation
❖Some processors have deep pipelines
EC792 21