L9 PipelineHazards 2

Pipeline Hazards
Next . . .
❖ Control Hazards
❖ Delayed Branch
❖ Dynamic Branch Prediction
EC792 2
Control Hazards
❖ Jump and Branch can cause great performance loss
❖ Jump instruction needs only the jump target address
❖ Branch instruction needs two things:
 Branch Result Taken or Not Taken
 Branch Target Address
▪ PC + 4 If Branch is NOT taken
▪ PC + 4 + (4 × immediate) If Branch is Taken
❖ Jump and Branch targets are computed in the ID stage

 At which point a new instruction is already being fetched
 Jump Instruction: 1-cycle delay
 Branch: 2-cycle delay for branch result (taken or not taken
decided in EX stage)
EC792 3
2-Cycle Branch Delay
❖ Control logic detects a Branch instruction in the 2nd Stage
❖ ALU computes the Branch outcome in the 3rd Stage
❖ Next1 and Next2 instructions will be fetched anyway
❖ Convert Next1 and Next2 into bubbles if branch is taken
cc1 cc2 cc3 cc4 cc5 cc6 cc7
Beq R9,R10,L1 IF Reg ALU
Next1 IF Reg Bubble Bubble Bubble
Next2 IF Bubble Bubble Bubble Bubble
Branch
L1: target instruction Target IF Reg ALU DM
Addr
EC792 4
Implementing Jump and Branch
NPC2
Jump or Branch Target
J
Next
Beq
PC
NPC
Bne
Im26
+1 Imm26
Imm16 zero
PCSrc 0
Instruction
Instruction
32
ALUout
1
Register File
BusA A
A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC
Rd 0
1 RW 1
D
Op
BusW 2 32
3
32
Rd3
0
Rd2
1
clk
func
Reg
Dst
Branch Delay = 2 cycles J, Beq, Bne
Branch target & outcome Main & ALU Control Signals

0
are computed in ALU stage
EX
Control
MEM
Bubble = 0 1
EC792 5
Predict Branch NOT Taken
❖ Branches can be predicted to be NOT taken
❖ If branch outcome is NOT taken then
 Next1 and Next2 instructions can be executed
 Do not convert Next1 & Next2 into bubbles
 No wasted cycles
❖ Else, convert Next1 and Next2 into bubbles – 2 wasted cycles
cc1 cc2 cc3 cc4 cc5 cc6 cc7
Beq R9,R10,L1 IF Reg ALU NOT Taken
Next1 IF Reg ALU DM Reg
Next2 IF Reg ALU DM Reg
EC792 6
Performance Impact
❖ Correct guess no penalty
❖ Incorrect guess 2 bubbles
❖ Assume
 no data hazard stalls
 20% control flow instructions
 70% of control flow instructions are taken
 New CPI is,
 CPI = 1 + (0.20*0.7) * 2 = 1.28
 IPC = 1 / [ 1 + 0.14 * 2 ] = 1 / 1.28 = 0.78
misprediction penalty
misprediction How to reduce the two penalty terms?
rate
EC792 7
Reducing the Delay of Branches
❖ Branch delay can be reduced from 2 cycles to just 1 cycle
❖ Branches can be determined earlier in the Decode stage

 A comparator is used in the decode stage to determine branch
decision, whether the branch is taken or not
 Because of forwarding, the delay in the second stage will be

increased and this will also increase the clock cycle
❖ Only one instruction that follows the branch is fetched
❖ If the branch is taken then only one instruction is flushed
❖ We should insert a bubble after jump or taken branch

 This will convert the next instruction into a NOP
EC792 8
Reducing Branch Delay to 1 Cycle
Longer Cycle
Zero
=
Jump or Branch Target
Next J Data forwarded

+1 Beq
PC
Reset
Bne then compared
Im16
Imm16
PCSrc 0
Instruction
Instruction
32
ALUout
1
Register File
BusA A
A
Memory Rs 5
RA 2
Instruction 3
L
Rt 5 U
0 RB BusB E 1
Address 0
PC
Rd 0
1 RW 1
D
Op
BusW 2 32
3
32
Rd3
0
Rd2
1
clk
func
Reg
Reset signal converts Dst
J, Beq, Bne
next instruction after ALUCtrl
Control Signals
jump or taken branch Main & ALU 0
EX
Control
MEM
into a bubble Bubble = 0 1
EC792 9
Branch Behavior in Programs
❖ Based on SPEC benchmarks on MIPS
 Branches occur with a frequency of 14% to 16% in integer
programs and 3% to 12% in floating point programs.
 About 75% of the branches are forward branches
 60% of forward branches are taken
 85% of backward branches are taken
 67% of all branches are taken
❖ Why are branches (especially backward branches) more

likely to be taken than not taken?
 Forward Branches (if-then-else)
 Backward Branches (end-of-loop)
EC792 10
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
 Execute successor instructions in sequence
 “Flush” instructions in pipeline if branch actually taken
 Advantage of late pipeline state update
 33% MIPS branches not taken on average
 PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
 67% MIPS branches taken on average
#4: Define branch to take place AFTER n following instructions
(Delayed branch)
EC792 11
Next . . .
❖ Pipelining versus Serial Execution
❖ Pipelined Datapath and Control
❖ Pipeline Hazards
❖ Data Hazards and Forwarding
❖ Load Delay, Hazard Detection, and Stall
❖ Control Hazards
❖ Delayed Branch and Dynamic Branch Prediction
EC792 12
Branch Hazard Alternatives
❖ Predict Branch Not Taken
 Successor instruction is already fetched
 Do NOT Flush instruction after branch if branch is NOT taken
 Flush only instructions appearing after Jump or taken branch
❖ Delayed Branch
 Define branch to take place AFTER the next instruction
 Compiler/assembler fills the branch delay slot (for 1 delay cycle)
❖ Dynamic Branch Prediction

 Loop branches are taken most of time
 Must reduce branch delay to 0, but how?
 How to predict branch behavior at runtime?
EC792 13
Delayed Branch
❖ Define branch to take place after the next instruction
❖ Instruction in branch delay slot is always executed
❖ Compiler (tries to) move a useful instruction into delay slot.
From before the Branch: Always helpful when possible
label:
❖ For a 1-cycle branch delay, we have one delay slot
branch instruction . . .
branch delay slot (next instruction) add R10,R11,R12

beq R17,R16,label
branch target (if branch taken)
Delay Slot
❖ Compiler fills the branch delay slot
 By selecting an independent instruction label:
 From before the branch . . .
❖ If no independent instruction is found
beq R17,R16,label
 Compiler fills delay slot with a NO-OP add R10,R11,R12
EC792 14
Delayed Branch
From the Target: Helps when branch is taken. May duplicate instructions
ADD R2, R1, R3 ADD R2, R1, R3

BEQZ R2, L1 BEQZ R2, L2
DELAY SLOT SUB R4, R5, R6
- -
L1: SUB R4, R5, R6 L1: SUB R4, R5, R6
L2: L2:
Instructions between BEQ and SUB (in fall through) must not use R4.
Why is instruction at L1 duplicated? What if R5 or R6 changed?
(Because, L1 can be reached by another path !!)
EC792 15
Delayed Branch
From Fall Through: Helps when branch is not taken
(default case)
ADD R2, R1, R3 ADD R2, R1, R3

BEQZ R2, L1 BEQZ R2, L1
DELAY SLOT SUB R4, R5, R6
SUB R4, R5, R6 -
-
L1: L1:
Instructions at target (L1 and after) must not use R4 till set (written)
again.
EC792 16
Figure C.14 Scheduling the branch delay slot. The top box in each pair shows the code before scheduling; the bottom box shows
the scheduled code. In (a), the delay slot is scheduled with an independent instruction from before the branch. This is the best
choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch
condition prevents the DADD instruction (whose destination is R1) from being moved after the branch. In (b), the branch delay slot
is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another
path. Strategy (b) is preferred when the branch is taken with high probability, such as a loop branch. Finally, the branch may be
scheduled from the not-taken fall-through as in (c). To make this optimization legal for (b) or (c), it must be OK to execute the
moved instruction when the branch goes in the unexpected direction. By OK we mean that the work is wasted, but the program will
still execute correctly. This is the case, for example, in (c) if R7 were an unused temporary register when the branch goes in the
unexpected direction.
EC792
17
Filling delay slots
❖ Compiler effectiveness for single branch delay slot:
 Fills about 60% of branch delay slots
 About 80% of instructions executed in branch delay slots
are useful in computation
 About 50% (i.e., 60% x 80%) of slots usefully filled
❖ Canceling branches or nullifying branches
 Include a prediction of the branch is taken or not taken
 If the prediction is correct, the instruction in the delay slot is
executed
 If the prediction is incorrect, the instruction in the delay slot
is quashed.
 Allow more slots to be filled from the target address or fall
through
EC792 18
Branch Behavior
Figure C.17 Misprediction rate on SPEC92 for a profile-based predictor varies widely but is
generally better for the floating-point programs, which have an average misprediction rate of
9% with a standard deviation of 4%, than for the integer programs, which have an average
misprediction rate of 15% with a standard deviation of 5%. The actual performance depends on
both the prediction accuracy and the branch frequency, which vary from 3% to 24%.
EC792 19
Drawback of Delayed Branching
❖ New meaning for branch instruction
 Branching takes place after next instruction (Not immediately!)
❖ Impacts software and compiler

 Compiler is responsible to fill the branch delay slot
 For a 1-cycle branch delay ➔ One branch delay slot
❖ However, modern processors are deeply pipelined

 Branch penalty is multiple cycles in deeper pipelines
 Multiple delay slots are difficult to fill with useful instructions
❖ MIPS used delayed branching in earlier pipelines

 However, delayed branching is not useful in recent processors
EC792 20
In case you needed motivation
❖Some processors have deep pipelines
[The Microarchitecture of the Pentium 4 Processor,

Intel Technology Journal, 2001]
EC792 21

L9 PipelineHazards 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L9 PipelineHazards 2

Uploaded by

Copyright:

Available Formats

Pipeline Hazards

❖ Dynamic Branch Prediction

❖ Jump and Branch targets are computed in the ID stage

Beq R9,R10,L1 IF Reg ALU

Next1 IF Reg Bubble Bubble Bubble

Next2 IF Bubble Bubble Bubble Bubble

Branch target & outcome Main & ALU Control Signals

cc1 cc2 cc3 cc4 cc5 cc6 cc7

Beq R9,R10,L1 IF Reg ALU NOT Taken

Next1 IF Reg ALU DM Reg

Next2 IF Reg ALU DM Reg

❖ Branches can be determined earlier in the Decode stage

 Because of forwarding, the delay in the second stage will be

❖ Only one instruction that follows the branch is fetched

❖ If the branch is taken then only one instruction is flushed

❖ We should insert a bubble after jump or taken branch

Next J Data forwarded

 About 75% of the branches are forward branches

 60% of forward branches are taken

 85% of backward branches are taken

 67% of all branches are taken

❖ Why are branches (especially backward branches) more

 Backward Branches (end-of-loop)

❖ Pipelined Datapath and Control

❖ Data Hazards and Forwarding

❖ Load Delay, Hazard Detection, and Stall

❖ Delayed Branch and Dynamic Branch Prediction

❖ Dynamic Branch Prediction

branch delay slot (next instruction) add R10,R11,R12

ADD R2, R1, R3 ADD R2, R1, R3

ADD R2, R1, R3 ADD R2, R1, R3

❖ Impacts software and compiler

❖ However, modern processors are deeply pipelined

❖ MIPS used delayed branching in earlier pipelines

[The Microarchitecture of the Pentium 4 Processor,

You might also like