2 Pipelining 2

CSC15107: Computer
Architecture
Lecture – 15, 16: Pipelined Processor – Data and Control
Hazards
07, 08 September 2020
Data Hazards
These occur when at any time, there are instructions
active that need to access the same data (memory or
register) locations.
Where there’s real trouble is when we have:
instruction A
instruction B
and B manipulates (reads or writes) data before A

does. This violates the order of the instructions, since
the architecture implies that A completes entirely before
B is executed.
Data Hazards
Read After Write (RAW)
Execution Order is:
InstrJ tries to read operand before InstrI writes it
InstrI
InstrJ
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler nomenclature). This hazard
results from an actual need for communication.
Data Hazards
Write After Read (WAR)
Execution Order is:
InstrJ tries to write operand before InstrI reads i
InstrI
– Gets wrong operand
InstrJ
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
– Called an “anti-dependence” by compiler writers.

This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
Data Hazards
Write After Write (WAW)
Execution Order is:
InstrJ tries to write operand before InstrI writes it
InstrI
– Leaves wrong result ( InstrI not InstrJ )
InstrJ
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers

This also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:

– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in later more complicated pipes

Data Hazards
Simple Solution to RAW
• Hardware detects RAW and stalls

• Assumes register written then read each cycle
+ low cost to implement, simple
-- reduces IPC
• Try to minimize stalls
Minimizing RAW stalls
• Bypass/forward/shortcircuit
• Use data before it is in the register
+ reduces/avoids stalls
-- complex
• Crucial for common RAW hazards
Data Hazards
Time (clock cycles)
IF ID/R F EX MEM WB
ALU
I add r1,r2,r3 Ifetch Reg DMem Reg
n
s
ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg
r.
ALU
and r6,r1,r7 Ifetch Reg DMem Reg
O
r
ALU
d or r8,r1,r9 Ifetch Reg DMem Reg
e
r
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
The use of the result of the ADD instruction in the next three instructions causes a hazard,
since the register is not written until after those instructions read it.
Forwarding is the concept of making data
available to the input of the ALU for subsequent
Data Hazards instructions, even though the generating
instruction hasn’t gotten to WB in order to write
Forwarding To Avoid the memory or registers.
Data Hazard
Time (clock cycles)
I
add r1,r2,r3
ALU
n Ifetch Reg DMem Reg
s
t
ALU
r. sub r4,r1,r3 Ifetch Reg DMem Reg
ALU
r Reg
and r6,r1,r7 Ifetch Reg DMem
d
e
r
ALU
Reg
or r8,r1,r9 Ifetch Reg DMem
ALU
Reg
xor r10,r1,r11 Ifetch Reg DMem
The data isn’t loaded until after
Data Hazards the MEM stage.
Time (clock cycles)
lw r1, 0(r2)
ALU
I Ifetch Reg DMem Reg
n
s
t
ALU
sub r4,r1,r6 Ifetch Reg DMem Reg
r.
ALU
r and r6,r1,r7 Ifetch Reg DMem Reg
d
e
ALU
r or r8,r1,r9 Ifetch Reg DMem Reg
There are some instances where hazards occur, even with forwarding.
The stall is necessary as shown
Data Hazards here.
Time (clock cycles)
I
n
lw r1, 0(r2)
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
O sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
r
d
ALU
e and r6,r1,r7 Ifetch Bubble Reg DMem Reg
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
There are some instances where hazards occur, even with forwarding.
Pipeline interlock – detects hazards and stalls.
This is another
representation of
Data Hazards the stall.
LW R1, 0(R2) IF ID EX MEM WB
SUB R4, R1, R5 IF ID EX MEM WB
AND R6, R1, R7 IF ID EX MEM WB
OR R8, R1, R9 IF ID EX MEM WB
LW R1, 0(R2) IF ID EX MEM WB
SUB R4, R1, R5 IF ID stall EX MEM WB
AND R6, R1, R7 IF stall ID EX MEM WB
OR R8, R1, R9 stall IF ID EX MEM WB

Data Hazards
Pipeline Scheduling
Instruction scheduled by compiler - move instruction in order to reduce stall.
lw Rb, b
lw Rc, c
Add Ra, Rb, Rc
sw a, Ra
lw Re, e
lw Rf, f
sub Rd, Re, Rf
sw d, Rd
Data Hazards
Pipeline Scheduling
Instruction scheduled by compiler - move instruction in order to reduce stall.
lw Rb, b code sequence for a = b+c before scheduling

lw Rc, c
Add Ra, Rb, Rc stall
sw a, Ra
lw Re, e code sequence for d = e+f before scheduling
lw Rf, f
sub Rd, Re, Rf stall
sw d, Rd
Arrangement of code after scheduling.
lw Rb, b
lw Rc, c
lw Re, e
Add Ra, Rb, Rc
lw Rf, f
sw a, Ra
sub Rd, Re, Rf
sw d, Rd
Pipeline Scheduling
Data Hazards
scheduled unscheduled
54%
gcc 31%
spice 42%
14%
tex 65%
25%
0% 20% 40% 60% 80%

% loads stalling pipeline
Control Hazards
A control hazard is when we need to find the destination of a branch, and

can’t fetch any new instructions until we know that destination.
Control Hazards Control Hazard on Branches
Three Stage Stall
10: beq r1,r3,36
ALU
Ifetch Reg DMem Reg
The PC is
not written
until the
end of 4th
ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg
clock
cycle!
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
22: add r8,r1,r9 Ifetch Reg DMem Reg
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
Control Hazards Branch Stall Impact
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
(Whoa! How did we get that 1.9???)
• Two part solution to this dramatic increase:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
• MIPS branch tests if register = 0 or ^ 0
• MIPS Solution:
– Move Zero test to ID/IF stage
– Adder to calculate new PC in ID/IF stage
●
must be fast
●
can't afford to subtract
●
compares with 0 are simple
●
Greater-than, Less-than test sign-bit, but not-equal must OR all bits
●
more general compares need ALU
– 1 clock cycle penalty for branch versus 3
1 clock cycle penalty
1 clock cycle penalty
Control Hazards Four Branch Hazard
Alternatives
#1: Stall until branch direction is clear: freeze or flush
#2: Predict Branch Not Taken

– Execute successor instructions in sequence
– “back out” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% MIPS branches not taken on average
– PC+4 already calculated, so use it to get next instruction
Predict Branch Not Taken
Alternatives
#3: Predict Branch Taken

– 53% MIPS branches taken on average
– But haven’t calculated branch target address in MIPS
●
MIPS still incurs 1 cycle branch penalty
●
Other machines: branch target known before outcome
Alternatives
#4: Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2 Branch delay of length n
........
sequential successorn
branch target if taken
– 1 slot delay allows proper decision and branch target address in 5 stage
pipeline
– MIPS uses this – branch delay slot
Delayed Branch
Control Hazards Delayed Branch
• Where to get instructions to fill branch delay slot?

– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
• Compiler effectiveness for single branch delay slot:

– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per
clock (superscalar)
Delay slot example
●
From before: ●
BEQZ $r2, label
ADD $r1, $r2, $r3 ADD $r1, $r2, $r3
BEQ $r2,$0, label
Delay slot
●
From target
●
From fall-through
Control Hazards Evaluating Branch
Alternatives
Pipeline speedup = Pipeline depth

1 +Branch frequencyBranch penalty
Scheduling Branch CPI speedup v. Speedup v.

scheme penalty unpipelined stall
Stall pipeline 3 ? ? ?
Predict taken 1 ? ? ?
Predict not taken 1 ? ? ?
Delayed branch 0.5 ? ? ?
Conditional & Unconditional = 14%, 65% change PC

Control Hazards Evaluating Branch
Alternatives
Pipeline speedup = Pipeline depth

1 +Branch frequencyBranch penalty
Scheduling Branch CPI speedup v. Speedup v.

scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional & Unconditional = 14%, 65% change PC

Control Hazards: Branch Prediction
• Deeper pipelines : Previous techniques insufficient
• Branch Prediction:
– Static low-cost Branch Prediction: rely on information available at compile
time
– Dynamic Branch Prediction: based on program behavior.
Control Hazards
Compiler “Static” Prediction of

Taken/Untaken Branches
-depends both on the accuracy of the scheme
and the frequency of conditional branches
Control Hazards
●
Dynamic Branch Prediction: branch-prediction
buffer or branch history table (1-bit scheme)
– Small memory indexed by the lower portion of the
address of the branch instruction.
– A bit to indicate taken or not taken
– If the hint turns out to be wrong, the prediction bit is
inverted and stored back.
●
Even if a branch is almost always taken, we will likely predict
incorrectly twice, rather than once, when it is not taken
Control Hazards
●
Dynamic Branch Prediction: branch-prediction
buffer or branch history table (2-bit scheme)
Studies of n-bit predictors have shown that the 2-bit predictors do almost as well, thus most
systems rely on 2-bit branch predictors rather than the more general n-bit predictors.
Control Hazards
●
Dynamic Branch Prediction: branch-prediction buffer or
branch history table (2-bit scheme)
– A branch-prediction buffer can be implemented as a small, special
“cache” accessed with the instruction address during the IF pipe
stage, or
– as a pair of bits attached to each block in the instruction cache and
fetched with the instruction.
– If the instruction is decoded as a branch and if the branch is
predicted as taken, fetching begins from the target as soon as the
PC is known.
– Otherwise, sequential fetching and executing continue.
Control Hazards
4096 entry – 2-bit predictor
Improve accuracy of predictor - increasing the size of the buffer and by

increasing the accuracy of the scheme we use for each prediction

2 Pipelining 2

Uploaded by

Copyright:

Available Formats

You might also like

2 Pipelining 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Pipelining 2

Uploaded by

Copyright:

Available Formats

CSC15107: Computer

Where there’s real trouble is when we have:

and B manipulates (reads or writes) data before A

– Called an “anti-dependence” by compiler writers.

• Can’t happen in MIPS 5 stage pipeline because:

• Called an “output dependence” by compiler writers

• Can’t happen in MIPS 5 stage pipeline because:

• Will see WAR and WAW in later more complicated pipes

• Hardware detects RAW and stalls

Minimizing RAW stalls

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID EX MEM WB

AND R6, R1, R7 IF ID EX MEM WB

OR R8, R1, R9 IF ID EX MEM WB

LW R1, 0(R2) IF ID EX MEM WB

SUB R4, R1, R5 IF ID stall EX MEM WB

AND R6, R1, R7 IF stall ID EX MEM WB

OR R8, R1, R9 stall IF ID EX MEM WB

Instruction scheduled by compiler - move instruction in order to reduce stall.

Instruction scheduled by compiler - move instruction in order to reduce stall.

lw Rb, b ­code sequence for a = b+c before scheduling

Arrangement of code after scheduling.

0% 20% 40% 60% 80%

A control hazard is when we need to find the destination of a branch, and

10: beq r1,r3,36

• MIPS branch tests if register = 0 or ^ 0

#1: Stall until branch direction is clear: freeze or flush

#2: Predict Branch Not Taken

#3: Predict Branch Taken

• Where to get instructions to fill branch delay slot?

• Compiler effectiveness for single branch delay slot:

Pipeline speedup = Pipeline depth

Scheduling Branch CPI speedup v. Speedup v.

Conditional & Unconditional = 14%, 65% change PC

Pipeline speedup = Pipeline depth

Scheduling Branch CPI speedup v. Speedup v.

Conditional & Unconditional = 14%, 65% change PC

• Deeper pipelines : Previous techniques insufficient

Compiler “Static” Prediction of

4096 entry – 2-bit predictor

Improve accuracy of predictor - increasing the size of the buffer and by

You might also like

lw Rb, b code sequence for a = b+c before scheduling