Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Pipeline Hazards


Structural Hazards: resource conflict
 Example: same cache/memory for instruction
and data

Data Hazards: same data item being
accessed/written in nearby instructions
 Example:

ADD R1, R2, R3

SUB R4, R1, R5

Control Hazards: branch instructions
Structural Hazards

Usually happen when a unit is not fully
pipelined
 That unit cannot churn out one instruction per
cycle

Or, when a resource has not been duplicated
enough
 Example: same I-cache and D-cache
 Example: single write-port for register-file

Usual solution: stall
 Also called pipeline bubble, or simply bubble
Stalling the Pipeline
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
Load IF ID EX MEM WB
I+1 IF ID EX MEM WB
I+2 IF ID EX MEM WB
I+3 STALL IF ID EX MEM WB
I+4 IF ID EX MEM WB


What is the slowdown due to stalls caused by
such load instructions?
CPI without stalls1
CPI with stalls1 F load
Slowdown1 F load
Why Allow Structural Hazards?

Lower Cost:
 Lesser hardware ==> lesser cost

Shorter latency of unpipelined unit
 May have other performance benefits
 Data hazards may introduce stalls anyway!

Suppose the FP unit is unpipelined, and the
other instructions have a 5-stage pipeline.
What percentage of instructions can be FP,
so that the CPI does not increase?
20% can be FP, assuming no clustering of FP instructions
Even if clustered, data hazards may introduce stalls anyway
Data Hazards

Example:

ADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10,R1, R11

All instructions after ADD depend on R1

Stalling is a possibility
 Can we do better?
Register File: Reads after Writes

ADDR1, R2, R3 IM Reg


eg ALU DM Reg
Re

SUB R4, R1, R5 IM Reg


eg ALU DM Reg
Re

ANDR6, R1, R7 IM Reg


eg ALU DM Reg
Re

OR R8, R1, R9 IM Reg


eg ALU DM
Minimizing Stalls via Forwarding

ADDR1, R2, R3 IM Reg


eg ALU DM Reg
Re

SUB R4, R1, R5 IM Reg


eg ALU DM Reg
Re

ANDR6, R1, R7 IM Reg


eg ALU DM Reg
Re

OR R8, R1, R9 IM Reg


eg ALU DM
Data Forwarding for Stores

ADDR1, R2, R3 IM Reg


eg ALU DM Reg
Re

LW R4, 0(R1) IM Reg


eg ALU DM Reg
Re

SW 12(R1), R4 IM Reg
eg ALU DM Reg
Re

Note: no data hazards on memory locations in DLX, since


memory references are always in order
Data Hazard Classification

Read after Write (RAW): use data
forwarding to overcome

Write after Write (WAW): arises only when
writes can happen in different pipeline stages

CC1 CC2 CC3 CC4 CC5 CC6
LW R1, 0(R2) IF ID EX MEM1 MEM2 WB


ADD R1, R2, R3 IF ID EX WB




 Has other problems as well: structural hazards



Write after Read (WAR): rare
CC1 CC2 CC3 CC4 CC5 CC6
SW 0(R1), R2 IF ID EX MEM1 MEM2 WB
ADD R2, R3, R4 IF ID EX WB
Stalls due to Data Hazard
LW R1, 0(R2) IM Reg
eg ALU DM Reg
Re

SUB R4, R1, R5 IM Reg


eg ALU DM Reg
Re

ANDR6, R1, R7 IM Reg


eg ALU DM Reg
Re

OR R8, R1, R9 IM Reg


eg ALU DM

Pipeline interlock is required: to detect hazard and stall


Avoiding such Stalls

Compiler scheduling:
 Example: a = b + c; d = e + f;

LW R1, b

LW R2, c
Without such scheduling,

LW R10, e what is the slow-down?

ADD R4, R1, R2 1 F loads causing stalls

LW R11, f

SW a, R4

ADD R12, R10, R11

SW d, R12
Pipeline Interlock for “Load”
Opcode of ID/EX
Opcode of IF/ID (IF/ID.IR0..5) Check for interlock
(ID/EX.IR0..5)
Load Reg-Reg ALU ID/EX.IR11.15 == IF/ID.IR6..10
Load Reg-Reg ALU ID/EX.IR11.15 == IF/ID.IR11..15
Load Load, store, ALU immediate, or branch ID/EX.IR11.15 == IF/ID.IR6..10
Control Logic for Data-
Forwarding

Data forwarding always happens
 From ALU or data-memory output
 To ALU input, data-memory input, or zero-
detection unit

Which registers to compare?

Compare the destination register field in
EX/MEM and MEM/WB latches with the
source register fields of IR in ID/EX and
EX/MEM stages
Control Hazard

Result of branch instruction not known until
end of MEM stage

Naïve solution: stall until result of branch
instruction is known
 That an instruction is a branch is known at the
end of its ID cycle
 Note: “IF” may have to be repeated
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Branch IF ID EX MEM WB
Branch succ IF STALL STALL IF ID EX MEM WB
Branch succ + 1 IF ID EX MEM
Reducing the Branch Delay

Three clock cycles wasted for every branch
==> significantly bad performance

Two things to speedup:
 Determine earlier, if branch is taken
 Compute PC earlier

Both can be done one cycle earlier

But, beware of data hazard
Branch Behaviour of Programs

Integer programs: 13% forward conditional,
3% backward conditional, 4% unconditional

FP programs: 7%, 2%, and 1% respectively

67% of branches are taken
 60% forward branches are taken
 85% backward branches are taken
Handling Control Hazards

Stall: Naïve solution

Predict untaken or Predict not-taken:
 Treat every branch as not taken
 Only slightly more complex
 Do not update machine state until branch
outcome is known
 Done by clearing the IF/ID register of the fetched
instruction
Predict Untaken Scheme
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
I (Untaken
IF ID EX MEM WB
branch)
I+1 IF ID EX MEM WB
I+2 IF ID EX MEM WB
I+3 IF ID EX MEM WB

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8


I (Taken
IF ID EX MEM WB
branch)
I+1 IF Noop Noop Noop Noop
Target IF ID EX MEM WB
Target + 1 IF ID EX MEM WB
Target +2 IF ID EX MEM
More Ways to Reduce Control
Hazard Delays

Predict taken:
 Treat every branch as taken
 Not of any use in DLX since branch target is not
known before branch condition anyway

May be of use in other architectures

Delayed branch:
 Instruction(s) after branch are executed anyway!
 Sequential successors are called branch-delay-
slots
Delayed Branch
EITHER OR CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
I (Untaken I (Taken
IF ID EX MEM WB
branch) branch)
I + 1 (Branch I + 1 (Branch
IF ID EX MEM WB
delay) delay)
I+2 Target IF ID EX MEM WB
I+3 Target + 1 IF ID EX MEM WB
I+4 Target + 2 IF ID EX MEM


DLX has one delay-slot

Note: another branch instruction cannot be
put in delay-slot

Compiler has to fill the delay-slots
Filling the Delay-Slot:
Option 1 of 3
ADD R1, R2, R3
if (R2 == 0) then if (R2 == 0) then
Delay slot ADD R1, R2, R3


Fill the slot from before the branch instruction

Restriction: branch must not depend on result of the
filled instruction

Improves performance: always
Filling the Delay-Slot:
Option 2 of 3
SUB R4, R5, R6

ADD R1, R2, R3 ADD R1, R2, R3


if (R1 == 0) then if (R1 == 0) then
Delay slot SUB R4, R5, R6


Fill the slot from the target of the branch instruction

Restriction: should be OK to execute instruction
even if not taken

Improves performance: when branch is taken
Filling the Delay-Slot:
Option 3 of 3
ADD R1, R2, R3 ADD R1, R2, R3
if (R1 == 0) then if (R1 == 0) then
Delay slot SUB R4, R5, R6
SUB R4, R5, R6


Fill the slot from fall through of the branch

Restriction: should be OK to execute instruction
even if taken

Improves performance: when branch is not taken
Helping the Compiler

Encode the compiler prediction in the branch
instruction
 CPU knows whether branch was predicted taken
or not taken by compiler
 Cancel or nullify if prediction incorrect
 Known as canceling or nullifying branch

Options 2 and 3 can now be used without restrictions
Static Branch Prediction

Predict-taken

Predict-untaken

Prediction based on direction
(forward/backward)

Profile-based prediction
Static Misprediction Rates
22.50%

20.00%

17.50%
Misprediction rate

15.00%

12.50%

10.00%

7.50%

5.00%

2.50%

0.00%
Com Eqn- Espre Gcc Li Do Ear Hy- Mdljd Su2
press tott sso duc dro2d p cor
Benchmark
Some Remarks

Delayed branches are architecturally visible
 Strength as well as weakness
 Advantage: better performance
 Disadvantage: what if implementation changes?

Deeper pipeline ==> more branch delays ==>
delay-slots may no longer be useful
 More powerful dynamic branch prediction

Note: need to remember extra PC while
taking exceptions/interrupts

Slowdown due to mispredictions:
1 Branch frequency Misprediction rate Penalty
Exceptions and Pipelining

What are exceptions?

I/O interrupt

System call

Tracing instruction execution, breakpoint

Integer/FP anomaly

Page fault

Misaligned memory access

Memory protection violation

Undefined instruction

Hardware malfunction/Power failure

Also called interrupts or faults
Exceptions: The Nemesis of
Pipelining

While taking exceptions, ensure that machine
is in a c“ onsistent”s tate

Exceptions can occur:
 In many pipeline stages
 Out of order
CC1 CC2 CC3 CC4 CC5 CC6
LW IF ID EX MEM WB
ADD IF ID EX MEM WB
Classification of Exceptions

Synchronous vs. Asynchronous
 Asynchronous usually caused by devices
external to the processor
 Asynchronous ==> can be handled after current
instruction (easier)

User requested vs. Coerced
 User requested ==> can be handled after current
instruction
 Coerced ==> unpredictable

User maskable vs. Non-maskable
Classification of Exceptions
(continued)

Within vs. Between instructions
 Within ==> instruction cannot be completed,
usually synchronous (harder)

Resume vs. Terminate
 Terminate process ==> easier
Exception Classification
Exception Within
Synchronous? Coerced? Maskable? Resume?
type instn.?
I/O request No Yes No No Yes
Sys. call Yes No No No Yes
Tracing/Brk.pt
Yes No Yes No Yes
.
ALU excpn. Yes Yes Yes Yes Yes
Page fault Yes Yes No Yes Yes
Misaligned.
Yes Yes Yes Yes Yes
mem. access

Protecn. violn. Yes Yes No Yes Yes

Undefined
Yes Yes No Yes No
instns.
H/w malfn./
No Yes No Yes No
power failure
Restarting Execution

Restartable: take exception, save state,
restart without affecting execution

Restarting
 Force a trap instruction into pipeline
 Until trap, disable all writes for faulting instruction
and all subsequent ones
 Trap into exception handling routine (OS)
 Need to save more than one PC for delayed
branches

Precise Exceptions: all instructions prior to
faulting one completed, but not any other
Exceptions in DLX
CC1 CC2 CC3 CC4 CC5 CC6
LW IF ID EX MEM WB
ADD IF ID EX MEM WB


Exceptions can occur:
 In same cycle, or even out-of-order

Cannot handle an exception when it occurs
in time
 Carry an instruction status in the pipeline latches
 In WB stage, exception corresponding to earliest
instruction will be handled
More Complications in
Pipelining

Multiple write stages

Or, changing processor state in the middle of
an instruction
 E.g., Auto-increment addressing mode in VAX

Updating memory state during instruction
 E.g., String copy instruction in VAX
More Complications in
Pipelining (continued)

Implicitly set condition codes
 Problems in scheduling the delay slot, and
during exceptions

Self-modifying code in 80x86!

Multi-cycle operations
MOVL R1, R2
ADDL3 42(R1), 56(R1)+, @(R1)
SUBL2 R2, R3
MOVC3 @(R1)(R2), 74(R2), R3
Data hazards very complicated to determine!
VAX pipelines micro-instructions
Pipelining Multi-cycle Opns.

Some operations take > 1 cycle (e.g. FP)

Handling multi-cycle opns. in the pipeline:
 Multiple EX stages
 Multiple functional units

An example: EX1

EX2
IF ID MEM WB
EX3
EX1: Main ALU
EX2: FP,Int mult.
EX4
EX3: FP add,sub
EX4: FP,Int div
Pipelining Multi-cycle Opns.
(continued)

Two things to consider:
 Different units may take different # cycles
 Some units may not be pipelined

Corresponding definitions:
 Latency: # cycles between an instn. & another
which can use its result
 Initiation/repeat interval: # cycles between issue
of two operations of the same type
The Multi-cycle Pipeline
EX

M1 M2 M3 M4 M5 M6 M7

IF ID MEM WB
A1 A2 A3 A4

DIVIDE

Functional Unit Latency Initiation interval


Main ALU 0 1
Data memory 1 1
FP add, sub 3 1
FP, Int mul 6 1
FP div 14 15
Pipeline Timing: An Example
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
ADDD IF ID A1 A2 A3 A4 MEM WB
LD IF ID EX MEM WB


Additional details:
 We require more latches
 ID/EX register must be expanded
More Hazards!

Structural hazards:
 Divide unit is not pipelined
 Multiple writes possible in the same cycle

Data hazards:
 RAW is more frequent
 WAW is possible

Control hazards:
 Out-of-order completion ==> difficulty in handling
exceptions
Multiple Writes/Cycle: An
Example
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
... IF ID EX MEM WB
... IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
... IF ID EX MEM WB
Multiple Writes/Cycle: Solution

Provide multiple write-ports

Or, detect and stall; Two possibilities:
 Detect in ID stage

Instruction reserve the write port using a reservation
register

Reservation register is shifted one bit each clock
 Detect in MEM stage

Easier to check

Can also give priority to longer cycle operation

But, stall can now be in two places

Stall may trickle back
Data Hazards
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12
LD F4, 0(R2) IF ID EX MEM WB
MULTD F0, F4, F6 IF ID STL M1 M2 M3 M4 M5 M6 M7 MEM

RAW hazards cause more stalls now

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
... IF ID EX MEM WB
... IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
... IF ID EX MEM WB
LD F2, 0(R2) IF ID EX MEM WB

WAW hazard: an example


Handling WAW Hazards

Occurs only when the result of ADDD is
overwritten without any instruction using it!
 Otherwise, RAW hazard stall would have
occurred

Hazard can be detected in ID stage of latter
instruction

Two ways to handle:
 Delay issue of load until ADDD enters MEM
 Stamp out result of ADDD
Control Hazard Complications

An example:
 DIVF F0, F2, F4 // Finishes last; excepn.
 ADDF F10, F10, F8 // Finishes first
 SUBF F12, F12, F14 // Finishes second

Out-of-order completion causes problems!
 Precise exceptions are difficult to implement
Achieving Precise Exceptions

Approach 1: Ostrich algorithm
 Don't care
 May be provide a slower precise mode

Example: special instructions to check for FP
exceptions

Approach 2: allow instruction issue to
continue only if previous instructions will
complete without exception
 Stall to maintain precise exceptions
Achieving Precise Exceptions
(continued)

Approach 3: save state to undo
 Two possibilities

History file: keep track of original value of registers

Future file: keep track of current value; main register
file updated after all previous instructions are done
 More buffer space required
 Hazard checks and control become very
complex
Achieving Precise Exceptions
(continued)

Approach 4: imprecise, but keep enough
state for OS to recover
 Keep track of incomplete instructions
 OS then runs those instructions before returning
control
 Complicated to execute these instructions
properly!

You might also like