Pipeline Hazards: Structural Hazards: Resource Conflict

Pipeline Hazards

Structural Hazards: resource conflict
Example: same cache/memory for instruction
and data

Data Hazards: same data item being
accessed/written in nearby instructions
Example:

ADD R1, R2, R3

SUB R4, R1, R5

Control Hazards: branch instructions
Structural Hazards

Usually happen when a unit is not fully
pipelined
That unit cannot churn out one instruction per
cycle

Or, when a resource has not been duplicated
enough
Example: same I-cache and D-cache
Example: single write-port for register-file

Usual solution: stall
Also called pipeline bubble, or simply bubble
Stalling the Pipeline
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
Load IF ID EX MEM WB
I+1 IF ID EX MEM WB
I+2 IF ID EX MEM WB
I+3 STALL IF ID EX MEM WB
I+4 IF ID EX MEM WB

What is the slowdown due to stalls caused by
such load instructions?
CPI without stalls1
CPI with stalls1 F load
Slowdown1 F load
Why Allow Structural Hazards?

Lower Cost:
Lesser hardware ==> lesser cost

Shorter latency of unpipelined unit
May have other performance benefits
Data hazards may introduce stalls anyway!

Suppose the FP unit is unpipelined, and the
other instructions have a 5-stage pipeline.
What percentage of instructions can be FP,
so that the CPI does not increase?
20% can be FP, assuming no clustering of FP instructions
Even if clustered, data hazards may introduce stalls anyway
Data Hazards

Example:

ADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10,R1, R11

All instructions after ADD depend on R1

Stalling is a possibility
Can we do better?
Register File: Reads after Writes
ADDR1, R2, R3 IM Reg

eg ALU DM Reg
Re
SUB R4, R1, R5 IM Reg

eg ALU DM Reg
Re
ANDR6, R1, R7 IM Reg

eg ALU DM Reg
Re
OR R8, R1, R9 IM Reg

eg ALU DM
Minimizing Stalls via Forwarding

eg ALU DM Reg
Re

eg ALU DM Reg
Re

eg ALU DM Reg
Re

eg ALU DM
Data Forwarding for Stores

eg ALU DM Reg
Re
LW R4, 0(R1) IM Reg

eg ALU DM Reg
Re
SW 12(R1), R4 IM Reg
eg ALU DM Reg
Re
Note: no data hazards on memory locations in DLX, since

memory references are always in order
Data Hazard Classification

Read after Write (RAW): use data
forwarding to overcome

Write after Write (WAW): arises only when
writes can happen in different pipeline stages

CC1 CC2 CC3 CC4 CC5 CC6
LW R1, 0(R2) IF ID EX MEM1 MEM2 WB

ADD R1, R2, R3 IF ID EX WB

Has other problems as well: structural hazards

Write after Read (WAR): rare
SW 0(R1), R2 IF ID EX MEM1 MEM2 WB
ADD R2, R3, R4 IF ID EX WB
Stalls due to Data Hazard
LW R1, 0(R2) IM Reg
eg ALU DM Reg
Re

eg ALU DM Reg
Re

eg ALU DM Reg
Re

eg ALU DM
Pipeline interlock is required: to detect hazard and stall

Avoiding such Stalls

Compiler scheduling:
Example: a = b + c; d = e + f;

LW R1, b

LW R2, c
Without such scheduling,

LW R10, e what is the slow-down?

ADD R4, R1, R2 1 F loads causing stalls

LW R11, f

SW a, R4

ADD R12, R10, R11

SW d, R12
Pipeline Interlock for “Load”
Opcode of ID/EX
Opcode of IF/ID (IF/ID.IR0..5) Check for interlock
(ID/EX.IR0..5)
Load Reg-Reg ALU ID/EX.IR11.15 == IF/ID.IR6..10
Load Reg-Reg ALU ID/EX.IR11.15 == IF/ID.IR11..15
Load Load, store, ALU immediate, or branch ID/EX.IR11.15 == IF/ID.IR6..10
Control Logic for Data-
Forwarding

Data forwarding always happens
From ALU or data-memory output
To ALU input, data-memory input, or zero-
detection unit

Which registers to compare?

Compare the destination register field in
EX/MEM and MEM/WB latches with the
source register fields of IR in ID/EX and
EX/MEM stages
Control Hazard

Result of branch instruction not known until
end of MEM stage

Naïve solution: stall until result of branch
instruction is known
That an instruction is a branch is known at the
end of its ID cycle
Note: “IF” may have to be repeated
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Branch IF ID EX MEM WB
Branch succ IF STALL STALL IF ID EX MEM WB
Branch succ + 1 IF ID EX MEM
Reducing the Branch Delay

Three clock cycles wasted for every branch
==> significantly bad performance

Two things to speedup:
Determine earlier, if branch is taken
Compute PC earlier

Both can be done one cycle earlier

But, beware of data hazard
Branch Behaviour of Programs

Integer programs: 13% forward conditional,
3% backward conditional, 4% unconditional

FP programs: 7%, 2%, and 1% respectively

67% of branches are taken
60% forward branches are taken
85% backward branches are taken
Handling Control Hazards

Stall: Naïve solution

Predict untaken or Predict not-taken:
Treat every branch as not taken
Only slightly more complex
Do not update machine state until branch
outcome is known
Done by clearing the IF/ID register of the fetched
instruction
Predict Untaken Scheme
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
I (Untaken
IF ID EX MEM WB
branch)
I+1 IF ID EX MEM WB
I+2 IF ID EX MEM WB
I+3 IF ID EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

I (Taken
IF ID EX MEM WB
branch)
I+1 IF Noop Noop Noop Noop
Target IF ID EX MEM WB
Target + 1 IF ID EX MEM WB
Target +2 IF ID EX MEM
More Ways to Reduce Control
Hazard Delays

Predict taken:
Treat every branch as taken
Not of any use in DLX since branch target is not
known before branch condition anyway

May be of use in other architectures

Delayed branch:
Instruction(s) after branch are executed anyway!
Sequential successors are called branch-delay-
slots
Delayed Branch
EITHER OR CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
I (Untaken I (Taken
IF ID EX MEM WB
branch) branch)
I + 1 (Branch I + 1 (Branch
IF ID EX MEM WB
delay) delay)
I+2 Target IF ID EX MEM WB
I+3 Target + 1 IF ID EX MEM WB
I+4 Target + 2 IF ID EX MEM

DLX has one delay-slot

Note: another branch instruction cannot be
put in delay-slot

Compiler has to fill the delay-slots
Filling the Delay-Slot:
Option 1 of 3
ADD R1, R2, R3
if (R2 == 0) then if (R2 == 0) then
Delay slot ADD R1, R2, R3

Fill the slot from before the branch instruction

Restriction: branch must not depend on result of the
filled instruction

Improves performance: always
Option 2 of 3
SUB R4, R5, R6
ADD R1, R2, R3 ADD R1, R2, R3

Delay slot SUB R4, R5, R6

Fill the slot from the target of the branch instruction

Restriction: should be OK to execute instruction
even if not taken

Improves performance: when branch is taken
Option 3 of 3
ADD R1, R2, R3 ADD R1, R2, R3
Delay slot SUB R4, R5, R6
SUB R4, R5, R6

Fill the slot from fall through of the branch

Restriction: should be OK to execute instruction
even if taken

Improves performance: when branch is not taken
Helping the Compiler

Encode the compiler prediction in the branch
instruction
CPU knows whether branch was predicted taken
or not taken by compiler
Cancel or nullify if prediction incorrect
Known as canceling or nullifying branch

Options 2 and 3 can now be used without restrictions
Static Branch Prediction

Predict-taken

Predict-untaken

Prediction based on direction
(forward/backward)

Profile-based prediction
Static Misprediction Rates
22.50%
20.00%
17.50%
Misprediction rate
15.00%
12.50%
10.00%
7.50%
5.00%
2.50%
0.00%
Com Eqn- Espre Gcc Li Do Ear Hy- Mdljd Su2
press tott sso duc dro2d p cor
Benchmark
Some Remarks

Delayed branches are architecturally visible
Strength as well as weakness
Advantage: better performance
Disadvantage: what if implementation changes?

Deeper pipeline ==> more branch delays ==>
delay-slots may no longer be useful
More powerful dynamic branch prediction

Note: need to remember extra PC while
taking exceptions/interrupts

Slowdown due to mispredictions:
1 Branch frequency Misprediction rate Penalty
Exceptions and Pipelining

What are exceptions?

I/O interrupt

System call

Tracing instruction execution, breakpoint

Integer/FP anomaly

Page fault

Misaligned memory access

Memory protection violation

Undefined instruction

Hardware malfunction/Power failure

Also called interrupts or faults
Exceptions: The Nemesis of
Pipelining

While taking exceptions, ensure that machine
is in a c“ onsistent”s tate

Exceptions can occur:
In many pipeline stages
Out of order
LW IF ID EX MEM WB
ADD IF ID EX MEM WB
Classification of Exceptions

Synchronous vs. Asynchronous
Asynchronous usually caused by devices
external to the processor
Asynchronous ==> can be handled after current
instruction (easier)

User requested vs. Coerced
User requested ==> can be handled after current
instruction
Coerced ==> unpredictable

User maskable vs. Non-maskable
Classification of Exceptions
(continued)

Within vs. Between instructions
Within ==> instruction cannot be completed,
usually synchronous (harder)

Resume vs. Terminate
Terminate process ==> easier
Exception Classification
Exception Within
Synchronous? Coerced? Maskable? Resume?
type instn.?
I/O request No Yes No No Yes
Sys. call Yes No No No Yes
Tracing/Brk.pt
Yes No Yes No Yes
.
ALU excpn. Yes Yes Yes Yes Yes
Page fault Yes Yes No Yes Yes
Misaligned.
Yes Yes Yes Yes Yes
mem. access
Protecn. violn. Yes Yes No Yes Yes
Undefined
Yes Yes No Yes No
instns.
H/w malfn./
No Yes No Yes No
power failure
Restarting Execution

Restartable: take exception, save state,
restart without affecting execution

Restarting
Force a trap instruction into pipeline
Until trap, disable all writes for faulting instruction
and all subsequent ones
Trap into exception handling routine (OS)
Need to save more than one PC for delayed
branches

Precise Exceptions: all instructions prior to
faulting one completed, but not any other
Exceptions in DLX
LW IF ID EX MEM WB
ADD IF ID EX MEM WB

Exceptions can occur:
In same cycle, or even out-of-order

Cannot handle an exception when it occurs
in time
Carry an instruction status in the pipeline latches
In WB stage, exception corresponding to earliest
instruction will be handled
More Complications in
Pipelining

Multiple write stages

Or, changing processor state in the middle of
an instruction
E.g., Auto-increment addressing mode in VAX

Updating memory state during instruction
E.g., String copy instruction in VAX
More Complications in
Pipelining (continued)

Implicitly set condition codes
Problems in scheduling the delay slot, and
during exceptions

Self-modifying code in 80x86!

Multi-cycle operations
MOVL R1, R2
ADDL3 42(R1), 56(R1)+, @(R1)
SUBL2 R2, R3
MOVC3 @(R1)(R2), 74(R2), R3
Data hazards very complicated to determine!
VAX pipelines micro-instructions
Pipelining Multi-cycle Opns.

Some operations take > 1 cycle (e.g. FP)

Handling multi-cycle opns. in the pipeline:
Multiple EX stages
Multiple functional units

An example: EX1
EX2
IF ID MEM WB
EX3
EX1: Main ALU
EX2: FP,Int mult.
EX4
EX3: FP add,sub
EX4: FP,Int div
Pipelining Multi-cycle Opns.
(continued)

Two things to consider:
Different units may take different # cycles
Some units may not be pipelined

Corresponding definitions:
Latency: # cycles between an instn. & another
which can use its result
Initiation/repeat interval: # cycles between issue
of two operations of the same type
The Multi-cycle Pipeline
EX
M1 M2 M3 M4 M5 M6 M7
IF ID MEM WB
A1 A2 A3 A4
DIVIDE
Functional Unit Latency Initiation interval

Main ALU 0 1
Data memory 1 1
FP add, sub 3 1
FP, Int mul 6 1
FP div 14 15
Pipeline Timing: An Example
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
ADDD IF ID A1 A2 A3 A4 MEM WB
LD IF ID EX MEM WB

Additional details:
We require more latches
ID/EX register must be expanded
More Hazards!

Structural hazards:
Divide unit is not pipelined
Multiple writes possible in the same cycle

Data hazards:
RAW is more frequent
WAW is possible

Control hazards:
Out-of-order completion ==> difficulty in handling
exceptions
Multiple Writes/Cycle: An
Example
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
... IF ID EX MEM WB
... IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
... IF ID EX MEM WB
Multiple Writes/Cycle: Solution

Provide multiple write-ports

Or, detect and stall; Two possibilities:
Detect in ID stage

Instruction reserve the write port using a reservation
register

Reservation register is shifted one bit each clock
Detect in MEM stage

Easier to check

Can also give priority to longer cycle operation

But, stall can now be in two places

Stall may trickle back
Data Hazards
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12
LD F4, 0(R2) IF ID EX MEM WB
MULTD F0, F4, F6 IF ID STL M1 M2 M3 M4 M5 M6 M7 MEM
RAW hazards cause more stalls now
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
... IF ID EX MEM WB
... IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
... IF ID EX MEM WB
LD F2, 0(R2) IF ID EX MEM WB
WAW hazard: an example

Handling WAW Hazards

Occurs only when the result of ADDD is
overwritten without any instruction using it!
Otherwise, RAW hazard stall would have
occurred

Hazard can be detected in ID stage of latter
instruction

Two ways to handle:
Delay issue of load until ADDD enters MEM
Stamp out result of ADDD
Control Hazard Complications

An example:
DIVF F0, F2, F4 // Finishes last; excepn.
ADDF F10, F10, F8 // Finishes first
SUBF F12, F12, F14 // Finishes second

Out-of-order completion causes problems!
Precise exceptions are difficult to implement
Achieving Precise Exceptions

Approach 1: Ostrich algorithm
Don't care
May be provide a slower precise mode

Example: special instructions to check for FP
exceptions

Approach 2: allow instruction issue to
continue only if previous instructions will
complete without exception
Stall to maintain precise exceptions
(continued)

Approach 3: save state to undo
Two possibilities

History file: keep track of original value of registers

Future file: keep track of current value; main register
file updated after all previous instructions are done
More buffer space required
Hazard checks and control become very
complex
(continued)

Approach 4: imprecise, but keep enough
state for OS to recover
Keep track of incomplete instructions
OS then runs those instructions before returning
control
Complicated to execute these instructions
properly!

Pipeline Hazards: Structural Hazards: Resource Conflict

Uploaded by

Copyright:

Available Formats

You might also like

Pipeline Hazards: Structural Hazards: Resource Conflict

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipeline Hazards: Structural Hazards: Resource Conflict

Uploaded by

Copyright:

Available Formats

Pipeline Hazards

ADDR1, R2, R3 IM Reg

SUB R4, R1, R5 IM Reg

ANDR6, R1, R7 IM Reg

OR R8, R1, R9 IM Reg

ADDR1, R2, R3 IM Reg

SUB R4, R1, R5 IM Reg

ANDR6, R1, R7 IM Reg

OR R8, R1, R9 IM Reg

ADDR1, R2, R3 IM Reg

LW R4, 0(R1) IM Reg

Note: no data hazards on memory locations in DLX, since

ADD R1, R2, R3 IF ID EX WB

Has other problems as well: structural hazards

SUB R4, R1, R5 IM Reg

ANDR6, R1, R7 IM Reg

OR R8, R1, R9 IM Reg

Pipeline interlock is required: to detect hazard and stall

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

ADD R1, R2, R3 ADD R1, R2, R3

Protecn. violn. Yes Yes No Yes Yes

Functional Unit Latency Initiation interval

RAW hazards cause more stalls now

WAW hazard: an example

You might also like