Professional Documents
Culture Documents
Pipeline Hazards: Structural Hazards: Resource Conflict
Pipeline Hazards: Structural Hazards: Resource Conflict
Structural Hazards: resource conflict
Example: same cache/memory for instruction
and data
Data Hazards: same data item being
accessed/written in nearby instructions
Example:
ADD R1, R2, R3
SUB R4, R1, R5
Control Hazards: branch instructions
Structural Hazards
Usually happen when a unit is not fully
pipelined
That unit cannot churn out one instruction per
cycle
Or, when a resource has not been duplicated
enough
Example: same I-cache and D-cache
Example: single write-port for register-file
Usual solution: stall
Also called pipeline bubble, or simply bubble
Stalling the Pipeline
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10
Load IF ID EX MEM WB
I+1 IF ID EX MEM WB
I+2 IF ID EX MEM WB
I+3 STALL IF ID EX MEM WB
I+4 IF ID EX MEM WB
What is the slowdown due to stalls caused by
such load instructions?
CPI without stalls1
CPI with stalls1 F load
Slowdown1 F load
Why Allow Structural Hazards?
Lower Cost:
Lesser hardware ==> lesser cost
Shorter latency of unpipelined unit
May have other performance benefits
Data hazards may introduce stalls anyway!
Suppose the FP unit is unpipelined, and the
other instructions have a 5-stage pipeline.
What percentage of instructions can be FP,
so that the CPI does not increase?
20% can be FP, assuming no clustering of FP instructions
Even if clustered, data hazards may introduce stalls anyway
Data Hazards
Example:
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10,R1, R11
All instructions after ADD depend on R1
Stalling is a possibility
Can we do better?
Register File: Reads after Writes
SW 12(R1), R4 IM Reg
eg ALU DM Reg
Re
DLX has one delay-slot
Note: another branch instruction cannot be
put in delay-slot
Compiler has to fill the delay-slots
Filling the Delay-Slot:
Option 1 of 3
ADD R1, R2, R3
if (R2 == 0) then if (R2 == 0) then
Delay slot ADD R1, R2, R3
Fill the slot from before the branch instruction
Restriction: branch must not depend on result of the
filled instruction
Improves performance: always
Filling the Delay-Slot:
Option 2 of 3
SUB R4, R5, R6
Fill the slot from the target of the branch instruction
Restriction: should be OK to execute instruction
even if not taken
Improves performance: when branch is taken
Filling the Delay-Slot:
Option 3 of 3
ADD R1, R2, R3 ADD R1, R2, R3
if (R1 == 0) then if (R1 == 0) then
Delay slot SUB R4, R5, R6
SUB R4, R5, R6
Fill the slot from fall through of the branch
Restriction: should be OK to execute instruction
even if taken
Improves performance: when branch is not taken
Helping the Compiler
Encode the compiler prediction in the branch
instruction
CPU knows whether branch was predicted taken
or not taken by compiler
Cancel or nullify if prediction incorrect
Known as canceling or nullifying branch
Options 2 and 3 can now be used without restrictions
Static Branch Prediction
Predict-taken
Predict-untaken
Prediction based on direction
(forward/backward)
Profile-based prediction
Static Misprediction Rates
22.50%
20.00%
17.50%
Misprediction rate
15.00%
12.50%
10.00%
7.50%
5.00%
2.50%
0.00%
Com Eqn- Espre Gcc Li Do Ear Hy- Mdljd Su2
press tott sso duc dro2d p cor
Benchmark
Some Remarks
Delayed branches are architecturally visible
Strength as well as weakness
Advantage: better performance
Disadvantage: what if implementation changes?
Deeper pipeline ==> more branch delays ==>
delay-slots may no longer be useful
More powerful dynamic branch prediction
Note: need to remember extra PC while
taking exceptions/interrupts
Slowdown due to mispredictions:
1 Branch frequency Misprediction rate Penalty
Exceptions and Pipelining
What are exceptions?
I/O interrupt
System call
Tracing instruction execution, breakpoint
Integer/FP anomaly
Page fault
Misaligned memory access
Memory protection violation
Undefined instruction
Hardware malfunction/Power failure
Also called interrupts or faults
Exceptions: The Nemesis of
Pipelining
While taking exceptions, ensure that machine
is in a c“ onsistent”s tate
Exceptions can occur:
In many pipeline stages
Out of order
CC1 CC2 CC3 CC4 CC5 CC6
LW IF ID EX MEM WB
ADD IF ID EX MEM WB
Classification of Exceptions
Synchronous vs. Asynchronous
Asynchronous usually caused by devices
external to the processor
Asynchronous ==> can be handled after current
instruction (easier)
User requested vs. Coerced
User requested ==> can be handled after current
instruction
Coerced ==> unpredictable
User maskable vs. Non-maskable
Classification of Exceptions
(continued)
Within vs. Between instructions
Within ==> instruction cannot be completed,
usually synchronous (harder)
Resume vs. Terminate
Terminate process ==> easier
Exception Classification
Exception Within
Synchronous? Coerced? Maskable? Resume?
type instn.?
I/O request No Yes No No Yes
Sys. call Yes No No No Yes
Tracing/Brk.pt
Yes No Yes No Yes
.
ALU excpn. Yes Yes Yes Yes Yes
Page fault Yes Yes No Yes Yes
Misaligned.
Yes Yes Yes Yes Yes
mem. access
Undefined
Yes Yes No Yes No
instns.
H/w malfn./
No Yes No Yes No
power failure
Restarting Execution
Restartable: take exception, save state,
restart without affecting execution
Restarting
Force a trap instruction into pipeline
Until trap, disable all writes for faulting instruction
and all subsequent ones
Trap into exception handling routine (OS)
Need to save more than one PC for delayed
branches
Precise Exceptions: all instructions prior to
faulting one completed, but not any other
Exceptions in DLX
CC1 CC2 CC3 CC4 CC5 CC6
LW IF ID EX MEM WB
ADD IF ID EX MEM WB
Exceptions can occur:
In same cycle, or even out-of-order
Cannot handle an exception when it occurs
in time
Carry an instruction status in the pipeline latches
In WB stage, exception corresponding to earliest
instruction will be handled
More Complications in
Pipelining
Multiple write stages
Or, changing processor state in the middle of
an instruction
E.g., Auto-increment addressing mode in VAX
Updating memory state during instruction
E.g., String copy instruction in VAX
More Complications in
Pipelining (continued)
Implicitly set condition codes
Problems in scheduling the delay slot, and
during exceptions
Self-modifying code in 80x86!
Multi-cycle operations
MOVL R1, R2
ADDL3 42(R1), 56(R1)+, @(R1)
SUBL2 R2, R3
MOVC3 @(R1)(R2), 74(R2), R3
Data hazards very complicated to determine!
VAX pipelines micro-instructions
Pipelining Multi-cycle Opns.
Some operations take > 1 cycle (e.g. FP)
Handling multi-cycle opns. in the pipeline:
Multiple EX stages
Multiple functional units
An example: EX1
EX2
IF ID MEM WB
EX3
EX1: Main ALU
EX2: FP,Int mult.
EX4
EX3: FP add,sub
EX4: FP,Int div
Pipelining Multi-cycle Opns.
(continued)
Two things to consider:
Different units may take different # cycles
Some units may not be pipelined
Corresponding definitions:
Latency: # cycles between an instn. & another
which can use its result
Initiation/repeat interval: # cycles between issue
of two operations of the same type
The Multi-cycle Pipeline
EX
M1 M2 M3 M4 M5 M6 M7
IF ID MEM WB
A1 A2 A3 A4
DIVIDE
Additional details:
We require more latches
ID/EX register must be expanded
More Hazards!
Structural hazards:
Divide unit is not pipelined
Multiple writes possible in the same cycle
Data hazards:
RAW is more frequent
WAW is possible
Control hazards:
Out-of-order completion ==> difficulty in handling
exceptions
Multiple Writes/Cycle: An
Example
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
... IF ID EX MEM WB
... IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
... IF ID EX MEM WB
Multiple Writes/Cycle: Solution
Provide multiple write-ports
Or, detect and stall; Two possibilities:
Detect in ID stage
Instruction reserve the write port using a reservation
register
Reservation register is shifted one bit each clock
Detect in MEM stage
Easier to check
Can also give priority to longer cycle operation
But, stall can now be in two places
Stall may trickle back
Data Hazards
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 CC12
LD F4, 0(R2) IF ID EX MEM WB
MULTD F0, F4, F6 IF ID STL M1 M2 M3 M4 M5 M6 M7 MEM
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11
MULTD F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
... IF ID EX MEM WB
... IF ID EX MEM WB
ADDD F2, F4, F6 IF ID A1 A2 A3 A4 MEM WB
... IF ID EX MEM WB
LD F2, 0(R2) IF ID EX MEM WB