04 Pipeline

Computer Architectures
Pipelining
Giorgio Richelli
Computer Architecture
Instruction Execution
• 5 basic steps
1. fetch instruction (F)
I-Fetch
2. decode instruction/read registers (R)
3. execute (X)
4. access memory (M) Decode
5. store result (W)
Execute
Memory
Write
Result
Why Pipeline?
Time (clock cycles)
ALU
I
n Inst 0 Im Reg Dm Reg
ALU
t Inst 1 Im Reg Dm Reg
r.
ALU
O Inst 2 Im Reg Dm Reg
r
d
Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
Slide courtesy of D. Patterson

Why Pipeline?
• Suppose we execute 100 instructions

• Single Cycle Machine
– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
• Multicycle Machine
– 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
• Ideal pipelined machine
– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
Slide courtesy of D. Patterson

Benefits of Pipelining
• Before pipelining:
– Throughput: 1 instruction per cycle
tclk = t F + t R + t X + t M + tW
– (or lower cycle time and CPI=5)
• After pipelining (multiple instructions in pipe at one time)

– Throughput: 1 instruction per cycle
tclk = max(t F , t R , t X , t M , tW ) + tlatch
Visualizing Pipelining
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

I
ALU
n Ifetch Reg DMem Reg
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
Pipeline Hazards
• Hazards prevent next instruction from executing

during its designated clock cycle
– Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away)
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline (missing sock)
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
Pipeline Hazards
• Data hazards
– an instruction uses the result of a previous instruction
ADD R1, R2, R3 or SW R1, 3(R2)
ADD R4, R1, R5 LW R3, 3(R2)
• Control hazards
– the location of an instruction depends on a previous instruction
JMP R25
…
LOOP: ADD R1, R2, R3
• Structural hazards
– two instructions need access to the same resource
• e.g., single unit shared for instruction fetch and load/store
• collision in reservation table
Data Hazards: RAW
• Read After Write (RAW)

InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3
• Caused by a “Dependence” (in compiler nomenclature).

• This hazard results from an actual need for
communication between pipeline stages.
Data Hazards: WAR
• Write After Read (WAR)

InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “anti-dependence” by compiler writers.

This results from reuse of the name “r1”.
• Can’t happen in the sample 5-stage pipeline because:

– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
Data Hazards: WAW
• Write After Write (WAW)

InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
• Called an “output dependence” by compiler writers

This also results from the reuse of name “r1”.
Data Hazards (RAW)
Cycle
F R X M W
Write Data to R1 Here

Instruction
F R X M W
Read from R1 Here

ADD R1, R2, R3
ADD R4, R1, R5
Types of Data Hazards
• RAW (read after write)

– only hazard for ‘fixed’ F R A M W
pipelines
– later instruction must read F R A M W
after earlier instruction
writes
• WAW (write after write)
F R 1 2 3 4 W
– variable-length pipeline (e.g.
FP/int) F R A M W
– later instruction must write
after earlier instruction
writes
• WAR (write after read) F R 1 2 3 4 R 5 W
– pipelines with late read F R A M W
– later instruction must write
after earlier instruction reads
Datapath vs Control
Datapath Controller
Regs signals
A
L
U Control Points
• Datapath: Storage, FU, interconnect sufficient to perform the desired functions

– Inputs are Control Points
– Outputs are signals
• Controller: State machine to orchestrate operation on the data path
– Based on desired function and signals
Control Hazards
Cycle
F R X M W
Destination Available Here

Instruction
F R X M W
Need Destination Here

JR R25
...
XX: ADD ...
Control Hazards
Time (clock cycles)

I Load Ifetch
ALU
Reg DMem Reg
n
s
ALU
Ifetch Reg DMem Reg
Instr 1
t
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
r
ALU
Ifetch Reg DMem Reg
Instr 3
d
e
ALU
Ifetch Reg DMem Reg
r Instr 4
Resolving Hazards: Pipeline Stalls
• Can resolve any type of hazard

– data, control, or structural
• Detect the hazard
• Freeze the pipeline up to the dependent stage
until the hazard is resolved
Resolving Hazards
Time (clock cycles)

I Load Ifetch
ALU
Reg DMem Reg
n
s Instr 1
ALU
Ifetch Reg DMem Reg
t
r.
ALU
Ifetch Reg DMem Reg
Instr 2
O
Stall Bubble Bubble Bubble Bubble Bubble
r
d
e
ALU
Ifetch Reg DMem Reg
Instr 3
r
Example Pipeline Stall (Diagram)
Cycle
F R X M W
Write Data to R1 Here

Instruction
F Bubble R X M W
Read from R1 Here
ADD R1, R2, R3

ADD R4, R1, R5
Forwarding to Avoid Data Hazard
Time (clock cycles)

I
n
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r3 Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
Speedup Equation for Pipelining
CPIpipelined = Ideal CPI + Average Stall cycles per Inst
Ideal CPI × Pipeline depth Cycle Timeunpipelined

Speedup = ×
Ideal CPI + Pipeline stall CPI Cycle Timepipelined
For simple RISC pipeline, Ideal CPI = 1:
Pipeline depth Cycle Timeunpipelined

Speedup = ×
1 + Pipeline stall CPI Cycle Timepipelined
Example: Dual-port vs. Single-port
• Machine A: Dual-ported memory (“Harvard Architecture”)

• Machine B: Single-ported memory, but its pipelined
implementation has a 1.05 times faster clock rate
• Ideal CPI = 1 for both
• Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockunpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
• Machine A is 1.33 times faster

Software Scheduling
to Avoid Load Hazards
Producing fast code for

a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Faster code:
Slow code:
LW Rb,b
LW Rb,b LW Rc,c
LW Rc,c LW Re,e
ADD Ra,Rb,Rc ADD Ra,Rb,Rc
SW a,Ra LW Rf,f
LW Re,e SW a,Ra
LW Rf,f SUB Rd,Re,Rf
SUB Rd,Re,Rf SW d,Rd
SW d,Rd
Branch Stall Impact
• If CPI = 1, 30% branch,

Stall 3 cycles => new CPI = 1.9
• Solution:
– Determine branch taken or not sooner, AND
– Compute taken branch address earlier
Branch Hazard Alternatives
Stall until branch direction is clear
Predict Branch (Taken or Not Taken)

– Execute successor instructions in sequence (if not
taken) and “squash” instructions in pipeline if branch
mispredicted
– Must have already calculated branch target address
Branch Hazard Alternatives
Delayed Branch
– Define branch to take place AFTER a following instruction
branch instruction
sequential successor1
sequential successor2 Branch delay of length n
........
sequential successorn
branch target if taken
– MIPS, SPARC
– Experience has shown it has issues:
• Makes code difficult to maintain and debug
• Compiler has to find an instruction which is safe to
execute regarless if the branch is taken or not
• If the hardware changes, old programs may not work at all
Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through

add $1,$2,$3 sub $4,$5,$6 add $1,$2,$3
if $2=0 then if $1=0 then
delay slot delay slot
add $1,$2,$3
sub $4,$5,$6
if $1=0 then
delay slot
becomes becomes becomes

add $1,$2,$3
if $2=0 then if $1=0 then
add $1,$2,$3 sub $4,$5,$6
add $1,$2,$3
if $1=0 then
sub $4,$5,$6
• A is the best choice, fills delay slot & reduces instruction count (IC)
• In B, the sub instruction may need to be copied, increasing IC
• In B and C, must be okay to execute sub when branch fails
Delayed Branch
• Compiler effectiveness for single branch delay slot:

– Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots
useful in computation
– About 50% (60% x 80%) of slots usefully filled
• Delayed branching has lost popularity compared to
more expensive but more flexible dynamic
approaches
• Growth in available transistors has made dynamic
approaches relatively cheaper
Exceptions and Interrupts
• Exception: An unusual event happens to an

instruction during its execution
– Examples: divide by zero, undefined opcode
• Interrupt: Hardware signal to switch processor to

new instruction stream
– Example: sound card interrupts when it needs more audio
output samples (audio “click” happens if it is left waiting)
Exceptions and Interrupts
• Exception or interrupt must appear to happen

between 2 instructions (Ii and Ii+1)
• Interrupt (exception) handler either aborts
program or restarts at instruction Ii+1
Precise Exceptions (Sequential Processor)
• When interrupt occurs, state of interrupted process is saved,

including PC, registers, and memory
• Interrupt is precise if the following three conditions hold
– All instructions preceding u have been executed, and have modified the
state correctly
– All instructions following u are unexecuted, and have not modified the
state
– If the interrupt was caused by an instruction, it was caused by
instruction u, which is either completely executed (e.g.: overflow) or
completely unexecuted (e.g: VM page fault)
• Precise interrupts are desirable if software is to fix up error that
caused interrupt and execution has to be resumed
– Easy for external interrupts, could be complex and costly for internal
– Imperative for some interrupts (VM page faults, IEEE FP standard)

04 Pipeline

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 Pipeline

Uploaded by

Copyright:

Available Formats

Computer Architectures

Slide courtesy of D. Patterson

• Suppose we execute 100 instructions

Slide courtesy of D. Patterson

• After pipelining (multiple instructions in pipe at one time)

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

• Hazards prevent next instruction from executing

• Read After Write (RAW)

• Caused by a “Dependence” (in compiler nomenclature).

• Write After Read (WAR)

• Called an “anti-dependence” by compiler writers.

• Can’t happen in the sample 5-stage pipeline because:

• Write After Write (WAW)

• Called an “output dependence” by compiler writers

Write Data to R1 Here

Read from R1 Here

• RAW (read after write)

• Datapath: Storage, FU, interconnect sufficient to perform the desired functions

Destination Available Here

Need Destination Here

Time (clock cycles)

• Can resolve any type of hazard

Time (clock cycles)

Write Data to R1 Here

Read from R1 Here

ADD R1, R2, R3

Time (clock cycles)

CPIpipelined = Ideal CPI + Average Stall cycles per Inst

Ideal CPI × Pipeline depth Cycle Timeunpipelined

For simple RISC pipeline, Ideal CPI = 1:

Pipeline depth Cycle Timeunpipelined

• Machine A: Dual-ported memory (“Harvard Architecture”)

• Machine A is 1.33 times faster

Producing fast code for

• If CPI = 1, 30% branch,

Stall until branch direction is clear

Predict Branch (Taken or Not Taken)

A. From before branch B. From branch target C. From fall through

becomes becomes becomes

• Compiler effectiveness for single branch delay slot:

• Exception: An unusual event happens to an

• Interrupt: Hardware signal to switch processor to

• Exception or interrupt must appear to happen

• When interrupt occurs, state of interrupted process is saved,

You might also like