Professional Documents
Culture Documents
Introduction To Pipelining Introduction To Pipelining
Introduction To Pipelining Introduction To Pipelining
Pipelining
Pipelining:Daily Works!!!
Laundry Example
• W, X, Y, Z each have one load of
clothes to wash, dry, and fold
A B C D
• Dryer takes 40
minutes
• Folder takes 20 minutes
Sequential Laundry
Time
6 PM 7 8 9 10 11 Midnight
30 40 20 30 40 20 30 40 20 30 40 20
90
T
a
s
A 90
k
O
B 90
r
d C 90
e
r
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
90
T
a
s
A
90
k
O B
90
r
d
e C
90
r
D
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons
• Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it
helps throughput of entire
Time
workload
• Pipeline rate is limited by
30 40 40 40 40 20 the slowest pipeline stage
• Multiple tasks operating
T A simultaneously
a
s • Potential speedup = Number
k B Draining pipe stages
• Unbalanced lengths of pipe
O Filling stages reduce speedup
r C • Time to “fill” pipeline and
d time to “drain” it reduces
e speedup
r D
What is pipelining?
• Arithmetic Pipeline
– Where different stages of an arithmetic operation are
handled along the stages of pipeline.
Instructional Pipeline
Contd…
• The processing of an instruction need not be divided into only two steps. To gain
further speed up, the pipeline must have more stages
• consider the following decomposition of the instruction
execution
- Fetch Instruction (FI): Read the next expected instruction into a
buffer.
- Decode Instruction ((DI): Determine the opcode and the operand
specifiers.
-Calculate Operand (CO): calculate the effective address of each
source operand.
- Fetch Operands(FO): Fetch each operand from memory.
Add NPC
+4
PC
Instr. IR
Memory
5 Steps of DLX Instr. Execution:
Step2
Step 2: Instruction decode/register fetch cycle (ID)
– Read source registers to A and B
A ← Regs[IR6..10 ]
B ← Regs[IR11..15 ]
NPC
MUX
A
ALU ALUOut
B MUX
Imm
OP
5 Steps of DLX Instr. Execution:
Step 4
Step 4: Memory access/branch completion cycle (MEM):
– Memory reference : Access memory either
• for LD: LMD ← Mem[ALUOutput] or
• for ST: Mem[ALUOutput] ← B
MUX
PC
else PC ← NPC; ALUOut
Cond
Data
LMD
Memory
B
5 Steps of DLX Instr. Execution:
Step 5
Step 5: Write-back cycle (WB):
Reg-Reg ALU : Store the result into the destination register
Regs[IR16..20 ] ← ALUOutput;
LMD
MUX
Register
File
ALUOut
OP
5 Steps of DLX Datapath
IF Stage ID Stage EX Stage MEM WB Stage
Stage
MUX
Add Zero?
+4
MUX
ALU
PC
MUX
Memory
SMD
Sign
Ext 32
16
A Simple Implementation
• A multi-cycle implementation
– needs temporary registers-- NPC, IC, A, B, Imm,
Cond, ALUOutput, LMD
– CPI improvements
• A single-cycle implementation
– one long clock cycle
– very inefficient for most machines that have a
reasonable variation among the amount of work
– requires the duplication of FU that could be shared in
a multi-cycle implementation
Visualizing Pipeline
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
ALU
IM Reg DM Reg
Instruction Order
ALU
IM Reg DM Reg
Filling
ALU
IM Reg DM Reg
Saving Information Produced
by Each Stage of Pipeline
• Information need to be stored at the end of a clock cycle,
otherwise it will be lost
• Each pipeline stage produces information(data, address, and
control) at the end of the clock cycle
• Thus, we need a storage(called inter-stage buffer) at end of
each pipeline stage
Inter-Stage Buffer
in DLX Pipeline
• F/D Buffer
– IR, NPC
• D/EX Buffer
– A, B, Imm, b(destination Reg address to store result),
OP(OP-code), cond
– NPC
• EX/M Buffer
– ALUout(arithmetic result or effective address)
– NPC, cond, b, OP
• M/W Buffer
– LMD(data for LD)
– ALUout(arithmetic result), b, OP
Pipelined DLX Datapath
- Multicycle -
IF Stage ID Stage EX Stage MEM WB
Stage Stage
MUX
Add Zero?
+4
MUX
PC
Instr. Reg
M/W Buffer
ALU
F/D Buffer
D/EX Buffer
EX/M Buffer
Memory File Data LMD
MUX
MUX
Memory
SMD
Sign
16 Ext 32
Basic Performance Issues
• Pipelining increases the CPU instruction throughput.
• Increase in throughput means that a program run faster and has lower
execution time.
• Imbalance among the pipe stage reduces performance since clock can
run no faster than the time needed for the slowest pipeline stage.
• Pipeline overhead arises from pipeline register delay and clock skew.
Contd…
• Buffering between stages marginally increase Cycle time
• Harzards reduce the CPI.
What is Hazard???
- is a risk in which pipeline operation stall(stop) for one or
more clock cycle.
- it prevent next instruction from executing during its
designated clock cycle
Pipeline Hazards
• There are three classes of hazards:
– Structural
• Happen due to simultaneous request for the same
resources by two or more instruction
– Eg. IF and MEM both required memory port.
– Data
• Instruction depends on result of prior instruction still
in the pipeline
– Control
• Happen due to branch and jump instruction.
Structural Hazard
Data harzard
Data Harzard Solution
• Types:
– Interlock: H/w detect data dependency and stall depent
instructions.
Contd..
- Forwarding or Bypassing: forward the result as soon as
available to EX
Contd…
• Instruction Scheduling: Reorder instruction. Such that
dependent instruction are 2-3 cycle apart.
– Useful for covering load delays and branch delays
– Useful in hiding delays due to long latency FP operation
Data Hazard Classification
• True data dependency - (RAW)
• Anti dependency - WAR
• Output dependency - WAW
Control Hazard
Next Class……