Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 35

Introduction to

Pipelining
Pipelining:Daily Works!!!
Laundry Example
• W, X, Y, Z each have one load of
clothes to wash, dry, and fold
A B C D

• Washer takes 30 minutes

• Dryer takes 40
minutes
• Folder takes 20 minutes
Sequential Laundry
Time
6 PM 7 8 9 10 11 Midnight

30 40 20 30 40 20 30 40 20 30 40 20
90
T
a
s
A 90
k

O
B 90

r
d C 90
e
r
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight

Time

30 40 40 40 40 20
90
T
a
s
A
90
k

O B
90
r
d
e C
90
r

D
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons
• Pipelining doesn’t help
6 PM 7 8 9 latency of single task, it
helps throughput of entire
Time
workload
• Pipeline rate is limited by
30 40 40 40 40 20 the slowest pipeline stage
• Multiple tasks operating
T A simultaneously
a
s • Potential speedup = Number
k B Draining pipe stages
• Unbalanced lengths of pipe
O Filling stages reduce speedup
r C • Time to “fill” pipeline and
d time to “drain” it reduces
e speedup
r D
What is pipelining?

• A pipeline is a series of stage, where some work is done at each


stage.

• Its an implementation technique whereby multiple instruction are


overlapped in execution.

• Implementation technique that exploits parallelism among the


instruction in a sequential instruction stream.

• A pipelined processor consists of sequence of processing


circuits called segment or stages through which a stream of
operands can be passed.
Why is pipelining desirable?
• Used to improved performance beyond what can be
achieved with non pipelined processing.
• Yields a reduction in the average execution time per
instruction.
Pipeline Designer Goal
• To balance the length of each pipeline stage
• If stage are well balanced, then the time per instruction on
the pipelined machine is equal to (T)

T = time per instruction on the pipelined machine


Number of pipe stages

Then, the speed up from pipelining equals the number of pipe


stages
Types of Pipelines
• Instructional Pipeline
– Where different stages of an instruction fetch and
execution are handled in a pipeline

• Arithmetic Pipeline
– Where different stages of an arithmetic operation are
handled along the stages of pipeline.
Instructional Pipeline
Contd…
• The processing of an instruction need not be divided into only two steps. To gain
further speed up, the pipeline must have more stages
• consider the following decomposition of the instruction
execution
- Fetch Instruction (FI): Read the next expected instruction into a
buffer.
- Decode Instruction ((DI): Determine the opcode and the operand
specifiers.
-Calculate Operand (CO): calculate the effective address of each
source operand.
- Fetch Operands(FO): Fetch each operand from memory.

- Execute Instruction (EI): Perform the indicated operation.

- Write Operand (WO): Store the result in memory.


The timing diagram …
Implementation Pipelining using
DLX
5 Steps of DLX Instr. Execution:
Step1
Step 1: Instruction fetch cycle (IF)
– Read instruction from memory and store into IR
• IR ← Mem[PC]

– Calculate the next instruction address


• NPC ← PC+4
• 1 instruction is stored in consecutive 4 bytes

Add NPC
+4
PC

Instr. IR
Memory
5 Steps of DLX Instr. Execution:
Step2
Step 2: Instruction decode/register fetch cycle (ID)
– Read source registers to A and B
A ← Regs[IR6..10 ]
B ← Regs[IR11..15 ]

– Make 16 bits sign extension of A


Reg
16-bit immediate field to make a IR File
B
32-bit immediate value
Imm ← ((IR16 )16 ## IR16..31 ) Rd
b

– Decoding is done in parallel: OP


fixed-field decoding Sign
Imm
b ← Rd 16 Ext 32
5 Steps of DLX Instr. Execution:
Step 3
Step 3: Execution/effective address cycle (EX):
– Memory reference: Effective Address calculation
» ALUOutput ← A + Imm

– Register-register ALU instruction: Perform ALU operation with R’s


» ALUOutput ← A func B; func B

– Register-Immediate ALU instruction: Perform ALU operation with


immediate operand
» ALUOutput ← A op Imm

– Branch: Effective Address calculation for branch target address


Determine condition code
» ALUOutput ← NPC + Imm; Cond ← (A op 0)
Step 3 EX
Zero? Cond

NPC

MUX
A
ALU ALUOut

B MUX

Imm

OP
5 Steps of DLX Instr. Execution:
Step 4
Step 4: Memory access/branch completion cycle (MEM):
– Memory reference : Access memory either
• for LD: LMD ← Mem[ALUOutput] or
• for ST: Mem[ALUOutput] ← B

– Branch : Test Condition


• if (cond) PC ← ALUOutput, NPC

MUX
PC
else PC ← NPC; ALUOut

Cond

Data
LMD
Memory
B
5 Steps of DLX Instr. Execution:
Step 5
Step 5: Write-back cycle (WB):
Reg-Reg ALU : Store the result into the destination register
Regs[IR16..20 ] ← ALUOutput;

Reg-Immediate ALU : Store the result into destination register


Regs[IR11..15 ] ← ALUOutput;

Load instruction: Store the data read from memory to the


destination register
Regs[IR11..15 ] ← LMD;

LMD

MUX
Register
File
ALUOut

OP
5 Steps of DLX Datapath
IF Stage ID Stage EX Stage MEM WB Stage
Stage

MUX
Add Zero?

+4

MUX
ALU
PC

Instr. Reg ALU Output


Memory File Data LMD
MUX

MUX
Memory

SMD

Sign
Ext 32
16
A Simple Implementation
• A multi-cycle implementation
– needs temporary registers-- NPC, IC, A, B, Imm,
Cond, ALUOutput, LMD
– CPI improvements

• A single-cycle implementation
– one long clock cycle
– very inefficient for most machines that have a
reasonable variation among the amount of work
– requires the duplication of FU that could be shared in
a multi-cycle implementation
Visualizing Pipeline
Time(clock cycles)
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

ALU
IM Reg DM Reg
Instruction Order

IM Reg ALU DM Reg Draining

IM Reg ALU DM Reg

ALU
IM Reg DM Reg
Filling

ALU
IM Reg DM Reg
Saving Information Produced
by Each Stage of Pipeline
• Information need to be stored at the end of a clock cycle,
otherwise it will be lost
• Each pipeline stage produces information(data, address, and
control) at the end of the clock cycle
• Thus, we need a storage(called inter-stage buffer) at end of
each pipeline stage
Inter-Stage Buffer
in DLX Pipeline
• F/D Buffer
– IR, NPC
• D/EX Buffer
– A, B, Imm, b(destination Reg address to store result),
OP(OP-code), cond
– NPC
• EX/M Buffer
– ALUout(arithmetic result or effective address)
– NPC, cond, b, OP
• M/W Buffer
– LMD(data for LD)
– ALUout(arithmetic result), b, OP
Pipelined DLX Datapath
- Multicycle -
IF Stage ID Stage EX Stage MEM WB
Stage Stage

MUX
Add Zero?

+4

MUX
PC

Instr. Reg

M/W Buffer
ALU
F/D Buffer

D/EX Buffer

EX/M Buffer
Memory File Data LMD
MUX

MUX
Memory

SMD

Sign
16 Ext 32
Basic Performance Issues
• Pipelining increases the CPU instruction throughput.

• Does not reduce the execution time of an individual instruction


rather slightly increase the execution time due to overhead in the control
of the pipeline.

• Increase in throughput means that a program run faster and has lower
execution time.

• Imbalance among the pipe stage reduces performance since clock can
run no faster than the time needed for the slowest pipeline stage.

• Pipeline overhead arises from pipeline register delay and clock skew.
Contd…
• Buffering between stages marginally increase Cycle time
• Harzards reduce the CPI.

What is Hazard???
- is a risk in which pipeline operation stall(stop) for one or
more clock cycle.
- it prevent next instruction from executing during its
designated clock cycle
Pipeline Hazards
• There are three classes of hazards:
– Structural
• Happen due to simultaneous request for the same
resources by two or more instruction
– Eg. IF and MEM both required memory port.
– Data
• Instruction depends on result of prior instruction still
in the pipeline
– Control
• Happen due to branch and jump instruction.
Structural Hazard
Data harzard
Data Harzard Solution
• Types:
– Interlock: H/w detect data dependency and stall depent
instructions.
Contd..
- Forwarding or Bypassing: forward the result as soon as
available to EX
Contd…
• Instruction Scheduling: Reorder instruction. Such that
dependent instruction are 2-3 cycle apart.
– Useful for covering load delays and branch delays
– Useful in hiding delays due to long latency FP operation
Data Hazard Classification
• True data dependency - (RAW)
• Anti dependency - WAR
• Output dependency - WAW
Control Hazard

Next Class……

You might also like