Professional Documents
Culture Documents
Improving Processor Performance With: Pipelining
Improving Processor Performance With: Pipelining
Performance with
Pipelining
COMP381 by M. Hamdi 1
Introduction to Pipelining
• Pipelining: An implementation technique that overlaps the execution of
multiple instructions. It is a key technique in achieving high-performance
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
Pipelined Laundry
Start work ASAP
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
• Pipelined laundry takes 3.5 hours for 4 loads
• Speedup = 6/3.5 = 1.7
Pipelining Lessons
• Latency vs. Throughput
• Question
– What is the latency in both cases ?
– What is the throughput in both cases ?
30 40 40 40 40 20
D
COMP381 by M. Hamdi 5
Pipelining Lessons [contd…]
• Question
– What is the fastest operation in the example ?
– What is the slowest operation in the example
Pipeline rate 30 40 40 40 40 20
limited by A
slowest
pipeline B
stage
C
D
COMP381 by M. Hamdi 6
Pipelining Lessons [contd…]
30 40 40 40 40 20
Multiple tasks operating
A simultaneously using
different resources
B
COMP381 by M. Hamdi 7
Pipelining Lessons [contd…]
• Question
– Would the speedup increase if we had more steps ?
30 40 40 40 40 20
A
Potential Speedup =
Number of pipe stages
B
COMP381 by M. Hamdi 8
Pipelining Lessons [contd…]
• Question
– Will it affect if “Folder” also took 40 minutes
30 40 40 40 40 20
A
1ns
Non-pipelined:
1 operation finishes
every 1ns
1ns
Pipelined:
1 operation finishes
every 200ps
COMP381 by M. Hamdi 12
Comments about pipelining
• Limitations:
– Computations must be divisible into stages of equal
sizes
– Pipeline registers add overhead
COMP381 by M. Hamdi 13
Another
Example
Unpipelined 30ns 3ns
System R
Comb. Delay = 33ns
E
Logic Throughput = 30M
G
Clock
Time
COMP381 by M. Hamdi 14
3 Stage Pipelining
R R R
Comb. Comb. Comb. Delay = 39ns
E E E
Logic Logic Logic Throughput = 77M
G G G
Time Op4
COMP381 by M. Hamdi 15
Limitation: Nonuniform
Pipelining
5ns 3ns 15ns 3ns 10ns 3ns
R R R
Com. Comb. Comb.
E E E
Log. Logic Logic
G G G
Delay = 18 * 3 = 54 n
Clock
Throughput = 55MHz
• Throughput limited by slowest stage
• Delay determined by clock period * number of stages
• Must attempt to balance stages
COMP381 by M. Hamdi 16
Limitation: Deep
Pipelines
5ns 3ns 5ns 3ns 5ns 3ns 5ns 3ns 5ns 3ns 5ns 3ns
R R R R R R
Com. Com. Com. Com. Com. Com.
E E E E E E
Log. Log. Log. Log. Log. Log.
G G G G G G
COMP381 by M. Hamdi 18
Pipelining
• Multiple instructions overlapped in execution
• Throughput optimization: doesn’t reduce time for
individual instructions
Instr 1
2
3 Instr 2 Instr 1
COMP381 by M. Hamdi 20
Pipelining: Design Goals
• An important pipeline design consideration is to balance
the length of each pipeline stage.
COMP381 by M. Hamdi 21
Pipelining: Design Goals
• Under these ideal conditions:
– Speedup from pipelining equals the number of pipeline stages:
n,
– One instruction is completed every cycle, CPI = 1 .
– This is an asymptote of course, but +10% is commonly achieved
– Difference is due to difficulty in achieving balanced stage
design
• Two ways to view the performance mechanism
– Reduced CPI (i.e. non-piped to piped change)
• Close to 1 instruction/cycle if you’re lucky
– Reduced cycle-time (i.e. increasing pipeline depth)
• Work split into more stages
• Simpler stages result in faster clock cycles
COMP381 by M. Hamdi 22
Implementation of MIPS
• We use the MIPS processor as an example to
demonstrate the concepts of computer pipelining.
• MIPS ISA is designed based on sound measurements
and sound architectural considerations (as covered in
class).
• It is used by numerous companies (Nintendo and
Playstation) through liscencing agreements.
• These same concepts are being used by ALL other
processors as well.
COMP381 by M. Hamdi 23
MIPS64 Instruction Format
I - type instruction 6 5 5 16
Opcode rs rt Immediate
0 5 6 10 11 15 16 31
Encodes: Loads and stores of bytes, words, half words. All immediates (rd rs op immediate)
Conditional branch instructions (rs1 is register, rd unused)
Jump register, jump and link register (rd = 0, rs = destination, immediate = 0)
R - type instruction 6 5 5 5 5 6
0 5 6 10 11 15 16 20 21 25 26 31
Register-register ALU operations: rd rs func rt Function encodes the data path operation:
Add, Sub .. Read/write special registers and moves.
J - Type instruction 6 26
0 5 6 31
Jump and jump and link. Trap and return from exception
COMP381 by M. Hamdi 24
A Basic Multi-Cycle
Implementation of MIPS
• Every integer MIPS instruction can be implemented in at most
five clock cycles (branch – 2 cycles, Store – 4 cycles, other – 5
cycles):
1 Instruction fetch cycle (IF):
IR Mem[PC]
NPC PC + 4
COMP381 by M. Hamdi 25
A Basic Implementation of MIPS (continued)
COMP381 by M. Hamdi 26
A Basic Implementation of MIPS (continued)
LMD Mem[ALUOutput] or
Mem[ALUOutput] B;
– Branch:
COMP381 by M. Hamdi 27
A Basic Implementation of MIPS (continued)
Regs[rd] ALUOutput;
– Register-Immediate ALU instruction:
Regs[rt] ALUOutput;
– Load instruction:
Regs[rt] LMD;
COMP381 by M. Hamdi 29
Simple MIPS Pipelined
Integer Instruction Processing
Clock Number Time in clock cycles
Instruction Number 1 2 3 4 5 6 7 8 9
Instruction I IF ID EX MEM WB
Instruction I+1 IF ID EX MEM WB
Instruction I+2 IF ID EX MEM WB
Instruction I+3 IF ID EX MEM WB
Instruction I +4 IF ID EX MEM WB
Time to fill the pipeline
MIPS Pipeline Stages:
IF = Instruction Fetch
ID = Instruction Decode Last instruction,
EX = Execution First instruction, I
I+4 completed
MEM = Memory Access Completed
WB = Write Back
COMP381 by M. Hamdi 30
Pipelining The MIPS Processor
• There are 5 steps in instruction execution:
1. Instruction Fetch
2. Instruction Decode and Register Read
3. Execution operation or calculate address
4. Memory access
5. Write result into register
COMP381 by M. Hamdi 31
Datapath for Instruction Fetch
ADD
PC
ADDR
Memory
RD Instruction
COMP381 by M. Hamdi 32
Datapath for R-Type Instructions
5 5 5 Operation
3
RN1 RN2 WN
RD1
Register File ALU Zero
WD
RD2
RegWrite
COMP381 by M. Hamdi 33
Datapath for Load/Store Instructions
COMP381 by M. Hamdi 34
Datapath for Load/Store Instructions
COMP381 by M. Hamdi 35
Datapath for Branch Instructions
PC +4 from
op rs rt offset/immediate instruction
16 datapath ADD
5 5 Operation
<<2
RN1 RN2 WN
RD1
Register File ALU Zero
WD
RD2
RegWrite
16
E
X 32
beq rs, rt, offset
T
N if (R[rs] == R[rt]) then
D PC <- PC+4 + s_extend(offset<<2)
COMP381 by M. Hamdi 36
Single-Cycle Processor
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
Instruction 5 5 5
Memory RN1 RN2 WN
RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
D
IF ID EX MEM WB
Instruction Fetch Instruction Decode Execute/ Address Calc. Memory Access Write Back
COMP381 by M. Hamdi 37
Pipelining - Key Idea
COMP381 by M. Hamdi 38
Pipeline Registers
• Pipeline registers are named with 2 stages (the
stages that the register is “between.”)
• ANY information needed in a later pipeline stage
MUST be passed via a pipeline register
Pipeline Registers
ADD
4 ADD
PC <<2
Instruction I
ADDR RD
32 16 32
Instruction 5 5 5
Memory RN1 RN2 WN
RD1
Register File ALU
WD
RD2 M
U ADDR
X
Data
E Memory RD M
U
16 X 32 X
T WD
N
D
Non-Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600 1800
Order Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
800ps
Instruction
lw $3, 300($0)
Fetch
800ps
800ps
Pipelined
Instruction 0 200 400 600 800 1000 1200 1400 1600
Order Time
Instruction REG REG
lw $1, 100($0) ALU MEM
Fetch RD WR
Instruction REG REG
lw $2, 200($0) Fetch
ALU MEM
RD WR
200ps
Instruction REG REG
lw $3, 300($0) ALU MEM
Fetch RD WR
200ps
200ps 200ps 200ps 200ps 200ps
COMP381 by M. Hamdi 41
Pipelined Example -
Executing Multiple Instructions
• Consider the following instruction sequence:
lw $r0, 10($r1)
sw $sr3, 20($r4)
add $r5, $r6, $r7
sub $r8, $r9, $r10
COMP381 by M. Hamdi 42
Executing Multiple Instructions
Clock Cycle 1
LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 43
Executing Multiple Instructions
Clock Cycle 2
SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 44
Executing Multiple Instructions
Clock Cycle 3
ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 45
Executing Multiple Instructions
Clock Cycle 4
SUB ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 46
Executing Multiple Instructions
Clock Cycle 5
SUB ADD SW LW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 47
Executing Multiple Instructions
Clock Cycle 6
SUB ADD SW
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 48
Executing Multiple Instructions
Clock Cycle 7
SUB ADD
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 49
Executing Multiple Instructions
Clock Cycle 8
SUB
ADD
ADD
4
<<2
PC
ADDR RD RN1 RD1
32 5 Zero
Instruction RN2 ALU
5
Memory Register
WN File
5 RD2 M
WD U ADDR
X
Data
Memory RD M
E U
16 X 32 X
T WD
N
5 D
COMP381 by M. Hamdi 50
Alternative View - Multicycle Diagram
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8
COMP381 by M. Hamdi 51
Pipelining: Design Goals
COMP381 by M. Hamdi 52
Pipelining Performance Example
• Example: For an unpipelined CPU:
– Clock cycle = 1ns, 4 cycles for ALU operations and branches and 5 cycles for
memory operations with instruction frequencies of 40%, 20% and 40%,
respectively.
– If pipelining adds 0.2 ns to the machine clock cycle then the speedup in
instruction execution from pipelining is:
COMP381 by M. Hamdi 53
Pipeline Throughput and Latency:
A More realistic Examples
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
COMP381 by M. Hamdi 54
Pipeline Throughput and Latency
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB L(I2) = 50ns
I3 IF ID EX MEM WB
I4 IF ID EX MEM
0 10 20 30 40 50 60
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
COMP381 by M. Hamdi 58
Pipeline Throughput and Latency
IF ID EX MEM WB
5 ns 4 ns 5 ns 10 ns 4 ns
COMP381 by M. Hamdi 59
Pipeline Throughput and Latency
IF ID EX MEM1 MEM2 WB
5 ns 4 ns 5 ns 5 ns 5 ns 4 ns
COMP381 by M. Hamdi 60
Pipeline Throughput and Latency
IF ID EX MEM1 MEM2 WB
5 ns 4 ns 5 ns 5 ns 5 ns 4 ns
I1 IF ID EX MEM1 MEM2 WB
I2 IF ID EX MEM1 MEM2 WB
I3 IF ID EX MEM1 MEM2 WB
I4 IF ID EX MEM1 MEM2 WB
I5 IF ID EX MEM1 MEM2 WB
I6 IF ID EX MEM1 MEM2 WB
I7 IF ID EX MEM1 MEM2 WB
COMP381 by M. Hamdi 61
Pipeline Throughput and Latency
IF ID EX MEM1 MEM2 WB
5 ns 4 ns 5 ns 5 ns 5 ns 4 ns
IF ID EX MEM1 MEM2 WB
5 ns 4 ns 5 ns 5 ns 5 ns 4 ns
COMP381 by M. Hamdi 64