Professional Documents
Culture Documents
Lec11 Pipeline 1 Notes
Lec11 Pipeline 1 Notes
Lec11 Pipeline 1 Notes
Spring 2022
Single Cycle Disadvantages
Cycle 1 Cycle 2
Clk
lw sw Waste
2/25
Single Cycle Disadvantages
• Though simple, the single cycle approach is not used because it is very slow
• Clock cycle must have the same length for every instruction
• What is the longest path (slowest instruction)? Load instruction!
• It is too long for the store instruction so the last part of the cycle here is wasted.
3/25
EX: Instruction Critical Paths
Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC
access, shift left 2, wires) except:
• Instruction memory (IM) and data memory (DM) (4 ns)
• ALU and adders (ALU) (2 ns)
• Register File access (Reg) (1 ns)
4/25
EX: Instruction Critical Paths
Calculate cycle time assuming negligible delays (for muxes, control unit, sign extend, PC
access, shift left 2, wires) except:
• Instruction memory (IM) and data memory (DM) (4 ns)
• ALU and adders (ALU) (2 ns)
• Register File access (Reg) (1 ns)
4/25
Solution:
Instr. IM Reg ALU DM Reg Total
R/I-type 4 1 2 1 8
lw 4 1 2 4 1 12
sw 4 1 2 4 11
beq 4 1 2 7
jal 4 2 1 7
jalr 4 1 2 1 8
5/25
How Can We Make It Faster?
• Start fetching and executing the next instruction before the current one has
completed
• Pipelining – (all?) modern processors are pipelined for performance
• Under ideal conditions and with a large number of instructions, the speedup
from pipelining is approximately equal to the number of pipe stages
• A five stage pipeline is nearly five times faster because the CC is “nearly” five
times faster
• Fetch (and execute) more than one instruction at a time
• Superscalar processing – stay tuned
6/25
Note:
• CC: clock cycle;
• IC: instruction count
• In reality the time per instruction in a pipeline processor is longer than the minimum
possible because
1 the pipeline stages may not be perfectly balanced
2 pipelining involves some overhead (like pipeline stage isolation registers).
7/25
The Five Stages of Load Instruction
lw IF ID EXE MEM WB
8/25
Equivalent Definitions
lw IF ID EXE MEM WB
9/25
A Pipelined RISC-V Processor
Start the next instruction before the current one has completed
• improves throughput - total amount of work done in a given time
• instruction latency (execution time, delay time, response time - time from the start of
an instruction to its completion) is not reduced
lw IF ID EXE MEM WB
sw IF ID EXE MEM WB
lw IF ID EXE MEM WB
sw IF ID EXE MEM WB
Instructions in order
11/25
Single Cycle versus Pipeline
lw sw Waste
sw IF ID EXE MEM WB
• To complete an entire instruction in the pipelined case takes 1000 ps (as compared to
800 ps for the single cycle case). Why ?
• How long does each take to complete 1,000,000 adds ?
12/25
Example:
If IF=100ps, ID=100ps, EXE=200ps, MEM=200ps, WB=100ps
• In single cycle setting, cycle-length=700ps
• In pipeline setting, each cycle length=200ps, so finish one instr will take 5 stages
(ie. 5 × 200ps).
13/25
Pipelining the RISC-V ISA
14/25
RISC-V Pipeline Datapath Additions/Mods
Add
Shift Add MEM/WB
4
left 2
Read Addr 1
Instruction Data
Register Read
Memory Memory
Read Addr 2Data 1
Read
PC
File Address
Read
Address Write Addr Read ALU Data
Data 2 Write Data
Write Data
Sign
16 Extend 32
System Clock
15/25
RISC-V Pipeline Control Path Modifications
PCSrc
ID/EX
EX/MEM
Control
IF/ID
Add
Branch MEM/WB
4 RegWrite Shift Add
left 2
Instruction Read Addr 1
Register Read Data
Memory Read Addr Data
2 1 Memory MemtoReg
Read ALUSrc
PC
16 Extend 32 ALUOp
RegDst
16/25
Pipeline Control
• IF Stage: read Instr Memory (always asserted) and write PC (on System Clock)
• ID Stage: no optional control signals to set
17/25
Graphically Representing RISC-V Pipeline
ALU
IM Reg DM Reg
18/25
Other Pipeline Structures Are Possible
MUL
ALU
IM Reg DM Reg
What if the data memory access is twice as slow as the instruction memory?
• make the clock twice as slow or ...
• let data memory access take two cycles (and keep the same clock rate)
ALU
19/25
Other Sample Pipeline Alternatives
• ARM7:
IM Reg EX
PC update decode
IM access Reg access
ALU op DM access
shift/rotate commit result
(write back)
• XScale:
Reg
ALU
IM1 IM2 Reg SHFT DM1
DM2
PC update decode DM write
BTB access reg 1 access ALU op reg write
start IM access
shift/rotate start DM access
IM access reg 2 access exception
20/25
Why Pipeline? For Performance!
Once the
Inst 0
ALU
I IM Reg DM Reg pipeline is full,
n one instruction
s is completed
Inst 1
ALU
t IM Reg DM Reg
every cycle, so
r. CPI = 1
ALU
O Inst 2 IM Reg DM Reg
r
d
ALU
e Inst 3 IM Reg DM Reg
r
ALU
Inst 4 IM Reg DM Reg
21/25
Can Pipelining Get Us Into Trouble?
22/25
Structure Hazards
23/25
Resolve Structural Hazard 1
ALU
Mem Reg Mem Reg
I memory
n
s
Inst 1
ALU
t Mem Reg Mem Reg
r.
ALU
O Inst 2 Mem Reg Mem Reg
r
d
ALU
e Inst 3 Mem Reg Mem Reg
r
ALU
Inst 4 Reading instruction Mem Reg Mem Reg
from memory
Fix with separate instr and data memories (I$ and D$)
24/25
Resolve Structural Hazard 2
ALU
I IM Reg DM Reg access hazard by
n doing reads in the
s second half of the
Inst 1
ALU
t IM Reg DM Reg
cycle and writes in
r. the first half
ALU
O Inst 2 IM Reg DM Reg
r
d
ALU
e add $2,$1, IM Reg DM Reg
r
25/25