Pipe 1 New

Curso de Arquiteturas Avançadas de Computadores
Professora: Roseli S. Wedemann

roseli@ime.uerj.br
sala D6019
Texto: Computer Organization & Design, the Hardware

Software Interface, D. A. Patterson and J. L. Hennessy,
Morgan Kaufman Publishers
cs 152 L1 3 .1 DAP Fa97,  U.CB

Sobre o Curso
Critério de Aprovação:
duas provas + prova final
para passar sem prova final, média >= 7
para passar com prova final, média >= 5
direito à 2a. chamada
para ir para a prova final, média >= 4
Nota = [ (P1 + P2) / 2 + PF ] / 2
Conteúdo
Capítulo 6: Enhancing Performance with Pipelining
Capítulo 9: Multiprocessors
2
cs 152 L1 3 .2 DAP Fa97,  U.CB
Recap: Microprogramming
Specialize state-diagrams easily captured by

microsequencer
simple increment & “branch” fields
datapath control fields
Control design reduces to Microprogramming

Microprogramming is a fundamental concept
implement an instruction set by building a very simple processor
and interpreting the instructions
essential for very complex instructions and when few register
transfers are possible
overkill when ISA matches datapath 1-1
3
cs 152 L1 3 .3 DAP Fa97,  U.CB
Summary: Microprogramming one inspiration for RISC
If simple instruction could execute at very high clock

rate…
If you could even write compilers to produce
microinstructions…
If most programs use simple instructions and
addressing modes…
If microcode is kept in RAM instead of ROM so as to fix
bugs …
If same memory used for control memory could be used
instead as cache for “macroinstructions”…
Then why not skip instruction interpretation by a
microprogram and simply compile directly into lowest
language of machine? (microprogramming is overkill
when ISA matches datapath 1-1)
4
cs 152 L1 3 .4 DAP Fa97,  U.CB
Recap: Exceptions and Interrupts
Exceptions are the hard part of control
Need to find convenient place to detect exceptions

and to branch to state or microinstruction that
saves PC and invokes the operating system
As we get pipelined, CPUs that support page faults

on memory accesses, which means that the
instruction cannot complete AND you must be able
to restart the program at exactly the instruction
with the exception, it gets even harder
5
cs 152 L1 3 .5 DAP Fa97,  U.CB
Definitions of control variables
EPC: Exception Program Counter (stores return

address)
exp_addr: address of exception handler routine
cause: reg used for storing the cause of exception
rd: reg destination operand in instruction
rt: second reg source operand in instruction
SX: constant specified in instruction
ZX: zero extended immediate
Ori: S <= reg or ZX, S is a reg (see pg A-57)
(see chapter 3 and section 5.6)
6
cs 152 L1 3 .6 DAP Fa97,  U.CB
Modification to the Control
Specification
IR <= MEM[PC]
PC <= PC + 4 “instruction fetch”
0000
undefined instruction
“decode”
EPC <= PC - 4 EPC <= PC - 4
S<= PC +SX
PC <= exp_addr other PC <= exp_addr
cause <= 12 (Ovf) 0001 cause <= 10 (RI)
R-type ORi LW BEQ

SW
Execute
S <= AIf- AB = B
S <= A fun B S <= A op ZX S <= A + SX S <= A + SX then PC <= S
0100 0110 1000 1011 0010
overflow
Memory
M <= MEM[S] MEM[S] <= B

1001
1100
Write-back
R[rd] <= S R[rt] <= S R[rt] <= M
0101 0111 1010
7
The Big Picture: Where are We Now?
The Five Classic Components of a Computer
Processor
Input
Control
Memory
Datapath Output
Today’s Topics:
Pipelining by Analogy
8
6.1 An overview of Pipelining. It’s Natural!
Example: Stages of Laundry

Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 30 minutes
“Folder” takes 30 minutes
“Storer” takes 30 minutes

to put clothes into drawers 9
cs 152 L1 3 .9 DAP Fa97,  U.CB
Sequential Laundry
6 PM 7 8 9 10 11 12 1 2 AM
T 3030 30 30 3030 30 30 3030 30 30 3030 30 30

a Time
A
s
k B
C
O
r D
d
e
r Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry
take? 10
cs 152 L1 3 .10 DAP Fa97,  U.CB
Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 12 1 2 AM
3030 30 30 30 3030 Time

T
a A
s
k B
C
O
D
r
d
e
r
Pipelined laundry takes 3.5 hours for 4 loads!
11
cs 152 L1 3 .11 DAP Fa97,  U.CB
Pipelining Lessons
Pipelining doesn’t help
latency of single task, it
6 PM 7 8 9 helps throughput of entire
workload
Time
T Multiple tasks operating
simultaneously using
a 3030 30 30 30 3030 different resources
s A Potential speedup = Number
k pipe stages
B Pipeline rate limited by
O slowest pipeline stage
C
r Stages should take about
d D same amount of time
e Unbalanced lengths of pipe
stages reduces speedup
r
Time to “fill” pipeline and
time to “drain” it reduces
speedup
cs 152 L1 3 .12 Stall for Dependences 12
DAP Fa97,  U.CB
The Five Stages of MIPS Instructions, ex. Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Load Ifetch Reg/Dec Exec Mem Wr
Ifetch: Instruction Fetch

Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode

Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file
base addressing
Ex: lw $1, 100 ($0) $1 = Memory[$0 + 100] 13
cs 152 L1 3 .13
DAP Fa97,  U.CB
We will limit our attention to 8 instrs
lw: load word (8ns) instruction fetch: 2ns

sw: store word (7ns) register read or write: 1ns
add: add (6ns) ALU operation: 2ns
sub: subtract (6ns) memory acces: 2ns
and: and (6ns)
or: or (6ns)
slt: set-less-than (compares 2 regs and sets a third to
1, if the first is less than the second, pg. 128) (6ns)
beq: branch-if-equal (branch to label if 2 regs are
equal, pg 123) (5ns)
Ex: beq $3, $4, L if $3 = $4 go to L
see Chapter 3 for descriptions of instructions
14
cs 152 L1 3 .14 DAP Fa97,  U.CB
Pipelining
Improve perfomance by increasing instruction

throughput
figure 6.3
Ideal speedup is number of stages in the pipeline. Do we
achieve this? 15
cs 152 L1 3 .15 DAP Fa97,  U.CB
Performance Metrics
In the single-cycle design model (every instruction
takes 1 clock cycle), the clock cycle must be
stretched to accommodate the slowest instruction
(Load is slowest and takes 8ns in MIPS).
In the pipelined design, the pipeline stages take a
single clock cycle. The clock cycle must be long
enough to accommodate the slowest operation
(2ns).
Assume write to reg file occurs in first half of clock
cycle and read from reg file occurs in second half of
cycle.
Speedup = execution time without pipeline /
execution time with pipeline = ideally to number of
stages.
(definition in pg. 101)
cs 152 L1 3 .16
16
DAP Fa97,  U.CB
Performance Improvement
pipe with 5 balanced stages: ideal speedup = 5,

8ns / 5 = 1.6ns clock cycle
stages may be imperfectly balanced and pipeline
involves some overhead, so time per instruction
(CPI) will exceed minimum possible and speedup
will be less than number of stages.
In ex., for 3 instrs: speedup = 24ns / 14ns = 1.7
If we add 1000 lw instrs (1 instr per 2ns):
speedup = 8024ns / 2014ns = 3.98 (almost 4)
Pipelining increases instruction throughput, as

opposed to decreasing the execution time of an
individual instruction.
Instruction throughput is the important metric!
cs 152 L1 3 .17
17
DAP Fa97,  U.CB
Basic Idea
fig 6.10 (pg. 450)

What do we need to add to actually split the datapath into
stages?
18
cs 152 L1 3 .18 DAP Fa97,  U.CB
Instruction sets for piplines
1. MIPS instrs are all the same length. Easier to

fetch instr in 1st stage and decode in 2nd stage.
2. MIPS has only a few instr formats, with the
source reg fields in the same place. 2nd stage can
start reading the reg file while HW is determining
what type of instr was fetched. Reg read + decode
in one stage.
3. Memory operands only appear in loads and
stores in MIPS. Use the execute stage to calculate
mem address and access mem in the next stage.
Otherwise stages 3 and 4 would break into 3
stages: address, access, execute.
4. Operands must be aligned in memory (chapt 3,
pg. 112). 1 data transfer takes 1 data mem access
(1 pipeline stage).
19
cs 152 L1 3 .19
DAP Fa97,  U.CB
Graphically Representing Pipelines
Can help with answering questions like:

how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
20
cs 152 L1 3 .20 DAP Fa97,  U.CB
Conventional Pipelined Execution Representation
Time
IFetch Dcd Exec Mem WB

Program Flow
21
cs 152 L1 3 .21 DAP Fa97,  U.CB
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1 Cycle 2
Clk
Single Cycle Implementation:
Load Store Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Multiple Cycle Implementation:

Load Store R-type
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch
Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr
Store Ifetch Reg Exec Mem Wr
R-type Ifetch Reg Exec Mem Wr 22

cs 152 L1 3 .22
DAP Fa97,  U.CB
Why Pipeline?
Suppose we execute 100 instructions

Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
Ideal pipelined machine

10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
speedup = 4500ns / 1040ns = 4.33
cs 152 L1 3 .23
23
DAP Fa97,  U.CB
Why Pipeline? Because the resources are there!
Time (clock cycles)
ALU
I Im Reg Dm Reg
n Inst 0
s
ALU
t
r.
Inst 1 Im Reg Dm Reg
ALU
O Inst 2 Im Reg Dm Reg
r
d Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
24
cs 152 L1 3 .24
DAP Fa97,  U.CB
Can pipelining get us into trouble?
Yes: Pipeline Hazards (next instr can’t execute in next
cycle)
structural hazards: attempt to use the same resource two different
ways at the same time
- E.g., combined washer/dryer would be a structural hazard or
folder busy doing something else (watching TV)
data hazards: attempt to use item before it is ready
- E.g., one sock of pair in dryer and one in washer; can’t fold until
get sock from washer through dryer
- instruction depends on result of prior instruction still in the
pipeline
control hazards: attempt to make a decision before condition is
evaluated
- E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in
- branch instructions
Can always resolve hazards by waiting

pipeline control must detect the hazard 25
cs 152 L1 3 .25
take action (or delay action) to resolve hazards DAP Fa97,  U.CB
Single Memory is a Structural Hazard
Time (clock cycles)
ALU
I Mem Reg Mem Reg
n Load
s
ALU
Mem Reg Mem Reg
t
r.
Instr 1
ALU
Mem Reg Mem Reg
O Instr 2
r
ALU
d Mem Reg Mem Reg
e
Instr 3
r
ALU
Mem Reg Mem Reg
Instr 4
Detection is easy in this case! (right half highlight means read, left half write)
26
cs 152 L1 3 .26
DAP Fa97,  U.CB
Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and

only one memory access per cycle, then 30% of
instrs are loads
average CPI = 0.7*1 + 0.3*2 = 1.3 cycles / instr

(30% of the time stall 1 cycle)
otherwise resource is more than 100% utilized
27
cs 152 L1 3 .27
DAP Fa97,  U.CB
Control Hazard Solutions
Stall (Bubble): wait until decision is clear

It’s possible to move up decision to 2nd stage by adding hardware
to check registers as being read, calculate branch address and
I update PC
Time (clock cycles)
n
ALU
s Mem Reg Mem Reg
t Add
r.
ALU
Mem Reg Mem Reg
Beq
O
r
Load
ALU
Mem Reg Mem Reg
d
e
r
Impact: 2 clock cycles per branch instruction

=> slow
28
cs 152 L1 3 .28 DAP Fa97,  U.CB
Control Hazards Limit Performance
Estimate average CPI for stalling on branches.

Suppose 17% of the instrs executed are branches
(gcc execution benchmark).
average CPI = 0.83*1 + 0.17*2 = 1.17 cycles / instr
29
cs 152 L1 3 .29
DAP Fa97,  U.CB
Predict: guess one direction then back up if wrong

Predict branch not taken is a common approach
I Time (clock cycles)

n
ALU
s Mem Reg Mem Reg
t Add
r.
ALU
Mem Reg Mem Reg
Beq
O
r
Load
ALU
Mem Reg Mem Reg
d
e
r
Impact: 1 clock cycle per branch instruction if right,

2 if wrong (right 50% of time)
30
cs 152 L1 3 .30
DAP Fa97,  U.CB
Dynamic Scheme
Dynamic HW predictors make guesses depending

on the behavior of each branch and may change
predictions for a branch over the life of a program.
Ex: Keep history of each branch as taken or not
taken and use the past to predict the future:
Such HW has 90% accuracy.
When the guess is wrong, the pipeline control must
ensure that the instrs following the wrongly
guessed branch have no effect, and must restart
the pipe from the proper branch address.
Longer pipes raise cost of mispredictions as is
usual with all solutions to control hazards.
31
cs 152 L1 3 .31
DAP Fa97,  U.CB
Redefine branch behavior (takes place after next
instruction) “delayed branch”. Place a “safe”instr after
the branch instr. The compiler rearranges instrs when
possible (doesn’t change program results).
I Time (clock cycles)
n
ALU
s Mem Reg Mem Reg
t Add
r.
ALU
Mem Reg Mem Reg
Beq
O
r
ALU
d Misc Mem Reg Mem Reg
ALU
r Load Mem Reg Mem Reg
Impact: 0 clock cycles per branch instruction if can find instruction to put
in “slot” ( 50% of time)
Longer pipes, more slots, harder to fill. 32
cs 152 L1 3 .32
DAP Fa97,  U.CB
Data Hazard on r1
An instr depends on the results of a previous instr still in the

pipe. Too many of these dependencies to be solved by a compiler
successfuly.
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
33
cs 152 L1 3 .33
DAP Fa97,  U.CB
Data Hazard on r1:
• Dependencies backwards in time are hazards
Time (clock cycles)

IF ID/RF EX MEM WB
ALU
I add r1,r2,r3 Im Reg Dm Reg
ALU
s Im Reg Dm Reg
t
sub r4,r1,r3
r.
ALU
Im Reg Dm Reg
and r6,r1,r7
O
ALU
r Im Reg Dm Reg
d or r8,r1,r9
e
ALU
Im Reg Dm Reg
r xor r10,r1,r11
cs 152 L1 3 .34
34
DAP Fa97,  U.CB
Data Hazard Solution:
• “Forward” result from one stage to another
Time (clock cycles)

IF ID/RF EX MEM WB
ALU
I add r1,r2,r3 Im Reg Dm Reg
ALU
s Im Reg Dm Reg
t
sub r4,r1,r3
r.
ALU
Im Reg Dm Reg
and r6,r1,r7
O
ALU
r Im Reg Dm Reg
d or r8,r1,r9
e
ALU
r xor r10,r1,r11 Im Reg Dm Reg
• “or” OK if define read/write properly

cs 152 L1 3 .35
35
DAP Fa97,  U.CB
Forwarding (or Bypassing): What about Loads
• Dependencies backwards in time are hazards

(shading indicates when operand is used by the instr)
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
lw r1,0(r2) Im Reg Dm Reg
ALU
Im Reg Dm Reg
sub r4,r1,r3
• Can’t solve with forwarding:

• Must delay/stall instruction dependent on loads
Study example of reordering of code to avoid stalls caused

by data hazards (pg 447). 36
cs 152 L1 3 .36
DAP Fa97,  U.CB
6.2 Designing a Pipelined Processor
Go back and examine your datapath and control

diagram
associated resources with states
ensure that flows do not conflict, or figure out how to
resolve
assert control in appropriate stage
To retain the value of an individual instr for its other

four stages, the value output by one stage must be
saved in a regs. We must place regs where there
are dividing lines between the stages. These are
called pipeline regs.
37
cs 152 L1 3 .37
DAP Fa97,  U.CB
Pipelined Datapath (as in book); hard to read
IF ID EX MEM WB
Fig 6.12
Data flowing from right to left does not affect
the current instr, only later instrs in the pipe
38
cs 152 L1 3 .38
DAP Fa97,  U.CB
Pipelined Processor (almost) for slides
What happens if we start a new instruction every

cycle?
Inst. Mem Valid
WB Ctrl
Mem Ctrl
IRmem
IRwb
Dcd Ctrl
IRex
Ex Ctrl
IR
Equal
Reg.
File
Exec
Reg
A
Next PC
File
S
PC
Access
Mem
M
Mem
Data
39
cs 152 L1 3 .39
DAP Fa97,  U.CB
Control and Datapath
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond

PC < PC+SX;
M <– Mem[S] Mem[S] <- B
Equal
R[rd] <– S; R[rt] <– S; R[rd] <– M;
Reg.
Inst. Mem
File
Exec
Reg
A
Next PC
File
S
PC
IR
M
B
Access
Mem
Mem
Data
D 40
cs 152 L1 3 .40
DAP Fa97,  U.CB
Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

Clock
1st lw Ifetch Reg/Dec Exec Mem Wr
2nd lw Ifetch Reg/Dec Exec Mem Wr
3rd lw Ifetch Reg/Dec Exec Mem Wr
The five independent functional units in the pipeline

datapath are:
Instruction Memory for the Ifetch stage
Register File’s Read ports (bus A and busB) for the Reg/Dec stage
ALU for the Exec stage
Data Memory for the Mem stage
Register File’s Write port (bus W) for the Wr stage
41
cs 152 L1 3.41
DAP Fa97,  U.CB
The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4
R-type Ifetch Reg/Dec Exec Wr


Exec:
ALU operates on the two register operands
Update PC
Wr: Write the ALU output back to the register file
42
cs 152 L1 3 .42
DAP Fa97,  U.CB
Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem!
We have pipeline conflict or structural hazard:

Two instructions try to write to the register file at the same time!
Only one write port
43
cs 152 L1 3 .43
DAP Fa97,  U.CB
Important Observation
Each functional unit can only be used once per

instruction
Each functional unit must be used at the same stage
for all instructions:
Load uses Register File’s Write Port during its 5th stage
1 2 3 4 5
R-type uses Register File’s Write Port during its 4th stage
1 2 3 4
° 2 ways to solve this pipeline hazard.
44
cs 152 L1 3 .44
DAP Fa97,  U.CB
Solution 1: Insert “Bubble” into the Pipeline

Clock
Ifetch Reg/Dec Exec Wr
R-type Ifetch Reg/Dec Pipeline Exec Wr

R-type Ifetch Bubble Reg/Dec Exec Wr
Ifetch Reg/Dec Exec
Insert a “bubble” into the pipeline to prevent 2 writes at

the same cycle
The control logic can be complex.
Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!

45
cs 152 L1 3 .45
DAP Fa97,  U.CB
Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle:

Now R-type instructions also use Reg File’s write port at Stage 5
Mem stage is a NOOP stage: nothing is being done.
1 2 3 4 5
R-type Ifetch Reg/Dec Exec Mem Wr

Clock
46
cs 152 L1 3 .46
DAP Fa97,  U.CB
Modified Control & Datapath
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC

< PC+SX;
M <– S M <– S M <– Mem[S] Mem[S] <- B
Equal
R[rd] <– M; R[rt] <– M; R[rd] <– M;
Reg.
Inst. Mem
File
Exec
Reg
A M
Next PC
File
S
PC
IR
Access
Mem
Mem
Data
D 47
cs 152 L1 3 .47
DAP Fa97,  U.CB
The Four Stages of Store
Store Ifetch Reg/Dec Exec Mem Wr


Exec: Calculate the memory address
Mem: Write the data into the Data Memory
48
cs 152 L1 3 .48
DAP Fa97,  U.CB
The Three Stages of Beq
Beq Ifetch Reg/Dec Exec Mem Wr

Reg/Dec:
Registers Fetch and Instruction Decode
Exec:
compares the two register operand,
select correct branch target address
latch into PC
49
cs 152 L1 3 .49 DAP Fa97,  U.CB
Control
Diagram
IR <- Mem[PC]; PC < PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC

< PC+SX;
M <– S M <– S M <– Mem[S] Mem[S] <- B
Equal
R[rd] <– S; R[rt] <– S; R[rd] <– M;
Reg.
Inst. Mem
File
Exec
Reg
A M
Next PC
File
S
PC
IR
Access
Mem
Mem
Data
D
50
cs 152 L1 3 .50 DAP Fa97,  U.CB
Datapath + Data Stationary Control
IR v v v
fun rw rw
Inst. Mem
rw
Decode
wb wb wb
rt me me WB
rs ex Mem
op Ctrl
Ctrl
rs rt im
Reg.
File
Reg
File A M
Exec
S
Access
Mem
Mem
Data
D
Next PC
PC
cs 152 L1 3 .51
51
DAP Fa97,  U.CB
Let’s Try it Out
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
these addresses are octal
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
52
cs 152 L1 3 .52
DAP Fa97,  U.CB
Start: Fetch 10
Inst. Mem n n n n
Decode
WB
Mem
Ctrl
Ctrl
IR im
rs rt
Reg.
File
Reg
A M
File
Exec
S
B 10 lw r1, r2(35)
Access
Mem
Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC
24 beq r6, r7, 100

30 ori r8, r9, 17
10
34 add r10, r11, r12

PC
100 and r13, r14, 15

cs 152 L1 3 .53 53
DAP Fa97,  U.CB
Fetch 14, Decode 10
n n n
lw r1, r2(35)
Inst. Mem
Decode
WB
Mem
Ctrl
Ctrl
IR im
2 rt
Reg.
File
Reg
A M
File
Exec
S
B 10 lw r1, r2(35)
Access
Mem
Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC
24 beq r6, r7, 100

30 ori r8, r9, 17
14
34 add r10, r11, r12

PC
100 and r13, r14, 15

cs 152 L1 3 .54 54
DAP Fa97,  U.CB
Fetch 20, Decode 14, Exec 10
addI r2, r2, 3

Inst. Mem n n
Decode
lw r1
WB
Mem
Ctrl
Ctrl
IR
35
2 rt
Reg.
File
Reg
r2 M
File
Exec
S
B 10 lw r1, r2(35)
Access
Mem
Mem
14 addI r2, r2, 3
Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC
30 ori r8, r9, 17

20
34 add r10, r11, r12

PC
100 and r13, r14, 15

cs 152 L1 3 .55
55
DAP Fa97,  U.CB
Fetch 24, Decode 20, Exec 14, Mem 10
sub r3, r4, r5

n
addI r2, r2, 3

Inst. Mem
Decode
lw r1
WB
Mem
Ctrl
Ctrl
IR
4 5
3
Reg.
r2+35
File
Reg
r2 M
File
Exec
B
10 lw r1, r2(35)
Access
Mem
Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC
24 beq r6, r7, 100

24
30 ori r8, r9, 17

34 add r10, r11, r12
PC
100 and r13, r14, 15

56
cs 152 L1 3 .56
DAP Fa97,  U.CB
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100

Inst. Mem
Decode
addI r2
sub r3
lw r1
WB
Mem
Ctrl
Ctrl
IR
6 7
M[r2+35]
Reg.
File
Reg
r2+3
r4
File
Exec
r5 10 lw r1, r2(35)
Access
Mem
Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC
24 beq r6, r7, 100

30 ori r8, r9, 17
30
34 add r10, r11, r12

PC
100 and r13, r14, 15

cs 152 L1 3 .57
57
DAP Fa97,  U.CB
ori r8, r9 17
Inst. Mem
Decode
addI r2
sub r3
WB
beq
Mem
Ctrl
Ctrl
r1=M[r2+35]
IR
100
9 xx
Reg.
r2+3
File
Reg
r4-r5
r6
File
Exec
r7 10 lw r1, r2(35)
Access
Mem
Mem
14 addI r2, r2, 3
Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC
30 ori r8, r9, 17

34
34 add r10, r11, r12

PC
100 and r13, r14, 15

58
cs 152 L1 3 58.
DAP Fa97,  U.CB
add r10, r11, r12

Inst. Mem
Decode
ori r8
sub r3
beq
WB
Mem
Ctrl
Ctrl
17
11 12
Reg.
IR r1=M[r2+35]
r4-r5
File
Reg
r9
File
Exec
xxx
r2 = r2+3
x 10 lw r1, r2(35)
Access
Mem
Mem
14 addI r2, r2, 3
Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC
100
30 ori r8, r9, 17
34 add r10, r11, r12
PC
ooops, we should have only one delayed instruction 100 and r13, r14, 15
cs 152 L1 3 .59
59
DAP Fa97,  U.CB
n
and r13, r14, r15

Inst. Mem
add r10
Decode
ori r8
beq
WB
Mem
Ctrl
Ctrl
xx
14 15
r9 | 17
Reg.
IR r1=M[r2+35]
File
xxx
Reg
r11
File
Exec
r2 = r2+3
r3 = r4-r5
r12 10 lw r1, r2(35)
Access
Mem
Mem
14 addI r2, r2, 3
Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC
104
30 ori r8, r9, 17
34 add r10, r11, r12
PC
100 and r13, r14, 15

Squash the extra instruction in the branch shadow! 60
cs 152 L1 3 .60
DAP Fa97,  U.CB
Inst. Mem n
add r10
Decode
and r13
ori r8
WB
Mem
Ctrl
Ctrl
xx
r11+r12
r9 | 17
Reg.
IR r1=M[r2+35]
File
Reg
r14
File
Exec
r2 = r2+3
r3 = r4-r5
r15 10 lw r1, r2(35)
Access
Mem
Mem
14 addI r2, r2, 3
Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC
110
30 ori r8, r9, 17
34 add r10, r11, r12
PC
100 and r13, r14, 15

Squash the extra instruction in the branch shadow! 61
cs 152 L1 3 .61
DAP Fa97,  U.CB
n
NO WB
Inst. Mem
and r13
add r10
Decode
NO Ovflow
WB
Mem
Ctrl
Ctrl
r14 & R15
r11+r12
Reg.
IR r1=M[r2+35]
File
Reg
File
Exec
r2 = r2+3
r3 = r4-r5
r8 = r9 | 17
Access
10 lw r1, r2(35)
Mem
Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC
24 beq r6, r7, 100

114
30 ori r8, r9, 17

34 add r10, r11, r12
PC
Squash the extra instruction in the branch shadow! 100 and r13, r14,
62 15
cs 152 L1 3 .62
DAP Fa97,  U.CB
Summary: Pipelining
What makes it easy

all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?

structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction
We’ll build a simple pipeline and look at these issues
We’ll talk about modern processors and what really

makes it hard:
exception handling
trying to improve performance with out-of-order execution, etc.
63
cs 152 L1 3 .63
DAP Fa97,  U.CB
Summary
Pipelining is a fundamental concept

multiple steps using distinct resources
Utilize capabilities of the Datapath by pipelined

instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards
64
cs 152 L1 3 .64
DAP Fa97,  U.CB

Pipe 1 New

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipe 1 New

Uploaded by

Copyright:

Available Formats

Curso de Arquiteturas Avançadas de Computadores

Professora: Roseli S. Wedemann

Texto: Computer Organization & Design, the Hardware

cs 152 L1 3 .1 DAP Fa97,  U.CB

Nota = [ (P1 + P2) / 2 + PF ] / 2

Specialize state-diagrams easily captured by

Control design reduces to Microprogramming

If simple instruction could execute at very high clock

Exceptions are the hard part of control

Need to find convenient place to detect exceptions

As we get pipelined, CPUs that support page faults

EPC: Exception Program Counter (stores return

R-type ORi LW BEQ

M <= MEM[S] MEM[S] <= B

The Five Classic Components of a Computer

Example: Stages of Laundry

Dryer takes 30 minutes

“Folder” takes 30 minutes

“Storer” takes 30 minutes

T 3030 30 30 3030 30 30 3030 30 30 3030 30 30

3030 30 30 30 3030 Time

Load Ifetch Reg/Dec Exec Mem Wr

Ifetch: Instruction Fetch

Reg/Dec: Registers Fetch and Instruction Decode

lw: load word (8ns) instruction fetch: 2ns

Improve perfomance by increasing instruction

pipe with 5 balanced stages: ideal speedup = 5,

Pipelining increases instruction throughput, as

fig 6.10 (pg. 450)

1. MIPS instrs are all the same length. Easier to

Can help with answering questions like:

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

Multiple Cycle Implementation:

Store Ifetch Reg Exec Mem Wr

R-type Ifetch Reg Exec Mem Wr 22

Suppose we execute 100 instructions

Ideal pipelined machine

speedup = 4500ns / 1040ns = 4.33

Time (clock cycles)

Can always resolve hazards by waiting

Time (clock cycles)

Example: if 1.3 memory accesses per instruction and

average CPI = 0.7*1 + 0.3*2 = 1.3 cycles / instr

Stall (Bubble): wait until decision is clear

Impact: 2 clock cycles per branch instruction

Estimate average CPI for stalling on branches.

average CPI = 0.83*1 + 0.17*2 = 1.17 cycles / instr

Predict: guess one direction then back up if wrong

I Time (clock cycles)

Impact: 1 clock cycle per branch instruction if right,

Dynamic HW predictors make guesses depending

An instr depends on the results of a previous instr still in the

xor r10, r1 ,r11

• Dependencies backwards in time are hazards

Time (clock cycles)

• “Forward” result from one stage to another

Time (clock cycles)

• “or” OK if define read/write properly

• Dependencies backwards in time are hazards

average CPI = 0.71 + 0.32 = 1.3 cycles / instr

average CPI = 0.831 + 0.172 = 1.17 cycles / instr