Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Curso de Arquiteturas Avançadas de Computadores

Professora: Roseli S. Wedemann


roseli@ime.uerj.br
sala D6019

Texto: Computer Organization & Design, the Hardware


Software Interface, D. A. Patterson and J. L. Hennessy,
Morgan Kaufman Publishers

cs 152 L1 3 .1 DAP Fa97,  U.CB


Sobre o Curso

Critério de Aprovação:
duas provas + prova final
para passar sem prova final, média >= 7
para passar com prova final, média >= 5
direito à 2a. chamada
para ir para a prova final, média >= 4

Nota = [ (P1 + P2) / 2 + PF ] / 2

Conteúdo
Capítulo 6: Enhancing Performance with Pipelining
Capítulo 9: Multiprocessors
2
cs 152 L1 3 .2 DAP Fa97,  U.CB
Recap: Microprogramming

Specialize state-diagrams easily captured by


microsequencer
simple increment & “branch” fields
datapath control fields

Control design reduces to Microprogramming


Microprogramming is a fundamental concept
implement an instruction set by building a very simple processor
and interpreting the instructions
essential for very complex instructions and when few register
transfers are possible
overkill when ISA matches datapath 1-1

3
cs 152 L1 3 .3 DAP Fa97,  U.CB
Summary: Microprogramming one inspiration for RISC

If simple instruction could execute at very high clock


rate…
If you could even write compilers to produce
microinstructions…
If most programs use simple instructions and
addressing modes…
If microcode is kept in RAM instead of ROM so as to fix
bugs …
If same memory used for control memory could be used
instead as cache for “macroinstructions”…
Then why not skip instruction interpretation by a
microprogram and simply compile directly into lowest
language of machine? (microprogramming is overkill
when ISA matches datapath 1-1)
4
cs 152 L1 3 .4 DAP Fa97,  U.CB
Recap: Exceptions and Interrupts

Exceptions are the hard part of control

Need to find convenient place to detect exceptions


and to branch to state or microinstruction that
saves PC and invokes the operating system

As we get pipelined, CPUs that support page faults


on memory accesses, which means that the
instruction cannot complete AND you must be able
to restart the program at exactly the instruction
with the exception, it gets even harder

5
cs 152 L1 3 .5 DAP Fa97,  U.CB
Definitions of control variables

EPC: Exception Program Counter (stores return


address)
exp_addr: address of exception handler routine
cause: reg used for storing the cause of exception
rd: reg destination operand in instruction
rt: second reg source operand in instruction
SX: constant specified in instruction
ZX: zero extended immediate
Ori: S <= reg or ZX, S is a reg (see pg A-57)
(see chapter 3 and section 5.6)
6
cs 152 L1 3 .6 DAP Fa97,  U.CB
Modification to the Control
Specification
IR <= MEM[PC]
PC <= PC + 4 “instruction fetch”
0000
undefined instruction
“decode”
EPC <= PC - 4 EPC <= PC - 4
S<= PC +SX
PC <= exp_addr other PC <= exp_addr
cause <= 12 (Ovf) 0001 cause <= 10 (RI)

R-type ORi LW BEQ


SW
Execute

S <= AIf- AB = B
S <= A fun B S <= A op ZX S <= A + SX S <= A + SX then PC <= S
0100 0110 1000 1011 0010
overflow
Memory

M <= MEM[S] MEM[S] <= B


1001
1100

Write-back
R[rd] <= S R[rt] <= S R[rt] <= M
0101 0111 1010
7
The Big Picture: Where are We Now?

The Five Classic Components of a Computer

Processor
Input
Control
Memory

Datapath Output

Today’s Topics:
Pipelining by Analogy

8
6.1 An overview of Pipelining. It’s Natural!

Example: Stages of Laundry


Ann, Brian, Cathy, Dave A B C D
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes

Dryer takes 30 minutes

“Folder” takes 30 minutes

“Storer” takes 30 minutes


to put clothes into drawers 9
cs 152 L1 3 .9 DAP Fa97,  U.CB
Sequential Laundry

6 PM 7 8 9 10 11 12 1 2 AM

T 3030 30 30 3030 30 30 3030 30 30 3030 30 30


a Time
A
s
k B
C
O
r D
d
e
r Sequential laundry takes 8 hours for 4 loads
If they learned pipelining, how long would laundry
take? 10
cs 152 L1 3 .10 DAP Fa97,  U.CB
Pipelined Laundry: Start work ASAP

6 PM 7 8 9 10 11 12 1 2 AM

3030 30 30 30 3030 Time


T
a A
s
k B
C
O
D
r
d
e
r
Pipelined laundry takes 3.5 hours for 4 loads!
11
cs 152 L1 3 .11 DAP Fa97,  U.CB
Pipelining Lessons
Pipelining doesn’t help
latency of single task, it
6 PM 7 8 9 helps throughput of entire
workload
Time
T Multiple tasks operating
simultaneously using
a 3030 30 30 30 3030 different resources
s A Potential speedup = Number
k pipe stages
B Pipeline rate limited by
O slowest pipeline stage
C
r Stages should take about
d D same amount of time
e Unbalanced lengths of pipe
stages reduces speedup
r
Time to “fill” pipeline and
time to “drain” it reduces
speedup
cs 152 L1 3 .12 Stall for Dependences 12
DAP Fa97,  U.CB
The Five Stages of MIPS Instructions, ex. Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Load Ifetch Reg/Dec Exec Mem Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode


Exec: Calculate the memory address
Mem: Read the data from the Data Memory
Wr: Write the data back to the register file

base addressing
Ex: lw $1, 100 ($0) $1 = Memory[$0 + 100] 13
cs 152 L1 3 .13
DAP Fa97,  U.CB
We will limit our attention to 8 instrs

lw: load word (8ns) instruction fetch: 2ns


sw: store word (7ns) register read or write: 1ns
add: add (6ns) ALU operation: 2ns
sub: subtract (6ns) memory acces: 2ns
and: and (6ns)
or: or (6ns)
slt: set-less-than (compares 2 regs and sets a third to
1, if the first is less than the second, pg. 128) (6ns)
beq: branch-if-equal (branch to label if 2 regs are
equal, pg 123) (5ns)
Ex: beq $3, $4, L if $3 = $4 go to L
see Chapter 3 for descriptions of instructions
14
cs 152 L1 3 .14 DAP Fa97,  U.CB
Pipelining

Improve perfomance by increasing instruction


throughput

figure 6.3
Ideal speedup is number of stages in the pipeline. Do we
achieve this? 15
cs 152 L1 3 .15 DAP Fa97,  U.CB
Performance Metrics
In the single-cycle design model (every instruction
takes 1 clock cycle), the clock cycle must be
stretched to accommodate the slowest instruction
(Load is slowest and takes 8ns in MIPS).
In the pipelined design, the pipeline stages take a
single clock cycle. The clock cycle must be long
enough to accommodate the slowest operation
(2ns).
Assume write to reg file occurs in first half of clock
cycle and read from reg file occurs in second half of
cycle.
Speedup = execution time without pipeline /
execution time with pipeline = ideally to number of
stages.
(definition in pg. 101)

cs 152 L1 3 .16
16
DAP Fa97,  U.CB
Performance Improvement

pipe with 5 balanced stages: ideal speedup = 5,


8ns / 5 = 1.6ns clock cycle
stages may be imperfectly balanced and pipeline
involves some overhead, so time per instruction
(CPI) will exceed minimum possible and speedup
will be less than number of stages.
In ex., for 3 instrs: speedup = 24ns / 14ns = 1.7
If we add 1000 lw instrs (1 instr per 2ns):
speedup = 8024ns / 2014ns = 3.98 (almost 4)

Pipelining increases instruction throughput, as


opposed to decreasing the execution time of an
individual instruction.
Instruction throughput is the important metric!
cs 152 L1 3 .17
17
DAP Fa97,  U.CB
Basic Idea

fig 6.10 (pg. 450)


What do we need to add to actually split the datapath into
stages?
18
cs 152 L1 3 .18 DAP Fa97,  U.CB
Instruction sets for piplines

1. MIPS instrs are all the same length. Easier to


fetch instr in 1st stage and decode in 2nd stage.
2. MIPS has only a few instr formats, with the
source reg fields in the same place. 2nd stage can
start reading the reg file while HW is determining
what type of instr was fetched. Reg read + decode
in one stage.
3. Memory operands only appear in loads and
stores in MIPS. Use the execute stage to calculate
mem address and access mem in the next stage.
Otherwise stages 3 and 4 would break into 3
stages: address, access, execute.
4. Operands must be aligned in memory (chapt 3,
pg. 112). 1 data transfer takes 1 data mem access
(1 pipeline stage).

19
cs 152 L1 3 .19
DAP Fa97,  U.CB
Graphically Representing Pipelines

Can help with answering questions like:


how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths

20
cs 152 L1 3 .20 DAP Fa97,  U.CB
Conventional Pipelined Execution Representation

Time

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB

IFetch Dcd Exec Mem WB


Program Flow
IFetch Dcd Exec Mem WB

21
cs 152 L1 3 .21 DAP Fa97,  U.CB
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1 Cycle 2
Clk
Single Cycle Implementation:
Load Store Waste

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk

Multiple Cycle Implementation:


Load Store R-type
Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch

Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr

Store Ifetch Reg Exec Mem Wr

R-type Ifetch Reg Exec Mem Wr 22


cs 152 L1 3 .22
DAP Fa97,  U.CB
Why Pipeline?

Suppose we execute 100 instructions


Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns

Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns

Ideal pipelined machine


10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

speedup = 4500ns / 1040ns = 4.33

cs 152 L1 3 .23
23
DAP Fa97,  U.CB
Why Pipeline? Because the resources are there!

Time (clock cycles)

ALU
I Im Reg Dm Reg
n Inst 0
s

ALU
t
r.
Inst 1 Im Reg Dm Reg

ALU
O Inst 2 Im Reg Dm Reg
r
d Inst 3

ALU
Im Reg Dm Reg
e
r
Inst 4

ALU
Im Reg Dm Reg

24
cs 152 L1 3 .24
DAP Fa97,  U.CB
Can pipelining get us into trouble?
Yes: Pipeline Hazards (next instr can’t execute in next
cycle)
structural hazards: attempt to use the same resource two different
ways at the same time
- E.g., combined washer/dryer would be a structural hazard or
folder busy doing something else (watching TV)
data hazards: attempt to use item before it is ready
- E.g., one sock of pair in dryer and one in washer; can’t fold until
get sock from washer through dryer
- instruction depends on result of prior instruction still in the
pipeline
control hazards: attempt to make a decision before condition is
evaluated
- E.g., washing football uniforms and need to get proper
detergent level; need to see after dryer before next load in
- branch instructions

Can always resolve hazards by waiting


pipeline control must detect the hazard 25
cs 152 L1 3 .25
take action (or delay action) to resolve hazards DAP Fa97,  U.CB
Single Memory is a Structural Hazard

Time (clock cycles)

ALU
I Mem Reg Mem Reg
n Load
s

ALU
Mem Reg Mem Reg
t
r.
Instr 1

ALU
Mem Reg Mem Reg
O Instr 2
r

ALU
d Mem Reg Mem Reg
e
Instr 3
r

ALU
Mem Reg Mem Reg
Instr 4

Detection is easy in this case! (right half highlight means read, left half write)
26
cs 152 L1 3 .26
DAP Fa97,  U.CB
Structural Hazards limit performance

Example: if 1.3 memory accesses per instruction and


only one memory access per cycle, then 30% of
instrs are loads

average CPI = 0.7*1 + 0.3*2 = 1.3 cycles / instr


(30% of the time stall 1 cycle)
otherwise resource is more than 100% utilized

27
cs 152 L1 3 .27
DAP Fa97,  U.CB
Control Hazard Solutions

Stall (Bubble): wait until decision is clear


It’s possible to move up decision to 2nd stage by adding hardware
to check registers as being read, calculate branch address and
I update PC
Time (clock cycles)
n

ALU
s Mem Reg Mem Reg
t Add
r.

ALU
Mem Reg Mem Reg
Beq
O
r
Load

ALU
Mem Reg Mem Reg
d
e
r

Impact: 2 clock cycles per branch instruction


=> slow
28
cs 152 L1 3 .28 DAP Fa97,  U.CB
Control Hazards Limit Performance

Estimate average CPI for stalling on branches.


Suppose 17% of the instrs executed are branches
(gcc execution benchmark).

average CPI = 0.83*1 + 0.17*2 = 1.17 cycles / instr

29
cs 152 L1 3 .29
DAP Fa97,  U.CB
Control Hazard Solutions

Predict: guess one direction then back up if wrong


Predict branch not taken is a common approach

I Time (clock cycles)


n

ALU
s Mem Reg Mem Reg
t Add
r.

ALU
Mem Reg Mem Reg
Beq
O
r
Load
ALU
Mem Reg Mem Reg
d
e
r

Impact: 1 clock cycle per branch instruction if right,


2 if wrong (right 50% of time)
30
cs 152 L1 3 .30
DAP Fa97,  U.CB
Dynamic Scheme

Dynamic HW predictors make guesses depending


on the behavior of each branch and may change
predictions for a branch over the life of a program.
Ex: Keep history of each branch as taken or not
taken and use the past to predict the future:
Such HW has 90% accuracy.
When the guess is wrong, the pipeline control must
ensure that the instrs following the wrongly
guessed branch have no effect, and must restart
the pipe from the proper branch address.
Longer pipes raise cost of mispredictions as is
usual with all solutions to control hazards.

31
cs 152 L1 3 .31
DAP Fa97,  U.CB
Control Hazard Solutions
Redefine branch behavior (takes place after next
instruction) “delayed branch”. Place a “safe”instr after
the branch instr. The compiler rearranges instrs when
possible (doesn’t change program results).
I Time (clock cycles)
n

ALU
s Mem Reg Mem Reg
t Add
r.

ALU
Mem Reg Mem Reg
Beq
O
r

ALU
d Misc Mem Reg Mem Reg

ALU
r Load Mem Reg Mem Reg

Impact: 0 clock cycles per branch instruction if can find instruction to put
in “slot” ( 50% of time)
Longer pipes, more slots, harder to fill. 32
cs 152 L1 3 .32
DAP Fa97,  U.CB
Data Hazard on r1

An instr depends on the results of a previous instr still in the


pipe. Too many of these dependencies to be solved by a compiler
successfuly.

add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7

or r8, r1 ,r9

xor r10, r1 ,r11

33
cs 152 L1 3 .33
DAP Fa97,  U.CB
Data Hazard on r1:

• Dependencies backwards in time are hazards

Time (clock cycles)


IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
s Im Reg Dm Reg
t
sub r4,r1,r3
r.

ALU
Im Reg Dm Reg
and r6,r1,r7
O

ALU
r Im Reg Dm Reg
d or r8,r1,r9
e

ALU
Im Reg Dm Reg
r xor r10,r1,r11

cs 152 L1 3 .34
34
DAP Fa97,  U.CB
Data Hazard Solution:

• “Forward” result from one stage to another

Time (clock cycles)


IF ID/RF EX MEM WB

ALU
I add r1,r2,r3 Im Reg Dm Reg

ALU
s Im Reg Dm Reg
t
sub r4,r1,r3
r.

ALU
Im Reg Dm Reg
and r6,r1,r7
O

ALU
r Im Reg Dm Reg
d or r8,r1,r9
e

ALU
r xor r10,r1,r11 Im Reg Dm Reg

• “or” OK if define read/write properly


cs 152 L1 3 .35
35
DAP Fa97,  U.CB
Forwarding (or Bypassing): What about Loads

• Dependencies backwards in time are hazards


(shading indicates when operand is used by the instr)
Time (clock cycles)
IF ID/RF EX MEM WB

ALU
lw r1,0(r2) Im Reg Dm Reg

ALU
Im Reg Dm Reg
sub r4,r1,r3

• Can’t solve with forwarding:


• Must delay/stall instruction dependent on loads

Study example of reordering of code to avoid stalls caused


by data hazards (pg 447). 36
cs 152 L1 3 .36
DAP Fa97,  U.CB
6.2 Designing a Pipelined Processor

Go back and examine your datapath and control


diagram
associated resources with states
ensure that flows do not conflict, or figure out how to
resolve
assert control in appropriate stage

To retain the value of an individual instr for its other


four stages, the value output by one stage must be
saved in a regs. We must place regs where there
are dividing lines between the stages. These are
called pipeline regs.

37
cs 152 L1 3 .37
DAP Fa97,  U.CB
Pipelined Datapath (as in book); hard to read

IF ID EX MEM WB

Fig 6.12
Data flowing from right to left does not affect
the current instr, only later instrs in the pipe
38
cs 152 L1 3 .38
DAP Fa97,  U.CB
Pipelined Processor (almost) for slides

What happens if we start a new instruction every


cycle?

Inst. Mem Valid

WB Ctrl
Mem Ctrl
IRmem

IRwb
Dcd Ctrl

IRex

Ex Ctrl
IR

Equal
Reg.
File
Exec
Reg

A
Next PC

File

S
PC

Access
Mem
M

Mem
Data
39
cs 152 L1 3 .39
DAP Fa97,  U.CB
Control and Datapath

IR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond


PC < PC+SX;

M <– Mem[S] Mem[S] <- B

Equal
R[rd] <– S; R[rt] <– S; R[rd] <– M;

Reg.
Inst. Mem

File
Exec
Reg

A
Next PC

File

S
PC

IR

M
B

Access
Mem

Mem
Data
D 40
cs 152 L1 3 .40
DAP Fa97,  U.CB
Pipelining the Load Instruction

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7


Clock

1st lw Ifetch Reg/Dec Exec Mem Wr

2nd lw Ifetch Reg/Dec Exec Mem Wr

3rd lw Ifetch Reg/Dec Exec Mem Wr

The five independent functional units in the pipeline


datapath are:
Instruction Memory for the Ifetch stage
Register File’s Read ports (bus A and busB) for the Reg/Dec stage
ALU for the Exec stage
Data Memory for the Mem stage
Register File’s Write port (bus W) for the Wr stage

41
cs 152 L1 3.41
DAP Fa97,  U.CB
The Four Stages of R-type

Cycle 1 Cycle 2 Cycle 3 Cycle 4

R-type Ifetch Reg/Dec Exec Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode


Exec:
ALU operates on the two register operands
Update PC

Wr: Write the ALU output back to the register file

42
cs 152 L1 3 .42
DAP Fa97,  U.CB
Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock

R-type Ifetch Reg/Dec Exec Wr Ops! We have a problem!

R-type Ifetch Reg/Dec Exec Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Wr

R-type Ifetch Reg/Dec Exec Wr

We have pipeline conflict or structural hazard:


Two instructions try to write to the register file at the same time!
Only one write port

43
cs 152 L1 3 .43
DAP Fa97,  U.CB
Important Observation

Each functional unit can only be used once per


instruction
Each functional unit must be used at the same stage
for all instructions:
Load uses Register File’s Write Port during its 5th stage
1 2 3 4 5
Load Ifetch Reg/Dec Exec Mem Wr

R-type uses Register File’s Write Port during its 4th stage

1 2 3 4
R-type Ifetch Reg/Dec Exec Wr

° 2 ways to solve this pipeline hazard.

44
cs 152 L1 3 .44
DAP Fa97,  U.CB
Solution 1: Insert “Bubble” into the Pipeline

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9


Clock
Ifetch Reg/Dec Exec Wr
Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Wr

R-type Ifetch Reg/Dec Pipeline Exec Wr


R-type Ifetch Bubble Reg/Dec Exec Wr
Ifetch Reg/Dec Exec

Insert a “bubble” into the pipeline to prevent 2 writes at


the same cycle
The control logic can be complex.
Lose instruction fetch and issue opportunity.

No instruction is started in Cycle 6!


45
cs 152 L1 3 .45
DAP Fa97,  U.CB
Solution 2: Delay R-type’s Write by One Cycle

Delay R-type’s register write by one cycle:


Now R-type instructions also use Reg File’s write port at Stage 5
Mem stage is a NOOP stage: nothing is being done.
1 2 3 4 5
R-type Ifetch Reg/Dec Exec Mem Wr

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9


Clock

R-type Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr

Load Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr

R-type Ifetch Reg/Dec Exec Mem Wr

46
cs 152 L1 3 .46
DAP Fa97,  U.CB
Modified Control & Datapath

IR <- Mem[PC]; PC <– PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC


< PC+SX;

M <– S M <– S M <– Mem[S] Mem[S] <- B

Equal
R[rd] <– M; R[rt] <– M; R[rd] <– M;

Reg.
Inst. Mem

File
Exec
Reg

A M
Next PC

File

S
PC

IR

Access
Mem

Mem
Data
D 47
cs 152 L1 3 .47
DAP Fa97,  U.CB
The Four Stages of Store

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Store Ifetch Reg/Dec Exec Mem Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory

Reg/Dec: Registers Fetch and Instruction Decode


Exec: Calculate the memory address
Mem: Write the data into the Data Memory

48
cs 152 L1 3 .48
DAP Fa97,  U.CB
The Three Stages of Beq

Cycle 1 Cycle 2 Cycle 3 Cycle 4

Beq Ifetch Reg/Dec Exec Mem Wr

Ifetch: Instruction Fetch


Fetch the instruction from the Instruction Memory

Reg/Dec:
Registers Fetch and Instruction Decode

Exec:
compares the two register operand,
select correct branch target address
latch into PC

49
cs 152 L1 3 .49 DAP Fa97,  U.CB
Control
Diagram
IR <- Mem[PC]; PC < PC+4;

A <- R[rs]; B<– R[rt]

S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC


< PC+SX;

M <– S M <– S M <– Mem[S] Mem[S] <- B

Equal
R[rd] <– S; R[rt] <– S; R[rd] <– M;

Reg.
Inst. Mem

File
Exec
Reg

A M
Next PC

File

S
PC

IR

Access
Mem

Mem
Data
D
50
cs 152 L1 3 .50 DAP Fa97,  U.CB
Datapath + Data Stationary Control

IR v v v
fun rw rw

Inst. Mem
rw

Decode
wb wb wb
rt me me WB
rs ex Mem
op Ctrl
Ctrl
rs rt im

Reg.
File
Reg
File A M

Exec
S

Access
Mem

Mem
Data
D
Next PC

PC

cs 152 L1 3 .51
51
DAP Fa97,  U.CB
Let’s Try it Out

10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
these addresses are octal
30 ori r8, r9, 17
34 add r10, r11, r12

100 and r13, r14, 15

52
cs 152 L1 3 .52
DAP Fa97,  U.CB
Start: Fetch 10

Inst. Mem n n n n

Decode
WB
Mem
Ctrl
Ctrl
IR im
rs rt

Reg.
File
Reg

A M
File

Exec
S
B 10 lw r1, r2(35)

Access
Mem

Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC

24 beq r6, r7, 100


30 ori r8, r9, 17
10

34 add r10, r11, r12


PC

100 and r13, r14, 15


cs 152 L1 3 .53 53
DAP Fa97,  U.CB
Fetch 14, Decode 10

n n n
lw r1, r2(35)
Inst. Mem

Decode
WB
Mem
Ctrl
Ctrl
IR im
2 rt

Reg.
File
Reg

A M
File

Exec
S
B 10 lw r1, r2(35)

Access
Mem

Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC

24 beq r6, r7, 100


30 ori r8, r9, 17
14

34 add r10, r11, r12


PC

100 and r13, r14, 15


cs 152 L1 3 .54 54
DAP Fa97,  U.CB
Fetch 20, Decode 14, Exec 10

addI r2, r2, 3


Inst. Mem n n

Decode

lw r1
WB
Mem
Ctrl
Ctrl
IR

35
2 rt

Reg.
File
Reg

r2 M
File

Exec
S
B 10 lw r1, r2(35)

Access
Mem

Mem
14 addI r2, r2, 3

Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC

30 ori r8, r9, 17


20

34 add r10, r11, r12


PC

100 and r13, r14, 15


cs 152 L1 3 .55
55
DAP Fa97,  U.CB
Fetch 24, Decode 20, Exec 14, Mem 10

sub r3, r4, r5


n

addI r2, r2, 3


Inst. Mem

Decode

lw r1
WB
Mem
Ctrl
Ctrl
IR
4 5
3

Reg.
r2+35

File
Reg

r2 M
File

Exec
B
10 lw r1, r2(35)

Access
Mem

Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC

24 beq r6, r7, 100


24

30 ori r8, r9, 17


34 add r10, r11, r12
PC

100 and r13, r14, 15


56
cs 152 L1 3 .56
DAP Fa97,  U.CB
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10

beq r6, r7 100


Inst. Mem

Decode

addI r2
sub r3

lw r1
WB
Mem
Ctrl
Ctrl
IR
6 7

M[r2+35]

Reg.
File
Reg

r2+3
r4
File

Exec
r5 10 lw r1, r2(35)

Access
Mem

Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC

24 beq r6, r7, 100


30 ori r8, r9, 17
30

34 add r10, r11, r12


PC

100 and r13, r14, 15


cs 152 L1 3 .57
57
DAP Fa97,  U.CB
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14

ori r8, r9 17
Inst. Mem

Decode

addI r2
sub r3
WB

beq
Mem
Ctrl
Ctrl

r1=M[r2+35]
IR

100
9 xx

Reg.
r2+3

File
Reg

r4-r5
r6
File

Exec
r7 10 lw r1, r2(35)

Access
Mem

Mem
14 addI r2, r2, 3

Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC

30 ori r8, r9, 17


34

34 add r10, r11, r12


PC

100 and r13, r14, 15


58
cs 152 L1 3 58.
DAP Fa97,  U.CB
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20

add r10, r11, r12


Inst. Mem

Decode

ori r8

sub r3
beq
WB
Mem
Ctrl
Ctrl

17
11 12

Reg.
IR r1=M[r2+35]

r4-r5

File
Reg

r9
File

Exec

xxx
r2 = r2+3

x 10 lw r1, r2(35)

Access
Mem

Mem
14 addI r2, r2, 3

Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC

100
30 ori r8, r9, 17
34 add r10, r11, r12
PC

ooops, we should have only one delayed instruction 100 and r13, r14, 15
cs 152 L1 3 .59
59
DAP Fa97,  U.CB
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
n

and r13, r14, r15


Inst. Mem

add r10
Decode

ori r8

beq
WB
Mem
Ctrl
Ctrl

xx
14 15

r9 | 17

Reg.
IR r1=M[r2+35]

File
xxx
Reg

r11
File

Exec
r2 = r2+3
r3 = r4-r5
r12 10 lw r1, r2(35)

Access
Mem

Mem
14 addI r2, r2, 3

Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC

104
30 ori r8, r9, 17
34 add r10, r11, r12
PC

100 and r13, r14, 15


Squash the extra instruction in the branch shadow! 60
cs 152 L1 3 .60
DAP Fa97,  U.CB
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30

Inst. Mem n

add r10
Decode

and r13

ori r8
WB
Mem
Ctrl
Ctrl

xx

r11+r12

r9 | 17

Reg.
IR r1=M[r2+35]

File
Reg

r14
File

Exec
r2 = r2+3
r3 = r4-r5
r15 10 lw r1, r2(35)

Access
Mem

Mem
14 addI r2, r2, 3

Data
D
20 sub r3, r4, r5
24 beq r6, r7, 100
Next PC

110
30 ori r8, r9, 17
34 add r10, r11, r12
PC

100 and r13, r14, 15


Squash the extra instruction in the branch shadow! 61
cs 152 L1 3 .61
DAP Fa97,  U.CB
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
n
NO WB
Inst. Mem

and r13

add r10
Decode
NO Ovflow
WB
Mem
Ctrl
Ctrl

r14 & R15

r11+r12

Reg.
IR r1=M[r2+35]

File
Reg
File

Exec
r2 = r2+3
r3 = r4-r5
r8 = r9 | 17

Access
10 lw r1, r2(35)

Mem

Mem
Data
D 14 addI r2, r2, 3
20 sub r3, r4, r5
Next PC

24 beq r6, r7, 100


114

30 ori r8, r9, 17


34 add r10, r11, r12
PC

Squash the extra instruction in the branch shadow! 100 and r13, r14,
62 15
cs 152 L1 3 .62
DAP Fa97,  U.CB
Summary: Pipelining

What makes it easy


all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores

What makes it hard?


structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction

We’ll build a simple pipeline and look at these issues

We’ll talk about modern processors and what really


makes it hard:
exception handling
trying to improve performance with out-of-order execution, etc.
63
cs 152 L1 3 .63
DAP Fa97,  U.CB
Summary

Pipelining is a fundamental concept


multiple steps using distinct resources

Utilize capabilities of the Datapath by pipelined


instruction processing
start next instruction while working on the current one
limited by length of longest stage (plus fill/flush)
detect and resolve hazards

64
cs 152 L1 3 .64
DAP Fa97,  U.CB

You might also like