Professional Documents
Culture Documents
ece4750-T02-proc-uarch
ece4750-T02-proc-uarch
revision: 2021-09-12-02-56
1
5 Pipeline Hazards: RAW Data Hazards 39
5.1. Expose in Instruction Set Architecture . . . . . . . . . . . . . . . . . 41
5.2. Hardware Stalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3. Hardware Bypassing/Forwarding . . . . . . . . . . . . . . . . . . . 43
5.4. RAW Data Hazards Through Memory . . . . . . . . . . . . . . . . . 47
Copyright © 2016 Christopher Batten. All rights reserved. This handout was originally
prepared by Prof. Christopher Batten at Cornell University for ECE 4750 / CS 4420. It has
2
been updated in 2017-2021 by Prof. Christina Delimitrou. Download and use of this hand-
out is permitted for individual educational non-commercial purposes only. Redistribution
either in part or in whole via both commercial or non-commercial means requires written
permission.
3
1. Processor Microarchitectural Design Patterns 2.0. Transactions and Steps
3
1. Processor Microarchitectural Design Patterns
1.2. Microarchitecture: Control/Datapath Split
Control Unit
Datapath
Memory
4
2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors
Technology Constraints
• Assume technology where Control Unit
logic is not too expensive, so Control Status
we do not need to overly
Datapath
minimize the number of
registers and combinational
logic
regfile
• Assume multi-ported register
file with a reasonable number
of ports is feasible imem imem dmem dmem
req resp req resp
• Assume a dual-ported Memory
combinational memory <1 cycle
combinational
5
2. TinyRV1 Single-Cycle Processor 2.1. High-Level Idea for Single-Cycle Processors
Fetch Decode Reg Read Write Fetch Decode Reg Write Fetch Decode Write
Inst Inst Arith Mem Reg Inst Inst Arith Reg Inst Inst Reg
6
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
regfile regfile
(read) (write)
pc
ADDI 31 20 19 15 14 12 11 7 6 0
regfile regfile
(read) (write)
pc
7
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
To control unit
ir[11:7]
alu
(read) (write)
ir[24:20]
+4
ir[31:7]
imm
pc gen
op2_sel
imemreq. imemresp.
addr data
MUL 31 25 24 20 19 15 14 12 11 7 6 0
wb_sel
(read) (write)
ir[24:20]
+4
ir[31:7]
imm
pc gen
op2_sel
imemreq. imemresp.
addr data
8
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
LW 31 20 19 15 14 12 11 7 6 0
mul
wb_sel
alu
(read) (write)
ir[24:20]
+4
ir[31:7]
imm
pc gen
op2_sel
SW 31 25 24 20 19 15 14 12 11 7 6 0
PC ← PC + 4
To control unit
ir[11:7]
mul
rf_wen
wb_sel
(read) (write)
ir[24:20]
+4
ir[31:7]
imm
pc gen
imm_type op2_sel
9
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
JAL 31 12 11 7 6 0
mul
jalbr_targ rf_wen
wb_sel
alu
(read) (write)
ir[24:20]
+4
ir[31:7]
imm alu_func
pc gen
pc_sel
imm_type op2_sel
+
JR 31 20 19 15 14 12 11 7 6 0
PC ← R[rs1]
To control unit
jr_targ ir[11:7]
mul
jalbr_targ rf_wen
wb_sel
(read) (write)
ir[24:20]
+4
ir[31:7]
imm alu_func
pc gen
pc_sel
imm_type op2_sel
+
10
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
BNE 31 25 24 20 19 15 14 12 11 7 6 0
bne rs1, rs2, imm imm rs2 rs1 001 imm 1100011
mul
jalbr_targ rf_wen
eq wb_sel
alu
(read) (write)
ir[24:20]
+4
ir[31:7]
imm alu_func
pc gen
pc_sel
imm_type op2_sel
+
11
2. TinyRV1 Single-Cycle Processor 2.2. Single-Cycle Processor Datapath
LW.AI 31 20 19 15 14 12 11 7 6 0
mul
jalbr_targ rf_wen
eq wb_sel
alu
(read) (write)
ir[24:20]
+4
ir[31:7]
imm alu_func
pc gen
pc_sel
imm_type op2_sel
+
12
2. TinyRV1 Single-Cycle Processor 2.3. Single-Cycle Processor Control Unit
imem dmem
pc imm op1 alu wb rf req req
inst sel type sel func sel wen val val
addi
sw
jal
jr jr i – – – 0 1 0
bne
13
2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance
There are many paths through the design that start at a state element
and end at a state element. The “critical path” is the longest path across
all of these paths. We can usually use a simple first-order static timing
estimate to estimate the cycle time (i.e., the clock period and thus also
the clock frequency).
To control unit
jr_targ ir[11:7]
mul
jalbr_targ rf_wen
eq wb_sel
alu
(read) (write)
ir[24:20]
+4
ir[31:7]
imm alu_func
pc gen
pc_sel
imm_type op2_sel
+
• register read = 1τ
• register write = 1τ
• regfile read = 10τ
• regfile write = 10τ
• memory read = 20τ
• memory write = 20τ
• +4 unit = 4τ
• immgen = 2τ
• mux = 3τ
• multiplier = 20τ
• alu = 10τ
• adder = 8τ
14
2. TinyRV1 Single-Cycle Processor 2.4. Analyzing Performance
15
3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors
Technology Constraints
• Assume legacy technology Control Unit
where logic is expensive, so Control Status
we want to minimize the
Datapath
number of registers and
combinational logic
• Assume an (unrealistic) regfile
combinational memory
• Assume multi-ported register
mem mem
files and memories are too req resp
expensive, these structures Memory
can only have a single <1 cycle
read/write port combinational
16
3. TinyRV1 FSM Processor 3.1. High-Level Idea for FSM Processors
Fetch Decode Reg Read Write Fetch Decode Reg Write Fetch Decode Write
Inst Inst Arith Mem Reg Inst Inst Arith Reg Inst Inst Reg
Fetch Decode Reg Read Write Fetch Decode Reg Write Fetch Decode Write
FSM
Inst Inst Arith Mem Reg Inst Inst Arith Reg Inst Inst Reg
17
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
To control unit
PC IR A
+4
pc_ alu_
bus_en bus_en
rd_
bus_en
RD
memreq. memresp.
addr data
F0 (pseudo-control-signal syntax)
F1
F2
18
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
To control unit
PC IR A B
alu_ rs1
func alu
+4 RF rs2
rf_wen rd
pc_ alu_ rf_ rf_addr
bus_en bus_en bus_en _sel
memreq. memresp.
addr data
F0 (pseudo-control-signal syntax)
F1
add rd, rs1, rs2
F2
A0
A1
A2
19
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
To control unit
b_sel c_sel
pc_en ir_en a_en
b_en c_en
>> >>
Datapath Bus
PC IR A B C
x0
imm_ imm
sext alu_ rs1
type gen func alu
+4 RF
imm rf_wen
rs2
rd
eq
pc_ immgen_ alu_ rf_ rf_addr
bus_en bus_en bus_en bus_en _sel
F0 ADDI Pseudo-Control-Signal
Fragment
F1
A1 AI1 M1 L1 S1 JA1 B1
A2 AI2 M2 L2 S2 JA2 B2
M3 L3 S3 B3
B4
M35 B5
20
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
BNE Instruction
LW Instruction bne rs1, rs2, imm
lw rd, imm(rs1)
B0: A ← RF[rs1]
L0: A ← RF[rs1] B1: B ← RF[rs2]
L1: B ← sext(imm_i) B2: B ← sext(imm_b);
L2: memreq.addr ← A + B if A == B goto F0
L3: RF[rd] ← RD; goto F0 B3: A ← PC
B4: A ← A − 4
B5: PC ← A + B; goto F0
SW Instruction
sw rs2, imm(rs1)
S0: WD ← RF[rs2]
S1: A ← RF[rs1]
S2: B ← sext(imm_s)
S3: memreq.addr ← A + B; goto F0
21
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
To control unit
b_sel c_sel
pc_en ir_en a_en
b_en c_en
>> >>
Datapath Bus
PC IR A B C
x0
imm_ imm
sext alu_ rs1
type gen func alu
+4 RF
imm rf_wen
rs2
rd
eq
pc_ immgen_ alu_ rf_ rf_addr
bus_en bus_en bus_en bus_en _sel
22
3. TinyRV1 FSM Processor 3.2. FSM Processor Datapath
To control unit
b_sel c_sel
pc_en ir_en a_en
b_en c_en
>> >>
Datapath Bus
PC IR A B C
x0
imm_ imm
sext alu_ rs1
type gen func alu
+4 RF
imm rf_wen
rs2
rd
eq
pc_ immgen_ alu_ rf_ rf_addr
bus_en bus_en bus_en bus_en _sel
23
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
Hardwired FSM
State
Control State
Signal Transition
Logic Logic
24
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
To control unit
b_sel c_sel
pc_en ir_en a_en
b_en c_en
>> >>
Datapath Bus
PC IR A B C
x0
imm_ imm
sext alu_ rs1
type gen func alu
+4 RF
imm rf_wen
rs2
rd
eq
pc_ immgen_ alu_ rf_ rf_addr
bus_en bus_en bus_en bus_en _sel
F0 1 0 0 0 0 0 0 1 0 0 0 – – – – – 0 1 r
F1 0 0 0 0 1 0 1 0 0 0 0 – – – – – 0 0 –
F2 0 0 1 0 0 1 0 0 0 0 0 – – – +4 – 0 0 –
A0
A1
A2
25
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
F0
opcode decoder Next State Encoding
n : increment uPC by one
d : dispatch based on opcode
f : goto state F0
+1 b : goto state F0 if A == B
uPC
Control Next
eq
Signals State
Status Signals
(1)
bus mux
en sel
• Use memory array (called the control store) instead of random logic
to encode both the control signal logic and the state transition logic
• Enables a more systematic approach to implementing complex
multi-cycle instructions
• Microcoding can produce good performance if accessing the control
store is much faster than accessing main memory
• Read-only control stores might be replaceable enabling in-field
updates, while read-write control stores can simplify diagnostics
and microcode patches
26
3. TinyRV1 FSM Processor 3.3. FSM Processor Control Unit
To control unit
b_sel c_sel
pc_en ir_en a_en
b_en c_en
>> >>
Datapath Bus
PC IR A B C
x0
imm_ imm
sext alu_ rs1
type gen func alu
+4 RF
imm rf_wen
rs2
rd
eq
pc_ immgen_ alu_ rf_ rf_addr
bus_en bus_en bus_en bus_en _sel
B0 0 0 0 1 0 0 0 1 0 0 0 – – – – rs1 0 0 –
B1 0 0 0 1 0 0 0 0 1 0 0 b – – – rs2 0 0 –
B2 0 1 0 0 0 0 0 0 1 0 0 b – b cmp – 0 0 –
B3 1 0 0 0 0 0 0 1 0 0 0 – – – – – 0 0 –
B4 0 0 1 0 0 0 0 1 0 0 0 – – – −4 – 0 0 –
B5 0 0 1 0 0 1 0 0 0 0 0 – – – + – 0 0 –
27
3. TinyRV1 FSM Processor 3.4. Analyzing Performance
To control unit
b_sel c_sel
pc_en ir_en a_en
b_en c_en
>> >>
Datapath Bus
PC IR A B C
x0
imm_ imm
sext alu_ rs1
type gen func alu
+4 RF
imm rf_wen
rs2
rd
eq
pc_ immgen_ alu_ rf_ rf_addr
bus_en bus_en bus_en bus_en _sel
• register read/write = 1τ
• regfile read/write = 10τ
• mem read/write = 20τ
• immgen = 2τ
• mux = 3τ
• alu = 10τ
• 1b shifter = 1τ
• tri-state buf = 1τ
28
3. TinyRV1 FSM Processor 3.4. Analyzing Performance
29
4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors
Technology Constraints
• Assume modern technology Control Unit
where logic is cheap and fast Control Status
(e.g., fast integer ALU)
Datapath
• Assume multi-ported register
files with a reasonable
number of ports are feasible regfile
30
4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors
Anne's
Load
Ben's
Load
Cathy's
Load
Dave's
Load
Anne's Anne's
Load Load
Ben's Ben's
Load Load
Cathy's Cathy's
Load Load
Dave's Dave's
Load Load
Pipelining lessons
31
4. TinyRV1 Pipelined Processor 4.1. High-Level Idea for Pipelined Processors
Fetch Decode Reg Read Write Fetch Decode Reg Write Fetch Decode Write
Inst Inst Arith Mem Reg Inst Inst Arith Reg Inst Inst Reg
Fetch Decode Reg Read Write Fetch Decode Reg Write Fetch Decode Write
FSM
Inst Inst Arith Mem Reg Inst Inst Arith Reg Inst Inst Reg
lw Read Update
Reg PC
Read Update
Reg PC
Update
PC
32
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Control Signal
Table
jbtarg
jr
pc_plus4
ir[31:0]
immgen
pc_sel_F
+
imm_type
pc_F +4 rf_
waddr_W
result_sel_X
wb_sel_M rf_
mul
wen_W
ir[19:15]
eq_X
op2_sel_D
alu_fn_X
33
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Draw on the above datapath diagram what paths we need to use as well
as any new paths we will need to add in order to implement the
following auto-incrementing load instruction.
val_F
imm_type
pc_F +4 pc_FD rf_
btarg_DX waddr_W
result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
ir[19:15]
eq_X
op2_sel_D op2_DX sd_XM
alu_fn_X
sd_DX
34
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
Pipeline diagrams
What would be the total execution time if these three instructions were
repeated 10 times?
35
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
next_
val_B val_B
if ( reset )
Stage A Stage B Stage C
val_B <= 0 Datapath Datapath Datapath
else if ( reg_en_B ) Logic Logic Logic
val_B <= next_val_A
36
4. TinyRV1 Pipelined Processor 4.2. Pipelined Processor Datapath and Control Unit
next_
val_B val_B
reg_en_B
always_ff @( posedge clk )
if ( reset ) Stage A Stage B Stage C
val_B <= 0 Datapath Datapath Datapath
else if ( reg_en_B ) Logic Logic Logic
val_B <= next_val_A
37
5. Pipeline Hazards: RAW Data Hazards
38
5. Pipeline Hazards: RAW Data Hazards
39
5. Pipeline Hazards: RAW Data Hazards 5.1. Expose in Instruction Set Architecture
40
5. Pipeline Hazards: RAW Data Hazards 5.2. Hardware Stalling
val_F
imm_type
pc_F +4 pc_FD rf_
btarg_DX waddr_W
result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X
op2_sel_D op2_DX sd_XM
alu_fn_X
sd_DX
ostall_waddr_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
ostall_waddr_M_rs1_D =
val_D && rs1_en_D && val_M && rf_wen_M
&& (inst_rs1_D == rf_waddr_M) && (rf_waddr_M != 0)
ostall_waddr_W_rs1_D =
val_D && rs1_en_D && val_W && rf_wen_W
&& (inst_rs1_D == rf_waddr_W) && (rf_waddr_W != 0)
... similar for ostall signals for rs2 source register ...
ostall_D = val_D
&& ( ostall_waddr_X_rs1_D || ostall_waddr_X_rs2_D
|| ostall_waddr_M_rs1_D || ostall_waddr_M_rs2_D
|| ostall_waddr_W_rs1_D || ostall_waddr_W_rs2_D )
42
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
val_F
imm_type op1_
pc_F +4 pc_FD rf_
byp_ btarg_DX waddr_W
sel_D result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X
op2_sel_D op2_DX sd_XM
alu_fn_X
sd_DX
ostall_waddr_X_rs1_D = 0
bypass_waddr_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
43
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
val_F
imm_type op1_
pc_F +4 pc_FD rf_
byp_ btarg_DX waddr_W
sel_D result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X
op2_sel_D op2_DX sd_XM
op2_ alu_fn_X
byp_
sel_D
bypass_from_X sd_DX
bypass_from_M
bypass_from_W
44
5. Pipeline Hazards: RAW Data Hazards 5.3. Hardware Bypassing/Forwarding
ALU-use latency is only one cycle, but load-use latency is two cycles.
lw x1, 0(x2)
addi x3, x1, 1
lw x1, 0(x2)
addi x3, x1, 1
ostall_load_use_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X == lw)
ostall_load_use_X_rs2_D =
val_D && rs2_en_D && val_X && rf_wen_X
&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X == lw)
ostall_D =
val_D && ( ostall_load_use_X_rs1_D || ostall_load_use_X_rs2_D )
bypass_waddr_X_rs1_D =
val_D && rs1_en_D && val_X && rf_wen_X
&& (inst_rs1_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X != lw)
bypass_waddr_X_rs2_D =
val_D && rs2_en_D && val_X && rf_wen_X
&& (inst_rs2_D == rf_waddr_X) && (rf_waddr_X != 0)
&& (op_X != lw)
45
5. Pipeline Hazards: RAW Data Hazards 5.4. RAW Data Hazards Through Memory
lw x1, 0(x2)
lw x3, 0(x4)
add x5, x1, x3
sw x5, 0(x6)
addi x2, x2, 4
addi x4, x4, 4
addi x6, x6, 4
addi x7, x7, -1
bne x7, x0, loop
sw x1, 0(x2)
lw x3, 0(x4) # RAW dependency occurs if R[x2] == R[x4]
sw x1, 0(x2)
lw x3, 0(x4)
46
6. Pipeline Hazards: Control Hazards
47
6. Pipeline Hazards: Control Hazards 6.1. Expose in Instruction Set Architecture
The jump resolution latency and branch resolution latency are the
number of cycles we need to delay the fetch of the next instruction in
order to avoid any kind of control hazard. Jump resolution latency is
two cycles, and branch resolution latency is three cycles.
48
6. Pipeline Hazards: Control Hazards 6.1. Expose in Instruction Set Architecture
Pipeline diagram showing using branch delay slots for control hazards
49
6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation
50
6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation
val_F
+
imm_type op1_
pc_F +4 pc_FD rf_
byp_ btarg_DX waddr_W
sel_D result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X
op2_sel_D op2_DX sd_XM
op2_ alu_fn_X
byp_
sel_D
bypass_from_X sd_DX
bypass_from_M
bypass_from_W
51
6. Pipeline Hazards: Control Hazards 6.2. Hardware Speculation
52
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
• Asynchronous Interrupts
– Input/output device needs to be serviced
– Timer has expired
– Power distruption or hardware failure
• Synchronous Exceptions
– Undefined opcode, privileged instruction
– Arithmetic overflow, floating-point exception
– Misaligned memory access for instruction fetch or data access
– Memory protection violation
– Virtual memory page faults
– System calls (traps) to jump into the operating system kernel
53
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
54
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
F D X M W
55
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
Exception
Handling Logic
val_F
+
pc_sel_F
imm_type op1_
pc_F +4 pc_FD rf_
byp_ btarg_DX waddr_W
sel_D result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD alu
eq_X rf_
op2_sel_D op2_DX sd_XM wen1_W
op2_ alu_fn_X
byp_ rf_
sel_D waddr1_W
bypass_from_X sd_DX
bypass_from_M
bypass_from_W
Control logic needs to redirect the front end of the pipeline just like for a
jump or branch. Again, squashes take priority over stalls, and PC select
logic must give priority to older instructions (i.e., priortize exceptions,
over branches, over jumps)!
56
6. Pipeline Hazards: Control Hazards 6.3. Interrupts and Exceptions
57
7. Pipeline Hazards: Structural Hazards
val_F
imm_type op1_
pc_F +4 pc_FD rf_
byp_ btarg_DX waddr_W
sel_D result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X
op2_sel_D op2_DX sd_XM
op2_ alu_fn_X
byp_
sel_D
bypass_from_X sd_DX
bypass_from_M
bypass_from_W
58
7. Pipeline Hazards: Structural Hazards 7.1. Expose in Instruction Set Architecture
Note that the key shared resources that are causing the structural
hazard are the pipeline registers at the end of the M stage. We cannot
write these pipeline registers with the transaction that is in the X stage
and also the transaction that is the M stage at the same time.
59
7. Pipeline Hazards: Structural Hazards 7.2. Hardware Stalling
60
7. Pipeline Hazards: Structural Hazards 7.3. Hardware Duplication
val_F
mul
wen_W
result mem_result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X rf_
op2_sel_D op2_DX sd_XM wen1_W
op2_ alu_fn_X
byp_ rf_
sel_D waddr1_W
bypass_from_X sd_DX
bypass_from_M
bypass_from_W
61
8. Pipeline Hazards: WAW and WAR Name Hazards
62
8. Pipeline Hazards: WAW and WAR Name Hazards 8.1. Software Renaming
63
9. Summary of Processor Performance 8.2. Hardware Stalling
ostall_waw_hazard_D =
val_D && rf_wen_D && val_Z && rf_wen_Z
&& (rf_waddr_D == rf_waddr_Z) && (rf_waddr_Z != 0)
64
9. Summary of Processor Performance
val_F
+
imm_type op1_
pc_F +4 pc_FD rf_
byp_ btarg_DX waddr_W
sel_D result_sel_X
wb_sel_M rf_
mul
wen_W
result result
_XM _MW
reg_ ir[19:15]
en_F reg_ regfile op1_sel_D op1_DX regfile
en_D ir[24:20] (read) (write)
ir_FD
alu
eq_X
op2_sel_D op2_DX sd_XM
op2_ alu_fn_X
byp_
sel_D
bypass_from_X sd_DX
bypass_from_M
bypass_from_W
• register read = 1τ
• register write = 1τ
• regfile read = 10τ
• regfile write = 10τ
• memory read = 20τ
• memory write = 20τ
• +4 unit = 4τ
• immgen = 2τ
• mux = 3τ
• multiplier = 20τ
• alu = 10τ
• adder = 8τ
65
9. Summary of Processor Performance
lw
lw
add
sw
addi
addi
addi
addi
bne
opA
opB
lw
66
9. Summary of Processor Performance
67
10. Case Study: Transition from CISC to RISC
68
10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30
Instruction Fetch
OR Writeback Result
69
10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30
70
10. Case Study: Transition from CISC to RISC 10.1. Example CISC: IBM 360/M30
μPC User PC
Small "Larger"
Decoder Decoder
71
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
72
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
Used two-phase clocking to enable five pipeline stages to fit into four
clock cycles. This avoided the need for explicit bypassing from the W
1.2 The MIPS Five-Stage Pipeline 5
stage to the end of the D stage.
RD WB
IF MEM
from to
Instruction 1 from ALU from
register register
I-cache D-cache
Instruction sequence
file file
Instruction 2
IF RD ALU MEM WB
Time
74
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
4.0
Performance Ratio
3.5
Ratio of
3.0
MIPS
MICROPROCESSOR REPORT
to 2.5
VAX
MIPS R10000
2.0
Uses Decoupled Architecture
High-Performance Core Will Drive MIPS High-End
Instructions for Years
Executed Ratio
1.5
by Linley Gwennap Taking advantage of its experience with the 200-
2x more instr MHz R4400, MTI was able to streamline the design and
P R O C ES
RO Not to be left out in the move to the
1.0 6x lower expects it to run at a high clock rate. Speaking at the
CPI
FORUM next generation of RISC, MIPS Tech- Microprocessor Forum, MTI’s Chris Rowen said that the
SO R
MIC
4 instr
Resume SRAM codes them. If a branch is discovered, it
Cache
is immediately predicted; if it is pre-
PC Decode, Map, Active Map dicted taken, the target address is sent
Unit Dispatch List Table to the instruction cache, redirecting the
128
4 instr Data fetch stream. Because of the one cycle
SRAM needed to decode the branch, taken
Memory Integer FP
branches create a “bubble” in the fetch
512K
Queue
16 entries
Queue
16 entries
Queue -16M stream; the deep queues, however, gen-
16 entries
128 erally prevent this bubble from delay-
ing the execution pipeline.
Integer Registers FP Registers The sequential instructions that
Avalanche Bus (64 bit addr/data)
64 ! 64 bits 64 ! 64 bits
are loaded during this extra cycle are
not discarded but are saved in a “re-
System Interface
75
10. Case Study: Transition from CISC to RISC 10.2. Example RISC: MIPS R2K
Microprogamming Today
76