Professional Documents
Culture Documents
Tomasulo Algorithm and Dynamic Branch Prediction: Professor David A. Patterson Computer Science 252 Spring 1998
Tomasulo Algorithm and Dynamic Branch Prediction: Professor David A. Patterson Computer Science 252 Spring 1998
Review: Summary
Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops
Memory dependencies hardest to determine
HW exploiting ILP
Works when cant know dependence at run time Code for one machine runs well on another
Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands)
Enables out-of-order execution => out-of-order completion DAP Spr.98 UCB 2 ID stage checked both for structural
Page 1
3.Register result statusIndicates which functional unit will write each register, if one exists. Blank when no DAP Spr.98 UCB 3 pending instructions will write that register
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
3 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
DAP Spr.98 UCB 4
Page 2
dest Fi F0 F8 F10
S1 Fj F2 F6 F0
S2 Fk F4 F2 F6
Mult1
Clock
9 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
DAP Spr.98 UCB 5
S1 Fj F2 F8 F0
S2 Fk F4 F2 F6
Mult1
Clock
17 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
DAP Spr.98 UCB 6
Page 3
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
DAP Spr.98 UCB 7
Page 4
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing DAP Spr.98 UCB 10 FP ops beyond basic block in FP queue
Page 5
Tomasulo Organization
From memory Load buffers 6 5 4 3 2 1 From instruction unit Floatingpoint operation queue FP registers
Operand buses
Operation bus
3 2 1 FP adders
FIGURE 4.8 The basic structure of a DLX FP unit using Tomasulo's algorithm.
BusyIndicates reservation station or FU is busy Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that DAP Spr.98 UCB 12 register.
Page 6
If reservation station free (no structural hazard), control issues instr & sends operands (renames registers).
2. E x e c u t i o n operate
When both operands ready then execute; if not ready, watch Common Data Bus for result
3. W r i t e r e s u l tfinish
Write on Common Data Bus to all awaiting units; mark reservation station available
Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast DAP Spr.98 UCB 13
Busy Op No No No No No
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
0 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Page 7
Busy Op No No No No No
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
1 FU
F0
F2
F4
F6
Load1
F8
F10
F12 ...
F30
Busy Op No No No No No
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
2 FU
F0
F2
Load2
F4
F6
Load1
F8
F10
F12 ...
F30
Page 8
S2 Vk
RS for j Qj
RS for k Qk
R(F4)
Load2
Clock
3 FU
F0
Mult1
F2
Load2
F4
F6
Load1
F8
F10
F12 ...
F30
Note: registers names are removed (renamed) in Reservation Stations; MULT issued vs. scoreboard DAP Spr.98 UCB 17 Load1 completing; what is waiting for Load1?
S2 Vk
RS for j Qj
RS for k Qk Load2
R(F4)
Load2
Clock
4 FU
F0
Mult1
F2
Load2
F4
F6
M(34+R2)
F8
Add1
F10
F12 ...
F30
Page 9
Op SUBD
S1 Vj M(34+R2)
S2 Vk M(45+R3)
RS for j Qj
RS for k Qk
R(F4) M(34+R2)
Mult1
Clock
5 FU
F0
Mult1
F2
M(45+R3)
F4
F6
M(34+R2)
F8
Add1
F10
Mult2
F12 ...
F30
RS for j Qj Add1
RS for k Qk
Mult1
Clock
6 FU
F0
Mult1
F2
M(45+R3)
F4
F6
Add2
F8
Add1
F10
Mult2
F12 ...
F30
Page 10
RS for j Qj Add1
RS for k Qk
Mult1
Clock
7 FU
F0
Mult1
F2
M(45+R3)
F4
F6
Add2
F8
Add1
F10
Mult2
F12 ...
F30
RS for j Qj
RS for k Qk
Mult1
Clock
8 FU
F0
F2
F4
F6
Add2
F8
F30
Mult1 M(45+R3)
M()-M() Mult2
Page 11
S1 Vj
RS for j Qj
RS for k Qk
Mult1
Clock
9 FU
F0
Mult1
F2
M(45+R3)
F4
F6
Add2
F8
F10
F12 ...
F30
M()M() Mult2
Mult1
Clock
10 FU
F0
Mult1
F2
M(45+R3)
F4
F6
Add2
F8
F10
F12 ...
F30
M()M() Mult2
Page 12
R(F4) M(34+R2)
Mult1
Clock
11 FU
F0
F2
F4
F6
(M-M)+M()
F8
F30
Mult1 M(45+R3)
M()M() Mult2
R(F4) M(34+R2)
Mult1
Clock
12 FU
F0
Mult1
F2
M(45+R3)
F4
F6
(M-M)+M()
F8
F10
F12 ...
F30
M()M() Mult2
Page 13
R(F4) M(34+R2)
Mult1
Clock
13 FU
F0
Mult1
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() Mult2
R(F4) M(34+R2)
Mult1
Clock
14 FU
F0
Mult1
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() Mult2
Page 14
R(F4) M(34+R2)
Mult1
Clock
15 FU
F0
Mult1
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() Mult2
M*F4
M(34+R2)
Clock
16 FU
F0
M*F4
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() Mult2
Page 15
M*F4
M(34+R2)
Clock
55 FU
F0
M*F4
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() Mult2
M*F4
M(34+R2)
Clock
56 FU
F0
M*F4
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() Mult2
Page 16
Busy Op No No No No No
RS for k Qk
Clock
57 FU
F0
M*F4
F2
M(45+R3)
F4
F6
(MM)+M()
F8
F10
F12 ...
F30
M()M() M*F4/M
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
Page 17
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores
Page 18
Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) To be clear, will show clocks for SUBI, BNEZ DAP Spr.98 UCB 37 Reality, integer instructions ahead
Busy Op No No No No No
S1 Vj
S2 Vk
F0
Qi
F2
F4
F6
F8
Page 19
Busy Op No No No No No
S1 Vj
S2 Vk
F0
Qi Load1
F2
F4
F6
F8
S1 Vj
S2 Vk
R(F2)
F0
Qi Load1
F2
F4
Mult1
F6
F8
Page 20
S1 Vj
S2 Vk
R(F2)
F0
Qi Load1
F2
F4
Mult1
F6
F8
S1 Vj
S2 Vk
R(F2)
F0
Qi Load1
F2
F4
Mult1
F6
F8
Page 21
S1 Vj
S2 Vk
R(F2)
F0
Qi Load1
F2
F4
Mult1
F6
F8
S1 Vj
S2 Vk
R(F2)
F0
Qi Load2
F2
F4
Mult1
F6
F8
Page 22
S2 Vk
R(F2) R(F2)
F0
Qi Load2
F2
F4
Mult2
F6
F8
S2 Vk
R(F2) R(F2)
F0
Qi Load2
F2
F4
Mult2
F6
F8
Page 23
S2 Vk
R(F2) R(F2)
F0
Qi Load2
F2
F4
Mult2
F6
F8
10
S2 Vk
F0
Qi Load2
F2
F4
Mult2
F6
F8
Page 24
10
S2 Vk
F0
Qi Load3
F2
F4
Mult2
F6
F8
10
S2 Vk
F0
Qi Load3
F2
F4
Mult2
F6
F8
Page 25
10
S2 Vk
F0
Qi Load3
F2
F4
Mult2
F6
F8
S2 Vk
F0
Qi Load3
F2
F4
Mult2
F6
F8
Page 26
M(72) R(F2)
F0
Qi Load3
F2
F4
Mult2
F6
F8
R(F2)
F0
Qi Load3
F2
F4
Mult1
F6
F8
Page 27
R(F2)
F0
Qi Load3
F2
F4
Mult1
F6
F8
R(F2)
F0
Qi Load3
F2
F4
Mult1
F6
F8
Page 28
R(F2)
F0
Qi Load3
F2
F4
Mult1
F6
F8
R(F2)
F0
Qi Load3
F2
F4
Mult1
F6
F8
Page 29
R(F2)
F0
Qi Load3
F2
F4
Mult1
F6
F8
Tomasulo Summary
Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264 DAP Spr.98 UCB 60
Page 30