Professional Documents
Culture Documents
Tomasulo Algorithm and Dynamic Branch Prediction
Tomasulo Algorithm and Dynamic Branch Prediction
Professor David A. Patterson Computer Science 252 Spring 1998 edited by C. Halatsis Winter 2001
Review: Summary
Instruction Level Parallelism (ILP) in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops
Memory dependencies hardest to determine
HW exploiting ILP
Works when cant know dependence at run time Code for one machine runs well on another
Key idea of Scoreboard: Allow instructions behind stall to proceed (Decode => Issue instr & read operands)
Enables out-of-order execution => out-of-order completion ID stage checked both for structural and WAW hazards
3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
DAP Spr.98 UCB 4
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Tomasulo Organization
FP Op Queue Load Buffer FP Registers
Store Buffer
BusyIndicates reservation station or FU is busy Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.
DAP Spr.98 UCB 9
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
0 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
1 FU
F0
F2
F4
F6
Load1
F8
F10
F12 ...
F30
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
2 FU
F0
F2
Load2
F4
F6
Load1
F8
F30
S2 Vk
RS for j Qj
RS for k Qk
R(F4)
Load2
Clock
3 FU
F0
F2
F4
F6
Load1
F8
F10
F12 ...
F30
Mult1 Load2
Note: registers names are removed (renamed) in Reservation Stations; MULT issued vs. scoreboard DAP Load1 completing; what is waiting for Load1? (onlySpr.98 UCB 14 F6)
S2 Vk
RS for j Qj
RS for k Qk Load2
R(F4)
Load2
Clock
4 FU
F0
F2
F4
F6
M(34+R2)
F8
Add1
F10
F12 ...
F30
Mult1 Load2
S2 Vk M(45+R3)
RS for j Qj
RS for k Qk
R(F4) M(34+R2)
Mult1
Clock
5 FU
F0
F2
F4
F6
M(34+R2)
F8
Add1
F10
Mult2
F12 ...
F30
Mult1 M(45+R3)
RS fo r k Qk
Clock
6 FU
F0
F2
F4
F6
Add2
F8
Add1
F30
Issue ADDD here vs. scoreboard? (Yes,because the Add unit has a
free reservation station to queue ADDD; the other RS queues the SUBD instruction) In this cycle SUBD starts executing (requres 1 more cycle to complete) In this cycle MULTD starts executing too. (requires 9 more cycles to complete)
RS for j Qj Add1
RS for k Qk
Mult1
Clock
7 FU
F0
F2
F4
F6
Add2
F8
Add1
F10
Mult2
F12 ...
F30
Mult1 M(45+R3)
Mult1
Clock
8 FU
F0
F2
F4
F6
Add2
F8
F10
F12 ...
F30
Mult1 M(45+R3)
M()-M() Mult2
RS for j Qj
RS for k Qk
Mult1
Clock
9 FU
F0
F2
F4
F6
Add2
F8
F10
F12 ...
F30
Mult1 M(45+R3)
M()M() Mult2
RS for j Qj
RS for k Qk
Mult1
Clock
10 FU
F0
F2
F4
F6
Add2
F8
F10
F12 ...
F30
Mult1 M(45+R3)
M()M() Mult2
Clock
11 FU
F0
F2
F4
F6
F8
F30
R(F4) M(34+R2)
Mult1
Clock
12 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Mult1 M(45+R3)
R(F4) M(34+R2)
Mult1
Clock
13 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Mult1 M(45+R3)
R(F4) M(34+R2)
Mult1
Clock
14 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Mult1 M(45+R3)
R(F4) M(34+R2)
Mult1
Clock
15 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Mult1 M(45+R3)
Mult1 completing; what is waiting for it? (DIVD andDAP Spr.98 UCB 26 F0)
M*F4
M(34+R2)
Clock
16 FU
F0
M*F4
F2
M(45+R3)
F4
F6
F8
F10
F12 ...
F30
M*F4
M(34+R2)
Clock
55 FU
F0
M*F4
F2
M(45+R3)
F4
F6
F8
F10
F12 ...
F30
M*F4
M(34+R2)
Clock
56 FU
F0
M*F4
F2
M(45+R3)
F4
F6
F8
F10
F12 ...
F30
DAP Mult 2 completing; what is waiting for it? (nothing) Spr.98 UCB 29
RS for j Qj
RS for k Qk
Clock
57 FU
F0
M*F4
F2
M(45+R3)
F4
F6
F8
F10
F12 ...
F30
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
DAP Spr.98 UCB 31
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores
Assume Multiply takes 4 clocks Assume first load takes 8 clocks (cache miss?), second load takes 4 clocks (hit) To be clear, will show clocks for SUBI, BNEZ Reality, integer instructions ahead
DAP Spr.98 UCB 34
S1 Vj
S2 Vk
0 R1 F0 F2 0 R1 R1 #8 Loop
Clock
0
R1 80
F0
Qi
F2
F4
F6
F8
Clock
1
R1 80
F0
Qi Load1
F2
F4
F6
F8
Clock
2
R1 80
F0
Qi Load1
F2
F4
Mult1
F6
F8
Clock
3
R1 80
F0
Qi Load1
F2
F4
Mult1
F6
F8
Clock
4
R1 72
F0
Qi Load1
F2
F4
Mult1
F6
F8
Clock
5
R1 72
F0
Qi Load1
F2
F4
Mult1
F6
F8
S1 Vj
S2 Vk
R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
Clock
6
R1 72
F0
Qi Load2
F2
F4
Mult1
F6
F8
R(F2) R(F2)
0 R1 F0 F2 0 R1 R1 #8 Loop
Clock
7
R1 72
F0
Qi Load2
F2
F4
Mult2
F6
F8
Clock
8
R1 72
F0
Qi Load2
F2
F4
Mult2
F6
F8
Clock
9
R1 64
F0
Qi Load2
F2
F4
Mult2
F6
F8
Clock
10
R1 64
F0
Qi Load2
F2
F4
Mult2
F6
F8
Clock
11
R1 64
F0
Qi Load3
F2
F4
Mult2
F6
F8
Clock
12
R1 64
F0
Qi Load3
F2
F4
Mult2
F6
F8
Clock
13
R1 64
F0
Qi Load3
F2
F4
Mult2
F6
F8
Clock
14
R1 64
F0
Qi Load3
F2
F4
Mult2
F6
F8
Clock
15
R1 64
F0
Qi Load3
F2
F4
Mult2
F6
F8
Clock
16
R1 64
F0
Qi Load3
F2
F4
Mult1
F6
F8
Clock
17
R1 64
F0
Qi Load3
F2
F4
Mult1
F6
F8
Clock
18
R1 56
F0
Qi Load3
F2
F4
Mult1
F6
F8
Clock
19
R1 56
F0
Qi Load3
F2
F4
Mult1
F6
F8
Clock
20
R1 56
F0
Qi Load3
F2
F4
Mult1
F6
F8
Clock
21
R1 56
F0
Qi Load3
F2
F4
Mult1
F6
F8
Tomasulo Summary
Reservations stations: renaming to larger set of registers + buffering source operands
Prevents registers as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW
Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264