Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

EEF011 Computer Architecture

Sections 3.2 and 3.3


Dynamic Scheduling
Tomasulos Algorithm

October 2004

A Dynamic Algorithm:
Tomasulos Algorithm
For IBM 360/91 (before caches!) 3 years after CDC
Goal: High Performance without special compilers
Small number of floating point registers (4 in 360)
prevented interesting compiler scheduling of operations
This led Tomasulo to try to figure out how to get more effective
registers renaming in hardware!

Why Study 1966 Computer?


The descendants of this have flourished!
Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604,

Example to eleminate WAR and WAW


by register renaming
Original
DIV.D
F0, F2, F4
ADD.D
F6, F0, F8
S.D
F6, 0(R1)
SUB.D
F8, F10, F14
MUL.D
F6, F10, F8
WAR between ADD.D and SUB.D, WAW between ADD.D and MUL.D
(Due to that DIV.D needs to take much longer cycles to get F0)

Register renaming
DIV.D
ADD.D
S.D
SUB.D
MUL.D

F0, F2, F4
S, F0, F8
S, 0(R1)
T, F10, F14
F6, F10, T

Tomasulo Algorithm
Register renaming provided

by reservation stations, which buffer the operands of


instructions waiting to issue
by the issue logic

Basic idea:

a reservation station fetches and buffers an operand as


soon as it is available, eliminating the need to get the
operand from a register (WAR)
pending instructions designate the reservation station that
will provide their input (RAW)
when successive writes to a register overlap in execution,
only the last one is actually used to update the register
(WAW)
As instructions are issued, the register specifiers for pending
operands are renamed to the names of the reservation
station, which provides register renaming
more reservation stations than real registers

Properties of Tomasulo Algorithm


1. Control & buffers distributed with Function Units (FU)
Hazard detection and execution control are distributed
FU buffers called reservation stations; have pending
operands
Registers in instructions replaced by values or pointers to
reservation stations(RS)
form of register renaming to avoids WAR, WAW hazards

2. Bypassing: Results passed directly to FU from RS, not


through registers, over Common Data Bus
that broadcasts results to all FUs, so allows all units waiting
for an operand to be loaded simultaneously

Load and Stores treated as FUs with RSs as well


Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue

Figure 3.2 Basic structure of a MIPS


floating-point unit using Tomasulos algorithm
Load buffers:
1. hold components of the effected addr
2. track outstanding loads that are waiting
on the memory
3. hold the results of completed loads that
are waiting for the CDB
Store buffers:
1. hold components of the
effected addr
2. hold the destination
memory addresses of
outstanding stores that are
waiting for the data value
to store
3. hold the addr and value to
store until the memory unit
is available

Three Stages of Tomasulo Algorithm


1. Issueget instruction from the head of the instruction queue
If reservation station free (no structural hazard),
control issues instr with the operand values (renames registers).
No free RS => there is a structural hazard
If the operands are not in the registers, keep track of FU
This step renames registers, eliminating WAR and WAW hazards

2. Executeoperate on operands (EX)

When both operands ready (placed into RS), then execute;


if not ready, monitor Common Data Bus for result
By delaying EX until the operands are available, RAW hazards are avoided

3. Write resultfinish execution (WB)

Write on Common Data Bus to the registers and the RS of all awaiting units;
mark reservation station available

Normal data bus: data + destination (go to bus)


Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast

7 Components of Reservation Station


Op: Operation to perform in the unit (e.g., + or )
Qj, Qk: Reservation stations producing the
corresponding source operand
Note: Qj,Qk=0 => ready or unnessary

Store buffers only have Qi for RS producing result

Vj, Vk: Value of Source operands

Only one of V field or the Q field is valid


Store buffers has V field, result to be stored

A: used to hold information for the memory address


calculation for a load or a store
Busy: Indicates reservation station or FU is busy
Register result status QiIndicates which functional
unit will write each register, if one exists. Blank when
no pending instructions that will write that register.

Tomasulo Example

Instruction stream

Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
Load1
Load2
Load3

Register result status:


Clock
0

No
No
No

3 Load/Buffers

Reservation Stations:
Time Name Busy
Add1
No
Add2
No
FU count
Add3
No
down
Mult1 No
Mult2 No

Busy Address

Op

S1
Vj

S2
Vk

RS
Qj

RS
Qk

3 FP Adder R.S.
2 FP Mult R.S.

F0

F2

F4

F6

F8

FU

Clock cycle
counter
Example speed: 3 clocks for FP +,-; 10 for * ; 40 clks for /

F10

F12

...

F30

Tomasulo Example Cycle 1


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1

Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No

Register result status:


Clock
1

FU

Busy Address
Load1
Load2
Load3

Op

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F0

F2

F4

F6

F8

Load1

Yes
No
No

34+R2

F10

F12

...

F30

Tomasulo Example Cycle 2


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2

Reservation Stations:
Time Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No

Register result status:


Clock
2

FU

Busy Address
Load1
Load2
Load3

Op

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F0

F2

F4

F6

F8

Load2

Yes
Yes
No

34+R2
45+R3

F10

F12

Load1

Note: Can have multiple loads outstanding

...

F30

Tomasulo Example Cycle 3


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3

Reservation Stations:
Time Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes MULTD
Mult2 No

Register result status:


Clock
3

FU

F0

Busy Address

S1
Vj

Load1
Load2
Load3

S2
Vk

RS
Qj

Yes
Yes
No

34+R2
45+R3

F10

F12

RS
Qk

R(F4) Load2

F2

Mult1 Load2

F4

F6

F8

...

Load1

Note: registers names are removed (renamed) in


Reservation Stations; MULT issued
Load1 completing; what is waiting for Load1?

F30

Tomasulo Example Cycle 4


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4

Reservation Stations:

Busy Address

3
4

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

No
Yes
No

45+R3

F10

F12

Time Name Busy Op


Add1 Yes SUBD M(A1)
Load2
Add2
No
Add3
No
R(F4) Load2
Mult1 Yes MULTD
Mult2 No

Register result status:


Clock
4

FU

F0

Mult1 Load2

M(A1) Add1

Load2 completing; what is waiting for Load2?

...

F30

Tomasulo Example Cycle 5


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


2 Add1 Yes SUBD M(A1) M(A2)
Add2
No
Add3
No
10 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
5

FU

F0

Mult1 M(A2)

No
No
No

F10

M(A1) Add1 Mult2

Timer starts down for Add1, Mult1

F12

...

F30

Tomasulo Example Cycle 6


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


1 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
9 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
6

FU

F0

Mult1 M(A2)

Add2

No
No
No

F10

F12

...

Add1 Mult2

Issue ADDD here despite name dependency on F6?

F30

Tomasulo Example Cycle 7


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

3
4

Busy Address

4
5

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


0 Add1 Yes SUBD M(A1) M(A2)
Add2 Yes ADDD
M(A2) Add1
Add3
No
8 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
7

FU

F0

No
No
No

Mult1 M(A2)

Add2

F10

F12

Add1 Mult2

Add1 (SUBD) completing; what is waiting for it?

...

F30

Tomasulo Example Cycle 8


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
2 Add2 Yes ADDD (M-M) M(A2)
Add3
No
7 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
8

FU

F0

Mult1 M(A2)

No
No
No

F10

Add2 (M-M) Mult2

F12

...

F30

Tomasulo Example Cycle 9


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
1 Add2 Yes ADDD (M-M) M(A2)
Add3
No
6 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
9

FU

F0

Mult1 M(A2)

No
No
No

F10

Add2 (M-M) Mult2

F12

...

F30

Tomasulo Example Cycle 10


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

3
4

4
5

Busy Address
Load1
Load2
Load3

10

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
0 Add2 Yes ADDD (M-M) M(A2)
Add3
No
5 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
10

FU

F0

No
No
No

Mult1 M(A2)

F10

F12

Add2 (M-M) Mult2

Add2 (ADDD) completing; what is waiting for it?

...

F30

Tomasulo Example Cycle 11


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
4 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
11

FU

F0

Mult1 M(A2)

No
No
No

F10

(M-M+M(M-M) Mult2

Write result of ADDD here?


All quick instructions complete in this cycle!

F12

...

F30

Tomasulo Example Cycle 12


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
3 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
12

FU

F0

Mult1 M(A2)

No
No
No

F10

(M-M+M(M-M) Mult2

F12

...

F30

Tomasulo Example Cycle 13


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
2 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
13

FU

F0

Mult1 M(A2)

No
No
No

F10

(M-M+M(M-M) Mult2

F12

...

F30

Tomasulo Example Cycle 14


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4

4
5

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
1 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
14

FU

F0

Mult1 M(A2)

No
No
No

F10

(M-M+M(M-M) Mult2

F12

...

F30

Tomasulo Example Cycle 15


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

Busy Address

3
4
15
7

4
5

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
0 Mult1 Yes MULTD M(A2) R(F4)
Mult2 Yes DIVD
M(A1) Mult1

Register result status:


Clock
15

FU

F0

Mult1 M(A2)

No
No
No

F10

F12

...

(M-M+M(M-M) Mult2

Mult1 (MULTD) completing; what is waiting for it?

F30

Tomasulo Example Cycle 16


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

3
4
15
7

4
5
16
8

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
Mult1 No
40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock
16

FU

F0

Busy Address

M*F4 M(A2)

No
No
No

F10

(M-M+M(M-M) Mult2

Just waiting for Mult2 (DIVD) to complete

F12

...

F30

Tomasulo Example Cycle 55


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

3
4
15
7

4
5
16
8

Load1
Load2
Load3

10

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
Mult1 No
1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock
55

FU

F0

Busy Address

M*F4 M(A2)

No
No
No

F10

(M-M+M(M-M) Mult2

F12

...

F30

Tomasulo Example Cycle 56


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

3
4
15
7
56
10

4
5
16
8

Load1
Load2
Load3

11

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
Mult1 No
0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock
56

FU

F0

Busy Address

M*F4 M(A2)

No
No
No

F10

F12

...

(M-M+M(M-M) Mult2

Mult2 (DIVD) is completing; what is waiting for it?

F30

Tomasulo Example Cycle 57


Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6

j
34+
45+
F2
F6
F0
F8

k
R2
R3
F4
F2
F6
F2

Exec Write
Issue Comp Result
1
2
3
4
5
6

Reservation Stations:

3
4
15
7
56
10

4
5
16
8
57
11

Load1
Load2
Load3

S1
Vj

S2
Vk

RS
Qj

RS
Qk

F2

F4

F6

F8

Time Name Busy Op


Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes DIVD M*F4 M(A1)

Register result status:


Clock
56

FU

F0

Busy Address

M*F4 M(A2)

No
No
No

F10

F12

...

(M-M+M(M-M) Result

Once again: In-order issue, out-of-order execution


and out-of-order completion.

F30

Tomasulo Drawbacks
Complexity

delays of 360/91, MIPS 10000, Alpha 21264,


IBM PPC 620 in CA:AQA 2/e, but not in silicon!

Many associative stores (CDB) at high speed


Performance limited by Common Data Bus

Each CDB must go to multiple functional units


high capacitance, high wiring density
Number of functional units that can complete per cycle
limited to one!
Multiple CDBs more FU logic for parallel assoc stores

Non-precise interrupts!
We will address this later

Tomasulo Loop Example


Loop:LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

This time assume Multiply takes 4 clocks


Assume 1st load takes 8 clocks
(L1 cache miss), 2nd load takes 1 clock (hit)
To be clear, will show clocks for SUBI, BNEZ
Reality: integer instructions ahead of Fl. Pt. Instructions

Show 2 iterations

Loop Example
Instruction status:
ITER Instruction
1
1
1
Iter2
ation 2
Count 2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No

Op

Vj

Exec Write
Issue CompResult

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

S1
Vk

S2
Qj

RS
Qk

Code:
LD
MULTD
SD
SUBI
BNEZ

No
No
No
No
No
No

Added Store Buffers


F0
F4
F4
R1
R1

Register result status

Clock
0

F0

R1
80

F2

F4

F6

F8

Fu

F10 F12

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Instruction Loop

Fu
Value of Register used for address, iteration control

Loop Example Cycle 1


Instruction status:
ITER Instruction
1

LD

F0

R1

Vj

S1
Vk

Reservation Stations:
Time

Name Busy
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 No

Exec Write
Issue CompResult

Op

S2
Qj

RS
Qk

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
No
No
No
No
No

80

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
1

R1
80

F0
Fu Load1

F2

F4

F6

F8

F10 F12

Loop Example Cycle 2


Instruction status:
ITER Instruction
1
1

LD
MULTD

F0
F4

0
F0

R1
F2

1
2

Vj

S1
Vk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

S2
Qj

RS
Qk

R(F2) Load1

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
No
No
No
No
No

80

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
2

R1
80

F0
Fu Load1

F2

F4
Mult1

F6

F8

F10 F12

Loop Example Cycle 3


Instruction status:
ITER Instruction
1
1
1

LD
MULTD
SD

F0
F4
F4

0
F0
0

R1
F2
R1

Reservation Stations:
Time

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

Vj

Exec Write
Issue CompResult
1
2
3

S1
Vk

S2
Qj

RS
Qk

R(F2) Load1

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
No
No
Yes
No
No

80

80

Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
3

R1
80

F0
Fu Load1

F2

F4

F6

F8

F10 F12

Mult1

Implicit renaming sets up data flow graph

Loop Example Cycle 4


Instruction status:
ITER Instruction
1
1
1

LD
MULTD
SD

F0
F4
F4

0
F0
0

R1
F2
R1

Reservation Stations:
Time

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

Vj

Exec Write
Issue CompResult
1
2
3

S1
Vk

S2
Qj

RS
Qk

R(F2) Load1

Fu

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

Yes
No
No
Yes
No
No

80

80

Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
4

R1
80

F0
Fu Load1

F2

F4

F6

F8

F10 F12

Mult1

Dispatching SUBI Instruction (not in FP queue)

Loop Example Cycle 5


Instruction status:
ITER Instruction
1
1
1

LD
MULTD
SD

F0
F4
F4

0
F0
0

R1
F2
R1

Reservation Stations:
Time

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

Vj

Exec Write
Issue CompResult
1
2
3

S1
Vk

S2
Qj

RS
Qk

R(F2) Load1

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
No
No
Yes
No
No

80

80

Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
5

R1
72

F0
Fu Load1

F2

F4

F6

F8

F10 F12

Mult1

And, BNEZ instruction (not in FP queue)

Loop Example Cycle 6


Instruction status:
ITER Instruction
1
1
1
2

LD
MULTD
SD
LD

F0
F4
F4
F0

0
F0
0
0

R1
F2
R1
R1

1
2
3
6

Vj

S1
Vk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

S2
Qj

RS
Qk

R(F2) Load1

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
Yes
No
Yes
No
No

80
72
80

Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
6

R1
72

F0
Fu Load2

F2

F4

F6

F8

F10 F12

Mult1

Notice that F0 never sees Load from location 80

Loop Example Cycle 7


Instruction status:
ITER Instruction
1
1
1
2
2

LD
MULTD
SD
LD
MULTD

F0
F4
F4
F0
F4

0
F0
0
0
F0

R1
F2
R1
R1
F2

1
2
3
6
7

Vj

S1
Vk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd

S2
Qj

RS
Qk

R(F2) Load1
R(F2) Load2

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
Yes
No
Yes
No
No

80
72
80

Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
7

R1
72

F0
Fu Load2

F2

F4

F6

F8

F10 F12

Mult2

Register file completely detached from computation


First and Second iteration completely overlapped

Loop Example Cycle 8


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

Vj

S1
Vk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd

S2
Qj

RS
Qk

R(F2) Load1
R(F2) Load2

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
Yes
No
Yes
Yes
No

80
72
80
72

Mult1
Mult2

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
8

R1
72

F0
Fu Load2

F2

F4
Mult2

F6

F8

F10 F12

Loop Example Cycle 9


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

Vj

S1
Vk

S2
Qj

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd

RS
Qk

R(F2) Load1
R(F2) Load2

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
Yes
No
Yes
Yes
No

80
72
80
72

Mult1
Mult2

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
9

R1
72

F0
Fu Load2

F2

F4

F6

Mult2

Load1 completing: who is waiting?


Note: Dispatching SUBI

F8

F10 F12

Loop Example Cycle 10


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk

10

10

S2
Qj

Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd
R(F2) Load2

RS
Qk

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
Yes
No
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

Fu

72
80
72

Mult1
Mult2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
10

R1
64

F0
Fu Load2

F2

F4

F6

Mult2

Load2 completing: who is waiting?


Note: Dispatching BNEZ

F8

F10 F12

Loop Example Cycle 11


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

3
4

Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk

Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)

10

10

11

S2
Qj

RS
Qk

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

64
80
72

Fu

Mult1
Mult2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
11

R1
64

F0
Fu Load3

F2

F4
Mult2

Next load in sequence

F6

F8

F10 F12

Loop Example Cycle 12


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

2
3

Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk

Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)

10

10

11

S2
Qj

RS
Qk

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

64
80
72

Fu

Mult1
Mult2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
12

R1
64

F0
Fu Load3

F2

F4

F6

F8

Mult2

Why not issue third multiply?

F10 F12

Loop Example Cycle 13


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

1
2

Exec Write
Issue CompResult
1
2
3
6
7
8
S1
Vk

Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)

10

10

11

S2
Qj

RS
Qk

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

64
80
72

Fu

Mult1
Mult2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
13

R1
64

F0
Fu Load3

F2

F4

F6

Mult2

Why not issue third store?

F8

F10 F12

Loop Example Cycle 14


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

0
1

Exec Write
Issue CompResult
1
2
3
6
7
8

9
14
10

11

S1
Vk

S2
Qj

RS
Qk

Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd M[80] R(F2)
Mult2 Yes Multd M[72] R(F2)

10

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

64
80
72

Fu

Mult1
Mult2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
14

R1
64

F0
Fu Load3

F2

F4

F6

F8

F10 F12

Mult2

Mult1 completing. Who is waiting?

Loop Example Cycle 15


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

Reservation Stations:
Time

Exec Write
Issue CompResult
1
2
3
6
7
8

9
14
10
15

11

S1
Vk

S2
Qj

RS
Qk

Name Busy Op
Vj
Add1
No
Add2
No
Add3
No
Mult1 No
Mult2 Yes Multd M[72] R(F2)

10
15

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

64
80
72

Fu

[80]*R2
Mult2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
15

R1
64

F0
Fu Load3

F2

F4

F6

F8

F10 F12

Mult2

Mult2 completing. Who is waiting?

Loop Example Cycle 16


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

9
14
10
15

11
16

Vj

S1
Vk

S2
Qj

RS
Qk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

10
15

R(F2) Load3

Busy Addr
Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
No

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

64
80
72

Fu

[80]*R2
[72]*R2

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

Register result status

Clock
16

R1
64

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

Loop Example Cycle 17


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

9
14
10
15

11
16

Vj

S1
Vk

S2
Qj

RS
Qk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

10
15

R(F2) Load3

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
Yes

64
80
72
64

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

[80]*R2
[72]*R2
Mult1

Register result status

Clock
17

R1
64

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

Loop Example Cycle 18


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

9
14
18
10
15

10
15

Vj

S1
Vk

S2
Qj

RS
Qk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

11
16

R(F2) Load3

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
Yes
Yes
Yes

64
80
72
64

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

[80]*R2
[72]*R2
Mult1

Register result status

Clock
18

R1
64

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

Loop Example Cycle 19


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

9
14
18
10
15
19

10
15
19
11
16

Vj

S1
Vk

S2
Qj

RS
Qk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

R(F2) Load3

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

No
No
Yes
No
Yes
Yes

72
64

[72]*R2
Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

64

Register result status

Clock
19

R1
56

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

Loop Example Cycle 20


Instruction status:
ITER Instruction
1
1
1
2
2
2

LD
MULTD
SD
LD
MULTD
SD

F0
F4
F4
F0
F4
F4

0
F0
0
0
F0
0

R1
F2
R1
R1
F2
R1

1
2
3
6
7
8

9
14
18
10
15
19

10
15
19
11
16
20

Vj

S1
Vk

S2
Qj

RS
Qk

Reservation Stations:
Time

Exec Write
Issue CompResult

Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 No

R(F2) Load3

Busy Addr

Fu

Load1
Load2
Load3
Store1
Store2
Store3

Yes
No
Yes
No
No
Yes

56

64

Mult1

Code:
LD
MULTD
SD
SUBI
BNEZ

F0
F4
F4
R1
R1

0
F0
0
R1
Loop

R1
F2
R1
#8

...

F30

64

Register result status

Clock
20

R1
56

F0
Fu Load1

F2

F4

F6

F8

F10 F12

Mult1

Once again: In-order issue, out-of-order execution


and out-of-order completion.

Why can Tomasulo overlap iterations


of loops?
Register renaming

Multiple iterations use different physical destinations for


registers (dynamic loop unrolling).

Reservation stations

Permit instruction issue to advance past integer control flow


operations
Also buffer old values of registers - totally avoiding the WAR
stall that we saw in the scoreboard.

Other perspective: Tomasulo building data flow


dependency graph on the fly.

Tomasulos scheme offers 2 major


advantages
(1) the distribution of the hazard detection logic

distributed reservation stations and the CDB


If multiple instructions waiting on single result, & each
instruction has other operand, then instructions can be
released simultaneously by broadcast on CDB
If a centralized register file were used, the units
would have to read their results from the registers
when register buses are available.

(2) the elimination of stalls for WAW and WAR


hazards of scoreboard

You might also like