Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 62

Lecture 6

Score Board Contd. And


Tomasulos Algorithm
Instructor: Laxmi Bhuyan

Nov. 2, 2004

Lec. 7

Three Parts of the Scoreboard


1. Instruction statuswhich of 4 steps the instruction is in
(Issue, Operand Read, EX, Write)
2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for
each functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., + or )
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready and not yet read. Set to
No after operand are read.

3. Register result statusIndicates which functional unit will write each register, if one
exists. Blank when no pending instructions will write that register

Nov. 2, 2004

Lec. 7

Detailed Scoreboard Pipeline


Control
Instruction
status

Wait until

Bookkeeping

Issue

Not busy (FU)


and not result(D)

Busy(FU) yes; Op(FU) op;


Fi(FU) `D; Fj(FU) `S1;
Fk(FU) `S2; Qj Result(S1);
Qk Result(`S2); Rj not Qj;
Rk not Qk; Result(D) FU;

Read
operands

Rj and Rk

Rj No; Rk No

Execution
complete

Functional unit
done

WAW

f((Fj( f )!=Fi(FU)
or Rj( f )=No) &
Write result (Fk( f )!=Fi(FU) or
Rk( f )=No))
WAR

Nov. 2, 2004

f(if Qj(f)=FU then Rj(f) Yes);


f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No

A.55 on page A-76

Lec. 7

Scoreboard Example
The following numbers are to illustrate behavior, not
representative
LD 1 cycle
(compute address + data cache access)

ADDDs and SUBs are 2 cycles


Multiply is 10 cycles
Divide is 40 cycles

Nov. 2, 2004

Lec. 7

Scoreboard Example

Instruction status
Instruction j
k
LD
F6
34+ R2
LD
F2
45+ R3
MULTDF0
F2 F4
SUBD F8
F6 F2
DIVD F10 F0 F6
ADDDF6
F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result

Busy
No
No
No
No
No

Clock

F0

Op

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj
Qk
Rj

Fk?
Rk

F2

F4

F6

F8

F10 F12

F30

...

FU

Nov. 2, 2004

Lec. 7

Scoreboard Example Cycle 1

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1

Busy
Yes
No
No
No
No

Clock

F0

FU

Nov. 2, 2004

Op
Load

dest
Fi
F6

S1
Fj

S2
Fk
R2

FU for j FU for k Fj?


Qj
Qk
Rj

F2

F4

F6 F8 F10 F12

...

Fk?
Rk
Yes

F30

Integer

Lec. 7

Scoreboard Example Cycle 2


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
F0
Clock
2
FU

Nov. 2, 2004

Read Execution Write


operands
complete Result
2

Note: Cant issue I2


because Integer unit
is busy. Cant issue
next instruction due
to in-order issue

dest
Op
Fi
Load F6

S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R2

Fk?
Rk
No

F2

F6 F8 F10
Integer

F30

F4

Lec. 7

F12

...

Scoreboard Example Cycle 3


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD
F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
F0
Clock
3
FU

Nov. 2, 2004

Read Execution Write


operands
complete Result
2
3

dest
Op Fi
Load F6

S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R2

Fk?
Rk
No

F2

F6 F8 F10
Integer

F30

F4

Lec. 7

F12

...

Scoreboard Example Cycle 4


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
F0
Clock
4
FU

Nov. 2, 2004

Read Execution Write


operands
complete Result
2
3
4

dest
Op Fi
Load F6

S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R2

Fk?
Rk
No

F2

F6 F8 F10

F30

F4

Lec. 7

F12

...

Scoreboard Example Cycle 5


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 No
Mult2 No
Add
No
Divide No
Register result status
F0
Clock
5
FU

Nov. 2, 2004

Read Execution Write


operands
complete Result
2
3
4
Now I2 is issued

dest
Op
Fi
Load F2

S1 S2 FU for FU
j for F
k j?
Fj Fk Qj
Qk
Rj
R3

Fk?
Rk
Yes

F2
F4
Integer

F6 F8 F10

F30

Lec. 7

F12

...

10

Scoreboard Example Cycle 6


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
6
SUBD F8 F6 F2
DIVD
F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 Yes
Mult2 No
Add
No
Divide No
Register result status
F0
Clock
6
FU Mult

Nov. 2, 2004

Read Execution Write


operands
complete Result
2
3
4
6

dest
Op Fi
Load F2
Mult F0

S1 S2 FU for j FU for k Fj?


Fj Fk Qj
Qk
Rj
R3
F2 F4 Integer
No

Fk?
Rk
No
Yes

F2
F4
Integer

F6 F8 F10

F30

Lec. 7

F12

...

11

Scoreboard Example Cycle 7


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
6
SUBD F8 F6 F2
7
DIVD
F10 F0 F6
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer Yes
Mult1 Yes
Mult2 No
Add
Yes
Divide No
Register result status
F0
Clock
7
FU Mult

Nov. 2, 2004

Read Execution Write


operands
complete Result
2
3
4
6
7

I3 stalled at read
because I2 isnt
complete

dest
Op Fi
Load F2
Mult F0

S1 S2 FU for j FU for k Fj?


Fj Fk Qj
Qk
Rj
R3
F2 F4 Integer
No

Subd F8

F6 F2

Integer Yes No

F2
F4
Integer

F6 F8 F10
Add

F12

Lec. 7

...

Fk?
Rk
No
Yes

F30

12

Scoreboard Example Cycle 8


Instruction status
j
k Issue
Instruction
LD
F6 34+ R2
1
LD
F2 45+ R3
5
MULTD F0 F2 F4
6
SUBD F8 F6 F2
7
DIVD F10 F0 F6
8
ADDD F6 F8 F2
Functional unit status
TimeName Busy
Integer No
Mult1
Yes
Mult2
No
Add
Yes
Divide Yes
Register result status
F0
Clock
8 Nov. 2, 2004
FU Mult1

Read EX
Write
Op
compl. Result
2
3
4
6
7
8

Op

dest
Fi

S1 S2 FU for FU
j for kFj? Fk?
Fj Fk Qj
Qk
Rj Rk

Mult

F0

F2 F4

Yes Yes

Sub
Div

F8
F10

F6 F2
F0 F6 Mult1

Yes Yes
No Yes

F2

F4

F6 F8 F10 F12
Lec.Add
7 Divide

...

F30

13

Scoreboard Example Cycle 9


Instruction status
j
k
Instruction
LD
F6 34+ R2
LD
F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status
Time Name
Integer
10 Mult1
Mult2
2 Add
Divide
Register result status
Clock
9
FU

Nov. 2, 2004

Read EX
Write
IssueOp
complete
Result
1
2
3
4
5
6
7
8
6
9
7
9
8
Busy Op
No
Yes Mult
No
Yes Sub
Yes Div
F0 F2
Mult1

Note: I3 and I4 read


operands because F2 is now
available. ADDD (I6) cant
be issued because SUBD
(I4) uses the adder

dest
Fi

S1 S2 FU for j FU for k Fj?


Fj Fk Qj
Qk
Rj

Fk?
Rk

F0

F2 F4

No

No

F8
F10

F6 F2
F0 F6 Mult1

No
No

No
Yes

F4

F6 F8 F10
Add Divide

...

F30

Lec. 7

F12

14

Scoreboard Example Cycle 11


Read Execution
Write
Instruction status
k Issueoperands
complete
Result
Instruction j
LD F6 34+ R2 1
2
3
4
LD F2 45+ R3 5
6
7
8
Note: Add takes 2 cycles,
MULTD
F0 F2 F4
6
9
so nothing happens in
SUBD
F8 F6 F2
7
9
11
cycle 10. MUL continues.
DIVDF10 F0 F6
8
ADDD
F6 F8 F2
dest S1 S2 FU for j FU for k Fj? Fk?
Functional unit status
TimeName Busy Op
Fi
Fj Fk Qj
Qk
Rj
Rk
Integer No
8 Mult1
Yes Mult F0
F2 F4
No No
Mult2
No
0 Add
Yes Sub F8
F6 F2
No No
Divide Yes Div F10 F0 F6 Mult1
No Yes
Register result status
F0 F2
F4
F6 F8 F10
F12
...
F30
Clock
11
FU Mult1
Add Divide

Nov. 2, 2004

Lec. 7

15

Scoreboard Example Cycle 12


Read Execution
Write
Instruction status
k Issueoperands
complete
Result
Instruction j
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
7 Mult1 Yes Mult F0
F2 F4
Mult2 No
Add
No
Divide Yes Div F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
12
FU Mult1

Nov. 2, 2004

FU for FU
j for F
k j?
Qj
Qk
Rj

Fk?
Rk

No

No

Mult1

No

Yes

F10 F12
Divide

...

F30

Lec. 7

16

Scoreboard Example Cycle 13


Read Execution
Write
Instruction status
j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
6 Mult1 Yes Mult F0
F2 F4
Mult2 No
Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
13
FU Mult1
Add

Nov. 2, 2004

Now ADDD is issued


because SUBD has
completed

FU for j FU for kFj?


Qj
Qk
Rj
No

Mult1
F10
F12
Divide

Lec. 7

Fk?
Rk
No

Yes Yes
No Yes
...

F30

17

Scoreboard Example Cycle 14


Read Execution
Write
Instruction status
j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
5 Mult1 Yes Mult F0
F2 F4
Mult2 No
2 Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
14
FU Mult1
Add

Nov. 2, 2004

FU for FU
j for F
k j?
Qj
Qk
Rj

Mult1
F10 F12
Divide

Lec. 7

Fk?
Rk

No

No

No
No

No
Yes

...

F30

18

Scoreboard Example Cycle 15


Read Execution
Write
Instruction status
j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
4 Mult1 Yes Mult F0
F2 F4
Mult2 No
1 Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
15
FU Mult1
Add

Nov. 2, 2004

Note: ADDD takes 2


cycles, so no change

FU for j FU for k Fj?


Qj
Qk
Rj

Mult1
F10
F12
Divide

Lec. 7

Fk?
Rk

No

No

No
No

No
Yes

...

F30

19

Scoreboard Example Cycle 16


Read Execution
Write
Instruction status
j
k Issue operands
complete
Result
Instruction
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
3 Mult1
Yes Mult F0
F2 F4
Mult2
No
0 Add
Yes Add F6
F8 F2
Divide Yes Div F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
16
FU Mult1
Add

Nov. 2, 2004

ADDD completes, but


MULTD and DIVD go on

FU for j FU for k Fj?


Qj
Qk
Rj

Mult1
F10
Divide

Lec. 7

F12

Fk?
Rk

No

No

No
No

No
Yes

...

F30

20

Scoreboard Example Cycle 17


Read Execution
Write
Instruction status
j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
2 Mult1
Yes Mult F0
F2 F4
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
17
FU Mult1
Add

Nov. 2, 2004

ADDD stalls, cant write back


due to WAR with DIVD.
MULT and DIV continue

FU for FU
j for F
k j?
Qj
Qk
Rj

Mult1
F10 F12
Divide

Lec. 7

Fk?
Rk

No

No

No
No

No
Yes

...

F30

21

Scoreboard Example Cycle 18


Read Execution
Write
Instruction status
j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
1 Mult1
Yes Mult F0
F2 F4
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
18
FU Mult1
Add

Nov. 2, 2004

MULT and DIV


continue

FU for FU
j for F
k j?
Qj
Qk
Rj

Mult1
F10 F12
Divide

Lec. 7

Fk?
Rk

No

No

No
No

No
Yes

...

F30

22

Scoreboard Example Cycle 19


Read Execution
Write
Instruction status
j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2 1
2
3
4
LD
F2 45+ R3 5
6
7
8
MULTD F0 F2 F4
6
9
19
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
0 Mult1
Yes Mult F0
F2 F4
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
19
FU Mult1
Add

Nov. 2, 2004

MULT completes
after 10 cycles

FU for FU
j for F
k j?
Qj
Qk
Rj

Mult1
F10 F12
Divide

Lec. 7

Fk?
Rk

No

No

No
No

No
Yes

...

F30

23

Scoreboard Example Cycle 20


j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
ADDD F6 F8 F2 13
14
16
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
20
FU
Add

Nov. 2, 2004

MULTD completes and


writes to F0

FU for FU
j for F
k j?
Qj
Qk
Rj

Fk?
Rk

No No
Yes Yes
F10 F12
Divide

Lec. 7

...

F30

24

Scoreboard Example Cycle 21


j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
21
ADDD F6 F8 F2 13
14
16
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
Yes Add F6
F8 F2
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
21
FU
Add

Nov. 2, 2004

Now DIVD reads


because F0 is
available

FU for FU
j for F
k j?
Qj
Qk
Rj

F10 F12
Divide

Lec. 7

Fk?
Rk

No
No

No
No

...

F30

25

Scoreboard Example Cycle 22


j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
21
ADDD F6 F8 F2 13
14
16 22
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
No
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
21
FU

Nov. 2, 2004

ADDD writes result


because WAR is
removed.

FU for FU
j for F
k j?
Qj
Qk
Rj

F10 F12
Divide

Lec. 7

Fk?
Rk

No

No

...

F30

26

Scoreboard Example Cycle 61


j
k Issueoperands
complete
Result
Instruction
LD
F6 34+ R2
1
2
3
4
LD
F2 45+ R3
5
6
7
8
MULTD F0 F2 F4
6
9
19 20
SUBD F8 F6 F2
7
9
11 12
DIVD F10 F0 F6
8
21
61
ADDD F6 F8 F2 13
14
16 22
dest S1 S2
Functional unit status
TimeName Busy Op
Fi
Fj Fk
Integer No
Mult1
No
Mult2
No
Add
No
Divide Yes Div
F10 F0 F6
Register result status
F0 F2
F4
F6 F8
Clock
61
FU

Nov. 2, 2004

DIVD completes
execution

FU for FU
j for F
k j?
Qj
Qk
Rj

F10 F12
Divide

Lec. 7

Fk?
Rk

No

No

...

F30

27

Scoreboard Example Cycle 62

Instruction status
Instruction j
k
LD
F6 34+ R2
LD
F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDDF6 F8 F2
Functional unit status
Time Name
Integer
Mult1
Mult2
Add
0 Divide
Register result status

Read Execution
Write
Issue operands
complete
Result
1
2
3
4
5
6
7
8
6
9
19
20
7
9
11
12
8
21
61
62
13
14
16
22
dest
S1 S2
Busy Op
Fi
Fj
Fk
No
No
No
No
No

Clock

F0

62

F2

F4

Execution is finished

FU for j FU for k Fj?


Qj
Qk
Rj

F6 F8 F10 F12

...

Fk?
Rk

F30

FU

Nov. 2, 2004

Lec. 7

28

Review: Scoreboard
Limitations of 6600 scoreboard

No forwarding
Limited to instructions in basic block (small window)
Large number of functional units (structural hazards)
Stall on WAR hazards
Stall on WAW hazards

DIV.D
ADD.D
WAR S.D
SUB.D
Antidependence
MUL.D

F0, F2, F4
F6, F0, F8
F6, 0(R1)
WAW
F8, F10, F14
Output dependence
F6, F10, F8

Name dependence
Nov. 2, 2004

Lec. 7

29

Another Dynamic Algorithm: Tomasulo Algorithm


For IBM 360/91 about 3 years after CDC 6600
Goal: High Performance without special compilers
Differences between Tomasulo Algorithm & Scoreboard
Control & buffers distributed with Function Units vs. centralized in
scoreboard; called reservation stations
Registers in instructions replaced by pointers to reservation station buffer
HW renaming of registers to avoid WAW hazards
Buffer operand values to avoid WAR hazards
Common Data Bus broadcasts results to all FUs
Load and Stores treated as FUs as well

Why study? Lead to Alpha 21264, HP 8000, MIPS 10000,


Pentium II, Power PC 604
Nov. 2, 2004

Lec. 7

30

FP unit and load-store unit using Tomasulos alg.

Nov. 2, 2004

Lec. 7

31

Another Dynamic Algorithm: Tomasulo Algorithm


DIV.D
ADD.D
S.D
SUB.D
MUL.D

F0, F2, F4
S, F0, F8
S, 0(R1)
T, F10, F14
F6, F10, T

register renaming

Implemented through reservation stations (rs) per functional unit


Buffers an operand as soon as it is available avoids WAR hazards.
Pending instr. designate rs that will provide their inputs avoids WAW hazards.
The last write in a sequence of same-register-writing actually updates the
register
Decentralize hazard detection and execution control
Instruction results are passed directly to the FU from rs rather than from registers
Through common data bus (CDB)

Nov. 2, 2004

Lec. 7

32

Three Stages of Tomasulo Algorithm


1. Issueget instruction from FP Op Queue
Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the
issue logic issues instr to rs & read operands into rs if ready (Register renaming =>
Solves WAR). Make status of destination register waiting for this latest instn even if
the previous instn writing to this register hasnt completed => Solves WAW
hazards.

2. Executionoperate on operands (EX)


When both operands are ready then execute;
if not ready, watch CDB for result Solves RAW

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting units;
mark reservation station available. Write result into dest. reg. if its status is r. =>
Solves WAW.

Normal data bus:


data + destination (go to bus)
CDB:
data + source
(come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does broadcast

Nov. 2, 2004

Lec. 7

33

Reservation Station Components


OpOperation to perform in the unit (e.g., + or )
Vj, Vk Value of the source operand.
Qj, Qk Name of the RS that would provide the source
operands. Value zero means the source operands already
available in Vj or Vk, or is not necessary.
BusyIndicates reservation station or FU is busy
Register File Status Qi:
Qi Indicates which functional unit will write each register, if
one exists. Blank (0) when no pending instructions that will write
that register meaning that the value is already available.

Nov. 2, 2004

Lec. 7

34

Tomasulo Example Cycle 0


Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy
LD
F6
34+
R2
Load1 No
LD
F2
45+
R3
Load2 No
MULTD F0
F2
F4
Load3 No
SUBD
F8
F6
F2
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10
0
FU

Nov. 2, 2004

Lec. 7

Address

F12

...

35

F30

Tomasulo Example Cycle 1


Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy
LD
F6
34+
R2
1
Load1 Yes
LD
F2
45+
R3
Load2 No
MULTD F0
F2
F4
Load3 No
SUBD
F8
F6
F2
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10
1
FU
Load1

Nov. 2, 2004

Lec. 7

Address
34+R2

F12

...

36

F30

Tomasulo Example Cycle 2

Nov. 2, 2004

Lec. 7

37

Tomasulo Example Cycle 3


Instruction status
Execution Write
Instruction
j
k Issue complete Result
LD
F6
34+
R2
1
2--3
Load1
LD
F2
45+
R3
2
3Load2
MULTD F0
F2
F4
3
Load3
SUBD
F8
F6
F2
DIVD
F10
F0
F6
ADDD
F6
F8
F2
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
0 Add2 No
read value
Add3 No
0 Mult1 Yes Mult
R(F4) Load2
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
3
FU Mult1 Load2
Load1

Nov. 2, 2004

Lec. 7

Busy
Yes
Yes
No

Address
34+R2
45+R3

F10

F12

...

38

F30

Tomasulo Example Cycle 4


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
0 Add2
Add3
0 Mult1
0 Mult2
Register result status
Clock
4

k
R2
R3
F4
F2
F6
F2

Execution Write
Issue complete Result
1
2--3
4
2
3--4
3
4

Busy Op
Yes Sub
No
No
Yes Mult
No

FU

Nov. 2, 2004

F0
Mult1

S1
Vj
M(A1)

F2
Load2

S2
Vk

Busy
Load1 No
Load2 Yes
Load3 No

Address
45+R3

RS for j RS for k
Qj
Qk
Load2

R(F4)

Load2

F4

F6
M(A1)

Lec. 7

F8
Add1

F10

F12

...

39

F30

Tomasulo Example Cycle 5


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
2 Add1
0 Add2
Add3
10 Mult1
0 Mult2
Register result status
Clock
5

k
R2
R3
F4
F2
F6
F2

Execution Write
Issue complete Result
1
2--3
4
2
3--4
5
3
4
5

Busy Op
Yes Sub
No
No
Yes Mult
Yes Div

FU

Nov. 2, 2004

F0
Mult1

Busy
Load1 No
Load2 No
Load3 No

S1
Vj
M(A1)

S2 RS for j RS for k
Vk
Qj
Qk
M(A2)

M(A2)

R(F4)
M(A1)

F2
M(A2)

F4

Address

Mult1
F6
M(A1)

Lec. 7

F8
F10 F12
Add1 Mult2

...

40

F30

Tomasulo Example Cycle 6


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
1 Add1
0 Add2
Add3
9 Mult1
0 Mult2
Register result status
Clock
6

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -F6
5
F2
6
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
Yes Sub
M(A1)
M(A2)
Yes Add
M(A2) Add1
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

F6
Add2

Lec. 7

Address

F8
F10 F12
Add1 Mult2

...

41

F30

Tomasulo Example Cycle 7


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
0 Add2
Add3
8 Mult1
0 Mult2
Register result status
Clock
7

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
F6
5
F2
6
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
Yes Sub
M(A1)
M(A2)
Yes Add
M(A2) Add1
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

F6
Add2

Lec. 7

Address

F8
F10 F12
Add1 Mult2

...

42

F30

Tomasulo Example Cycle 8


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
2 Add2
Add3
7 Mult1
0 Mult2
Register result status
Clock
8

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
Yes Add
M1-M2
M(A2)
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

Address

F6
F8
F10 F12
Add2 M1-M2 Mult2

Lec. 7

...

43

F30

Tomasulo Example Cycle 9


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
1 Add2
Add3
6 Mult1
0 Mult2
Register result status
Clock
9

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
Yes Add
M1-M2
M(A2)
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

Address

F6
F8
F10 F12
Add2 M1-M2 Mult2

Lec. 7

...

44

F30

Tomasulo Example Cycle 10


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
0 Add2
Add3
5 Mult1
0 Mult2
Register result status
Clock
10

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -- 10
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
Yes Add
M1-M2
M(A2)
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

Address

F6
F8
F10 F12
Add2 M1-M2 Mult2

Lec. 7

...

45

F30

Tomasulo Example Cycle 11


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
Add2
Add3
4 Mult1
0 Mult2
Register result status
Clock
11

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -- 10
11
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
No
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

Address

F6
F8
F10 F12
M1-M2+M(A2)
M1-M2 Mult2

Lec. 7

...

46

F30

Tomasulo Example Cycle 12


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
Add2
Add3
4 Mult1
0 Mult2
Register result status
Clock
12

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -- 10
11
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
No
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

Address

F6
F8
F10 F12
M1-M2+M(A2)
M1-M2 Mult2

Lec. 7

...

47

F30

Tomasulo Example Cycle 15


Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10
F0
ADDD
F6
F8
Reservation Stations
Time Name
0 Add1
Add2
Add3
0 Mult1
0 Mult2
Register result status
Clock
15

Execution Write
k Issue complete Result
Busy
R2
1
2--3
4
Load1 No
R3
2
3--4
5
Load2 No
F4
3
6 -- 15
Load3 No
F2
4
6 -- 7
8
F6
5
F2
6
9 -- 10
11
S1
S2 RS for j RS for k
Busy Op
Vj
Vk
Qj
Qk
No
No
No
Yes Mult
M(A2)
R(F4)
Yes Div
M(A1) Mult1

FU

Nov. 2, 2004

F0
Mult1

F2
M(A2)

F4

Address

F6
F8
F10 F12
M1-M2+M(A2)
M1-M2 Mult2

Lec. 7

...

48

F30

Tomasulo Example Cycle 16


Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
16
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes Div
M*F4
M(A1)
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
16
FU M*F4
M(A2)
M1-M2+M(A2)
M1-M2 Mult2

Nov. 2, 2004

Lec. 7

49

F30

Tomasulo Example Cycle 56


Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
16
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
17 -- 56
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes Div
M*F4
M(A1)
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
56
FU M*F4
M(A2)
M1-M2+M(A2)
M1-M2 Mult2

Nov. 2, 2004

Lec. 7

50

F30

Tomasulo Example Cycle 57


Instruction status
Execution Write
Instruction
j
k Issue complete Result
Busy Address
LD
F6
34+
R2
1
2--3
4
Load1 No
LD
F2
45+
R3
2
3--4
5
Load2 No
MULTD F0
F2
F4
3
6 -- 15
16
Load3 No
SUBD
F8
F6
F2
4
6 -- 7
8
DIVD
F10
F0
F6
5
17 -- 56
57
ADDD
F6
F8
F2
6
9 -- 10
11
Reservation Stations
S1
S2 RS for j RS for k
Time Name Busy Op
Vj
Vk
Qj
Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 No
Register result status
Clock
F0
F2
F4
F6
F8
F10 F12
...
57
FU M*F4
M(A2)
M1-M2+M(A2)
M1-M2 result

Nov. 2, 2004

Lec. 7

51

F30

Branch Prediction (3.4, 3.5)

Nov. 2, 2004

Lec. 7

52

Branch Prediction
Easiest (static prediction)

Always taken, always not taken


Opcode based
Displacement based (forward not taken, backward taken)
Compiler directed (branch likely, branch not likely)

Next easiest
1 bit predictor remember last taken/not taken per branch
Use a branch-prediction buffer or branch-history table
Use part of the PC (low-order bits) to index buffer/table
Multiple branches may share the same bit

Invert the bit if the prediction is wrong


Backward branches for loops will be mispredicted twice

Nov. 2, 2004

Lec. 7

53

Q: Assume a loop branch is taken nine times in a row, then not taken once. What
is the prediction accuracy using 1-bit predictor?
A: After first loop, the predictor will say not to take because the last time the
execution came out of loop, it set a 0 in the predictor. So, its a misprediction.
The bit will now be set to 1. Works fine until the last loop when it is predicted
as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy.
How about a 2-bit predictor? Let the prediction be changed only after it misses
twice in a row.

Nov. 2, 2004

Lec. 7

54

2-bit Branch Prediction


Has 4 states instead of 2, allowing for more information about
tendencies
A prediction must miss twice before it is changed
Good for backward branches of loops

Nov. 2, 2004

Lec. 7

55

Branch History Table

Has limited size


2 bits by N (e.g. 4K)
4K same as infinite, see Fig. 3.9
Uses low-order bits of branch PC to
choose entry
branch PC

BHT
01

Nov. 2, 2004

Lec. 7

56

Can we do better ?
Correlating branch predictors also look at other branches for
clues
if (aa==2)

T
aa = 0

if (bb==2)

bb = 0
if(aa!=bb) {

NT

Prediction if the last branch is NT

Prediction if the last branch is T


(1,1) predictor uses history of 1 branch and uses a 1-bit predictor
Nov. 2, 2004

Lec. 7

57

Correlating Branch Predictor


If we use 2 branches as histories, then there are 4 possibilities
(T-T, NT-T, NT-NT, NT-T).
For each possibility, we need to use a predictor (1-bit, 2-bit).
And this repeats for every branch.

(2,2) branch prediction

Nov. 2, 2004

Lec. 7

58

Performance of Correlating Branch Prediction


With same number of
state bits, (2,2) performs
better than noncorrelating
2-bit predictor.
Outperforms a 2-bit
predictor with infinite
number of entries

Nov. 2, 2004

Lec. 7

59

General (m,n) Branch Predictors


The global history register is an m-bit shift register that records
the last m branches encountered by the processor
Usually use both the PC address and the GHR (2-level)
m-bit ghr
01
PC
Combining
funciton

Nov. 2, 2004

n-bit predictors

00

Lec. 7

60

Is Branch Predictor Enough?


When is using branch prediction beneficial?
When the outcome is known later than the target
For example, in our standard MIPS pipeline, we compute the target in ID
stage but testing the branch condition incur a structure hazard in register
file.

If we predict the branch is taken and suppose it is correct, what is


the target address?
Need a mechanism to provide target address as well

Can we eliminate the one cycle delay for the 5-stage pipeline?
Need to fetch from branch target immediately after branch

Nov. 2, 2004

Lec. 7

61

Branch Target Buffer (BTB)


Is the current instruction a branch ?
BTB provides the answer before the current instruction is decoded
and therefore enables fetching to begin after IF-stage .

What is the branch target ?


BTB provides the branch target if the prediction is a taken direct
branch (for not taken branches the target is simply PC+4 ) .

Nov. 2, 2004

Lec. 7

62

You might also like