Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Instruction Level Parallelism and

Scheduling

CPI

Pipelined CPI =
Ideal CPI + Structural Stalls + Data Hazard Stalls +
Control Stalls

Reduce Stalls!

Instruction Level Parallelism

Generally, achieved for basic blocks


But, the number of instructions for which it is
applicable is small
Hence, need to extract ILP across basic blocks
In other words, across loops (or loop-iterations!)

Parallelism in loops
For (i=0; i<100; i++)
x[i] = x[i] + y[i];
Unrolling of the loop will expose the parallelism!

Dependences and Hazards

Data dependences
Instruction I dependent on J
Or I dependent on J and J dependent on K etc
Chain of dependences!

Data Dependences

Dependences are properties of programs


Whether it leads to hazard or not, is a property of
pipelined organization

Data dependence conveys

The possibility of a hazard


The order in which results must be calculated
Upper bound on how much parallelism can be exploited

It is easy to figure out data dependence for registers, than for


memory

Named dependence

Named Dependences
Anti-dependence
Output Dependence

Not a true dependence, can be removed using


different names for registers!

Data Hazards

RAW
(due to Data dependence or True dependence)

WAR
(due to Anti-dependence)

WAW
(due to Output dependence)

Control dependence
If (p1)
S1;

If (p2)
S2;

S1 or S2 can not be moved => Control dependence


Properties of control dependence

A statement inside control-structure can not be moved out (before the


branch)
A statement outside (before) the branch control-structure can not be
moved in

Two more properties to look at


Exception behavior
Data Flow

Loop Unrolling

Unroll loop
Use different symbolic names
Branch instruction is also avoided

Where is the penalty?

Another way of loop unrolling (Example on 78)

10

Find loop iterations to be independent


Use different registers
Eliminate extra test and branch instruction
Loads and stores for different iterations are
independent, can use this fact to reduce
stalls
Preserve dependencies

11

Loop Unrolling & Scheduling

Determine, how loop unrolling is useful.


For independent loop iterations and overhead of
loop maintenance code.
Use different registers to avoid unnecessary
constraints
Adjust loop termination and iteration code.
Check the possibility of scheduling load and
store instructions.
Schedule the instructions within the loop
preserving the dependency.
12

Branch Prediction

Static branch predictors


Predict the branch as taken

Predictors based on profiling information


Use profilers
Collect the information based from earlier
runs
Useful as the behavior of a specific branch is
biased towards taken or not-taken

13

Effectiveness of branch prediction scheme


depends upon
The Accuracy of the Scheme

The Frequency of
Conditional Branches

Higher Branch Frequency for Integer programs


And Misprediction Rate is also higher for integer
programs

So,

how to predict using


Static Branch Prediction.

For FP programs Misprediction Rate - 9%


For Integer progrms - Misprediction Rate - 15%
14

Dynamic Branch Prediction and Branch


Prediction Buffers

1-bit predictors
Taken or not-taken
Use low order address-bits to access the branch
buffer (and the prediction-bit)
Multiple addresses will map to same predictor
If branch is always taken, gives 2 incorrect values for
branch not-taken single time
Use 2-bit predictors

15

2-bit Predictors

The States in 2 bit predictor scheme


By using 2 bits rather than 1,
a branch that strongly favors Taken or Not Taken will be mispredicted less often than with a 1-bit predictor.
16

2 bits are used to encode 4 states in the system.


Branch is predicted as Taken if count value > = half of 2n
Otherwise it is predicted Not Taken

17

Co--relating branch predictors


Co
If (aa==2)
aa = 0;

If (bb==2)
bb =0;
If (aa != bb)
{
}

When both first and second branches are taken,


third branch is not taken.
This kind of information can not be captured using
just 2-bit predictors

18

(m, n) Predictors

Look at last (most-recent) m-branches in the code


Depending on their status (taken/not-taken), choose
from 2m n-bit predictors
Depending on size of buffer, lower address bits also
used to index
A 2-bit predictor with no history is (0,2) predictor

Number of bits in (m, n) predictor =

2m x n x Number-of-prediction-entries

19

Tournament Predictors

Which one to choose, local or global?


Combination of local and global predictors
Use 2-bit saturation predictor to choose from local or global or mixed
predictors

Used in P-4 and Alpha processors

Local predictor is 2-level predictor


1024 10-bit entries, each 10-bits corresponding to most recent 10branch outcomes
Use the selected 10-bits to index into table with 1K entries consisting
of 3-bit counters

20

Dynamic Scheduling, again!

Same compiled code can be used irrespective of specific design of a


pipeline
Tries to avoid stalling
Enables out-of-order execution
Possibility of WAR and WAW hazards
Stages

Issue (WAW and Structural hazards)


Read-operands (RAW)
Execute
Write Results (WAR)
DIV.D F0, F2, F4
ADD.D F6, F0, F8
SUB.D F8, F10, F14
MUL.D F6, F10, F8

Score-boarding revision

21

Tomasulos approach

One of the earliest scheduling schemes by IBM for IBM 360/91

Basic idea Enable register renaming and buffering of results


1. DIV.D F0, F2, F4
2. ADD.D F6, F0, F8
3. S.D F6, 0(R1)
4. SUB.D F8, F10, F14
5. MUL.D F6, F10, F8

Identify Hazards in the above code


WAR 2 & 4
WAW 2 & 5
RAW 1 & 2, 4 &5, 2 & 3

Write above code with renaming,


S for F6 in 2 and
T for F8 in 4.

22

Tomasulos approach

Use of reservation stations (buffers)

Implicitly achieves re-naming.

More reservation stations can be available than number of registers

Use of reservation stations against centralized register file


Hazard detection and execution control is distributed
Results are directly passed to functional units than through registers
(by-passing)
Requires a data-bus

23

Tomasulos approach

24

Tomasulos approach

CDB goes everywhere except for Load unit

Stages
1.

Issue

2.

Get next instruction from the instruction queue


Allot a reservation station if available
No reservation station -> structural hazard
If operands not in register, keep track of functional units generating
them (thus renaming registers)

Execute
Delay execution till operands are available (RAW)

3.

Write
Write into CDB.
This in-turn writes to registers and to stations waiting for this
operand

25

Tomasulos approach

Table

Op
Qj, Qk reservation station that will produce values
Vj,Vk Actual values of operands that are available
A Store address used (for Load or store)
Busy Indicates that the station is busy

26

Tomasulos approach

Advantages
Independent of pipeline used
Effective in presence of caches (it was designed
before caches were designed!)

Register-renaming and dynamic scheduling is


an important technique
It made its re-entry in 1990s.

27

Hardware--based
Hardware
based--speculation

How to overcome control dependence?

Require us to speculate about branches.

Tomasulos algorithm (and scoreboarding) doesnt speculate on


branches

Basic idea
Why cant some instruction speculatively proceed ahead if its
operands are available?
Just dont commit the instruction if you are not sure
In a sense,
separate out execution of an instruction from its commit

28

Hardware--based
Hardware
based--speculation
1.
2.

3.

Dynamic branch prediction to choose which


instruction to execute
Speculation to allow the execution of instructions
before the control dependences are resolved (with the
ability to undo the effects of an incorrectly speculated
sequence)
Dynamic scheduling to deal with the scheduling of
different combinations of basic blocks.

29

Dynamic scheduling with speculation

So, what are the issues with such an approach?


If speculation is wrong, roll-back.
That also gives us a solution

Will commit only if know that the instruction is not speculative


Will always commit in-order
Execute out-of-order, but commit-in-order
We require a buffer to store instructions which are speculative and
yet-to-commit
Re-Order-Buffer (ROB)

30

Tomasulos and hardwarehardware-speculation

Tomasulos write result into register-files,


so all subsequent instructions get it from there
Hardware speculation writes into buffers
Useful in maintaining precise exceptions and to avoid
WAW hazards easily.

31

ROB

Each ROB entry


Instruction type
Destination field
Value field
Ready field (Instruction completed or NOT)

Where is re-naming occurring?


ROB

We will still use reservation-stations as buffers


Each entry in reservation-station will refer to ROB-tag than
reservation-station entry (as was the case with Tomasulo)

32

Four steps

Issue
If there is empty ROB-slot AND empty reservation station, o/w
instruction issue is stalled

Execute
Execute WHEN operands are available (RAW hazards)

Write result
Write onto CDB, then to ROB and any reservation-station-waiting

Commit
Normal commit, when instruction reaches head of ROB
In case of store, write to memory
In case of wrong speculation, flush the remaining entries in the ROB

33

Tomasulos approach extended to handle speculation


ROB

34

Recovery

Recover as soon as possible for incorrect


branch
Dont wait till instruction to reach head of ROB
Exceptions handled only when instruction
reaches head-of-ROB

35

Loads and Stores

Difference in working of store


Stores can directly store into ROB
Intervening Load and Store
Dont allow Load if any non-committed store is
present in ROB with same address

36

Summary of H/W speculation

Technique more useful for


Integer pipelines
Multiple-issue processors

37

Exploiting ILP: Using Multiple Issue


& Static Scheduling
Multiple Issue Processors:
Allow multiple instructions to issue in a clock cycle
1. Statically scheduled superscalar processors
2. VLIW (Very Long Instruction Word) processors
3. Dynamically scheduled superscalar processors
4. Dynamically scheduled superscalar processors (speculative)
Superscalar Processors:
Issue varying no. of instructions per clock cycle
Use in-order or out-of-order execution
VLIW Processors:
Issue multiple (fixed no.) instructions per clock
Statically scheduled by compiler

38

Exploiting ILP Using


Dynamic Scheduling, Multiple Issue & Speculation

1.

How to assign reservation station? And


Update pipeline controls?
Do this in Half clock cycle, so that 2 instructions
can be processed in one clock cycle

Or
2. Build the logic necessary to handle two instructions at
once including any possible dependences between the
instructions

Modern processors do both,


Use pipeline and widen the issue logic
39

Exploiting ILP Using


Dynamic Scheduling, Multiple Issue & Speculation

Additional Challenge :
How to complete & commit
multiple instructions per clock cycle?

See Example on page 119


on page 120 (without speculation)
& on page 121 (with speculation)

40

Advanced Techniques for


Instruction Delivery & Speculation

Increasing Instruction Fetch Bandwidth


Use of Branch Target Buffer (BTB)
BTB predicts the address of next instruction and sends
it out before decoding the instruction
If PC of fetched instruction matches with PC in the
buffer, then the corresponding predicted PC is used as
the next PC

41

Branch Target Buffer

42

Steps involved in handling an instruction with a branchbranch-target buffer

43

Speculation :
Implementation Issues & Execution

Return Address Predictors


Integrated Instruction Fetch Units
Register Renaming Vs ROBs
How much to Speculate?
Speculating through Multiple Branches
Value Prediction

44

You might also like