Instruction Level Parallelism

Instruction Level Parallelism and
Scheduling
CPI
Pipelined CPI =
Ideal CPI + Structural Stalls + Data Hazard Stalls +
Control Stalls
Reduce Stalls!
Instruction Level Parallelism
Generally, achieved for basic blocks

But, the number of instructions for which it is
applicable is small
Hence, need to extract ILP across basic blocks
In other words, across loops (or loop-iterations!)
Parallelism in loops
For (i=0; i<100; i++)
x[i] = x[i] + y[i];
Unrolling of the loop will expose the parallelism!
Dependences and Hazards
Data dependences
Instruction I dependent on J
Or I dependent on J and J dependent on K etc
Chain of dependences!
Data Dependences
Dependences are properties of programs

Whether it leads to hazard or not, is a property of
pipelined organization
Data dependence conveys
The possibility of a hazard

The order in which results must be calculated
Upper bound on how much parallelism can be exploited
It is easy to figure out data dependence for registers, than for

memory
Named dependence
Named Dependences
Anti-dependence
Output Dependence
Not a true dependence, can be removed using

different names for registers!
Data Hazards
RAW
(due to Data dependence or True dependence)
WAR
(due to Anti-dependence)
WAW
(due to Output dependence)
Control dependence
If (p1)
S1;
If (p2)
S2;
S1 or S2 can not be moved => Control dependence

Properties of control dependence
A statement inside control-structure can not be moved out (before the

branch)
A statement outside (before) the branch control-structure can not be
moved in
Two more properties to look at

Exception behavior
Data Flow
Loop Unrolling
Unroll loop
Use different symbolic names
Branch instruction is also avoided
Where is the penalty?
Another way of loop unrolling (Example on 78)
10
Find loop iterations to be independent

Use different registers
Eliminate extra test and branch instruction
Loads and stores for different iterations are
independent, can use this fact to reduce
stalls
Preserve dependencies
11
Loop Unrolling & Scheduling
Determine, how loop unrolling is useful.

For independent loop iterations and overhead of
loop maintenance code.
Use different registers to avoid unnecessary
constraints
Adjust loop termination and iteration code.
Check the possibility of scheduling load and
store instructions.
Schedule the instructions within the loop
preserving the dependency.
12
Branch Prediction
Static branch predictors

Predict the branch as taken
Predictors based on profiling information

Use profilers
Collect the information based from earlier
runs
Useful as the behavior of a specific branch is
biased towards taken or not-taken
13
Effectiveness of branch prediction scheme

depends upon
The Accuracy of the Scheme
The Frequency of
Conditional Branches
Higher Branch Frequency for Integer programs

And Misprediction Rate is also higher for integer
programs
So,
how to predict using

Static Branch Prediction.
For FP programs Misprediction Rate - 9%

For Integer progrms - Misprediction Rate - 15%
14
Dynamic Branch Prediction and Branch

Prediction Buffers
1-bit predictors
Taken or not-taken
Use low order address-bits to access the branch
buffer (and the prediction-bit)
Multiple addresses will map to same predictor
If branch is always taken, gives 2 incorrect values for
branch not-taken single time
Use 2-bit predictors
15
2-bit Predictors
The States in 2 bit predictor scheme

By using 2 bits rather than 1,
a branch that strongly favors Taken or Not Taken will be mispredicted less often than with a 1-bit predictor.
16
2 bits are used to encode 4 states in the system.

Branch is predicted as Taken if count value > = half of 2n
Otherwise it is predicted Not Taken
17
Co--relating branch predictors

Co
If (aa==2)
aa = 0;
If (bb==2)
bb =0;
If (aa != bb)
{
}
When both first and second branches are taken,

third branch is not taken.
This kind of information can not be captured using
just 2-bit predictors
18
(m, n) Predictors
Look at last (most-recent) m-branches in the code

Depending on their status (taken/not-taken), choose
from 2m n-bit predictors
Depending on size of buffer, lower address bits also
used to index
A 2-bit predictor with no history is (0,2) predictor
Number of bits in (m, n) predictor =
2m x n x Number-of-prediction-entries
19
Tournament Predictors
Which one to choose, local or global?

Combination of local and global predictors
Use 2-bit saturation predictor to choose from local or global or mixed
predictors
Used in P-4 and Alpha processors
Local predictor is 2-level predictor

1024 10-bit entries, each 10-bits corresponding to most recent 10branch outcomes
Use the selected 10-bits to index into table with 1K entries consisting
of 3-bit counters
20
Dynamic Scheduling, again!
Same compiled code can be used irrespective of specific design of a

pipeline
Tries to avoid stalling
Enables out-of-order execution
Possibility of WAR and WAW hazards
Stages
Issue (WAW and Structural hazards)

Read-operands (RAW)
Execute
Write Results (WAR)
DIV.D F0, F2, F4
ADD.D F6, F0, F8
SUB.D F8, F10, F14
MUL.D F6, F10, F8
Score-boarding revision
21
Tomasulos approach
One of the earliest scheduling schemes by IBM for IBM 360/91
Basic idea Enable register renaming and buffering of results

1. DIV.D F0, F2, F4
2. ADD.D F6, F0, F8
3. S.D F6, 0(R1)
4. SUB.D F8, F10, F14
5. MUL.D F6, F10, F8
Identify Hazards in the above code

WAR 2 & 4
WAW 2 & 5
RAW 1 & 2, 4 &5, 2 & 3
Write above code with renaming,

S for F6 in 2 and
T for F8 in 4.
22
Tomasulos approach
Use of reservation stations (buffers)
Implicitly achieves re-naming.
More reservation stations can be available than number of registers
Use of reservation stations against centralized register file

Hazard detection and execution control is distributed
Results are directly passed to functional units than through registers
(by-passing)
Requires a data-bus
23
Tomasulos approach
24
Tomasulos approach
CDB goes everywhere except for Load unit
Stages
1.
Issue
2.
Get next instruction from the instruction queue

Allot a reservation station if available
No reservation station -> structural hazard
If operands not in register, keep track of functional units generating
them (thus renaming registers)
Execute
Delay execution till operands are available (RAW)
3.
Write
Write into CDB.
This in-turn writes to registers and to stations waiting for this
operand
25
Tomasulos approach
Table
Op
Qj, Qk reservation station that will produce values
Vj,Vk Actual values of operands that are available
A Store address used (for Load or store)
Busy Indicates that the station is busy
26
Tomasulos approach
Advantages
Independent of pipeline used
Effective in presence of caches (it was designed
before caches were designed!)
Register-renaming and dynamic scheduling is

an important technique
It made its re-entry in 1990s.
27
Hardware--based
Hardware
based--speculation
How to overcome control dependence?
Require us to speculate about branches.
Tomasulos algorithm (and scoreboarding) doesnt speculate on

branches
Basic idea
Why cant some instruction speculatively proceed ahead if its
operands are available?
Just dont commit the instruction if you are not sure
In a sense,
separate out execution of an instruction from its commit
28
Hardware--based
Hardware
based--speculation
1.
2.
3.
Dynamic branch prediction to choose which

instruction to execute
Speculation to allow the execution of instructions
before the control dependences are resolved (with the
ability to undo the effects of an incorrectly speculated
sequence)
Dynamic scheduling to deal with the scheduling of
different combinations of basic blocks.
29
Dynamic scheduling with speculation
So, what are the issues with such an approach?

If speculation is wrong, roll-back.
That also gives us a solution
Will commit only if know that the instruction is not speculative

Will always commit in-order
Execute out-of-order, but commit-in-order
We require a buffer to store instructions which are speculative and
yet-to-commit
Re-Order-Buffer (ROB)
30
Tomasulos and hardwarehardware-speculation
Tomasulos write result into register-files,

so all subsequent instructions get it from there
Hardware speculation writes into buffers
Useful in maintaining precise exceptions and to avoid
WAW hazards easily.
31
ROB
Each ROB entry

Instruction type
Destination field
Value field
Ready field (Instruction completed or NOT)
Where is re-naming occurring?

ROB
We will still use reservation-stations as buffers

Each entry in reservation-station will refer to ROB-tag than
reservation-station entry (as was the case with Tomasulo)
32
Four steps
Issue
If there is empty ROB-slot AND empty reservation station, o/w
instruction issue is stalled
Execute
Execute WHEN operands are available (RAW hazards)
Write result
Write onto CDB, then to ROB and any reservation-station-waiting
Commit
Normal commit, when instruction reaches head of ROB
In case of store, write to memory
In case of wrong speculation, flush the remaining entries in the ROB
33
Tomasulos approach extended to handle speculation

ROB
34
Recovery
Recover as soon as possible for incorrect

branch
Dont wait till instruction to reach head of ROB
Exceptions handled only when instruction
reaches head-of-ROB
35
Loads and Stores
Difference in working of store

Stores can directly store into ROB
Intervening Load and Store
Dont allow Load if any non-committed store is
present in ROB with same address
36
Summary of H/W speculation
Technique more useful for

Integer pipelines
Multiple-issue processors
37
Exploiting ILP: Using Multiple Issue

& Static Scheduling
Multiple Issue Processors:
Allow multiple instructions to issue in a clock cycle
1. Statically scheduled superscalar processors
2. VLIW (Very Long Instruction Word) processors
3. Dynamically scheduled superscalar processors
4. Dynamically scheduled superscalar processors (speculative)
Superscalar Processors:
Issue varying no. of instructions per clock cycle
Use in-order or out-of-order execution
VLIW Processors:
Issue multiple (fixed no.) instructions per clock
Statically scheduled by compiler
38
Exploiting ILP Using

Dynamic Scheduling, Multiple Issue & Speculation
1.
How to assign reservation station? And

Update pipeline controls?
Do this in Half clock cycle, so that 2 instructions
can be processed in one clock cycle
Or
2. Build the logic necessary to handle two instructions at
once including any possible dependences between the
instructions
Modern processors do both,

Use pipeline and widen the issue logic
39
Exploiting ILP Using

Dynamic Scheduling, Multiple Issue & Speculation
Additional Challenge :
How to complete & commit
multiple instructions per clock cycle?
See Example on page 119

on page 120 (without speculation)
& on page 121 (with speculation)
40
Advanced Techniques for

Instruction Delivery & Speculation
Increasing Instruction Fetch Bandwidth

Use of Branch Target Buffer (BTB)
BTB predicts the address of next instruction and sends
it out before decoding the instruction
If PC of fetched instruction matches with PC in the
buffer, then the corresponding predicted PC is used as
the next PC
41
Branch Target Buffer
42
Steps involved in handling an instruction with a branchbranch-target buffer
43
Speculation :
Implementation Issues & Execution
Return Address Predictors

Integrated Instruction Fetch Units
Register Renaming Vs ROBs
How much to Speculate?
Speculating through Multiple Branches
Value Prediction
44

Instruction Level Parallelism

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instruction Level Parallelism

Uploaded by

Copyright:

Available Formats

Instruction Level Parallelism and

Instruction Level Parallelism

Generally, achieved for basic blocks

Dependences and Hazards

Dependences are properties of programs

Data dependence conveys

The possibility of a hazard

It is easy to figure out data dependence for registers, than for

Not a true dependence, can be removed using

S1 or S2 can not be moved => Control dependence

A statement inside control-structure can not be moved out (before the

Two more properties to look at

Where is the penalty?

Another way of loop unrolling (Example on 78)

Find loop iterations to be independent

Loop Unrolling & Scheduling

Determine, how loop unrolling is useful.

Static branch predictors

Predictors based on profiling information

Effectiveness of branch prediction scheme

Higher Branch Frequency for Integer programs

how to predict using

For FP programs Misprediction Rate - 9%

Dynamic Branch Prediction and Branch

The States in 2 bit predictor scheme

2 bits are used to encode 4 states in the system.

Co--relating branch predictors

When both first and second branches are taken,

Look at last (most-recent) m-branches in the code

Number of bits in (m, n) predictor =

Which one to choose, local or global?

Used in P-4 and Alpha processors

Local predictor is 2-level predictor

Dynamic Scheduling, again!

Same compiled code can be used irrespective of specific design of a

Issue (WAW and Structural hazards)

One of the earliest scheduling schemes by IBM for IBM 360/91

Basic idea Enable register renaming and buffering of results

Identify Hazards in the above code

Write above code with renaming,

Use of reservation stations (buffers)

Implicitly achieves re-naming.

More reservation stations can be available than number of registers

Use of reservation stations against centralized register file

CDB goes everywhere except for Load unit

Get next instruction from the instruction queue

Register-renaming and dynamic scheduling is

How to overcome control dependence?

Require us to speculate about branches.

Tomasulos algorithm (and scoreboarding) doesnt speculate on

Dynamic branch prediction to choose which

Dynamic scheduling with speculation

So, what are the issues with such an approach?

Will commit only if know that the instruction is not speculative

Tomasulos and hardwarehardware-speculation

Tomasulos write result into register-files,

Each ROB entry

Where is re-naming occurring?

We will still use reservation-stations as buffers

Tomasulos approach extended to handle speculation

Recover as soon as possible for incorrect

Loads and Stores

Difference in working of store