Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022

ECE338
Parallel Computer Architecture

Spring 2022
Introduction to Instruction Level

Parallelism (ILP)
Nikos Bellas
Electrical and Computer Department

University of Thessaly
ECE 338 Parallel Computer Architecture 1

Instruction Level Parallelism
Execution Time = Inst. Count * CPI * Clock Period
CPI is the focus of this class (Clocks/Instruction)
CPI =
Ideal pipeline CPI + // =1 in MIPS pipeline
Structural stalls + // Need more hardware to reduce struct. stalls
Data hazard stalls + // Due to Dependences
Control stalls // Due to Branches
• Stalls increase CPI (reduce IPC, Instructions Per Cycle)

and reduce performance
• ILP refers to simultaneous completion of instructions
• Large ILP  Large IPC

Data Dependence
• Data dependence (Read After Write, RAW dependence)
– Instruction j is data dependent on instruction i if
• Instruction i produces a result that may be used by instruction j
• OR Instruction j is data dependent on instruction k and instruction
k is data dependent on instruction i
fld f0,0(x1)
Potential data
fadd.d f4,f0,f2 dependence.
sd $s0, 0($s0)
fsd f4,0(x1) Detectable only at
execution (not good )
addi x1,x1,-8 ld $s1, -20($s1)
bne x1,x2,Loop
• Dependent instructions cannot be executed simultaneously

• Need to always obey program order
Data Dependence
Given two instructions i and j determine when they can
safely execute in parallel
Ri : set of locations read by i

Wi : set of locations written by j
The two instructions I and j can be executed in parallel

if:
Ri Wj  {}
Rj Wi  {} Bernstein conditions
Wi Wj  {}
Example

Name Dependences
• Two instructions use the same name but no flow of
information
– Not a true data dependence, but is a problem when reordering
instructions
fld f0,0(x1) add x0, x1, x2
fadd.d f4,f0,f2 add x3, x0, x2
fsd f4,0(x1) sub x0, x5, x6

addi x1,x1,-8
bne x1,x2,Loop
– Antidependence (WAR): instruction j writes a register or memory

location that instruction i reads
– Output dependence (WAW): instruction i and instruction j write the
same register or memory location
• To resolve, use register renaming techniques

Name Dependences
• WAW and WAR dependences may stall a static pipeline when the
CPU has few registers
• Compilers require a lot of registers to do register renaming and

avoid WAR and WAW stalls
Register Renaming
by the compiler
fdiv.d f0,f2,f4 fdiv.d f0,f2,f4
fadd.d f6,f0, f8 WAR fadd.d S,f0,f8
fsd f6,0(x1) fsd S,0(x1)
fsub.d f8,f10,f14
fsub.d T,f10,f14
fmul.d f6,f10,f8 WAW
fmul.d f6,f10,T
Not always possible to have two extra registers available
(atECE 338 Parallel
compile Computer Architecture
time!) 7
Control Dependence
• Ordering of instruction i with respect to a branch
instruction
S1;
if p2 {
S2;
}
• Instruction S2 cannot be moved before the p2 branch
• Instruction S1 cannot be moved after the p2 branch so that its
execution is controlled by the branch
– But, in some cases, instruction movement can happen
add x1,x2,x3
beq x12,x0,skip
sub x4,x5,x6 If x4 isn’t used after skip, it is possible to
add x5,x4,x9 move sub before the branch
skip:
or x7,x8,x9
•ILP is restricted by all these dependences
•ILP extraction is usually transparent to the programmer
• Extracted by the hardware (superscalar processors)

Extracting ILP by examining 100’s of instructions
Scheduling them in parallel as operands become available
Rename registers to eliminate anti dependences
out-of-order execution
Speculative execution

•OR by the compiler (e.g. VLIW-Very Large
Instruction Word processors)
Multiple operations packed in one instruction

Each operation slot corresponds to a specific FU
Parallelism within a VLIW instruction
Compiler forms the instructions

Instruction Scheduling for Exposing ILP
• Pipeline scheduling is the simplest compiler optimization to improve ILP
• Separate dependent instruction from the source instruction by the
pipeline latency of the source instruction
for (i=999; i>=0; i=i-1) Loop: fld f0,0(x1) Loop: fld f0,0(x1)
x[i] = x[i] + s; stall addi x1,x1,-8
fadd.d f4,f0,f2 fadd.d f4,f0,f2
stall stall
stall stall
fsd f4,0(x1) fsd f4,8(x1)
addi x1,x1,-8 bne x1,x2,Loop
bne x1,x2,Loop 7 cycles per element. CPI = 7/5=1.4
8 cycles per element. CPI = 8/5=1.6

Loop Unrolling for Exposing ILP
• Loop unrolling
– Unroll by a factor of 4 (assume # Schedule the unrolled loop:
elements is divisible by 4)
– Eliminate unnecessary instructions Loop: fld f0,0(x1)
Loop: fld f0,0(x1) fld f6,-8(x1)
fadd.d f4,f0,f2 fld f8,-16(x1)
fsd f4,0(x1) //drop addi & bne fld f14,-24(x1)
fld f6,-8(x1) fadd.d f4,f0,f2
fadd.d f8,f6,f2 fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne fadd.d f12,f0,f2
fld f0,-16(x1) fadd.d f16,f14,f2
fadd.d f12,f0,f2 fsd f4,0(x1)
fsd f12,-16(x1) //drop addi & bne fsd f8,-8(x1)
fld f14,-24(x1) fsd f12,-16(x1)
fadd.d f16,f14,f2 fsd f16,-24(x1)
fsd f16,-24(x1) addi x1,x1,-32
addi x1,x1,-32 bne x1,x2,Loop
bne x1,x2,Loop 14 cycles
Register pressure can 3.5 cycles per element
be a big problem! CPI = 1
Strip Mining
– If Number of iterations = n (unknown)
– Unroll by k
– Generate pair of loops:
• First executes n mod k times
• Second executes n / k times

Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022

Uploaded by

Copyright:

Available Formats

You might also like

Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022

Uploaded by

Copyright:

Available Formats

ECE338

Parallel Computer Architecture

Introduction to Instruction Level

Electrical and Computer Department

ECE 338 Parallel Computer Architecture 1

• Stalls increase CPI (reduce IPC, Instructions Per Cycle)

ECE 338 Parallel Computer Architecture 2

• Dependent instructions cannot be executed simultaneously

Ri : set of locations read by i

The two instructions I and j can be executed in parallel

ECE 338 Parallel Computer Architecture 5

fadd.d f4,f0,f2 add x3, x0, x2

fsd f4,0(x1) sub x0, x5, x6

– Antidependence (WAR): instruction j writes a register or memory

• To resolve, use register renaming techniques

• Compilers require a lot of registers to do register renaming and

• Extracted by the hardware (superscalar processors)

ECE 338 Parallel Computer Architecture 9

Multiple operations packed in one instruction

ECE 338 Parallel Computer Architecture 10

ECE 338 Parallel Computer Architecture 11

ECE 338 Parallel Computer Architecture 13

You might also like