Introduction To Instruction Level Parallelism (ILP) : ECE338 Parallel Computer Architecture Spring 2022

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

ECE338

Parallel Computer Architecture


Spring 2022

Introduction to Instruction Level


Parallelism (ILP)
Nikos Bellas

Electrical and Computer Department


University of Thessaly

ECE 338 Parallel Computer Architecture 1


Instruction Level Parallelism
Execution Time = Inst. Count * CPI * Clock Period
CPI is the focus of this class (Clocks/Instruction)
CPI =
Ideal pipeline CPI + // =1 in MIPS pipeline
Structural stalls + // Need more hardware to reduce struct. stalls
Data hazard stalls + // Due to Dependences
Control stalls // Due to Branches

• Stalls increase CPI (reduce IPC, Instructions Per Cycle)


and reduce performance
• ILP refers to simultaneous completion of instructions
• Large ILP  Large IPC

ECE 338 Parallel Computer Architecture 2


Data Dependence
• Data dependence (Read After Write, RAW dependence)
– Instruction j is data dependent on instruction i if
• Instruction i produces a result that may be used by instruction j
• OR Instruction j is data dependent on instruction k and instruction
k is data dependent on instruction i

fld f0,0(x1)
Potential data
fadd.d f4,f0,f2 dependence.
sd $s0, 0($s0)
fsd f4,0(x1) Detectable only at
execution (not good )
addi x1,x1,-8 ld $s1, -20($s1)
bne x1,x2,Loop

• Dependent instructions cannot be executed simultaneously


• Need to always obey program order
ECE 338 Parallel Computer Architecture 3
Data Dependence
Given two instructions i and j determine when they can
safely execute in parallel

Ri : set of locations read by i


Wi : set of locations written by j

The two instructions I and j can be executed in parallel


if:
Ri Wj  {}
Rj Wi  {} Bernstein conditions
Wi Wj  {}
ECE 338 Parallel Computer Architecture 4
Example

ECE 338 Parallel Computer Architecture 5


Name Dependences
• Two instructions use the same name but no flow of
information
– Not a true data dependence, but is a problem when reordering
instructions
fld f0,0(x1) add x0, x1, x2

fadd.d f4,f0,f2 add x3, x0, x2

fsd f4,0(x1) sub x0, x5, x6


addi x1,x1,-8

bne x1,x2,Loop

– Antidependence (WAR): instruction j writes a register or memory


location that instruction i reads
– Output dependence (WAW): instruction i and instruction j write the
same register or memory location

• To resolve, use register renaming techniques


ECE 338 Parallel Computer Architecture 6
Name Dependences
• WAW and WAR dependences may stall a static pipeline when the
CPU has few registers

• Compilers require a lot of registers to do register renaming and


avoid WAR and WAW stalls

Register Renaming
by the compiler
fdiv.d f0,f2,f4 fdiv.d f0,f2,f4
fadd.d f6,f0, f8 WAR fadd.d S,f0,f8
fsd f6,0(x1) fsd S,0(x1)
fsub.d f8,f10,f14
fsub.d T,f10,f14
fmul.d f6,f10,f8 WAW
fmul.d f6,f10,T
Not always possible to have two extra registers available
(atECE 338 Parallel
compile Computer Architecture
time!) 7
Control Dependence
• Ordering of instruction i with respect to a branch
instruction
S1;
if p2 {
S2;
}
• Instruction S2 cannot be moved before the p2 branch
• Instruction S1 cannot be moved after the p2 branch so that its
execution is controlled by the branch
– But, in some cases, instruction movement can happen
add x1,x2,x3
beq x12,x0,skip
sub x4,x5,x6 If x4 isn’t used after skip, it is possible to
add x5,x4,x9 move sub before the branch
skip:
or x7,x8,x9
ECE 338 Parallel Computer Architecture 8
Instruction Level Parallelism
•ILP is restricted by all these dependences
•ILP extraction is usually transparent to the programmer

• Extracted by the hardware (superscalar processors)


Extracting ILP by examining 100’s of instructions
Scheduling them in parallel as operands become available
Rename registers to eliminate anti dependences
out-of-order execution
Speculative execution

ECE 338 Parallel Computer Architecture 9


Instruction Level Parallelism
•OR by the compiler (e.g. VLIW-Very Large
Instruction Word processors)

Multiple operations packed in one instruction


Each operation slot corresponds to a specific FU
Parallelism within a VLIW instruction
Compiler forms the instructions

ECE 338 Parallel Computer Architecture 10


Instruction Scheduling for Exposing ILP
• Pipeline scheduling is the simplest compiler optimization to improve ILP
• Separate dependent instruction from the source instruction by the
pipeline latency of the source instruction
for (i=999; i>=0; i=i-1) Loop: fld f0,0(x1) Loop: fld f0,0(x1)
x[i] = x[i] + s; stall addi x1,x1,-8
fadd.d f4,f0,f2 fadd.d f4,f0,f2
stall stall
stall stall
fsd f4,0(x1) fsd f4,8(x1)
addi x1,x1,-8 bne x1,x2,Loop
bne x1,x2,Loop 7 cycles per element. CPI = 7/5=1.4
8 cycles per element. CPI = 8/5=1.6

ECE 338 Parallel Computer Architecture 11


Loop Unrolling for Exposing ILP
• Loop unrolling
– Unroll by a factor of 4 (assume # Schedule the unrolled loop:
elements is divisible by 4)
– Eliminate unnecessary instructions Loop: fld f0,0(x1)
Loop: fld f0,0(x1) fld f6,-8(x1)
fadd.d f4,f0,f2 fld f8,-16(x1)
fsd f4,0(x1) //drop addi & bne fld f14,-24(x1)
fld f6,-8(x1) fadd.d f4,f0,f2
fadd.d f8,f6,f2 fadd.d f8,f6,f2
fsd f8,-8(x1) //drop addi & bne fadd.d f12,f0,f2
fld f0,-16(x1) fadd.d f16,f14,f2
fadd.d f12,f0,f2 fsd f4,0(x1)
fsd f12,-16(x1) //drop addi & bne fsd f8,-8(x1)
fld f14,-24(x1) fsd f12,-16(x1)
fadd.d f16,f14,f2 fsd f16,-24(x1)
fsd f16,-24(x1) addi x1,x1,-32
addi x1,x1,-32 bne x1,x2,Loop
bne x1,x2,Loop 14 cycles
Register pressure can 3.5 cycles per element
be a big problem! CPI = 1
ECE 338 Parallel Computer Architecture 12
Strip Mining
– If Number of iterations = n (unknown)
– Unroll by k
– Generate pair of loops:
• First executes n mod k times
• Second executes n / k times

ECE 338 Parallel Computer Architecture 13

You might also like