Professional Documents
Culture Documents
02 Ilp
02 Ilp
02 Ilp
Pipelining and
Instruction-Level
Parallelism
15-418 Parallel Computer Architecture and Programming
CMU 15-418/15-618, Spring 2019
▪ CPUs
▪ “Sequential” code – i.e., single / few threads
▪ GPUs
▪ Programs with lots of independent work “Embarrassingly parallel”
▪ Key questions:
▪ Where is the parallelism?
▪ Whose job is it to find parallelism?
▪ Highlighted areas
actually execute
instructions
▪ Highlighted areas
actually execute
instructions
Most area
spent executing
the program
▪ (Rest is mostly
I/O & memory,
not scheduling)
▪ Key ideas:
▪ Pipelining & Superscalar: Work on multiple instructions at once
▪ Out-of-order execution: Dynamically schedule instructions
whenever they are “ready”
▪ Speculation: Guess what the program will do next to discover
more independent work, “rolling back” incorrect guesses
Preamble
cmp r1, #0
ble .L4
int poly(int *coef, push {r4, r5}
int terms, int x) { mov r3, r0
add r1, r0, r1, lsl #2
int power = 1; movs r4, #1
int value = 0; movs r0, #0
.L3:
for (int j = 0; j < terms; j++) { ldr r5, [r3], #4
Iteration
value += coef[j] * power; cmp r1, r3
mla r0, r4, r5, r0
power *= x;
mul r4, r2, r4
} bne .L3
return value; pop {r4, r5}
bx lr
} .L4:
Fini
movs r0, #0
CMU 15-418/15-618, Spring 2019 bx lr
r0: value
r1: &coef[terms]
Example: r2:
r3:
x
&coef[j]
Polynomial evaluation r4:
r5:
power
coef[j]
▪ Compiling on ARM
.L3:
ldr r5, [r3], #4 // r5 <- coef[j]; j++ (two operations)
cmp r1, r3 // compare: j < terms?
mla r0, r4, r5, r0 // value += r5 * power (mul + add)
mul r4, r2, r4 // power *= x
bne .L3 // repeat?
cmp r1, #0
ble .L4
push {r4, r5}
mov r3, r0
add r1, r0, r1, lsl #2
movs r4, #1
movs r0, #0
ldr r5, [r3], #4
cmp r1, r3
mla r0, r4, r5, r0
mul r4, r2, r4
bne .L3
...
CMU 15-418/15-618, Spring 2019
Example:
Polynomial evaluation
▪ Executing poly(A, 3, x)
cmp r1, #0
Preamble
ble .L4
push {r4, r5}
mov r3, r0
add r1, r0, r1, lsl #2
movs r4, #1
movs r0, #0
ldr r5, [r3], #4
J=0 iteration
cmp r1, r3
mla r0, r4, r5, r0
mul r4, r2, r4
bne .L3
...
CMU 15-418/15-618, Spring 2019
Example:
Polynomial evaluation
▪ Executing poly(A, 3, x)
Preamble
ble .L4 ldr r5, [r3], #4
Fini
... bx lr
CMU 15-418/15-618, Spring 2019
Example:
Polynomial evaluation
▪ Executing poly(A, 3, x)
Preamble
ble .L4 ldr r5, [r3], #4
Fini
... bx lr
CMU 15-418/15-618, Spring 2019
The software-hardware boundary
▪ The instruction set architecture (ISA) is a functional
contract between hardware and software
▪ It says what each instruction does, but not how
▪ Example: Ordered sequence of x86 instructions
...
...
...
...
...
...
…
Execute ldr cmp
...
...
...
...
...
...
Fetch ldr cmp mla mul bne ldr cmp mla mul bne
Decode ldr cmp mla mul bne ldr cmp mla mul
…
Execute ldr cmp mla mul bne ldr cmp mla
Latency = 4 ns / instr
Throughput = 1 instr / ns
CMU 15-418/15-618, Spring 2019 4X speedup!
Speedup achieved through
pipeline parallelism
Processor works on 4
TIME instructions at a time
Fetch ldr cmp mla mul bne ldr cmp mla mul bne
Decode ldr cmp mla mul bne ldr cmp mla mul
…
Execute ldr cmp mla mul bne ldr cmp mla
...
...
...
Fetch ldr cmp mla mul bne ldr cmp mla mul bne
Decode ldr cmp mla mul bne ldr cmp mla mul
…
Execute ldr cmp mla mul bne ldr
CPU
Fetch Decode Execute Commit
r3 r3+4
mla
mul cmp r1 ldr
CPU
Fetch Decode Execute Execute Commit
Mem
…
Execute ldr cmp mla mul bne ldr
Loop iteration
ldr r5, [r3], #4 cmp mla
cmp r1, r3
mla r0, r4, r5, r0
mul r4, r2, r4 bne
bne .L3
...
Loop iteration
ldr r5, [r3], #4 cmp mla
cmp r1, r3
mla r0, r4, r5, r0
mul r4, r2, r4 bne
bne .L3
... bne
3
4
5
6
7
8
9
10
11
12 ldr r5, [r3], #4
cmp r1, r3
13
mla r0, r4, r5, r0
14 mul r4, r2, r4
15 bne .L3
16
3 cmp
4
5
6
7
8
9
10
11
12 ldr r5, [r3], #4
cmp r1, r3
13
mla r0, r4, r5, r0
14 mul r4, r2, r4
15 bne .L3
16
3 cmp
4 mla
5
6
7
8
9
10
11
12 ldr r5, [r3], #4
cmp r1, r3
13
mla r0, r4, r5, r0
14 mul r4, r2, r4
15 bne .L3
16
3 cmp
4 mla
5
6
7
8
9
10
11
12 ldr r5, [r3], #4
cmp r1, r3
13
mla r0, r4, r5, r0
14 mul r4, r2, r4
15 bne .L3
16
3 cmp
4 bne mla
5
6
7
8
9
10
11
12 ldr r5, [r3], #4
cmp r1, r3
13
mla r0, r4, r5, r0
14 mul r4, r2, r4
15 bne .L3
16
3 cmp
ldr mul
4 bne mla
5 cmp
6 bne
7 mla
8
9
10 ldr r5, [r3], #4
cmp r1, r3
11 mla r0, r4, r5, r0
12 mul r4, r2, r4
13 bne .L3
ldr r5, [r3], #4
14 cmp r1, r3
15 mla r0, r4, r5, r0
mul r4, r2, r4
16
bne .L3
CMU 15-418/15-618, Spring 2019
1
ldr mul
2
TIME
3 cmp
ldr mul
4 bne mla
5 cmp
ldr mul
6 bne
7 cmp mla
8 bne
9
10 mla
11
12
13
14
15
16
3 cmp
ldr mul
4 bne mla
5 cmp
ldr mul
6 bne
7 cmp mla
ldr mul
8 bne
9 cmp
ldr mul
10 bne mla
11 cmp
ldr mul
12 bne
13 cmp mla
ldr mul
14 bne
15 cmp
ldr mul
16 bne mla
CPU
Instruction Buffer
Execute
C
A
B Execute
Fetch &
ldr
Decode
Execute ldr
Commit ldr
Fetch &
ldr cmp
Decode
Fetch &
ldr cmp mla
Decode
Fetch &
ldr cmp mla mul
Decode
Fetch &
ldr cmp mla mul bne
Decode
Fetch &
ldr cmp mla mul bne ldr cmp mla mul bne ldr cmp mla mul bne ldr
Decode
Fetch &
ldr cmp mla mul bne ldr cmp mla mul bne ldr cmp mla mul bne ldr
Decode
ldr
Fetch &
ldr r5, [r3], #4
Decode cmp r1, r3
cmp mla r0, r4, r5, r0
mul r4, r2, r4
bne .L3
ldr r5, [r3], #4
cmp r1, r3
Execute
mla r0, r4, r5, r0
mul r4, r2, r4
bne .L3
Commit
Commit
Commit
Commit
Commit
Commit
Commit
Commit
mul ldr mul
Commit
mul ldr mul
ldr mla bne cmp mul ldr mla bne cmp mul ldr mla bne cmp mul ldr
Fetch &
Decode
cmp mul ldr mla bne cmp mul ldr mla bne cmp mul ldr mla bne cmp
ldr cmp bne mul ldr cmp bne mul ldr cmp
ldr cmp mla bne cmp mla bne cmp mla bne cmp mla
Commit
mul ldr mul ldr mul ldr mul
Mem
ldr ldr ldr ldr ldr ldr
Execute
Int
cmp bne cmp bne cmp bne cmp bne cmp bne cmp
Execute
Mult
mla mul mla mul mla mul
Execute
Commit
bne cmp bne cmp