Professional Documents
Culture Documents
Compilation Techniques For Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
Compilation Techniques For Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
José M. Andión
1. Introduction
5. Conclusions
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
2 / 118
Outline
1. Introduction
5. Conclusions
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
3 / 118
“HPC is a crucial asset for Europe’s innovation capacity
and is of strategic importance to its scientific and
industrial capabilities, as well as to its citizens”
– H2020 Work Programme
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
4 / 118
Figure 1.1: Trends in Transistors, Performance, and Power for General-Purpose Processors – Various
metrics are shown for a selection of processors usually found in high-performance desktop and server sys-
tems from 1975 to 2010. For over 30 years, engineers used increasedC.clock frequencies and powerarchitectures
hungry for flexible and
Trends in Transistors, Performance, and
architectural techniques to turn the wealth of available transistors intoefficient
F. Batten. Simplified vector-thread
data-parallel accelerators. PhD thesis, Massachusetts
single-thread performance. Unfortu-
Power
nately, forconstraints
power General-Purpose
are forcing engineers Processors Institute of Technology, Department of Electrical Engineering and
to integrate multiple cores ontoScience,
Computer a single die in an attempt to
2010.
continue performance scaling, albeit only for parallel applications. (Data gathered from publicly available
Compilation Techniques for Automatic
data-sheets, Extraction of Parallelism
press-releases, and Locality
and SPECint in Heterogeneous
benchmark Architectures
results. Some data gathered by M. Horowitz, F. Labonte,
5 / 118
0
100
200
300
400
500
Jun.93
Nov.93
Jun.94
Nov.94
Jun.95
Nov.95
Jun.96
Nov.96
Jun.97
Nov.97
Jun.98
Nov.98
Jun.99
Nov.99
Jun.00
Nov.00
Accelerators
Jun.01
Nov.01
Jun.02
Nov.02
Jun.03
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
0
100
200
300
400
500
Nov.06
Jun.07
Nov.07
Jun.93 Jun.08
Nov.93 Nov.08
Jun.94 Jun.09
Nov.94 Nov.09
Jun.95 Jun.10
Nov.95 Nov.10
Jun.96
Nov.96 Jun.11
Jun.97 Nov.11
Nov.97 Jun.12
Jun.98 Nov.12
Nov.98 Jun.13
Jun.99 Nov.13
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
6 / 118
Nov.01
Jun.02
Nov.02
Jun.03
1
2
4
6
8
9
10
12
14
16
18
32
60
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
Nov.06
Jun.07
Nov.07
Jun.08
Nov.08
Jun.09
Nov.09
Jun.10
Nov.10
Jun.11
Nov.11
Jun.12
Nov.12
Jun.13
Nov.13
Jun.14
Development over Time
Nov.14
Jun.15
N/A
IBM
Intel
AMD
Other
Cores per Socket
Hybrid
NVIDIA
0
100
200
300
400
500
Jun.93
Nov.93
Jun.94
Nov.94
Jun.95
Nov.95
Jun.96
Nov.96
Jun.97
Nov.97
Jun.98
Nov.98
Jun.99
Nov.99
Jun.00
Nov.00
Accelerators
Jun.01
Nov.01
Jun.02
Nov.02
Jun.03
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
0
100
200
300
400
500
Nov.06
Jun.07
Nov.07
Jun.93 Jun.08
Nov.93 Nov.08
Jun.94 Jun.09
Nov.94 Nov.09
Jun.95 Jun.10
Nov.95 Nov.10
Jun.96
Nov.96 Jun.11
Jun.97 Nov.11
Nov.97 Jun.12
Jun.98 Nov.12
Nov.98 Jun.13
Jun.99 Nov.13
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
6 / 118
Nov.01
Jun.02
Nov.02
Jun.03
1
2
4
6
8
9
10
12
14
16
18
32
60
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
Nov.06
Jun.07
Nov.07
Jun.08
Nov.08
Jun.09
Nov.09
Jun.10
Nov.10
Jun.11
Nov.11
Jun.12
Nov.12
Jun.13
Nov.13
Jun.14
Development over Time
Nov.14
Jun.15
N/A
IBM
Intel
AMD
Other
Cores per Socket
Hybrid
NVIDIA
0
100
200
300
400
500
Jun.93
Nov.93
Jun.94
Nov.94
Jun.95
Nov.95
Jun.96
Nov.96
Jun.97
Nov.97
Jun.98
Nov.98
Jun.99
Nov.99
Jun.00
Nov.00
Accelerators
Jun.01
Nov.01
Jun.02
Nov.02
Jun.03
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
0
100
200
300
400
500
Nov.06
Jun.07
Nov.07
Jun.93 Jun.08
Nov.93 Nov.08
Jun.94 Jun.09
Nov.94 Nov.09
Jun.95 Jun.10
Nov.95 Nov.10
Jun.96
Nov.96 Jun.11
Jun.97 Nov.11
Nov.97 Jun.12
Jun.98 Nov.12
Nov.98 Jun.13
Jun.99 Nov.13
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
6 / 118
Nov.01
Jun.02
Nov.02
Jun.03
1
2
4
6
8
9
10
12
14
16
18
32
60
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
Nov.06
Jun.07
Nov.07
Jun.08
Nov.08
Jun.09
Nov.09
Jun.10
Nov.10
Jun.11
Nov.11
Jun.12
Nov.12
Jun.13
Nov.13
Jun.14
Development over Time
Nov.14
Jun.15
N/A
IBM
Intel
AMD
Other
Cores per Socket
Hybrid
NVIDIA
A New “Software Crisis”
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
7 / 118
Extraction of Parallelism
• libraries
-
• compiler directives
Productivity
• programming languages
• parallelizing compilers
+
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
8 / 118
Extraction of Parallelism
• libraries
-
• compiler directives
Our Proposal:
Productivity
• Source-to-Source
programming languages Parallelizing
Compiler for CPUs and GPUs
• parallelizing compilers
+
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
8 / 118
Extraction of Locality
• The Locality Principle: temporal & spatial clustering
• data placement
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
9 / 118
Extraction of Locality
• The Locality Principle: temporal & spatial clustering
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
9 / 118
Outline
1. Introduction
5. Conclusions
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
10 / 118
Outline
• KIR: A diKernel-based IR
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
11 / 118
Outline
• KIR: A diKernel-based IR
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
12 / 118
State-of-the-art vs. Our Approach
• Our approach:
Construction of the KIR Automatic Partitioning
OpenMP-
Sequential enabled
Compiler IR
C/Fortran Parallel
(ASTs,
Source Classification Spurious C/Fortran
DDG, diKernel Execution Parallelization
Code of diK-level diK-level Source
CFG) Recognition Scopes Strategy
Dependences Dependences Code
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
13 / 118
the DDG exhibits data dependences between statements (solid
i = 0;
Definitions 2.1.6–2.1.8), which reflects the computational stages
BB1
program and groups diKernels
BB0, into these stages.
BB5 & BB4 t = 0; (2)
j = 0; (2)
5 } (1)
BB3
6 y[i] = t; if (j < m) T
7 } F
BB4
y[i] = t;
BB1, BB3 & BB2 (1) (2)
(1)
BB5
if (i < n) T
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
14 / 118
diKernel: Domain-Independent
Computational Kernel
DOMAIN-INDEPENDENT
CONCEPT LEVEL
(programming practice)
SEMANTIC LEVEL
(control flow and
data dependence graphs)
SYNTACTIC LEVEL
(abstract syntax tree)
TEXT LEVEL
(ASCII code)
BB1
BB2
K < iBB4 > K < jBB1 >
t = t + A[i][j] * x[j]; (2)
i++; (2)
(1)
BB5
Edges (1), (2) are abstracted
if (i < n) T with diKernels
F
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
16 / 118
values from variables x and y, respectively.
• DEF(x, xi )Flow
diKernel-level is theDependences
range of values of variable x produc
tion of statement xi .
• Identification of the flow of information across the
• USE(x, y j ) is the range of values of variable x used th
program
statement y j .
• Statement-level dominance
Definition
• Range of 2.1.5.
values < x1 . . .xxproduced
LetofKvariable n> ! K < y1 .used
and . . ym > be a d
dencethroughout the diKernels
that connects execution K ofx statements
and Ky through the DDG edg
is a diKernel-level flow dependence, Kx · Ky , if it holds tha
statement y j and DEF(x, xi ) ◆ USE(x, y j ).
BB0
i = 0;
i=0 dominates i++
BB1 DEF(i,i=0)⊇USE(i,i++)
t = 0; (2)
BB2
K < iBB4 > K < jBB1 >
t = t + A[i][j] * x[j]; (2)
i++; (2)
(1)
BB5
if (i < n) T
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
18 / 118
Hierarchy of Execution Scopes
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
20 / 118
Outline
• KIR: A diKernel-based IR
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
21 / 118
ement the loop index (see BB0, BB5 and BB4, respectively, for the
nally, the DDG exhibits data dependences between statements (solid
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
23 / 118
Parallelizing Transformations
1. r = 0;
1. #pragma omp parallel for
2. #pragma omp parallel for reduction(+:r)
2. for (i = 0; i < n; i++) {
3. for (i = 0; i < n; i++) {
3. A[i] = 2
4. r = r + A[i];
4. }
5. }
Array Expansion
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
24 / 118
ion of the KIR consists of three steps: first, the construction
Automatic
of the program and Partitioning driven
their data dependence by the KIR
relationships (Def- (I)
2); second, the construction of the flow dependences between
tions 2.1.3–2.1.5); and third, the construction of the hierarchy of
(Definitions 2.1.6–2.1.8), which reflects the computational stages
critical path ROOT EXECUTION SCOPE
program and groups diKernels into these stages.
ES_fori (Figure 2.2, lines 1-7)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
25 / 118
Automatic Partitioning driven by the KIR (and II)
critical path
ROOT EXECUTION SCOPE
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
Figure 2.6 – Parallelized code of the dense matrix-vector multiplica
26 / 118
Outline
• KIR: A diKernel-based IR
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
27 / 118
Automatic Parallelization of the Benchmark Suite
• Synthetic Benchmarks
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
28 / 118
2.2 Automatic
2.2 Automatic Parallelization
Parallelization of
of the
the Benchmark
Benchmark Suite
Suite 29
29
DenseAMUX
1 for (i = 0; i < n; i++) {
1 for (i = 0; i < n; i++) {
2 t = 0; ROOT EXECUTION SCOPE
2 t = 0;
3 for (j = 0; j < m; j++) {
3 for (j = 0; j < m; j++) { ES_fori (Figures 2.8a and 2.8b, lines 1-7)
4 t = t + A[i][j] x[j];
4 t = t + A[i][j] ** x[j];
5 }
5 } K < t2 >
6 y[i] = t;
6 y[i] = t; scalar assignment
7 }
7 }
ES_forj (Figures 2.8a and 2.8b, lines 3-5)
(a) Source code of DenseAMUX (dense matrix-
(a) Source code of DenseAMUX (dense matrix-
vector multiplication). K < t4 >
vector multiplication).
SparseAMUX
scalar reduction
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
30 / 118
AMUXMS & ATMUX
AMUXMS
11 for (i =
for (i =
0; i
0; i
<
<
n; i++) {
n; i++) {
22 y[i] =
y[i] =
A[i]
A[i] *
*
x[i];
x[i];
33 }
}
44 for (j =
for (j =
0; j < n; j++) {
0; j < n; j++) { ROOT EXECUTION SCOPE
55 for (l
for (l
= ja[j]; l < ja[j+1]-1; l++) {
= ja[j]; l < ja[j+1]-1; l++) {
66 y[j] = y[j] + A[l] * x[ja[l]]; ES_fori (Figures 2.9a and 2.9b, lines 1-3)
y[j] = y[j] + A[l] * x[ja[l]];
77 }
}
88 }
}
< y2 >
regular assignment
(a) Source
(a) Source code
code of
of routine
routine AMUXMS
AMUXMS from
from
SparsKit-II.
ATMUX
SparsKit-II. ES_forj,l (Figures 2.9a and 2.9b, lines 4-8)
(b) Source
(b) Source code
code of
of routine
routine ATMUX
ATMUX from
from
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
SparsKit-II. 31 / 118
AMUXMS
ATMUX 13 }
PARTIALLY
(a) Parallelized code of routinePARALLEL LOOP
AMUXMS of Figure 2.9a.
EQUAKE (I)
2.2 Automatic Parallelization of the Benchmark Suite 43
ES_fori,j (Figure 2.18, lines 2-4)
EQUAKE (II)
19 K[Anext][0][0] * disp[dispt][i][0]
20 + K[Anext][1][0] * disp[dispt][i][1] ES_fori,j (Figure 2.18, lines 2-4)
21 + K[Anext][2][0] * disp[dispt][i][2];
22 disp[disptplus][col][1] += K[Anext][0][1] ... < disp4 >
23 disp[disptplus][col][2] += K[Anext][0][2] ... regular assignment
24 Anext++;
25 } ES_fori,while (Figure 2.18, lines 5-27)
26 disp[disptplus][i][0] += sum0; ...
27 }
< disp26 >
28 time = iter * Exc.dt; irregular reduction
29 for (i = 0; i < ARCHnodes; i++)
30 for (j = 0; j < 3; j++)
ES_fori,j (Figure 2.18, lines 29-31)
31 disp[disptplus][i][j] *= - Exc.dt * Exc.dt;
32 for (i = 0; i < ARCHnodes; i++)
< disp31 >
33 for (j = 0; j < 3; j++) regular reduction
34 disp[disptplus][i][j] +=
35 2.0 * M[i][j] * disp[dispt][i][j]
36 - (M[i][j] - Exc.dt / 2.0 * C[i][j]) ES_fori,j (Figure 2.18, lines 32-37)
37 * disp[disptminus][i][j] - ...
38 for (i = 0; i < ARCHnodes; i++) < disp34 >
regular reduction
39 for (j = 0; j < 3; j++)
40 disp[disptplus][i][j] /= (M[i][j] + Exc.dt / 2.0 * C[i][j]);
41 for (i = 0; i < ARCHnodes; i++) ES_fori,j (Figure 2.18, lines 38-40)
42 for (j = 0; j < 3; j++)
43 vel[i][j] = 0.5 / Exc.dt * (disp[disptplus][i][j] < disp40 >
44 - disp[disptminus][i][j]); regular reduction
45 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i;
46 } ES_fori,j (Figure 2.18, lines 41-44)
Figure 2.18 – Excerpt of the source code of the EQUAKE application. < vel43 >
regular assignment
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
35 / 118
EQUAKE
Chapter(III)
2. A Novel CompilerMinimization of
Support for Multicore Systems
thread creation/destruction
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
37 / 118
Outline
• KIR: A diKernel-based IR
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
38 / 118
Program Characteristics Compilers
Unknown LB
Complex CF
Irreg. writes
Irreg. reads
Temp. vars
PLUTO
GCC
ICC
KIR
Benchmark diKernel
p p p p
reg. assig. regular assignment
p p p
irreg. assig. irregular assignment
p p
sc. reduc. 1 scalar reduction ⇡
Synthetic
p p
sc. reduc. 2 scalar reduction ⇡
p p p
sc. reduc. 3 scalar reduction ⇡
p p p p
reg. reduc. regular reduction
p p p p
irreg. reduc. irregular reduction
p
reg. recurr. regular recurrence
p p p
DenseAMUX regular assignment ⇡
p p p p
Algebra
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
39 / 118
Performance: EQUAKE (Execution Time)
100 3.0
Remaining
Overhead
90
Irregular
2.5
80
70
2.0
60
50 1.5
40
1.0
30
20
0.5
10
0 0.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC
WL x 1 WL x 2 WL x 3 WL x 1 WL x 2 WL x 3
1. Introduction
5. Conclusions
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
41 / 118
Outline
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
42 / 118
Outline
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
43 / 118
GPGPU with CUDA
• First GPGPU programs look like graphics applications
• Main ideas:
2. Shared memory
3. Barriers
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
44 / 118
Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.
Example of CUDA-enabled
GPU
bution block distributes vertex work packets architecture
Command processing
to the various TPCs in the SPA. The TPCs The GPU host interface unit communi-
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
execute vertex shader programs, and (if cates with the host 45 / CPU,
responds to
118
more read-only memories: the constant memory and the texture memory (with spe-
cial addressing modes and not targeted in this thesis). CUDA assumes that all
of these memories are physically on the GPU card, separated from the memory
addressed by the CPU. Thus, memory allocations and transfers must be explicitly
GPU Memories managed by the programmer.
The compute capability of an NVIDIA GPU defines its core architecture (Tesla,
Fermi, etc.), supported features (e.g., double-precision floating-point operations),
technical specifications (e.g., the maximum dimensions of the hierarchy of threads)
and architectural specifications (e.g., the number of warp schedulers).
for transcendental functions and attribute
In summary, the generation
interpolation—the interpolationofofefficient
pixel GPGPU code requires the program-
merattributes
to explicitly
from handle the GPU
vertex attributes hardware
defining a architecture through the following
primitive. Each
programming SFUexposed
features also contains
by CUDA:four
floating-point multipliers. The SM uses the
TPC texture unit as a third execution unit
and uses the SMC Location and ROP unitsAccess to Scope
implement external memory load, store,
registers
and atomic accesses. SMA low-latency read & write one GPU thread
inter-
local memory
connect network betweenDRAM the SPs read & write one GPU thread
and the
shared memory
shared-memory banksSMprovides read & write all GPU threads in a block
shared-
global access.
memory memory DRAM read & write all GPU threads & CPU
The GeForce 8800 Ultra clocks the SPs
Table
and SFU3.1units
– Characteristics
at 1.5 GHz, forofa peak
the GPU
of 36memories considered in this thesis.
Gflops per SM. To optimize power and area
explicit allocations and
efficiency, some SM non-data-path units transfers
operate at half the SP clock rate.
1 Threadification
+
4 Coalescing
6 Divergency
7 Occupancy
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
47 / 118
GPGPU with OpenHMPP
automatic &
automatic & automatic &
data transfers
manual manual manual
sw-managed automatic
explicit handling explicit handling
caches handling
parallelism gangs, workers, loop iterations,
loop iterations
specification SIMD tasks, SIMD
standard loop
directives no no
transformations
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
48 / 118
Outline
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
49 / 118
GPU Programming Features addressed by our Automatic Technique
1 Threadification
+
4 Coalescing
6 Divergency
7 Occupancy
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
50 / 118
3.3 Chains of Recurrences
Chains
Chains of Recurrences
of recurrences (from now on,(chrecs)
chrecs) are an algebraic formalism to repre-
sent closed-form functions which have been successfully used to expedite func-
tion evaluation
• Algebraic at a number of points in a regular interval [18].
formalism
Definition 3.3.1. Given a constant f 2 Z, a function g : N0 ! Z, and the operator
+, the chrec f = {f, +, g} is defined as a function f : N0 ! Z such that:
i 1
{f, +, g}(i ) = f + Â g( j)
j =0
74 Chapter 3. Locality-Aware Automatic Parallelization for GPGPU
• Useful for representing the iterations of a loop and array access patterns
Hence, the chrecs can be used for representing the iterations of a loop. For
1 // only for_i is threadified 1 // only for_j is threadified
example, the loop index
2 for (i = 0; i <= N; i++) { i
of for in Figure 3.4 takes =integer
2 for (j CHRECS_x
valuesj++)
in the interval
0; j <= =N;[{0,+,1}][{0,+,1}]
{
k
3[0, sizex
for (j1].= The
0; jchrec
<= N;{0,j++)
+, 1}{ provides3 a closed-form function
for (i = 0; to i++)
i <= N; compute
{ the
4value of ... x[i][j]
i at each fori ...
iteration. 4 ... x[i][j] ...
5 } 5 }
6 } The chrecs, which are given by the KIR61 ,}have demonstrated to be a powerful
representation
(a) of the code
Source complexS1. loops and the memory(c) accesses
Source that
code appear
S2. in full-
• We instantiate (particularize) them for each GPU thread
scale real applications
T0
[13].
T1
In the T2same example of Figure T0
3.4, theT1memory access
T2
pattern i in the
(i =0first
) dimension
( i =1) (of
i =2input
) [i ][ j][k] (see line
( j=10
0) of Figure
( j=1) 3.4) (can
j=2)be
j =0
represented [0][0] the chrec
xwith x [1][0]{0, +x,[21][}0. ] i =0 x [0][0] x [0][1] x [0][2]
j =1 x [0][1] x [1][1] x [2][1] i =1 x [1][0] x [1][1] x [1][2]
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
j=2 algebraic
The x [0][2] propertiesx [1][2] of xchrecs [2][2] provide 2
i =118
51 / rules
x [2][carrying
for 0] x [2][
out1] arithmetic
x [2][2]
Detection of Coalesced Accesses to the GPU
Global Memory
74 Chapter 3. Locality-Aware Automatic Parallelization for GPGPU
row CHRECS_xk = [{0,+,1}][{0,+,1}] column
major major
1 // only for_i is threadified 1 // only for_j is threadified
2 for (i = 0; i <= N; i++) { 2 for (j = 0; j <= N; j++) {
3 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) {
4 ... x[i][j] ... 4 ... x[i][j] ...
5 } 5 }
6 } 6 }
chrecs
2nd dim {0, +, 1} {0, +, 1} {0, +, 1} 2nd dim {0} {1} {2}
(b) Non-coalesced accesses. (d) Coalesced accesses.
convex set
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
3.4.1 Detection of Coalesced Accesses to the GPU Global 53 / 118
is not empty, then some data are accessed several times and they can be stored
Usage of Registers to Store Reused Data within a
in the GPU registers if they are not modified by another thread (lines 6–9). Note
that the shared memory could be used for the same purpose as it has the same
GPU Threadaccess time as registers.
Algorithm 3.2 Usage of registers to store reused data within a GPU thread
1:PROCEDURE S TORE R EUSED D ATA I N R EGISTERS
Input: n-dimensional array x [s1 ][s2 ] . . . [sn ]
Input: loop nest L = L1 , L2 , . . . , Ll where L1 is the threadified loop
Output: a modified program that exploits reused data to maximize the usage of
the GPU registers
2: collect accesses xk [ik,1 ][ik,2 ] . . . [ik,n ] with k 2 {1, . . . , m}
3: CHRECS_xk [{fk,1 , +, gk,1 }][{fk,2 , +, gk,2 }] . . . [{fk,n , +, gk,n }]
4: for each thread Ti do
5: CHRECS_xkTi Ti , + , gTi }][{ fTi , + , gTi }] . . . [{ fTi , + , gTi }]
[{fk,1 k,1 k,2 k,2 k,n k,n
T m
6: REUSED_DATA_xTi k =1 CHRECS_x Ti
k
7: if (REUSED_DATA_xTi 6= ∆) then
8: store reused data between the accesses made by Ti in its set of
registers if data are private
9: end if
10: end for
11: end PROCEDURE
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
54 / 118
considered block, the chrecs are instantiated by fixing the value of the index of L1
that the thread executes (lines 5–7). If the intersection of the instantiated chrecs
Usage of the GPU Shared Memory for Data Shared
associated to all the accesses is not empty, then some data are accessed several
Algorithm 3.3 Usage of the GPU shared memory for data shared between the
threads of a block
1: PROCEDURE S TORE S HARED D ATA I N S HARED M EMORY
Input: n-dimensional array x [s1 ][s2 ] . . . [sn ]
Input: loop nest L = L1 , L2 , . . . , Ll where L1 is the threadified loop
Output: a modified program using the GPU shared memory to share data be-
tween the threads of a block
2: collect accesses xk [ik,1 ][ik,2 ] . . . [ik,n ] with k 2 {1, . . . , m}
3: CHRECS_xk [{fk,1 , +, gk,1 }][{fk,2 , +, gk,2 }] . . . [{fk,n , +, gk,n }]
4: for each block B do
5: for each thread Ti in B do
6: CHRECS_xkTi [{fk,1 Ti , + , gTi }][{ fTi , + , gTi }] . . . [{ fTi , + , gTi }]
k,1 k,2 k,2 k,n k,n
7: end for
TTi
8: SHDATA_x CHRECS_xkTi with k 2 {1, . . . , m}
9: if (SHDATA_x 6= ∆) then
10: store data shared between the threads of block B
in the shared memory
11: end if
12: end for
13: end PROCEDURE
The previous technique can prevent some GPU compiler optimizations: typ-
ically, these binary compilers make better optimizations if the program is coded
with several instructions using scalar variables (avoiding arrays and loops). In or-
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
der to solve this issue, Algorithm 3.5 applies 56 loop / 118unrolling and loop interchange
Use Scalar Variables to Enable GPU Compiler
Optimizations
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
58 / 118
are not shown because they are already represented in the notation of the execu-
tion scope and the types of the remaining diKernels. Only the regular reduction
K <output31 > determines if CONV3D is parallelizable (note that the remaining
CONV3D & SGEMM parts of the KIR are shaded because they represent privatizable temporaries). As
the regular reduction diKernel represents conflict-free loop iterations, it can be
converted into a forall parallel loop. On the CPU, it can be parallelized using the
OpenMP parallel for directive (see Section 2.2.1).
Table 3.2 summarizes the GPU features addressed by our locality-aware au-
tomatic parallelization technique to generate the same optimal variant as the one
written by an expert in GPU programming. The first optimized variant is conv3d-
conv3d-hmpp1
conv3d-hmpp2
conv3d-hmpp3
sgemm-hmpp1
sgemm-hmpp2
sgemm-hmpp3
sgemm-hmpp4
sgemm-cublas
conv3d-cpu
sgemm-mkl
sgemm-cpu
GPU Features
p p p p p p p
Coalescing - p p - - p p -
Registers - p - - p -
Shared Memory - - - -
Table 3.2 – GPU features exploited with each variant of CONV3D and SGEMM.
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
59 / 118
Chapter 3. Locality-Aware Automatic Parallelization for GPGPU
shaded to be omitted
in the discovering of
CONV3D (I) parallelism
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
e 3.4 – Source code of the 3D discrete convolution operator (CONV3D). 60 / 118
CONV3D (II)
• conv3d-cpu: Sequential code
• conv3d-hmpp1: Coalescing
1. int i, j, k, size_x, size_y, size_z; CHRECS_input1 =
2. float coefx,coefy,coefz,*input,*output; [{0,+,1}][{0,+,1}][{0,+,1}]
3.
4. for (i = 0; i < size_x; i++) {
5. for (j = 0; j < size_y; j++) {
6. for (k = 0; k < size_z; k++) {
7. float tempx = input[i][j][k]+coefx*
8. …
CHRECS_input1T0 = CHRECS_input1T1 =
[{0,+,1}][{0}][{0}] [{0,+,1}][{0}][{1}]
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
61 / 118
CONV3D (III)
• conv3d-hmpp2: Registers
4. for (j = 0; j < size_y; j++) {
5. for (k = 0; k < size_z; k++) {
6. for (i = 0; i < size_x; i++) {
7. float tempx = input[i][j][k]+coefx*
8. (
9. input[i-1][j][k]+input[i+1][j][k]+
10.…
∩≠∅
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
62 / 118
4 float coefx, float coefy, float coefz) {
5
6 #pragma hmppcg gridify (j, k)
7 for (int j = 0; j < sizey; j++) {
Chapter 3. Locality-Aware Automatic Parallelization for GPGPU
8 for (int k = 0; k < sizez; k++) {
CONV3D (IV) 9
10
float i___minus4 = 0;
float i___minus3 = input[-4][j][k];
11 float i___minus2 = input[-3][j][k];
12 float i___minus1 = input[-2][j][k];
1 #pragma hmpp conv3d___hmpp2 codelet 13 float i___plus0 = input[-1][j][k];
2 void conv3d___hmpp2(float output[sizex][sizey][sizez], 14 float i___plus1 = input[0][j][k];
3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], 15 float i___plus2 = input[1][j][k];
4 float coefx, float coefy, float coefz) { 16 float i___plus3 = input[2][j][k];
5 17 float i___plus4 = input[3][j][k];
6 #pragma hmppcg gridify (j, k) 18 for (int i = 0; i < sizex; i++) {
7 for (int j = 0; j < sizey; j++) { 19 i___minus4 = i___minus3;
8 for (int k = 0; k < sizez; k++) { 20 i___minus3 = i___minus2;
9 float i___minus4 = 0; 21 i___minus2 = i___minus1;
10 float i___minus3 = input[-4][j][k]; 22 i___minus1 = i___plus0;
11 float i___minus2 = input[-3][j][k]; 23 i___plus0 = i___plus1;
12 float i___minus1 = input[-2][j][k]; 24 i___plus1 = i___plus2;
13 float i___plus0 = input[-1][j][k]; 25 i___plus2 = i___plus3;
14 float i___plus1 = input[0][j][k]; 26 i___plus3 = i___plus4;
15 float i___plus2 = input[1][j][k]; 27 i___plus4 = input[i+4][j][k];
16 float i___plus3 = input[2][j][k]; 28 float tempx = i___plus0 + coefx *
17 float i___plus4 = input[3][j][k]; 29 (
18 for (int i = 0; i < sizex; i++) { 30 i___minus1 + i___plus1 +
19 i___minus4 = i___minus3; 31 i___minus2 + i___plus2 +
20 i___minus3 = i___minus2; 32 i___minus3 + i___plus3 +
21 i___minus2 = i___minus1; 33 i___minus4 + i___plus4
22 i___minus1 = i___plus0; 34 );
23 i___plus0 = i___plus1; 35 float tempy = ...
24 i___plus1 = i___plus2; 36 float tempz = ...
25 i___plus2 = i___plus3; 37 output[i][j][k] =
26 i___plus3 = i___plus4; 38 output[i][j][k] + tempx + tempy + tempz;
27 i___plus4 = input[i+4][j][k]; 39 }
28 float tempx = i___plus0 + coefx * 40 }
29 ( 41 }
30 i___minus1 + i___plus1 + 42 }
31 i___minus2 + i___plus2 +
32
Compilation Techniques fori___minus3 + i___plus3
Automatic Extraction + and Locality in Heterogeneous Architectures
of Parallelism
33 i___minus4 + i___plus4 Figure 3.6 – Excerpt
63 / 118
of the parallelized code of the 3D discrete convolutio
CONV3D (V)
Table 3.3 – Chrecs for the accesses in lines 24–30 of Figure 3.4 (CONV3D).
shared clause
3.5.2 Case Study: of the gridify directive
SGEMM
The simple-precision general matrix multiplication from the BLAS library [24]
performs the matrix operation:
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
64 / 118 C = a·A⇥B+b·C
3.5 Case Studies
ase Studies 87
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
66 / 118
From the point of view of the locality, the challenge of SGEMM is to handle
the tradeoff between opposite array traversals efficiently: row-major for C and
A, and column-major for B. On the CPU, the general solution is to apply loop
tiling: matrices are computed in small tiles to keep data in the cache memory.
SGEMM (II)
This approach can be also applied on the GPU using the shared memory as cache
and being aware of coalescing.
The first variant of SGEMM is the sequential code shown in Figure 3.8
(sgemm-cpu). In addition, we have selected the cblas_sgemm function of the
non-clustered, threaded part of the Intel MKL library [69] to build the sgemm-mkl
• sgemm-cpu: Sequential code variant.
• sgemm-mkl: Intel MKL variant, which are analyzed by Algorithm 3.1 as follows. Regarding A, all the
threads of a warp have the same chrecs and thus access the same memory posi-
tion (see line 14 of Algorithm 3.1). Regarding B, coalescing is maximized because
Studies 87
the chrecs of the first dimension are the same while the chrecs of the second one
8 Compilation
– Source code for
Techniques ofAutomatic
the simple-precision general
Extraction of Parallelism and Locality inmatrix multiplication
Heterogeneous Architectures
67 / 118
SGEMM (III)
Figure 3.10 – Excerpt of the parallelized code of the simple-precision general ma-
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
trix multiplication (SGEMM): variant sgemm-hmpp2. 68 / 118
SGEMM (and IV)
1 int m, n, k;
• sgemm-hmpp3: 2
3
#define DELTA 16
4
Let the compiler
#pragma hmpp sgemm___hmpp3 codelet
5 void sgemm___hmpp3(float C[m][n], float alpha, float A[m][k],
6 float B[k][n], float beta) {
use the registers 7
8 #pragma hmppcg gridify (i,j), blocksize(128x1)
(fullunroll) 9
10
for (int i = 0; i < m; i = i + DELTA) {
for (int j = 0; j < n; j++) {
11 float prod[DELTA];
12
sgemm-hmpp4:
#pragma hmppcg fullunroll
• 13 for (int t = 0; t < DELTA; t++) {
14 prod[t] = 0;
Use the shared 15
16
}
for (int l = 0; l < k; l++) {
memory for B 17
18
#pragma hmppcg fullunroll
for (int t = 0; t < DELTA; t++) {
19 prod[t] += A[i+t][l] * B[l][j];
20 }
• sgemm-cublas: 21
22
}
#pragma hmppcg fullunroll
NVIDIA CUBLAS 23
24
for (int t = 0; t < DELTA; t++) {
C[i+t][j] = alpha * prod[t] + beta * C[i+t][j];
25
library
}
26 }
27 }
28 }
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
69 / 118
Outline
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
70 / 118
Performance Evaluation: CONV3D
sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768
120
conv3d-cpu
100 conv3d-hmpp1
conv3d-hmpp2
80 conv3d-hmpp3
GFLOPS
60
40
20
0
CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
71 / 118
Performance Evaluation: SGEMM (I)
m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280,
1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192
500
sgemm-cpu
sgemm-mkl
400
sgemm-hmpp1
sgemm-hmpp2
GFLOPS
300 sgemm-hmpp3
sgemm-hmpp4
200 sgemm-cublas
100
0
CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
72 / 118
Performance Evaluation: SGEMM (and II)
blue: sgemm-cublas
red: sgemm-hmpp4
8192 black: sgemm-mkl
6144
4096
k
2048
1024
128
8192
6144 8192
4096 6144
4096
2048 2048
1024 1024
n 128 128
m
GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
73 / 118
Outline
1. Introduction
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
74 / 118
Outline
• Problem Formulation
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
75 / 118
Outline
• Problem Formulation
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
76 / 118
10 for (k = 0; k <= i - 1; ++k) k k k k k T
[ 1 ... l 1 l l+1 ... n ] =
11 x = x - A[j][k] * A[i][k]; T
12 A[j][i] = x * p[i]; =[ 0 ... 0 1 ikl+1 ... ikn ]
13 }
14 } Positions {ij , 0 < j < l} will not change between iterations
Problem Statement
Figure 2. Source code of the cholesky application.
k and (k + 1), and therefore jk = 0; while positions {ij , l <
j n} will be reset to 0, and therefore jk = ikj .
DO i1 = 0, u1 ( !
ı )
DO i = 0, u ( !
2 ı ) 2
.
.
.
DO in = 0, un ( ! ı )
104 Chapter 4. Trace-based Affine Reconstruction of Code
V [ f1 ( !
ı )] . . . [ f m ( !
ı )]
bounds are assumed to be inclusive, i.e., 0 i j u j ( !
ı ).
ere {u j ,Since 0 <f j jis affine, n}the are access affinecan be functions
rewritten as: that provide the upper bounds
p i j ; { f d ( i1 , . . . , i n ) , 0 < !
d m! } is the set of affine functions that converts
V [ f 1 ( ı )] . . . [ f m ( ı )] = V [c0 + i1 c1 + . . . + in cn ] (4.1)
en point in the iteration space of the nest to a point in the data space of V; an
= {iwhere , . . . ,
V i is} T is
the base a column
address of vector the array,which encodes
c0 is a constant theand
stride, state
eachof
{c jeach
, 0 < iteratio
1 n
riable.j Note n} is the that, if theoforiginal
coefficient the loop index access i j , and
is itnot
must accountitfor
affine, the dimension-
can be modeled in th
ality of the original array. For instance, an access A[2 ⇤ i ][ j] to an array A[ N ][ M]
rst case as one loop (with one iteration and 78 /the
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
118 corresponding stride) per entr
j resets to 0. In all other case, the next access
!ı k+1 = {i k . . . , i k + 1, 0, . . . , 0}.
1.1. A set of indices built complying with these conditions will
1 be referred
j
Chapter
Lemma4.4.1.2. The stride
Trace-based between
Affine two consecutive
Reconstruction accesses sk = V ( !
of Code ı k +1 ) V(!
ı k)
is a linear combination of the coefficients of the loop indices.
he stride between two consecutive accesses sk = V ( ! ı k +1 ) V ( ! ı k)
ation of the coefficients of theEquation
Proof. Using (4.1), sk can be rewritten as:
loop indices.
s k = V + ( c + c i k +1 + . . . + c i k +1 )
sk
uation (4.1), can be rewritten as: 0 1 1 n n
V + (c0 + c1 i1k + . . . + cn ink ) =
sk = V + (c0 + c1 i1k+1 + . . . + = cn ink+1 ) c1 d1k + . . . + cn dnk =
k k ! !k
V + ( c0 + c1 i1 + . . . += cn in ) = c d
= c1 d1k + . . . + cn dnk =
! !k
= c d
• Problem Formulation
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
81 / 118
Solution Space 2n + 1 candidates
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
82 / 118
by a single instruction, included ! in the n . Using execution trace. ! Since the upp
ritten !
inner
as:that index; and that its main diagonal is equal to 1 2 Z
the canonical U and w
loop , the
tuple Note ı toUbe is avalid
lower under number
triangular
functions the loop
ofmatrix,
nested
are affine,
! constraints
loops)
sincenumber capable
each no of index
u of!
nested
( ı ) i in
generating
jcancan
loops) be the
depend
capable
written whole on
of observed
as: an
generating theform
access trace.obse
whole can
condition for a given For iteration 108 tuple ı to be valid Chapter
j under the loop constraints in 4. Trace-based Affine Reconstruction of
example, a 2-levelFor loop !with indices
example, a 2-level i and loop j mightwith! iterate sequentially
indices i and j might over iterate
e written
inner
Chapter index;
4. as:and
theTrace-based
that
! T 108
canonical
its main
Affine
loop form
diagonal
Reconstruction
canelements
be
is
written
equal
in of
to
Code
Chapter
as:
1 2 Z n .
4.] Trace-based Using U and
Affine w , the
ifReconstruction ui = of areCod
! !
ı + w for a0given iterationu j tuple
Ucondition
all the ! arrayall
u
A[ N
the ! ı
][ M
elements
(4.4)
if the
w
in upperarray Abounds
u i
[N ][ M] are
. . . u
thedefined upperasbounds
i
N, de
Problem Resolution (I)
number
ı to of nested loops) capable of generating the whole observed access t
thebe valid under the loop constraints in
= M and access is j
V
( i
) M
= j j
.
+ This j,1can 1 +be
+
rewritten j, ( j as 1 )
an ( j 1 )
equivalent
u j = M and the access is V [i ⇤ M + j]. This can be rewritten
[ ⇤ + ]
the !
canonical ! loop !form T can be written
1-level
For
loop
example,
as:
with
U ! ı index !w
a 2-level
1-level 0
i, using
! Tloop u
with indices i and j might iterate sequentially
Nindex ⇤ M and access ui V =[iobserved M(4.4)
. ⇤Section and4.1.5 willV [i ].
pable U ıof + w
generating 0the
number wholeofobserved
and
nested access
loops)
all the elements
therefore it
+
is
trace.
capable loop
in array to
possible
of (4.4)
i with
=
generating
A[ N build][ M] ifathe
i,the using
matrix
whole
upper U bounds
]N
2 Z are
n⇥n
access
access
defined
and a
trac
ascolu
ui =
detail a mechanism to detail increase the dimensionality
a mechanism to increase of the resulting nest.
dimensionality of the resulti
with indices i and jFor might example, iterate
! ! sequentially
a 2-level
nu j ! loopT with indices
= M and
! over
the access k is V [ i i and j might iterate
⇤ M + j ] . This can be rewritten sequentially as an equiv ov
hm has already identified aLet w Upartial
2 Z ! += w{ asolution
such
ı a 1-level that:
, . . .0, awith
loop }Let = S{aVn(=i,=
index
! ! 1 ), . . . , V ( ı N
ı {using
a , .
u , a =
!
} NLet
= )}
⇤ {M be
V us( !the
andı 1assume
(4.4)
) addresses
, .
access. . , V V
( ! ı
[ i ]Nthat
generated
. )} be
Section the
the ada
4.1.5
[ N ][ M] if the Letupperus assume bounds
all the that are
thedefined
elements in array
algorithm as has
1 ui A= [ NN,
N
already ][ M] identified if the upper 1 i N
a! bounds
partial solutionare defined k
Sn = as ui =
s the subtrace a , . . . , a by}ausing single 2instruction,
n anested byincluded loops,
a single in thek =
instruction, 3execution
c
included , I
trace.k , U, inSince
the ! w the
execution, which
upper bounds
trace. recon
Since t
rithm
V [i ⇤ M + has! j . already
This
k {!can
{ c , I , U, w u}j, which
] 1 identified
be
= M functionsk
rewritten
and a as
the are
reconstructs partial
detail
an
access 1equivalent
the is0 solution
mechanism
V
subtrace[ i0 ⇤ ! M
to increase
. .
{ + .a S1 j0
,]n. . .
the{
This
, a k
dimensionality
} can
using !be n rewritten
nested
of } resulting
loops,as an
nest.
equivale 2
affine, ! each u j ( ı ) are
functions canaffine, be written each
! as:
u j ( ı ) can ! Nbek = written as: w
sucts Let
follows: uthe us assume
= subtrace that the algorithm has
6 already
Let a {identified
a , . . . , a } a partial
7
{ V
⇤ Mwhose ı 1solution
, . . . , V ı S
components be the
V [i ]. Sectionare addresses gene
defi
sing
! i whose
!
M and
N ⇤ components a
1-level
{ 1 ,
access . . .
loop
are V,defined
[ai ]with
k using
. }Section u2,1follows:
6index
as
6subtrace
n
4.1.5 =
nested
i, 1using will
10 u.i. loops, N.= 0N 7 = (
7 in nthe
)
and access ( n )} 4.1.5 6w
6 bo w
{ c , •Ik ,Coefficients
U, w } , which
of the reconstructs
Loop Indices the by a single uinstruction, { !aı , . . .
w , aincluded
}
u using
i u. . . !ı nested
u execution
w i loops,
u
trace. Since the
i . . . u ! upper
(4.2)
i
ase the dimensionality of the
detail resulting
a mechanism U = 6nest.
uto u3,2j
increase ( 1 ) 1the. .dimensionality
= j. + k 0j,1!17 + ( + ) j,(of
= j 1jthe
) ( j 1resulting
+ + , +
and nest.j,( jw1) = (j 6
dwhoseas follows: components are
! defined as follows:
6 3,1
functions are affine, each u j ( ı ) can be written as:
6 .. of ..loop ..indices. .! .. 7
7
j ) 1
j,1 1)
6
4
! • Vector ! c 2 Z n!of coefficients . . !
=ts{of V ( loop 1
ı ), . .indices.
. , V ( ı )}LetN be the aand { a1 4
=addresses
therefore, . . . .it, agenerated
isN } .= and{.therefore
possible Vto( build ı! 1 ) ,.it .a.is5 V ( ıUN
.matrix
,possible !
• 2toVector
)} Z n
build⇥
be the n and
a matrix ac column
addresses 2 UZ 2n of
n
generate
vector
Z ⇥ n
coe
and
w
! n ! u ( ı ) = w + u i + . . . + u i
uded in the execution
! by ktrace.
a !
single w 2 Zthe
Since
1 ! such
instruction,k un,1
upperthat: n u⇥bounds
included
n,2
k w u2n,3 Z in n
. such
j. .thethat: 1execution trace. Since the upper boun
j j,1 1 j, ( j 1 ) ( j 1 )
ients of
• • Vector • Matrix
loop
Iteration
n
2indices.
c Indices [ ı | . . . | ıof]loop
Z I of=coefficients 2 Z indices. of reconstructed indices.
(Z! n ⇥
ı ) canof k bereconstructed
written as: functionsindices.
2
are affine,1 each
and 0 u0j ( .ı. .) can
therefore !
it is
23
0 be
possible 1 writtento
0 build 0 .as: a .• . 0Matrix
matrix
3
U 2 ZInk⇥2n= and [3! 1 | . . . v|
aı column
k = [! 1| . . . | ! k ] 2 Z6 n⇥! k2 n such that: 6 7 7since no index w1
• Matrixn ⇥ k I ı ı Note w
u
6 2,1 of
that Z reconstructed
U 1 is0 a .lower
. . 0 6 72,1u indices.
triangular 1 0 matrix,
. . . 0 7 6 7 i j can de
]= 2w Z of reconstructed indices.
6
6 u u ! 1 . . . 0
67
7
7 ! ! 6 w2 7
6 . n7 (4.3) !
j + u j,1 i1 + . . . + u j,( j 1) i( j inner (4.2) u u 1 . . . 0
U = u 2 ı U w 6 u i 3
. . . u 7
, i
and w = , and (4.
w
1) index;
6
6 ..
3,1 and
j (
3,2
..
that
) 1 = 0 its
. . . ..6 7 ..
=j 6+ main
0 7 3,1
j,1
. . . diagonal
10 +
3,2 + is j, equal
( j 7
1 ) ( j to 1 ) 1 6 2
4 . 5
Z. 7 . Using
2 3 U
.. .. . . .. 7
• Bounds 4 . . 6 u .. . .4 5 . . 7 . ! . . 5 w1
condition for a6 given 2,1 1iteration0 . . . tuple 0 7 ı to be valid under wn the6 loop7co
n⇥n and a ucolumn un,2 6 u . . . 1 u u 7u . . . 1 n⇥n and a ! 6 w2 7
to build a matrix Uand 2 Z therefore the canonicalit isn,1possible
U = loop vector
n,3
6 u3,1 form
6 to
u3,2 canbuild 1 . ben,1 a matrix
n,2
. . written
0 7 7 n,3 U as: 2 Z , and w = 6 . vect
column 6 7
! 6 . . . . . 7 4 .. 7 5
w 2 Z n such that: 4 .. .. .. . . .. 5
Note that U is a lowerNote triangular that Umatrix, is a ! lower since no index
triangular ! i j can depend
matrix, since noon wan
index n ij c
3 2inner index; and thatn,1 u u n,2 3 n,3 u . . . U 1 ı + !w ! 0 n T ! !
its
innermain index;diagonal and is
that equal its mainto 1
diagonal . Using
is equal U and
to w
1 , the n . Us
0 1 0 2 0 ... 0 3 2 Z 2 2 3 Z
7 6 condition for w a 1given iteration condition 7 tuple
for a given
! ı to beiteration valid under tuplethe ! ı loop validwunder
to beconstraints 1 inthe lo
0 7 u
6the2,1 16 Note0 7 .
that . . U 0is a7 lower triangular matrix, since no index 6 i j can depend
7 o
7 6 ! canonical
Let us w
loop
6assume 2 7form thecan
that 7
thebe written
canonical
algorithm loop as:form can be written!
has already as:
identified! n .6 a wpartial
2 7 andsol !
0 7 U = , 6
and u w u
= inner
6 index;
1 7 . . .and 0
(4.3) that 7 its main diagonal is equal , to
and 1 w2 Z
= 6 Using U7 w
(4.
7 3,1
6Parallelism
! k
3,26 !. 7 . 7 iteration ! 6 loop .. 7
.. 7
Compilation Techniques for Automatic Extraction of 6{ c.. , I , ..U,
and condition
Locality .
4 w.. },5which
in . . .. reconstructs
for
Heterogeneous a given 7 U ı + w the0 subtrace
Architectures ! ! tuple ! ıT to be
U
valid
! ı { a1w, . . . ,0a4
under
! !the T
k } . (4.4)
using 5 n ne
constrain
. 5 4 . . . . . 5 83 / 118 +
valid solution, Sn has to meet the following requirements:
consecutive pair of indices ! ! k
ı k and !! k + 1
ı k+1 must be sequential a
consecutive
Problem pair of indices ı and ı must be sequential as
nition 4.1.1. Resolution (II)
nition 4.1.1.
that this
• To
condition is stronger than simply requiring that the i
that thisbecondition
a valid solution:
is stronger than simply requiring that the i
indices stay inside the loop bounds, which could be written exten
indices stay inside the loop bounds, which could be written exten
ation (4.4) as: consecutive pair of indices must be sequential
• Each
tion (4.4) as: k ! 1⇥ k n⇥k
UIk + !w 11⇥k 0n⇥k
UI + w 1 0
observed strides are coherent with the reconstructed ones. U
observed
• The strides are strides
observed coherent
arewith the reconstructed
coherent with the ones. U
ma 4.1.2 this can be expressed as:
ma 4.1.2 reconstructed ones as:
this can be expressed
! ! k +1 ! k ! !k k
c (!
! ı k +1 ! ı k) = !c! d k = sk
c( ı ı )= c d =s
1 The first column uniquely identifies the instruction that emits the memory a
#define N 32;
2
3
and the second is the address of the accessed
double p[N], A[N][N], x;
int i, j, k;
1 0x1e2d140memory location.
2 0x1e2d140
4 88 0x1e2d248
.
. 89 0x1e2d340
5 #pragma scop 1 0x00400cbe .0x1e2d140
6 for (i = 0; i < N; ++i) { 90 0x1e2d348
2 0x00400cbe 0x1e2d140 91 0x1e2d350
7 x = A[i][i]; 3 30 0x1e2d140
0x00400cbe 0x1e2d140 92 0x1e2d340
8 for (j = 0; j <= i - 1; ++j) 4
31 0x1e2d240
9 4.2.1 Reconstruction Process
x = x - A[i][j] * A[i][j]; 5
...
32 0x1e2d248
0x00400cbe 0x1e2ef18
33 0x1e2d240
93
94
0x1e2d348
0x1e2d350
10 p[i] = 1.0 / sqrt(x); 6 0x00400cbe 0x1e2ef20 .
11 for (j = i + 1; j < N; ++j) { 34 0x1e2d248 .
.
7 0x00400cbe .0x1e2ef28
12 x = A[i][j]; .
13 The pseudocode that implements
for (k = 0; k <= i - 1; ++k) our proposal was presented in Section 4.1
(b) Excerpt of the memory trace
.
! ! k 1 ! k ! !k
Upon processing access akc+(1 , ıthe algorithm
+
ı ) =first = sk the observed stride:
c dcalculates
Problem Resolution: Building the System
k
Upon processing access ak+1 , the =
s ak+1 afirst
algorithm k (4.6)
calculates the observed stride:
• Calculate the observed stride
Afterwards, it builds a diophantine2 linear equation system based on Lemma 4.1.2
!k
to discover the potential indices ı = s k +a1k+
which
1 (4.6)
ak generate an access stride that is
equal to the observed
Afterwards, it builds aone:
diophantine 2 linear equation system based on Lemma 4.1.2
• Build a diophantine linear equation system
to discover the potential indices ! ı k +1 which generate an access stride that is
! ! k 1 ! k k ! T ! !k ! T k
c ( ı + ı ) = s ) ( c c ) d = c s (4.7)
equal to the observed one:
One!or
•!
T morensolutions:
⇥n !!
Explore them independently
k n is the solution. There
where ( c c ) 2! Z
c( ı ! k is
+ 1 the !system matrix,
ı ) = s ) ( c c ) d = c T sk
k k ! andT ! d 2
k Z ! (4.7)
are two• possible No solution situations
under current when solving this system:
boundaries
! T ! n ⇥ n !k n is the solution. There
where ( c • Increase c ) 2 Zdimensionality is the system matrix,
adding a new loop and d 2 Z
1.
are two The system has one or more integer solutions. In this case, for each solution
!kpossible situations ! when solving
k + 1 ! k !this system:
k k + 1
⇥ k ! k +1 ⇤
d , •the newboundaries
Modify index ı = ı + d is calculated, and I = I| ı .
U, !w , and ! chas remain unchanged. Each of theseInsolutions must be explored
1. The •system Discard this one
branch or more integer solutions. this case, for each solution
! ! ! ! ⇥ ! ⇤
independently.
d k , the new index ı k+1 = ı k + d k is calculated, and Ik+1 = Ik | ı k+1 .
2 !
Compilation Techniques !
for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
U, w , and c remain unchanged. Each86 of / 118 these solutions must be explored
2 linear equation system based on Lemma 4.1.
ng the Linear
Afterwards, Diophantine
it builds System
a diophantine
!
o discover the potential indices ı k+1 which generate an access stride that i
qualProblem
to the Resolution:
observed one: Solving the System
ugh the system in Equation (4.7) has infinite solutions in the general
a few are valid!
in the
! k
context
1 ! k
of the
k
affine
! T
loop
! !reconstruction,
k ! T k
which m
c( ı + ı )=s )(c c)d = c s (4.7
sible to develop ad-hoc solving strategies.
! T ! n ⇥ n !k n is the solution. Ther
where• ( As
c indices
c ) 2 Z must is the system matrix, and 2
ma 4.1.3. There are at mostben sequential, there are at in Equation (4.7).
d Z
valid solutions to the system
re two most
possible
n situations
solutionswhen solving this system:
pond to indices:
! k +1
{ ıorl more ! k
), 0 < l In
+(l, ı solutions.
= integer n} case, for each solution
this
1. The system has one
!k ! ! !k ⇥ k ! k +1 ⇤
construction
d , the new index ı k + 1 = ı + d is calculated, and I 111
k k + 1 = I| ı
If index
U, !
w , !ı
andk +!1
c must
remainbe sequential
unchanged. to
Each index
of !ı
these ksolutions
as per Definition
must be 4.1.1
explored
• We only need to calculate ! the predicted
is a independently.
single degree of freedom for k : the position dk that is equal to 1.
ult intostride
account, it is possible
for each valid index to find
andall
d valid solutions
compare withl of the
me (O(the n)) byobserved stride
2 The system must be diophantine, as loop indices may only have
simply testing the valid indices ! k+ 1
2 3 n 2 3 l , calculat-
ı integer values.
d k
k ! !k 0
tride for each combination 6 as 1 ŝ = c d , and accepting those
7
l 6 l . 7
6 .. 7 . 6 k .. k7
rate a stride equal to the6observed 7 one6(ŝl = s 7, obtained using
6 k 7 6 7
hese will be particular solutions6 dl 1 7of the 0 7 { a1 , . . . , a k +1 },
6 subtrace
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
87 / 118
The pseudocode
13 The for
first that
column 0;8k !
(k =implements ⇥ our
<= identifies
uniquely i - 1;
1
⇤theproposal
++k)
instruction wasemits
that presented
5 0x00400cbe
the memory inaccess,
Section 4.1.4. As
0x1e2ef18
10Figure
14
p[i]4.2 – The Cholesky
= 1.0 / > c
sqrt(x);
> = matrix
s =
A[i][k]
[ a 2 a
decomposition.1 ] (b)
= [60
Excerpt
] of
0x00400cbe the memory trace
0x1e2ef20
mentioned,
11 thethe
and
for (j
xsecond
= x ->
extraction
= i +is1;
A[j][k]
the
< process
address
j 2< N;⇥*
of
! starts
the
++j) 1
⇤
by
accessed
!{ 2
; memory location.
calling E7XTRACT
generated by ()the
with the A[i][k]
access following
2
15
12
A[j][i] = x Ip[i];
x = A[i][j]; *
= ı | ı = [0, 1] 0x00400cbe 0x1e2ef28
S1 : 13
16 } (see line 14 of Figure 4.2a).
for (k = 0;> >8kU <==i [- ⇥1;
! 1 ] ⇤
++k)
(b) Excerpt of the memory trace
Problem Resolution: CHOLESKY (II)
17 }
14 x = x ->
4.2.1 endscop
18 #pragma
15 A[j][i] = x >
:> !
A[j][k]
>
Reconstruction
<
c =
w2 = [⇥1!
* p[i];
* T
Process
s 1 = [ a;
A[i][k]
] 1 !2 ⇤ 2 a 1 ] = [0]
generated by the access A[i][k]
16
I = ı | ı = [0, (see
1] line 14 of Figure 4.2a).
he engine predicts an iteration
}
(a)
of the code.
Source
only loop that has been found so far:
17 The pseudocode that
> implements
U = [ our
1 ] proposal was presented in Section 4.1.4. As
As commented
} intheSection
> 4.1.3, this is the only feasible solution for the sub-
• Solution
18 #pragma for
mentioned, the
endscop first
!
>
:
extraction
! !wtwo
process
1accesses:
starts
TT by calling E XTRACT () with the following
trace { a1 = 0x1e2d140,
2
S1 : Figure
ŝ = ac4.2
2
2 = d– 0x1e2d140
1
2 The
18= [0Cholesky
= [ ]
] [1] } matrixEdecomposition.
=. 0Next, XTRACT () starts to process the
! ⇥ 1⇤
c s = [ a2 a1 ] = [0]
following access in (a) theSource
trace:code.>
>
>
=
< I2 = ! ⇥ 1 ! 2
⇤
Astocommented
hich is equal the observed in stride
Section s>24.1.3,
.UThe
ıthis| ı is=the
matrix
[0, 1]only feasible solution for the sub-
of reconstructed indices is
Figure 4.2 – The > =Cholesky
[ 1 ] matrix decomposition.
pdated: trace { a1 = 0x1e2d140, a2 = 0x1e2d140 }. Next, E XTRACT () starts to process the
>
:
wa3==
! 1] T0x1e2d140
[h i
following access
I = in[ I the
| + trace:
( 1, !ı 2
)] of
= the 0 only1the2loop
the engine predicts an iteration that has beenforfound
the sub-so far:
• Processing the third access:
and computes
As
the
trace
commented
{ a1stride
in Section
witha2the
= 0x1e2d140,
4.1.3, this
prior access
= 0x1e2d140
is only feasible
}. Next,(see
solution
line() 4starts
E XTRACT of Algorithm
to process the 4.1):
nd the loop bounds following
need toaccess be recomputed. 2 a3
in the trace: ! Thanks
= !0x1e2d140
2 to Corollary
T 4.1.6:
ŝ1 = c d 1 = [0] [1] = 0
the engine predicts 2 an iteration hof i the only loop that has been found so far:
s = a
⇥ 3 ⇤ a 2 = T= 0x1e2d140 0x1e2d140 = 0
0x1e2d140
!
w 0
w 0 T
i 3a 3
2 T
and computes
which is equal thetostride = with
the observed
1 =thestride
1prior
! ! = saccess
[
2 . ] The (see
matrix
T
lineof4 reconstructed
of Algorithm indices4.1): is
2 2
and computes the strideŝwith 1 = the c d
prior = [0(see
1access ] [1line
] = 4 of0 Algorithm 4.1):
Then, our method tries to efficiently traverse the
updated: h solutioni space considering:
nd U remains unchanged. The s2 =new asI32 = solution
a = is! linear
0x1e2d140 2 and0x1e2d140
the algorithm = 0 contin-
[ I 2| + (
= a3 a2 = 0x1e2d140
! 1, ı )]
2 . The = 0 1 =20
0x1e2d140
which
es processing the trace! is equal and to the
2 updating ! observed
2 I! and stride
w in s
the same matrix
way of reconstructed
until the observed indices is
g = U ı + w = [ 1] [1] + [1] = [ 1] + [1] = [0]
updated:
andto: the loop bounds 4.2tries Case beStudy: CHOLESKY
triesneed to recomputed. Thanks to iCorollary
Then, our method to efficiently traverse the h solution space considering: 4.1.6:
ride changes Then, our method to efficiently traverse the solution space considering:
• !The reconstruction ! I = !continues
[ I | + ! ( 1, !ı 2 until
)]+h[= 0 1] 1+ [12] = [0]
2
g does not
2
have positive elements,
2
i
g = U
!
ı + w ⇥ 0 ⇤and
= [ 1 ]
T
[ 1 ] thus
1 ]
3
= Tcannot guide the exploration (as
[
T 30
30
= a! 2 a30 ! = ı + w
20x1e2d240! 0
= w = i
1 1] [1] +1 [Thanks = ) [
1] = [ to2 ] 1in = =
+ [1] = [04.1.6: 256 3
expected
s
and theinloop the
!g 2 does
31 = Uneed
gfirst
bounds notiterations).
have positive to w be =
As [ will
recomputed.
elements,
0x1e2d140
andbe thussuggested
cannot guide
s 0x100
] Corollary
the Section
]
exploration4.4.1,
(as a simple
heuristic that solves
expected inthis
the firstkind of
iterations). situations
As will beis considering
suggested in Sectionthat, a simple!
4.4.1, when g iscontin-
not yet
! and U remains unchanged. Neither The !
new solution
g 0 nor h i Tis linear and the
the algorithm
3brute force approach notexploring all (as
possible
2
g does not
Compilation Techniques for Automatic
havethat
Extraction
heuristic positive
of Parallelism and
solves this !elements,
Locality
kind in ⇥ ⇤ Tand
Heterogeneous
0 of situations is thus
Architectures
consideringcannot T when
that, guide !
g isthe exploration
yet
w = w = i 88 / 118= [2]
n +1 l
l, !
onents. There
ı k ) (see are (n4.1.2)
Section + 1) can
potential
predict solutions
a stride that
differentneed fromto be0explored
because (as !
c =
(see Problem
in the right half
Lemma Resolution:
of
4.1.2). Figure 4.1), the
Therefore, onesubtrace
for116
each insertion position
{ a1 , . . . , a31 } cannot of be
theChapter
newly 4. Trac
generated
ered
th anIncreasing
index.
affine The most
access the Solution
common
enclosed Dimensionality
in situation,
a 1-level particularly
loop and thefor (I) values ofofk,the
large
dimensionality
the newly
rrent solutiondiscovered
S130 must loops are outer
be increased than the
to build S 31 .previously
For this known
purpose, ones.
the In
function
where 2a 0 in position p has been added to
se, given
• () an insertion
calls G ROW ()point
for the 0 possible
( p,two p! kn+)1 for the new
insertion loopofindex
points
! the i p , the
new loop to th
TRACT
Add a newk+loop ı = f ( p, ı k ) has been added
endices
lines generated, I 1 2 Z (4.1).
25–30 of Algorithm n+1)⇥(k+1) , is as follows:
As the most common situation is that newly
with the new loop index can be derived fr
covered loops are outer than
2 the previously known 3 ones, it starts with x = 0.
I k
ROW () inserts a new row and6column
(1:p,:) in U: ! ! k +1 !
k +1 ! k + 1 7 c( ı ı
#ı " 5
I "=4 06 1 ⇥ k 7
#
1 Ik 0 1 0
U= ( p+1:n,:) =
0 U(1:1,1:1) 0 1
ew index into I, updating the previous ones with a row of 0 to match the new
h
• In CHOLESKY
mensionality: 0
c ,...,c ,c ,c ,..., 1 p p p +1
" # " #
0 ... 0 1 0 ... 0 1
I= =
I(1:30) 0 0 . . . 29 0
dCompilation
a newTechniques
element in !w:
for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
89 / 118
For the sake
..0 of simp
.
the calculations
6 . 7assoc
6 in.. Sec.73.4.
Problem Resolution:
116 Chapter 4. Trace-based Affine Reconstruction of Code
n
discussed
6 7
X 6 7
0 k k !
Increasing
!
ı
the
! Solution
k +1
Dimensionality
k
h
c = +
= f( p, ı ) has been added to the matrix. The coefficient
i
(and
where a 0 in position p has been padded to each indexc
II)
k
r r ı 2 I , and a new column
i6
r=p+1
0 c associatedThe approach
6
7
propose
7
6 0 priority
3.3 Branch 7
0
! c1 , . . . , c p , c p , c p +1 , . . . , c n 6 1 7 =
p
with the new loop U and index arebe
wcan updated
derivedasfrom described
Equation in Sec.
(4.7): 3.4 to reflect ing the6relevant 7 soluti
any new information available. If no solution is found for the i k
for 6
each address 7of th
• The coefficient for the new loop can be derived from the
Chapter 4. Trace-based Affine Reconstruction
boundary conditions, !c (! +1
of
ı kthen
Code
! ı k )branch
this = sk ) is discarded. Note that number6 .. 7 .
p
6 of potential 7 s
observed stride
on p has been added to each index !
there must be a practical limit to the maximum
ı 2 Ik , and a new column 2
solution size, as in the
3
general case 0any trace {a1 , . . . , aN }
acceptable 4 the remain
processing 5
eral case, the ktime for e
) has been added to the matrix. The coefficient c p associated 6 . 7 0 in
can be generated using at most N6 6 affine
.. 7nested loops. To en-
7 trace containing N add
index can be derived from Equation (4.7):
sure that a minimal solution, in terms 6
6 0 of
7 the dimensionality
7 O(nn
N
). Consequently
h i6 7 the ksolution 0 space k k could take a
! c( ı ! k +1 ! k ofk the generated Z–polyhedron,
0
ı ) = s ) c1 , . . . , c p , c p , c p +1 , . . . , c n 6 1 7
should be traversed in a breadth-first
is
6
6
found,
7
fashion.
7
= s ) c p = s Â
+ ular order
i r cr
of the solution space,
2 3 6 ikp 7 r = p +1
0 Revisiting the cholesky example,
6
6 ... 7 31
7 there are two pos- defined as:
6 sible 7 !
6 ... 7insertion points for the new 4 loop 5 in S2 . As the most
6
6 common
7
7 situation is that newly i
discovered
After calculating the new c , U and w
k
n loops !are outer ! are updated
h i6 0 7 Lemma 3.2. Each ele
6 than 7 the previously
7 k known ones, n it explores p = 0 first. The
more iterations of ind
c1 , . . . , c p , c0p , c p+1 , . . . , cn 6 1
• In CHOLESKY 6 new 7
6 k
in
= s
this
)
loop coefficient
7 section
c 0 k
p vector and
= s + to reflect
 indexi k
r c r any new information
matrix are calculated bounds
available.
U, !w .
If
6 ip 7 r = p +1
6 as:. 7
6 .. 7 the boundary conditions, then this branch is discarded.
4 5 ! ! ⇥ ⇤ Proof. jk is equal to th
After calculating
cink00 =athe 30 new
+ i 30 c , U and w are updated
practical
1 1c = limit to the maximum
256 + 0 · 29 ) !c =as described previously
256 0 acceptable solution
in ij minus siz
the current
in this section to reflect any new information available. If no solution is found for
n
0 thek boundary
c p = s + Â ir cr k conditions,any trace { a
then31this branch , . . .
is,
10 . . . N0 1 a }
discarded. can be
Note generated
that there must using
be j at
k
= mostUN (j,
r = p+1limit to the maximum I =acceptable solution size, as in the general case |
a practical
this reason, the solution space should be traversed
0 . . . 29 0 wj +uinj,1 i1a
+..
any trace { a1 , . . . , a N } can be generated using at most N affine nested loops. For
ng the new ! c , U and ! w are Theupdated traversal of the solution
as described previously space continues. The next ob- By number
constructionofofge
the
ensure that
this reason, the solution space31should be traversed in
served If stride is
a minimal solution (in
a breadth-first
a32 fora31 = 8. No increase of the
terms of
fashion, to
eflect any new information available. no solution is=found loops is 1. Therefore,
ensure
Compilation Techniques for that
Automatic a minimal
Extraction of
currently solution
reached.
Parallelism
found
and (in
Locality
loop terms
in of number
Heterogeneous
indices
must produces
of
Architectures generated nested loops) is
ditions, then this branch is discarded. Note that there be 90 / such
118 stride: of loop i before i >
1 l l 1 l l +1 l
the selected branch does not match the observed stride (i.e., ŝlk 6= sk ). A different
ction
truction 115 115
loop index i l 0 will have to be selected as described in Section 4.1.2, but the con-
{j w0j , ln< } canj k be n}calculated
can be calculated exclusively selecting selecting
exclusively vectors invectors the in the
! structed
nreı!
Problem
reduced
Lemma
bez asreduced
I +
to:
4.1.5,
1 willResolution:
to: then
not be valid
index ! k +
ı lindex
Updating
in1 the
= +( !context ! k
+1 ı ) is not
kl,
of thethe ! Loop
extracted
ak )feasible
Bounds
loop bounds, U and
per Lemma
w , because either0 ı l 0 will 4.1.5,
! k+1 then
not be ı
sequential
l = +( tol, ! ıı k , oris itnotwilla violate
feasibleboundary
! k0+1
when
for ıcalculating
z when w
calculating j (since w i l >
(since 0 iby
k+1the definition of the +
> 0 by the definition of the +
conditions.
z •z
i(1:j,1:j Loop In this
windices
0 10⇥(1⇥( scenario,
must
j 1j ) 1) 1⇥(
1 j 1 to the 0one
j it
1 is
be necessary
l
sequential
j j1) 1)
⇥( to generate
and (4.12) stay new intoboundary
loop conditions
ore, i
w
).UTherefore,
jj,1:j
) ) (
0 will
1
!
1:j,1:j ) +
1be
) + w
equal
j0
w j willcan bebe
==
equal 0 calculated
to the one calculated for the (4.12)
previous
for the step
previous(4.5): step
0 and j w 0 . These found by solving the system in Equation
sing I k .bounds
Using the
k same reasoning, if (9 j, 0 < j < l, i k +1
6 = 0 ) ,k +1
orithm using I . Using the thsame th reasoning, 0 if
z z(9 j, 0 <j j < l, i j 6 = 0),
otes
notes thethe first
firstj entries
j entries of ofthethe j0 kj + row
1
row of! U
of 0 U ,
1
and
0 , and
k 1
i ( i
1:j,1:j n
1:j,1:j 1 ) k
1
2 2
1)by the
+a1feasible selection for
is not a feasible selection for calculating calculating
U I + w w 0
l1. Otherwise,
⇥(
z
+ ) (
0 0 ⇥(and +
)
wl . Otherwise, and by the (4.10)
stj ( j 1) 1entries in in thethe first j rows ofof matrix z
) entries first j rows k+1 i .izOnly
matrix . Only ( j( jz 1)1)
operation
of the + operation and indexand sequentiality,
index sequentiality, il th = il will il be
k + 1
= imaximum.
will be maximum.
at
that is the
is
If the system •
the Inconsistent
number
number of of unknowns
is inconsistent, unknownssystem inin
then the Theth
j row
thej generated
the row branch
ofofUU 0 . Sec-
0 l
. Sec-
iteration is discarded
space is not a polytope
nk matrix can be extracted from I k+k1+to 1 to build i z asz as long
rank matrix can be extracted
and the solution is not valid. If the system has solutions, from I build i long then it will be overdeter-
reare iterations • System
iterations wherewhere with
index
index isolutions
j isi j ismaximum.
maximum. Overdetermined
ByBytaking takingad- ad-
aloop loop form, form, this this means means ! thatthat it it is is always
always possibletotobuild
possible build !
lt, the calculation of w 0 has a ! complexity
0 of O ( 1 ) . Once w 0 is ! 0 is
x,this
and and result,
solvesolve the
thethe calculation
system
system inin of
linear
linear w time has
time a complexity
O(Oj)(.j)By . Byapplyingapplying of O ( 1 ) . Once w
known rows { U 0 , l {Uj0 ( nl }0can bencalculated by reducing
d,mplexity
the
complexity unknown of the rows
j,:
calculation
of the calculation of U becomes O(n ).
( ) j,: ) , of U 0j becomes
} can O be
(2n 2calculated
) . by reducing
malinsystem
Equation (4.10) to ((4.10)
in Equation n l+ to1()nequation l + 1) equation systems of the form:
systems of the form:
! k ! k !
g =U ı + w
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
the engine predicts an iteration of the only loop that has been 94 / 118found so far:
Outline
• Problem Formulation
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
95 / 118
Supporting Nonlinearity: Input Noise
affine or
can unlabeled traces)
be modified to discard some observations before concluding that a branch
cannot lead to a solution. This feature has to be statistically guided, to avoid
discarding too many points and reaching a very simplified nest version.
• The exploration of the solution space can be modified to
Hence, the algorithm has been extended to allow discarding input noise. When-
discard until
ever !
maxa ŝobserved
g predicts k
accesses
k
that does not match the observed s , the reconstruction k
The use of ! g is capable of identifying errors as long as the current Snk accu-
ratelyExtraction
Compilation Techniques for Automatic represents the trace.
of Parallelism However,
and Locality this does
in Heterogeneous not happen in the initial stages of
Architectures
96 / 118
Supporting Nonlinearity: Missing Data
• The 4.3
exploration of the solution space can be modified
Supporting Nonlinearity 133 to
insert until max predicted
access predicted by !
strides
g k allows to continue exploring the branch:
( )
e
sk = Â ŝk+r , 0 < e max
r =0
loops 10:
11:
curr_depth + +
end for
12: end while
13: return W
14: end FUNCTION=0
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
the problem intractable 98 /due
118 to the vast amount of different alternatives to be ex-
Outline
• Problem Formulation
• Experimental Evaluation
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
99 / 118
140 Chapter 4. Trace-based Affine Reconstruction of Code
32
Normalized extraction time (log)
16
1
3mm dynprog cholesky trisolv jacobi-1d
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
101 / 118
Experimental Evaluation:
Piecewise Reconstruction (I)
1 for
1 for (t1=3;t1<=3
(t1=3;t1<=3 *N-6;t1++)
*N-6;t1++) { {
2 2lbp=max(ceild(t1+1,2),t1-N+2);
lbp=max(ceild(t1+1,2),t1-N+2);
1 for
1 for (i=1;i<=N-2;i++)
(i=1;i<=N-2;i++)
3 3ubp=min(floord(t1+N-2,2),t1-1);
ubp=min(floord(t1+N-2,2),t1-1);
2 2for for (j=1;j<=N-2;j++)
(j=1;j<=N-2;j++)
4 #pragma
4 #pragma omp omp parallel
parallel for for
3 3 a[i][j]
a[i][j] = ...
= ...
5 5for for (t2=lbp;t2<=ubp;t2++)
(t2=lbp;t2<=ubp;t2++)
6 6 a[(t1-t2)][(-t1+2
seidel
a[(t1-t2)][(-t1+2 *t2)]
*t2)] = ...= ...
(a) Original
(a) Original sequential
sequential code.code.
(b) Automatically
(b) Automatically parallelized
parallelized code.code.
(from the PLUTO examples)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
102 / 118
Reconstructed Trace Pieces of
max_depth = 1 (161 pieces)
seidel for the Thread #0
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
103 / 118
Reconstructed Trace Pieces of
max_depth = 2 (58 pieces)
seidel for the Thread #0
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
104 / 118
Reconstructed Trace Pieces of
max_depth = 3 (41 pieces)
seidel for the Thread #0
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
105 / 118
Reconstructed Trace Pieces of
max_depth = 4 (3 pieces)
seidel for the Thread #0
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
106 / 118
Reconstructed Trace Pieces of
max_depth = 4
seidel for All the Threads
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
107 / 118
Experimental Evaluation:
Piecewise Reconstruction (and VII)
increasing the
maximum depth has
diminishing returns
a small number of
loops represent most
of the issued accesses
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
108 / 118
Outline
1. Introduction
5. Conclusions
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
109 / 118
Main Contributions (I)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
110 / 118
Main Contributions (II)
Generation of parallel code for multicore processors with the insertion of
OpenMP directives
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
111 / 118
Main Contributions (III)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
112 / 118
Main Contributions (and IV)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
113 / 118
Future Research Lines
• J. M. Andión, M. Arenaz, and J. Touriño. Domain-independent kernel-based intermediate representation for automatic parallelization of sequential
programs. In Poster Abstracts of the 6th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and
Embedded Systems (ACACES), pages 71–74, Terrasa, Spain, 2010.
• J. M. Andión, M. Arenaz, and J. Touriño. Automatic partitioning of sequential applications driven by domain-independent kernels. In Proceedings of the
15th Workshop on Compilers for Parallel Computing (CPC), CDROM, Vienna, Austria, 2010.
• J. M. Andión, M. Arenaz, and J. Touriño. A new intermediate representation for GCC based on the XARK compiler framework. In Proceedings of the 2nd
International Workshop on GCC Research Opportunities (GROW) (in conjunction with HiPEAC), pages 89–100, Pisa, Italy, 2010.
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
115 / 118
Publications (II)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
116 / 118
Publications (and III)
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
117 / 118
Compilation Techniques
for Automatic Extraction
of Parallelism and Locality
in Heterogeneous Architectures
José M. Andión