Compilation Techniques For Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures

Compilation Techniques
for Automatic Extraction

of Parallelism and Locality
in Heterogeneous Architectures
José M. Andión
PHD ADVISORS: Gabriel Rodríguez and Manuel Arenaz

Outline
1. Introduction
2. A Novel Compiler Support for Multicore Systems
3. Locality-Aware Automatic Parallelization for GPGPU
4. Trace-Based Affine Reconstruction of Code
5. Conclusions
Compilation Techniques for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
2 / 118
Outline
1. Introduction
5. Conclusions
3 / 118
“HPC is a crucial asset for Europe’s innovation capacity
and is of strategic importance to its scientific and
industrial capabilities, as well as to its citizens”
– H2020 Work Programme
“HPC has contributed substantially to national economic

prosperity and rapidly accelerated scientific discovery”
– POTUS Executive Order for Creating a National Strategic Computing Initiative
“To out-compute is to out-compete”

– US Council of Competitiveness
4 / 118
Figure 1.1: Trends in Transistors, Performance, and Power for General-Purpose Processors – Various
metrics are shown for a selection of processors usually found in high-performance desktop and server sys-
tems from 1975 to 2010. For over 30 years, engineers used increasedC.clock frequencies and powerarchitectures
hungry for flexible and
Trends in Transistors, Performance, and
architectural techniques to turn the wealth of available transistors intoefficient
F. Batten. Simplified vector-thread
data-parallel accelerators. PhD thesis, Massachusetts
single-thread performance. Unfortu-
Power
nately, forconstraints
power General-Purpose
are forcing engineers Processors Institute of Technology, Department of Electrical Engineering and
to integrate multiple cores ontoScience,
Computer a single die in an attempt to
2010.
continue performance scaling, albeit only for parallel applications. (Data gathered from publicly available
Compilation Techniques for Automatic
data-sheets, Extraction of Parallelism
press-releases, and Locality
and SPECint in Heterogeneous
benchmark Architectures
results. Some data gathered by M. Horowitz, F. Labonte,
5 / 118
0
100
200
300
400
500
Jun.93
Nov.93
Jun.94
Nov.94
Jun.95
Nov.95
Jun.96
Nov.96
Jun.97
Nov.97
Jun.98
Nov.98
Jun.99
Nov.99
Jun.00
Nov.00
Accelerators
Jun.01
Nov.01
Jun.02
Nov.02
Jun.03
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
0
100
200
300
400
500
Nov.06
Jun.07
Nov.07
Jun.93 Jun.08
Nov.93 Nov.08
Jun.94 Jun.09
Nov.94 Nov.09
Jun.95 Jun.10
Nov.95 Nov.10
Jun.96
Nov.96 Jun.11
Jun.97 Nov.11
Nov.97 Jun.12
Jun.98 Nov.12
Nov.98 Jun.13
Jun.99 Nov.13
The TOP500 List

Nov.99 Jun.14
Jun.00 Nov.14
Nov.00 Jun.15
Jun.01
6 / 118
Nov.01
Jun.02
Nov.02
Jun.03
1
2
4
6
8
9
10
12
14
16
18
32
60
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
Nov.06
Jun.07
Nov.07
Jun.08
Nov.08
Jun.09
Nov.09
Jun.10
Nov.10
Jun.11
Nov.11
Jun.12
Nov.12
Jun.13
Nov.13
Jun.14
Development over Time
Nov.14
Jun.15
N/A
IBM
Intel
AMD
Other
Cores per Socket
Hybrid
NVIDIA
0
100
200
300
400
500
Jun.93
Nov.93
Jun.94
Nov.94
Jun.95
Nov.95
Jun.96
Nov.96
Jun.97
Nov.97
Jun.98
Nov.98
Jun.99
Nov.99
Jun.00
Nov.00
Accelerators
Jun.01
Nov.01
Jun.02
Nov.02
Jun.03
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
0
100
200
300
400
500
Nov.06
Jun.07
Nov.07
Jun.93 Jun.08
Nov.93 Nov.08
Jun.94 Jun.09
Nov.94 Nov.09
Jun.95 Jun.10
Nov.95 Nov.10
Jun.96
Nov.96 Jun.11
Jun.97 Nov.11
Nov.97 Jun.12
Jun.98 Nov.12
Nov.98 Jun.13
Jun.99 Nov.13
The TOP500 List

Nov.99 Jun.14
Jun.00 Nov.14
Nov.00 Jun.15
Jun.01
6 / 118
Nov.01
Jun.02
Nov.02
Jun.03
1
2
4
6
8
9
10
12
14
16
18
32
60
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
Nov.06
Jun.07
Nov.07
Jun.08
Nov.08
Jun.09
Nov.09
Jun.10
Nov.10
Jun.11
Nov.11
Jun.12
Nov.12
Jun.13
Nov.13
Jun.14
Nov.14
Jun.15
N/A
IBM
Intel
AMD
Other
Cores per Socket
Hybrid
NVIDIA
0
100
200
300
400
500
Jun.93
Nov.93
Jun.94
Nov.94
Jun.95
Nov.95
Jun.96
Nov.96
Jun.97
Nov.97
Jun.98
Nov.98
Jun.99
Nov.99
Jun.00
Nov.00
Accelerators
Jun.01
Nov.01
Jun.02
Nov.02
Jun.03
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
0
100
200
300
400
500
Nov.06
Jun.07
Nov.07
Jun.93 Jun.08
Nov.93 Nov.08
Jun.94 Jun.09
Nov.94 Nov.09
Jun.95 Jun.10
Nov.95 Nov.10
Jun.96
Nov.96 Jun.11
Jun.97 Nov.11
Nov.97 Jun.12
Jun.98 Nov.12
Nov.98 Jun.13
Jun.99 Nov.13
The TOP500 List

Nov.99 Jun.14
Jun.00 Nov.14
Nov.00 Jun.15
Jun.01
6 / 118
Nov.01
Jun.02
Nov.02
Jun.03
1
2
4
6
8
9
10
12
14
16
18
32
60
Nov.03
Jun.04
Nov.04
Jun.05
Nov.05
Jun.06
Nov.06
Jun.07
Nov.07
Jun.08
Nov.08
Jun.09
Nov.09
Jun.10
Nov.10
Jun.11
Nov.11
Jun.12
Nov.12
Jun.13
Nov.13
Jun.14
Nov.14
Jun.15
N/A
IBM
Intel
AMD
Other
Cores per Socket
Hybrid
NVIDIA
A New “Software Crisis”
• Modeling reality in software is already difficult
• Hardware architecture with multiple levels of increasing

complexity
• How do we distribute computations?
• How do we distribute data?
7 / 118
Extraction of Parallelism
• libraries
-
• compiler directives
Productivity
• programming languages
• parallelizing compilers
+
8 / 118
• libraries
-
• compiler directives
Our Proposal:
Productivity
• Source-to-Source
programming languages Parallelizing
Compiler for CPUs and GPUs
• parallelizing compilers
+
8 / 118
Extraction of Locality
• The Locality Principle: temporal & spatial clustering
• Techniques to improve locality:
• loop interchange, fission and fusion of loops and arrays, tiling…
• hardware and software prefetching
• data placement
• design of ad-hoc memory systems
• Implemented in compiler frameworks
9 / 118
Extraction of Locality
• The Locality Principle: temporal & spatial clustering
• Techniques to improve locality:
• loop interchange, fission and fusion of loops and arrays, tiling…

Our Proposal:
• hardware and software prefetching

Affine Reconstruction of Code from a
Trace of Memory Accesses
• data placement
• design of ad-hoc memory systems
• Implemented in compiler frameworks
9 / 118
Outline
1. Introduction
5. Conclusions
10 / 118
Outline
• KIR: A diKernel-based IR
• Automatic Partitioning driven by the KIR
• Automatic Parallelization of the Benchmark Suite
• Experimental Evaluation
11 / 118
Outline
12 / 118
State-of-the-art vs. Our Approach
• Current parallelizing compilers are based on systems of

equations that respect all dependences of statement-
based IRs even if they are merely implementation artifacts
• Our approach:
Construction of the KIR Automatic Partitioning
OpenMP-
Sequential enabled
Compiler IR
C/Fortran Parallel
(ASTs,
Source Classification Spurious C/Fortran
DDG, diKernel Execution Parallelization
Code of diK-level diK-level Source
CFG) Recognition Scopes Strategy
Dependences Dependences Code
13 / 118
the DDG exhibits data dependences between statements (solid
on of the KIR consists of three steps: first, the construction

Standard Statement-based IR
of the program and their data dependence relationships (Def-
2); second, the construction of the flow dependences between
ions 2.1.3–2.1.5); and third, the construction of the hierarchy of BB0
i = 0;
Definitions 2.1.6–2.1.8), which reflects the computational stages
BB1
program and groups diKernels
BB0, into these stages.
BB5 & BB4 t = 0; (2)
j = 0; (2)
1 for (i = 0; i < n; i++) { BB2
2 t = 0; t = t + A[i][j] * x[j]; (2)
3 for (j = 0; j < m; j++) {

4 t = t + A[i][j] * x[j]; (1)
j++; (2)
5 } (1)
BB3
6 y[i] = t; if (j < m) T
7 } F
BB4
y[i] = t;
BB1, BB3 & BB2 (1) (2)
– Source code of the dense matrix-vector multiplication.

i++; (2)
(1)
BB5
if (i < n) T
14 / 118
diKernel: Domain-Independent
Computational Kernel
• Characterizes the computations carried out in a program

without being affected by how they are coded
DOMAIN-SPECIFIC
CONCEPT LEVEL
(problem solving methods
and application domain)
DOMAIN-INDEPENDENT
CONCEPT LEVEL
(programming practice)
SEMANTIC LEVEL
(control flow and
data dependence graphs)
SYNTACTIC LEVEL
(abstract syntax tree)
TEXT LEVEL
(ASCII code)
• SCC of the DDG ignoring flow-of-control statements

15 / 118
1 for (i = 0; i < n; i++) {
2 t = 0;
Building the KIR (I) 3
4
for (j = 0; j < m; j++) {
t = t + A[i][j] * x[j];
5 }
6 y[i] = t;
BB0
7 }
i = 0;
BB1
t = 0; (2) Figure 2.2 – Source code of the dense matrix-vector m

K < iBB0 >
j = 0; (2)
BB2
K < iBB4 > K < jBB1 >
t = t + A[i][j] * x[j]; (2)
j++; (2) K < jBB2 > K < tBB1 >

(1)
(1)
BB3
if (j < m) T K < tBB2 >

F
BB4
(1) y[i] = t; (2) K < yBB4 >
i++; (2)
(1)
BB5
Edges (1), (2) are abstracted
if (i < n) T with diKernels
F
16 / 118
values from variables x and y, respectively.
• DEF(x, xi )Flow
diKernel-level is theDependences
range of values of variable x produc
tion of statement xi .
• Identification of the flow of information across the
• USE(x, y j ) is the range of values of variable x used th
program
statement y j .
• Statement-level dominance
Definition
• Range of 2.1.5.
values < x1 . . .xxproduced
LetofKvariable n> ! K < y1 .used
and . . ym > be a d
dencethroughout the diKernels
that connects execution K ofx statements
and Ky through the DDG edg
is a diKernel-level flow dependence, Kx · Ky , if it holds tha
statement y j and DEF(x, xi ) ◆ USE(x, y j ).
For illustrative purposes, Figure 2.4 highlights the diKe

dences of the diKernel-level data dependence graph. As ca
17 / 118
Building the KIR (II)
BB0
i = 0;
i=0 dominates i++
BB1 DEF(i,i=0)⊇USE(i,i++)
t = 0; (2)
K < iBB0 >

j = 0; (2)
BB2
K < iBB4 > K < jBB1 >
t = t + A[i][j] * x[j]; (2)
j++; (2) K < jBB2 > K < tBB1 >

(1)
(1)
BB3
if (j < m) T K < tBB2 >

F
BB4
(1) y[i] = t; (2) K < yBB4 >
i++; (2)
(1)
BB5
if (i < n) T
18 / 118
Hierarchy of Execution Scopes
• To expose the computational stages of the program
• Based on the hierarchy of loops: one execution scope for

each perfect loop nest
• The root execution scope is a special node that represents

the program as a whole.
• diKernels belong to the innermost execution scope that

contains all of their statements
• diKernels that compute the loop indices belong to the ES

of the corresponding loop
19 / 118
.2); second, the construction of the flow dependences between
itions 2.1.3–2.1.5); and third, the construction of the hierarchy of
(Definitions 2.1.6–2.1.8), which reflects the computational stages
Building
program thediKernels
and groups KIR (and III) stages.
into these
1 for (i = 0; i < n; i++) {

2 t = 0;
3 for (j = 0; j < m; j++) { ROOT EXECUTION SCOPE
4 t = t + A[i][j] * x[j]; ES_fori (Figure 2.2, lines 1-7)

5 }
6 y[i] = t; K < tBB1 >
scalar assignment
7 }
K < iBB0 > ES_forj (Figure 2.2, lines 3-5)
– Source code of the dense matrix-vector multiplication. K < tBB2 >

scalar reduction
K < iBB4 > K < jBB1 >
K < yBB4 >

K < jBB2 > K < tBB1 > regular assignment
K < tBB2 >
K < yBB4 >
20 / 118
Outline
21 / 118
ement the loop index (see BB0, BB5 and BB4, respectively, for the
nally, the DDG exhibits data dependences between statements (solid
Spurious diKernel-level Dependences

ruction of the KIR consists of three steps: first, the construction
nels of
• the program and their data dependence relationships (Def-
They do not prevent the parallelization
–2.1.2); second, the construction of the flow dependences between
efinitions 2.1.3–2.1.5); and third, the construction of the hierarchy of
ROOT EXECUTION SCOPE
ES_fori (Figure 2.2, lines 1-7)

pes (Definitions 2.1.6–2.1.8), which reflects the
t is a privatizable
scalar variable
computational stages
tial program and groups diKernels into these stages. K < tBB1 >
scalar assignment
ES_forj (Figure 2.2, lines 3-5)

1 for (i = 0; i < n; i++) {
2 t = 0; K < tBB2 >
3 for (j = 0; j < m; j++) { scalar reduction
4 t = t + A[i][j] * x[j];
5 }
6 y[i] = t; K < yBB4 >
7 } regular assignment
e 2.2 – Source code of the dense matrix-vector multiplication.

22 / 118
OpenMP-enabled Parallelization Strategy
• Find the critical path in the KIR
• diKernel-level flow dependences
• Parallelizing transformations for each type of diKernel
• Optimizations for the joint parallelization of loops
• Minimize synchronization between diKernels
• Minimize thread creation/destruction
23 / 118
Parallelizing Transformations
FULLY PARALLEL LOOP PARTIALLY PARALLEL LOOP
1. r = 0;
1. #pragma omp parallel for
2. #pragma omp parallel for reduction(+:r)
2. for (i = 0; i < n; i++) {
3. for (i = 0; i < n; i++) {
3. A[i] = 2
4. r = r + A[i];
4. }
5. }
Array Expansion
24 / 118
ion of the KIR consists of three steps: first, the construction
Automatic
of the program and Partitioning driven
their data dependence by the KIR
relationships (Def- (I)
2); second, the construction of the flow dependences between
tions 2.1.3–2.1.5); and third, the construction of the hierarchy of
(Definitions 2.1.6–2.1.8), which reflects the computational stages
critical path ROOT EXECUTION SCOPE
program and groups diKernels into these stages.
1 for (i = 0; i < n; i++) { K < tBB1 >

scalar assignment
2 t = 0;
3 for (j = 0; j < m; j++) {
4 t = t + A[i][j] * x[j];
5 }
K < tBB2 >
6 y[i] = t; scalar reduction
7 }
– Source code of the dense matrix-vector multiplication. K < yBB4 >

regular assignment
25 / 118
Automatic Partitioning driven by the KIR (and II)
critical path
K < tBB1 >

26
scalar assignment Chapter 2. A Novel Compiler Support for Multico
FULLY PARALLEL LOOP
1 #pragma omp parallel shared(A,x,y) private(i,j,t)
K < tBB2 > 2 {
scalar reduction 3 #pragma omp for schedule(static)
4 for (i = 0; i < n; i = i + 1) {
5 t = 0;
6 for (j = 0; j < m; j = j + 1) {
K < yBB4 > 7 t = (t) + ((A[i][j]) * (x[j]));
regular assignment 8 }
9 y[i] = t;
10 }
11 }
Figure 2.6 – Parallelized code of the dense matrix-vector multiplica
26 / 118
Outline
27 / 118
Automatic Parallelization of the Benchmark Suite
• Synthetic Benchmarks
• Dense/Sparse Matrix-Vector Multiplication
• Sobel Edge Filter
• SWIM from SPEC CPU2000
• EQUAKE from SPEC CPU2000
28 / 118
2.2 Automatic
2.2 Automatic Parallelization
Parallelization of
of the
the Benchmark
Benchmark Suite
Suite 29
29
DenseAMUX & SparseAMUX
DenseAMUX
1 for (i = 0; i < n; i++) {
1 for (i = 0; i < n; i++) {
2 t = 0; ROOT EXECUTION SCOPE
2 t = 0;
3 for (j = 0; j < m; j++) {
3 for (j = 0; j < m; j++) { ES_fori (Figures 2.8a and 2.8b, lines 1-7)
4 t = t + A[i][j] x[j];
4 t = t + A[i][j] ** x[j];
5 }
5 } K < t2 >
6 y[i] = t;
6 y[i] = t; scalar assignment
7 }
7 }
ES_forj (Figures 2.8a and 2.8b, lines 3-5)
(a) Source code of DenseAMUX (dense matrix-
vector multiplication). K < t4 >
vector multiplication).
SparseAMUX
scalar reduction
1 for (i = 0; i < n; i++) {

1 for (i = 0; i < n; i++) {
2 t = 0;
2 t = 0;
3 for (j = ia[i]; j < ia[i+1]-1; j++) { K < y6 >
3 for (j = ia[i]; j < ia[i+1]-1; j++) {
4 t = t + A[j] x[ja[j]]; regular assignment
4 t = t + A[j] ** x[ja[j]];
5 }
5 }
6 y[i] = t;
6 y[i] = t;
7 }
7 } (c) KIR for DenseAMUX and AMUX.
(c) KIR for DenseAMUX and AMUX.
(b) Source code of routine AMUX from
(b) Source code of routine AMUX from
SparsKit-II.
SparsKit-II.
29 / 118
7 }

vector multiplication).
SparseAMUX 1 for (i = 0; i < n; i++) {

2 t = 0;
3 for (j = ia[i]; j < ia[i+1]-1; j++) {
4 t = t + A[j] * x[ja[j]];
5 }
6 y[i] = t;
7 }
ROOT EXECUTION SCOPE (c) KIR for DenseAMUX and A
(b) 1-7)
ES_fori (Figures 2.8a and 2.8b, lines Source code of routine AMUXFULLY
fromPARALLEL LOOP
SparsKit-II.
K < t2 > 1 #pragma omp parallel shared(A,ia,ja,x,y) private(i,j,t)

scalar assignment 2 {
3 #pragma omp for schedule(static)
4 for (i = 0; i < n; i++) {
ES_forj (Figures 2.8a and 2.8b, lines 3-5)
5 t = 0;
6 for (j = ia[i]; j < (ia[i+1] - 1); j = j + 1) {
K < t4 > 7 t = (t) + ((A[j]) * (x[ja[j]]));
scalar reduction
8 }
9 y[i] = t;
10 }
11 }
K < y6 >
regular assignment (d) Parallelized code of the routine AMUX.
Figure 2.8 – Dense and sparse matrix-vector multiplication.
30 / 118
AMUXMS & ATMUX
AMUXMS
11 for (i =
for (i =
0; i
0; i
<
<
n; i++) {
n; i++) {
22 y[i] =
y[i] =
A[i]
A[i] *
*
x[i];
x[i];
33 }
}
44 for (j =
for (j =
0; j < n; j++) {
0; j < n; j++) { ROOT EXECUTION SCOPE
55 for (l
for (l
= ja[j]; l < ja[j+1]-1; l++) {
= ja[j]; l < ja[j+1]-1; l++) {
66 y[j] = y[j] + A[l] * x[ja[l]]; ES_fori (Figures 2.9a and 2.9b, lines 1-3)
y[j] = y[j] + A[l] * x[ja[l]];
77 }
}
88 }
}
< y2 >
regular assignment
(a) Source
(a) Source code
code of
of routine
routine AMUXMS
AMUXMS from
from
SparsKit-II.
ATMUX
SparsKit-II. ES_forj,l (Figures 2.9a and 2.9b, lines 4-8)
11 for (i = 0; i < n; i++) {

for (i = 0; i < n; i++) { < y6 >
22 y[i] = 0;
y[i] = 0; irregular reduction
33 }
}
44 for (j = 0; j < n; j++) {
for (j = 0; j < n; j++) {
55 for (l = ia[j]; l < ia[j+1]-1;
for (l = ia[j]; l < ia[j+1]-1;
l++) {
l++) {
66 y[ja[l]] = y[ja[l]] + x[j] *
y[ja[l]] = y[ja[l]] + x[j] *
A[l];
A[l]; (c) KIR
(c) KIR for
for AMUXMS
AMUXMS and
and ATMUX.
ATMUX.
77 }
}
88 }
}
(b) Source
(b) Source code
code of
of routine
routine ATMUX
ATMUX from
from
SparsKit-II. 31 / 118
AMUXMS
32 Chapter 2. A Novel Compiler Support for Mu

FULLY PARALLEL LOOP
1 #pragma omp parallel shared(A,x,ja,y) private(i,j,l,t)
ES_fori (Figures 2.9a and 2.9b, lines 1-3) 2 {
3 #pragma omp for schedule(static) nowait
4 for (i = 0; i < n; i = i + 1) {
< y2 >
5 y[i] = (A[i]) * (x[i]);
regular assignment
6 }
8 for (j = 0; j < n; j = j + 1) {
ES_forj,l (Figures 2.9a and 2.9b, lines 4-8)
9 for (l = ja[j]; l < (ja[j+1] - 1); l = l + 1) {
10 y[j] = (y[j]) + ((A[l]) * (x[ja[l]]));
< y6 > 11 }
irregular reduction 12 }
13 }
FULLY PARALLEL LOOP
(a) Parallelized code of routine AMUXMS of Figure 2.9a
1 #pragma omp parallel shared(A,ia,ja,x,y) private(i,j,l,y___

2 {
3 if (omp_get_thread_num() == 0) {
4 y___private = y;
5 } else {
6 y___private
32 / 118 = (float *) malloc(n * sizeof(float));
8 for (j = 0; j < n; j = j + 1) {
9 for (l = ja[j]; l < (ja[j+1] - 1); l = l + 1) {
10 y[j] = (y[j]) + ((A[l]) * (x[ja[l]]));
11 }
12 }
ATMUX 13 }
PARTIALLY
(a) Parallelized code of routinePARALLEL LOOP
AMUXMS of Figure 2.9a.
1 #pragma omp parallel shared(A,ia,ja,x,y) private(i,j,l,y___private)

2 {
4 y___private = y;
5 } else {
6 y___private = (float *) malloc(n * sizeof(float));
7 }
ROOT EXECUTION SCOPE 8 for (i = 0; i < n; i = i + 1) {
9 y___private[i] = 0; Initialization
ES_fori (Figures 2.9a and 2.9b, lines 1-3) 10 }
< y2 > 12 for (j = 0; j < n; j = j + 1) {
regular assignment 13 for (l = ia[j]; l < (ia[j+1] - 1); l = l + 1) {
14 y___private[ja[l]] = (y___private[ja[l]]) + ((x[j]) * (A[l]));
ES_forj,l (Figures 2.9a and 2.9b, lines 4-8) 15 }
16 }
17 #pragma omp critical
Computation
< y6 >
irregular reduction 18 if (omp_get_thread_num() != 0) {
19 for (i = 0; i < n; i = i + 1) {
20 y[i] += y___private[i];
21 }
22 } Reduction
23 if (omp_get_thread_num() != 0) {
24 free(y___private);
25 }
26 }
(b) Parallelized code of routine ATMUX of Figure 2.9b.

Figure 2.10 – Parallelized 33 /codes
118 of variations of the sparse matrix-vector multi
ES_foriter (Figure 2.18, lines 1-46)
EQUAKE (I)
2.2 Automatic Parallelization of the Benchmark Suite 43
ES_fori,j (Figure 2.18, lines 2-4)
< disp4 >

regular assignment
1 for (iter = 1; iter <= timesteps; iter++) {

2 for (i = 0; i < ARCHnodes; i++) ES_fori,while (Figure 2.18, lines 5-27)
3 for (j = 0; j < 3; j++)
4 disp[disptplus][i][j] = 0.0; < disp26 >
5 for (i = 0; i < ARCHnodes; i++) { irregular reduction
6 Anext = ARCHmatrixindex[i]; Alast = ARCHmatrixindex[i+1];
7 sum0 = K[Anext][0][0] * disp[dispt][i][0] ES_fori,j (Figure 2.18, lines 29-31)
8 + K[Anext][0][1] * disp[dispt][i][1]
9 + K[Anext][0][2] * disp[dispt][i][2]; < disp31 >
10 sum1 = K[Anext][1][0] * ...; sum2 = K[Anext][2][0] * ...; regular reduction
11 Anext++;
12 while (Anext < Alast) {
13 col = ARCHmatrixcol[Anext];
14 sum0 += K[Anext][0][0] * disp[dispt][col][0]
< disp34 >
15 + K[Anext][0][1] * disp[dispt][col][1] regular reduction
16 + K[Anext][0][2] * disp[dispt][col][2];
17 sum1 += K[Anext][1][0]*...; sum2 += K[Anext][2][0]*...;
18 disp[disptplus][col][0] += ES_fori,j (Figure 2.18, lines 38-40)
19 K[Anext][0][0] * disp[dispt][i][0]
20 + K[Anext][1][0] * disp[dispt][i][1] < disp40 >
regular reduction
21 + K[Anext][2][0] * disp[dispt][i][2];
22 disp[disptplus][col][1] += K[Anext][0][1] ...
23 disp[disptplus][col][2] += K[Anext][0][2] ... ES_fori,j (Figure 2.18, lines 41-44)
24 Anext++;
25 } < vel43 >
26 disp[disptplus][i][0] += sum0; ... regular assignment
27 }
28 time = iter * Exc.dt;
29 Techniques
Compilation for (i =for0; i < ARCHnodes;
Automatic i++) and Locality in Heterogeneous Architectures
30 for (j = 0; j < 3; j++) 34 / 118
14 sum0 += K[Anext][0][0] * disp[dispt][col][0]
15 + K[Anext][0][1] * disp[dispt][col][1]
16 + K[Anext][0][2] * disp[dispt][col][2];
17 sum1 += K[Anext][1][0]*...; sum2 += K[Anext][2][0]*...; ROOT EXECUTION SCOPE
18 disp[disptplus][col][0] +=
ES_foriter (Figure 2.18, lines 1-46)
EQUAKE (II)
19 K[Anext][0][0] * disp[dispt][i][0]
20 + K[Anext][1][0] * disp[dispt][i][1] ES_fori,j (Figure 2.18, lines 2-4)
21 + K[Anext][2][0] * disp[dispt][i][2];
22 disp[disptplus][col][1] += K[Anext][0][1] ... < disp4 >
23 disp[disptplus][col][2] += K[Anext][0][2] ... regular assignment
24 Anext++;
25 } ES_fori,while (Figure 2.18, lines 5-27)
26 disp[disptplus][i][0] += sum0; ...
27 }
< disp26 >
28 time = iter * Exc.dt; irregular reduction
29 for (i = 0; i < ARCHnodes; i++)
30 for (j = 0; j < 3; j++)
31 disp[disptplus][i][j] *= - Exc.dt * Exc.dt;
32 for (i = 0; i < ARCHnodes; i++)
< disp31 >
33 for (j = 0; j < 3; j++) regular reduction
34 disp[disptplus][i][j] +=
35 2.0 * M[i][j] * disp[dispt][i][j]
36 - (M[i][j] - Exc.dt / 2.0 * C[i][j]) ES_fori,j (Figure 2.18, lines 32-37)
37 * disp[disptminus][i][j] - ...
38 for (i = 0; i < ARCHnodes; i++) < disp34 >
regular reduction
39 for (j = 0; j < 3; j++)
40 disp[disptplus][i][j] /= (M[i][j] + Exc.dt / 2.0 * C[i][j]);
41 for (i = 0; i < ARCHnodes; i++) ES_fori,j (Figure 2.18, lines 38-40)
42 for (j = 0; j < 3; j++)
43 vel[i][j] = 0.5 / Exc.dt * (disp[disptplus][i][j] < disp40 >
44 - disp[disptminus][i][j]); regular reduction
45 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i;
46 } ES_fori,j (Figure 2.18, lines 41-44)
Figure 2.18 – Excerpt of the source code of the EQUAKE application. < vel43 >
regular assignment
35 / 118
EQUAKE
Chapter(III)
2. A Novel CompilerMinimization of
Support for Multicore Systems
thread creation/destruction
1 #pragma omp parallel shared(disp) private(disp___disptplus___private,...)

2 {
4 disp___disptplus___private = disp[disptplus];
5 } else {
6 disp___disptplus___private = (double **) malloc (ARCHnodes * sizeof(double *));
7 for (i = 0; i < ARCHnodes; i = i + 1)
8 disp___disptplus___private[i] = (double *) malloc(3 * sizeof(double));
9 }
10 for (iter = 1; iter < (timesteps + 1); iter = iter + 1) { PARTIALLY PARALLEL LOOP
11 #pragma omp barrier
13 for (j = 0; j < 3; j = j + 1)
14 disp___disptplus___private[i][j] = 0.0; Initialization
16 for (i = 0; i < ARCHnodes; i = i + 1) {
17 Anext = ARCHmatrixindex[i]; Alast = ARCHmatrixindex[i+1];
18 sum0 = K[Anext][0][0] * ...
19 Anext++;
22 sum0 += K[Anext][0][0] * ...
23 disp___disptplus___private[col][0] += K[Anext][0][0] * ... Computation
24 Anext++;
25 }
26 disp___disptplus___private[i][0] += sum0; ...
27 }
29 if (omp_get_thread_num() != 0)
30 for (i = 0; i < ARCHnodes; i = i + 1) Reduction
31 for (j = 0; j < 3; j = j + 1)
32 disp[disptplus][i][j] += disp___disptplus___private[i][j];
36 / 118
18 sum0 = K[Anext][0][0] * ...
19 Anext++;
22 sum0 += K[Anext][0][0] * ...
23 disp___disptplus___private[col][0] += K[Anext][0][0] * ...
EQUAKE (and IV)

24 Anext++;
25 }
26 disp___disptplus___private[i][0] += sum0; ...
27 }
29 if (omp_get_thread_num() != 0)
31 for (j = 0; j < 3; j = j + 1)
32 disp[disptplus][i][j] += disp___disptplus___private[i][j];
36 for (i = 0; i < ARCHnodes; i = i + 1) FULLY PARALLEL LOOP
37 for (j = 0; j < 3; j = j + 1)
38 disp[disptplus][i][j] *= - Exc.dt * Exc.dt;
41 for (j = 0; j < 3; j = j + 1)
FULLY PARALLEL LOOP
42 disp[disptplus][i][j] += ...
45 for (j = 0; j < 3; j = j + 1) FULLY PARALLEL LOOP
46 disp[disptplus][i][j] /= ...
49 for (j = 0; j < 3; j = j + 1) FULLY PARALLEL LOOP
50 vel[i][j] = ...
51 i = disptminus; disptminus = dispt; dispt = disptplus; disptplus = i;
52 } /* for iter */
53 if (omp_get_thread_num() != 0) {
55 free(disp___disptplus___private[i]);
56 free(disp___disptplus___private);
57 }
58 }
Figure 2.20 – Excerpt of the parallelized code of the EQUAKE application.
37 / 118
Outline
38 / 118
Program Characteristics Compilers
Unknown LB
Complex CF
Irreg. writes
Irreg. reads
Temp. vars
PLUTO
GCC
ICC
KIR
Benchmark diKernel
p p p p
reg. assig. regular assignment
p p p
irreg. assig. irregular assignment
p p
sc. reduc. 1 scalar reduction ⇡
Synthetic
p p
p p p
p p p p
reg. reduc. regular reduction
p p p p
irreg. reduc. irregular reduction
p
reg. recurr. regular recurrence
p p p
DenseAMUX regular assignment ⇡
p p p p
Algebra
AMUX regular assignment

p p p
AMUXMS regular reduction
p p p p
ATMUX irregular reduction
p p p
sobel1 regular assignment
p p p
Apps Im.
sobel2 regular assignment

p p p
SWIM regular recurrence U
p p p p
EQUAKE irregular reduction ⇡
Effectiveness 2 Intel Xeon E5520 quad-core Nehalem processors at 2.26 GHz

with 8 MB of cache memory per processor and 8 GB of RAM
39 / 118
Performance: EQUAKE (Execution Time)
100 3.0
Remaining
Overhead
90
Irregular
2.5
80
70
2.0
60
50 1.5
40
1.0
30
20
0.5
10
0 0.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC KIR/ICC ICC
WL x 1 WL x 2 WL x 3 WL x 1 WL x 2 WL x 3
Execution Time (s) Speedup

40 / 118
Outline
1. Introduction
3. Locality-Aware Automatic Parallelization for

GPGPU
5. Conclusions
41 / 118
Outline
• GPGPU with CUDA and OpenHMPP
• Locality-Aware Generation of Efficient GPGPU Code
• CONV3D & SGEMM
42 / 118
Outline
• CONV3D & SGEMM
43 / 118
GPGPU with CUDA
• First GPGPU programs look like graphics applications
• CUDA enables the use of C
CUDA kernel: specifies the operation of a single GPU thread
• Main ideas:
1. Lightweight parallel threads in hierarchy: grid, block
2. Shared memory
3. Barriers
44 / 118
Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.
Example of CUDA-enabled
GPU
bution block distributes vertex work packets architecture
Command processing
to the various TPCs in the SPA. The TPCs The GPU host interface unit communi-
execute vertex shader programs, and (if cates with the host 45 / CPU,
responds to
118
more read-only memories: the constant memory and the texture memory (with spe-
cial addressing modes and not targeted in this thesis). CUDA assumes that all
of these memories are physically on the GPU card, separated from the memory
addressed by the CPU. Thus, memory allocations and transfers must be explicitly
GPU Memories managed by the programmer.
The compute capability of an NVIDIA GPU defines its core architecture (Tesla,
Fermi, etc.), supported features (e.g., double-precision floating-point operations),
technical specifications (e.g., the maximum dimensions of the hierarchy of threads)
and architectural specifications (e.g., the number of warp schedulers).
for transcendental functions and attribute
In summary, the generation
interpolation—the interpolationofofefficient
pixel GPGPU code requires the program-
merattributes
to explicitly
from handle the GPU
vertex attributes hardware
defining a architecture through the following
primitive. Each
programming SFUexposed
features also contains
by CUDA:four
floating-point multipliers. The SM uses the
TPC texture unit as a third execution unit
and uses the SMC Location and ROP unitsAccess to Scope
implement external memory load, store,
registers
and atomic accesses. SMA low-latency read & write one GPU thread
inter-
local memory
connect network betweenDRAM the SPs read & write one GPU thread
and the
shared memory
shared-memory banksSMprovides read & write all GPU threads in a block
shared-
global access.
memory memory DRAM read & write all GPU threads & CPU
The GeForce 8800 Ultra clocks the SPs
Table
and SFU3.1units
– Characteristics
at 1.5 GHz, forofa peak
the GPU
of 36memories considered in this thesis.
Gflops per SM. To optimize power and area
explicit allocations and
efficiency, some SM non-data-path units transfers
operate at half the SP clock rate.
SM multithreading. A graphics vertex or

pixel shader is a program for a single thread
Figure 3. Streaming multiprocessor (SM).
that describes how to process a vertex or a
pixel. Similarly, a CUDA kernel is a C
stencil shadow generation and cube map program for a single thread that describes
texture generation. Geometry shader output how one thread computes 46 a result.
/ 118
Graphics
GPU Programming Features in CUDA
1 Threadification
+
2 Thread grouping: warps

Impact on performance
3 Minimization of CPU-GPU data transfers
4 Coalescing
5 Maximization of the usage of registers and shared memory
6 Divergency
7 Occupancy
8 Threads per block

-
47 / 118
GPGPU with OpenHMPP
interaction RPC RPC RPC
address spaces disjoint disjoint disjoint
automatic &
automatic & automatic &
data transfers
manual manual manual
sw-managed automatic
explicit handling explicit handling
caches handling
parallelism gangs, workers, loop iterations,
loop iterations
specification SIMD tasks, SIMD
standard loop
directives no no
transformations
48 / 118
Outline
• Locality-Aware Generation of Efficient GPGPU

Code
• CONV3D & SGEMM
49 / 118
GPU Programming Features addressed by our Automatic Technique
1 Threadification
+
2 Thread grouping: warps

Impact on performance
3 Minimization of CPU-GPU data transfers
4 Coalescing
5 Maximization of the usage of registers and shared memory
6 Divergency
7 Occupancy
8 Threads per block

-
50 / 118
3.3 Chains of Recurrences
Chains
Chains of Recurrences
of recurrences (from now on,(chrecs)
chrecs) are an algebraic formalism to repre-
sent closed-form functions which have been successfully used to expedite func-
tion evaluation
• Algebraic at a number of points in a regular interval [18].
formalism
Definition 3.3.1. Given a constant f 2 Z, a function g : N0 ! Z, and the operator
+, the chrec f = {f, +, g} is defined as a function f : N0 ! Z such that:
i 1
{f, +, g}(i ) = f + Â g( j)
j =0
74 Chapter 3. Locality-Aware Automatic Parallelization for GPGPU
• Useful for representing the iterations of a loop and array access patterns
Hence, the chrecs can be used for representing the iterations of a loop. For
1 // only for_i is threadified 1 // only for_j is threadified
example, the loop index
2 for (i = 0; i <= N; i++) { i
of for in Figure 3.4 takes =integer
2 for (j CHRECS_x
valuesj++)
in the interval
0; j <= =N;[{0,+,1}][{0,+,1}]
{
k
3[0, sizex
for (j1].= The
0; jchrec
<= N;{0,j++)
+, 1}{ provides3 a closed-form function
for (i = 0; to i++)
i <= N; compute
{ the
4value of ... x[i][j]
i at each fori ...
iteration. 4 ... x[i][j] ...
5 } 5 }
6 } The chrecs, which are given by the KIR61 ,}have demonstrated to be a powerful
representation
(a) of the code
Source complexS1. loops and the memory(c) accesses
Source that
code appear
S2. in full-
• We instantiate (particularize) them for each GPU thread
scale real applications
T0
[13].
T1
In the T2same example of Figure T0
3.4, theT1memory access
T2
pattern i in the
(i =0first
) dimension
( i =1) (of
i =2input
) [i ][ j][k] (see line
( j=10
0) of Figure
( j=1) 3.4) (can
j=2)be
j =0
represented [0][0] the chrec
xwith x [1][0]{0, +x,[21][}0. ] i =0 x [0][0] x [0][1] x [0][2]
j =1 x [0][1] x [1][1] x [2][1] i =1 x [1][0] x [1][1] x [1][2]
j=2 algebraic
The x [0][2] propertiesx [1][2] of xchrecs [2][2] provide 2
i =118
51 / rules
x [2][carrying
for 0] x [2][
out1] arithmetic
x [2][2]
Detection of Coalesced Accesses to the GPU
Global Memory
row CHRECS_xk = [{0,+,1}][{0,+,1}] column
major major
1 // only for_i is threadified 1 // only for_j is threadified
2 for (i = 0; i <= N; i++) { 2 for (j = 0; j <= N; j++) {
3 for (j = 0; j <= N; j++) { 3 for (i = 0; i <= N; i++) {
4 ... x[i][j] ... 4 ... x[i][j] ...
5 } 5 }
6 } 6 }
(a) Source code S1. (c) Source code S2.

T0 T1 T2 T0 T1 T2
( i =0) ( i =1) ( i =2) ( j =0) ( j =1) ( j =2)
j =0 x [0][0] x [1][0] x [2][0] i =0 x [0][0] x [0][1] x [0][2]
j =1 x [0][1] x [1][1] x [2][1] i =1 x [1][0] x [1][1] x [1][2]
j =2 x [0][2] x [1][2] x [2][2] i =2 x [2][0] x [2][1] x [2][2] the
... ... ... ... ... ... ... ... same
1st dim {0} {1} {2} 1st dim {0, +, 1} {0, +, 1} {0, +, 1}
chrecs
chrecs
2nd dim {0, +, 1} {0, +, 1} {0, +, 1} 2nd dim {0} {1} {2}
(b) Non-coalesced accesses. (d) Coalesced accesses.
convex set
Figure 3.3 – Examples of access patterns to the GPU global memory.

52 / 118
Detection
72
of whether an Access to the GPU Global
Chapter 3. Locality-Aware Automatic Parallelization for GPGPU
Memory can be Coalesced
Algorithm 3.1 Detection of whether an access to the GPU global memory can be
coalesced
1: FUNCTION IS C OALESCED A CCESS
Input: access xk [ik,1 ][ik,2 ] . . . [ik,n ] to an n-dimensional array x stored in row-major
order
Input: loop nest L = L1 , L2 , . . . , Ll where L1 is the threadified loop
Output: returns whether the given access xk can be coalesced after threadifying
the loop nest L
2: CHRECS_xk [{fk,1 , +, gk,1 }][{fk,2 , +, gk,2 }] . . . [{fk,n , +, gk,n }]
3: W warp of GPU threads {T0, T1, T2. . . }
4: for each thread Ti in W do
5: CHRECS_xkTi Ti , + , gTi }][{ fTi , + , gTi }] . . . [{ fTi , + , gTi }]
[{fk,1 k,1 k,2 k,2 k,n k,n
6: end for
Tj Tj T0 , + , gT0 }) then
7: if (9d2{1 . . . n 1}, Tj2W {T0} : {fk,d , +, gk,d } 6= {fk,d k,d
8: return false . first n 1 chrecs differ
9: end if
STi Ti Ti }
10: CHRECS_RANGE_xk,n {fk,n , +, gk,n
11: if CHRECS_RANGE_xk,n defines a convex set then
12: return true . threads of the warp access consecutive locations
13: else
Tj Tj T0 , + , gT0 })
14: return (8 Tj 2 W {T0} : {fk,n , +, gk,n } = {fk,n k,n
. threads of the warp access the same location
15: end if
16: end FUNCTION
3.4.1 Detection of Coalesced Accesses to the GPU Global 53 / 118
is not empty, then some data are accessed several times and they can be stored
Usage of Registers to Store Reused Data within a
in the GPU registers if they are not modified by another thread (lines 6–9). Note
that the shared memory could be used for the same purpose as it has the same
GPU Threadaccess time as registers.
Algorithm 3.2 Usage of registers to store reused data within a GPU thread
1:PROCEDURE S TORE R EUSED D ATA I N R EGISTERS
Input: n-dimensional array x [s1 ][s2 ] . . . [sn ]
Output: a modified program that exploits reused data to maximize the usage of
the GPU registers
2: collect accesses xk [ik,1 ][ik,2 ] . . . [ik,n ] with k 2 {1, . . . , m}
4: for each thread Ti do
5: CHRECS_xkTi Ti , + , gTi }][{ fTi , + , gTi }] . . . [{ fTi , + , gTi }]
[{fk,1 k,1 k,2 k,2 k,n k,n
T m
6: REUSED_DATA_xTi k =1 CHRECS_x Ti
k
7: if (REUSED_DATA_xTi 6= ∆) then
8: store reused data between the accesses made by Ti in its set of
registers if data are private
9: end if
10: end for
11: end PROCEDURE
54 / 118
considered block, the chrecs are instantiated by fixing the value of the index of L1
that the thread executes (lines 5–7). If the intersection of the instantiated chrecs
Usage of the GPU Shared Memory for Data Shared
associated to all the accesses is not empty, then some data are accessed several
between the Threads of a Block

times and can be stored in the shared memory (lines 8–11).
Algorithm 3.3 Usage of the GPU shared memory for data shared between the
threads of a block
1: PROCEDURE S TORE S HARED D ATA I N S HARED M EMORY
Input: n-dimensional array x [s1 ][s2 ] . . . [sn ]
Output: a modified program using the GPU shared memory to share data be-
tween the threads of a block
2: collect accesses xk [ik,1 ][ik,2 ] . . . [ik,n ] with k 2 {1, . . . , m}
4: for each block B do
5: for each thread Ti in B do
6: CHRECS_xkTi [{fk,1 Ti , + , gTi }][{ fTi , + , gTi }] . . . [{ fTi , + , gTi }]
k,1 k,2 k,2 k,n k,n
7: end for
TTi
8: SHDATA_x CHRECS_xkTi with k 2 {1, . . . , m}
9: if (SHDATA_x 6= ∆) then
10: store data shared between the threads of block B
in the shared memory
11: end if
12: end for
13: end PROCEDURE
Another general technique to improve performance is loop tiling. It consists

55 / 118
access xk , a bigger portion of data to compute (D) is given to each thread. Hence,
the algorithm increments the step of L1 to i = i + D (see line 2 of Algorithm 3.4).
Scalar variables inside L are promoted to arrays of size D, and their corresponding
Increase the Computational Load of a GPU Thread
reads and writes are transformed into loops preserving dependences (lines 3–6).
The optimization of the size of D depends on runtime information about the GPU
hardware, thus it has been adjusted empirically by hand in this work.
Algorithm 3.4 Increase the computational load of a GPU thread

1:PROCEDURE I NCREASE L OAD
Input: access xk [ik,1 ][ik,2 ] . . . [ik,n ] to an n-dimensional array x stored in row-major
order
Input: loop nest L = L1 , L2 , . . . , Ll where both L1 , L2 are threadified
Input: amount of data D to be processed by a GPU thread
Output: a modified program after applying loop tiling under the OpenHMPP
programming model
2: increment the step of the outer loop L1 to D
3: for each scalar variable s in L do
4: promote s to an array s[D]
5: transform reads and writes to s into loops of D iterations
6: end for
7: end PROCEDURE
The previous technique can prevent some GPU compiler optimizations: typ-
ically, these binary compilers make better optimizations if the program is coded
with several instructions using scalar variables (avoiding arrays and loops). In or-
der to solve this issue, Algorithm 3.5 applies 56 loop / 118unrolling and loop interchange
Use Scalar Variables to Enable GPU Compiler
Optimizations
Algorithm 3.5 Use scalar variables to enable GPU compiler optimizations

1:PROCEDURE I NCREASE L OAD
Input: loop nest L = L1 , L2 , L3 . . . , Ll that results of Algorithm 3.4 where both
L1 , L2 are threadified, the step of L1 is D, and L3 is the created loop with D
iterations
Output: a modified program that uses more scalar variables to enable GPU com-
piler optimizations
2: apply loop fission to L3 , the loop created in line 5 of Algorithm 3.4
3: for each loop L30 resulting from the fission of L3 do
4: interchange loops until L30 is the innermost one
5: insert a fullunroll directive before L30
6: end for
7: end PROCEDURE
Section 3.5.1 presents the study of the three-dimensional discrete convolution

(CONV3D). With this case study we cover stencil codes, which are commonly
found in computer simulations, image processing and finite element methods.
Next, Section 3.5.2 addresses the simple-precision 57 / 118 general matrix multiplication
Outline
• CONV3D & SGEMM
58 / 118
are not shown because they are already represented in the notation of the execu-
tion scope and the types of the remaining diKernels. Only the regular reduction
K <output31 > determines if CONV3D is parallelizable (note that the remaining
CONV3D & SGEMM parts of the KIR are shaded because they represent privatizable temporaries). As
the regular reduction diKernel represents conflict-free loop iterations, it can be
converted into a forall parallel loop. On the CPU, it can be parallelized using the
OpenMP parallel for directive (see Section 2.2.1).
Table 3.2 summarizes the GPU features addressed by our locality-aware au-
tomatic parallelization technique to generate the same optimal variant as the one
written by an expert in GPU programming. The first optimized variant is conv3d-
conv3d-hmpp1
conv3d-hmpp2
conv3d-hmpp3
sgemm-hmpp1
sgemm-hmpp2
sgemm-hmpp3
sgemm-hmpp4
sgemm-cublas
conv3d-cpu
sgemm-mkl
sgemm-cpu
GPU Features
p p p p p p p
Coalescing - p p - - p p -
Registers - p - - p -
Shared Memory - - - -
Table 3.2 – GPU features exploited with each variant of CONV3D and SGEMM.
59 / 118
shaded to be omitted
in the discovering of
CONV3D (I) parallelism
1 int sizex, sizey, sizez, bound = 4;

2
3 void conv3d(float output[sizex][sizey][sizez],
4 float input[bound+sizex+bound][4+sizey+4][4+sizez+4],
5 float coefx, float coefy, float coefz) { ROOT EXECUTION SCOPE
6
7 for (int i = 0; i < sizex; i++) { ES_fori,j,k (Figure 3.4, lines 7-35)
8 for (int j = 0; j < sizey; j++) {
9 for (int k = 0; k < sizez; k++) {
10 float tempx = input[i][j][k] + coefx * K < tempx10 >
11 ( scalar assignment
12 input[i-1][j][k] + input[i+1][j][k] +
15 input[i-4][j][k] + input[i+4][j][k]
16 ); K < tempy17 >
17 float tempy = input[i][j][k] + coefy * scalar assignment
18 (
19 input[i][j-1][k] + input[i][j+1][k] +
22 input[i][j-4][k] + input[i][j+4][k]
23 );
K < tempz24 >
24 float tempz = input[i][j][k] + coefz *
scalar assignment
25 (
26 input[i][j][k-1] + input[i][j][k+1] +
29 input[i][j][k-4] + input[i][j][k+4] K < output31 >
30 ); regular reduction
31 output[i][j][k] =
32 output[i][j][k] + tempx + tempy + tempz;
33 }
34 }
35 } FULLY PARALLEL LOOP
36 }
e 3.4 – Source code of the 3D discrete convolution operator (CONV3D). 60 / 118
CONV3D (II)
• conv3d-cpu: Sequential code
• conv3d-hmpp1: Coalescing
1. int i, j, k, size_x, size_y, size_z; CHRECS_input1 =
2. float coefx,coefy,coefz,*input,*output; [{0,+,1}][{0,+,1}][{0,+,1}]
3.
4. for (i = 0; i < size_x; i++) {
5. for (j = 0; j < size_y; j++) {
6. for (k = 0; k < size_z; k++) {
7. float tempx = input[i][j][k]+coefx*
8. …
• Default OpenHMPP policy

CHRECS_input1T0 = CHRECS_input1T1 =
[{0}][{0}][{0,+,1}] [{0}][{1}][{0,+,1}]
• Loop nest is permuted to forj, fork, fori
CHRECS_input1T0 = CHRECS_input1T1 =
[{0,+,1}][{0}][{0}] [{0,+,1}][{0}][{1}]
61 / 118
CONV3D (III)
• conv3d-hmpp2: Registers
4. for (j = 0; j < size_y; j++) {
5. for (k = 0; k < size_z; k++) {
6. for (i = 0; i < size_x; i++) {
7. float tempx = input[i][j][k]+coefx*
8. (
9. input[i-1][j][k]+input[i+1][j][k]+
10.…
CHRECS_input1 = CHRECS_input2 = CHRECS_input3 =

[{0,+,1}][{0,+,1}][{0,+,1}] [{-1,+,1}][{0,+,1}][{0,+,1}] [{1,+,1}][{0,+,1}][{0,+,1}]
CHRECS_input1T0 = CHRECS_input2T0 = CHRECS_input3T0 =

[{0,+,1}][{0}][{0}] [{-1,+,1}][{0}][{0}] [{1,+,1}][{0}][{0}]
∩≠∅
62 / 118
4 float coefx, float coefy, float coefz) {
5
6 #pragma hmppcg gridify (j, k)
CONV3D (IV) 9
10
float i___minus4 = 0;
float i___minus3 = input[-4][j][k];
11 float i___minus2 = input[-3][j][k];
12 float i___minus1 = input[-2][j][k];
1 #pragma hmpp conv3d___hmpp2 codelet 13 float i___plus0 = input[-1][j][k];
2 void conv3d___hmpp2(float output[sizex][sizey][sizez], 14 float i___plus1 = input[0][j][k];
3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4], 15 float i___plus2 = input[1][j][k];
4 float coefx, float coefy, float coefz) { 16 float i___plus3 = input[2][j][k];
5 17 float i___plus4 = input[3][j][k];
6 #pragma hmppcg gridify (j, k) 18 for (int i = 0; i < sizex; i++) {
7 for (int j = 0; j < sizey; j++) { 19 i___minus4 = i___minus3;
8 for (int k = 0; k < sizez; k++) { 20 i___minus3 = i___minus2;
9 float i___minus4 = 0; 21 i___minus2 = i___minus1;
10 float i___minus3 = input[-4][j][k]; 22 i___minus1 = i___plus0;
11 float i___minus2 = input[-3][j][k]; 23 i___plus0 = i___plus1;
12 float i___minus1 = input[-2][j][k]; 24 i___plus1 = i___plus2;
13 float i___plus0 = input[-1][j][k]; 25 i___plus2 = i___plus3;
14 float i___plus1 = input[0][j][k]; 26 i___plus3 = i___plus4;
15 float i___plus2 = input[1][j][k]; 27 i___plus4 = input[i+4][j][k];
16 float i___plus3 = input[2][j][k]; 28 float tempx = i___plus0 + coefx *
17 float i___plus4 = input[3][j][k]; 29 (
18 for (int i = 0; i < sizex; i++) { 30 i___minus1 + i___plus1 +
19 i___minus4 = i___minus3; 31 i___minus2 + i___plus2 +
20 i___minus3 = i___minus2; 32 i___minus3 + i___plus3 +
21 i___minus2 = i___minus1; 33 i___minus4 + i___plus4
22 i___minus1 = i___plus0; 34 );
23 i___plus0 = i___plus1; 35 float tempy = ...
24 i___plus1 = i___plus2; 36 float tempz = ...
25 i___plus2 = i___plus3; 37 output[i][j][k] =
26 i___plus3 = i___plus4; 38 output[i][j][k] + tempx + tempy + tempz;
27 i___plus4 = input[i+4][j][k]; 39 }
28 float tempx = i___plus0 + coefx * 40 }
29 ( 41 }
30 i___minus1 + i___plus1 + 42 }
31 i___minus2 + i___plus2 +
32
Compilation Techniques fori___minus3 + i___plus3
Automatic Extraction + and Locality in Heterogeneous Architectures
of Parallelism
33 i___minus4 + i___plus4 Figure 3.6 – Excerpt
63 / 118
of the parallelized code of the 3D discrete convolutio
CONV3D (V)
• conv3d-hmpp3: Shared memory
4.for (j = 0; j < size_y; j++) {

5. for (k = 0; k < size_z; k++) { T0 T1
CHRECS
6. for (i = 0; i < size_x; i++) { 1st dim 2nd dim 3rd dim 1st dim 2nd dim 3rd dim
… CHRECS_input19 {0, +, 1} {0} {0} {0, +, 1} {0} {1}
21. float tempz = input[i][j][k]+coefz* CHRECS_input20 {0, +, 1} {0} { 1} {0, +, 1} {0} {0}
22. ( CHRECS_input21 {0, +, 1} {0} {1} {0, +, 1} {0} {2}
23. input[i][j][k-1]+input[i][j][k+1]+ CHRECS_input22 {0, +, 1} {0} { 2} {0, +, 1} {0} { 1}
24. input[i][j][k-2]+input[i][j][k+2]+ CHRECS_input23 {0, +, 1} {0} {2} {0, +, 1} {0} {3}
25. input[i][j][k-3]+input[i][j][k+3]+ CHRECS_input24 {0, +, 1} {0} { 3} {0, +, 1} {0} { 2}
26. input[i][j][k-4]+input[i][j][k+4]
CHRECS_input25 {0, +, 1} {0} {3} {0, +, 1} {0} {4}
27. );
…
CHRECS_input26 {0, +, 1} {0} { 4} {0, +, 1} {0} { 3}
CHRECS_input27 {0, +, 1} {0} {4} {0, +, 1} {0} {5}
Table 3.3 – Chrecs for the accesses in lines 24–30 of Figure 3.4 (CONV3D).
shared clause
3.5.2 Case Study: of the gridify directive
SGEMM
The simple-precision general matrix multiplication from the BLAS library [24]
performs the matrix operation:
64 / 118 C = a·A⇥B+b·C
3.5 Case Studies
1 #pragma hmpp conv3d___hmpp3 codelet
CONV3D (and VI)

3.5 Case Studies 2 void conv3d___hmpp3(float output[sizex][sizey][sizez],
85
3
4
float input[bound+sizex+bound][4+sizey+4][4+sizez+4],
float coefx, float coefy, float coefz) {
5 float input___shared[bound+8+bound][bound+32+bound];
1 #pragma hmpp conv3d___hmpp3 codelet 6 #pragma hmppcg gridify(j,k),blocksize(32x8),shared(input___shared),un
2 void conv3d___hmpp3(float output[sizex][sizey][sizez], 7 for (int j = 0; j < sizey; j++) {
3 float input[bound+sizex+bound][4+sizey+4][4+sizez+4],
4 float coefx, float coefy, float coefz) {
9 int tx = 0;
5 float input___shared[bound+8+bound][bound+32+bound];
10 int ty = 0;
11
6 #pragma hmppcg gridify(j,k),blocksize(32x8),shared(input___shared),unguarded #pragma hmppcg set tx = RankInBlockX()
12 #pragma hmppcg set ty = RankInBlockY()
13 int rk = tx + bound;
9 int tx = 0;
14 int rj = ty + bound;
10 int ty = 0;
15 float i___minus4 = ...
11 #pragma hmppcg set tx = RankInBlockX() 16 for (int i = 0; i < sizex; i++) {
12 #pragma hmppcg set ty = RankInBlockY() 17 i___minus4 = ...
13 int rk = tx + bound;
18 #pragma hmppcg grid barrier
14 int rj = ty + bound;
19 input___shared[rj-bound][rk-bound] = input[i][j-bound][k-boun
15 float i___minus4 = ...
20 input___shared[rj+bound][rk-bound] = input[i][j+bound][k-boun
16 for (int i = 0; i < sizex; i++) {
21 input___shared[rj-bound][rk+bound] = input[i][j-bound][k+boun
17 i___minus4 = ...
22 input___shared[rj+bound][rk+bound] = input[i][j+bound][k+boun
18 #pragma hmppcg grid barrier 23 #pragma hmppcg grid barrier
19 input___shared[rj-bound][rk-bound] = input[i][j-bound][k-bound];
24 float tempx = ...
20 input___shared[rj+bound][rk-bound] = input[i][j+bound][k-bound];
25 float tempy = i___plus0 + coefy *
21 input___shared[rj-bound][rk+bound] = input[i][j-bound][k+bound];
26 (
22 input___shared[rj+bound][rk+bound] = input[i][j+bound][k+bound];
27 input___shared[rj-1][rk] + input___shared[rj+1][rk] +
23 #pragma hmppcg grid barrier 28 input___shared[rj-2][rk] + input___shared[rj+2][rk] +
24 float tempx = ...
25 float tempy = i___plus0 + coefy *
30 input___shared[rj-4][rk] + input___shared[rj+4][rk]
26 (
31 );
32 float tempz = i___plus0 + coefz *
33 (
34 input___shared[rj][rk-1] + input___shared[rj][rk+1] +
30 input___shared[rj-4][rk] + input___shared[rj+4][rk]
31 );
32 float tempz = i___plus0 + coefz *
37 input___shared[rj][rk-4] + input___shared[rj][rk+4]
33 (
38 );
40 output[i][j][k] + tempx + tempy + tempz;
41 }
37 input___shared[rj][rk-4] + input___shared[rj][rk+4]
42 }
38 );
43 }
44 }
40
Compilation output[i][j][k]
Techniques + tempx of
for Automatic Extraction + Parallelism
tempy + tempz;
and Locality in Heterogeneous Architectures
41 } 65Figure
/ 118 3.7 – Excerpt of the parallelized code of the 3D discrete convolut
SGEMM (I)
ase Studies 87

1 int m, n, k;
2 void sgemm(float C[m][n], float alpha, float A[m][k],
3 float B[k][n], float beta) {
4 K < prod7 >
scalar assignment
5 for (int i = 0; i < m; i++) {
6 for (int j = 0; j < n; j++) {
7 float prod = 0; ES_forl (Figure 3.8, lines 8-10)
8 for (int l = 0; l < k; l++) {
9 prod += A[i][l] * B[l][j]; K < prod9 >
scalar reduction
10 }
11 C[i][j] = alpha * prod + beta * C[i][j];
12 }
13 }
K < C11 >
14 } regular reduction
e 3.8 – Source code of the simple-precision general matrix multiplication

MM). FULLY PARALLEL LOOP
66 / 118
From the point of view of the locality, the challenge of SGEMM is to handle
the tradeoff between opposite array traversals efficiently: row-major for C and
A, and column-major for B. On the CPU, the general solution is to apply loop
tiling: matrices are computed in small tiles to keep data in the cache memory.
SGEMM (II)
This approach can be also applied on the GPU using the shared memory as cache
and being aware of coalescing.
The first variant of SGEMM is the sequential code shown in Figure 3.8
(sgemm-cpu). In addition, we have selected the cblas_sgemm function of the
non-clustered, threaded part of the Intel MKL library [69] to build the sgemm-mkl
• sgemm-cpu: Sequential code variant.
The first OpenHMPP variant is sgemm-hmpp1. It is trivially built by offload-

ing to the GPU the same code as sgemm-cpu. Table 3.4 shows the chrecs for this
• sgemm-mkl: Intel MKL variant, which are analyzed by Algorithm 3.1 as follows. Regarding A, all the
threads of a warp have the same chrecs and thus access the same memory posi-
tion (see line 14 of Algorithm 3.1). Regarding B, coalescing is maximized because
Studies 87
the chrecs of the first dimension are the same while the chrecs of the second one
sgemm-hmpp1: Offloading (and check coalescing)

define a convex set (lines 10–12). Finally, the same situation holds for C and thus
• accesses are coalesced.
The second OpenHMPP variant is sgemm-hmpp2. Algorithm 3.4 transforms

1 int m, n, k; the source code of Figure 3.8 as follows. The scalar variable prod is promoted to
2 void sgemm(float C[m][n], float alpha, float A[m][k],
an array prod[D], and thus a new loop fort is created to enclose all its definitions
3 float B[k][n], float beta) { and uses (see lines 3–6 of Algorithm 3.4). The step of the outer fori is incremented
4
5 for (int i = 0; i < m; i++) {
6 for (int j = 0; j < n; j++) { not instantiated T0 T1
CHRECS
7 float prod = 0; 1st dim 2nd dim 1st dim 2nd dim 1st dim 2nd dim
CHRECS_A {0, +, 1} {0, +, 1} {0} {0, +, 1} {0} {0, +, 1}
8 for (int l = 0; l < k; l++) {
CHRECS_B {0, +, 1} {0, +, 1} {0, +, 1} {0} {0, +, 1} {1}
9 prod += A[i][l] * B[l][j]; CHRECS_C {0, +, 1} {0, +, 1} {0} {0} {0} {1}
10 }
11 C[i][j] = alpha * prod + beta * C[i][j]; Table 3.4 – Chrecs for the accesses to arrays A, B and C in SGEMM.
12 }
13 }
14 }
8 Compilation
– Source code for
Techniques ofAutomatic
the simple-precision general
Extraction of Parallelism and Locality inmatrix multiplication
Heterogeneous Architectures
67 / 118
SGEMM (III)
• sgemm-hmpp2: Tiling preserving coalescing

1 int m, n, k;
2 #define DELTA 16
3
4 #pragma hmpp sgemm___hmpp2 codelet
5 void sgemm___hmpp2(float C[m][n], float alpha, float A[m][k],
7
8 #pragma hmppcg gridify (i,j), blocksize(128x1)
9 for (int i = 0; i < m; i = i + DELTA) {
10 for (int j = 0; j < n; j++) {
11 float prod[DELTA];
12 for (int t = 0; t < DELTA; t++) {
13 prod[t] = 0;
14 for (int l = 0; l < k; l++) {
15 prod[t] += A[i+t][l] * B[l][j];
16 }
17 C[i+t][j] = alpha * prod[t] + beta * C[i+t][j];
18 }
19 }
20 }
21 }
Figure 3.10 – Excerpt of the parallelized code of the simple-precision general ma-
trix multiplication (SGEMM): variant sgemm-hmpp2. 68 / 118
SGEMM (and IV)
1 int m, n, k;
• sgemm-hmpp3: 2
3
#define DELTA 16
4
Let the compiler
#pragma hmpp sgemm___hmpp3 codelet
5 void sgemm___hmpp3(float C[m][n], float alpha, float A[m][k],
use the registers 7
8 #pragma hmppcg gridify (i,j), blocksize(128x1)
(fullunroll) 9
10
for (int i = 0; i < m; i = i + DELTA) {
for (int j = 0; j < n; j++) {
11 float prod[DELTA];
12
sgemm-hmpp4:
#pragma hmppcg fullunroll
• 13 for (int t = 0; t < DELTA; t++) {
14 prod[t] = 0;
Use the shared 15
16
}
for (int l = 0; l < k; l++) {
memory for B 17
18
for (int t = 0; t < DELTA; t++) {
19 prod[t] += A[i+t][l] * B[l][j];
20 }
• sgemm-cublas: 21
22
}
NVIDIA CUBLAS 23
24
for (int t = 0; t < DELTA; t++) {
C[i+t][j] = alpha * prod[t] + beta * C[i+t][j];
25
library
}
26 }
27 }
28 }
69 / 118
Outline
• CONV3D & SGEMM
70 / 118
Performance Evaluation: CONV3D
sizex, sizey and sizez in 128, 256, 384, 512, 640 and 768
120
conv3d-cpu
100 conv3d-hmpp1
conv3d-hmpp2
80 conv3d-hmpp3
GFLOPS
60
40
20
0
CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton)
Fermi cards introduced

memory caches
71 / 118
Performance Evaluation: SGEMM (I)
m, n and k in 128, 256, 384, 512, 640, 768, 896, 1024, 1152, 1280,
1408, 1536, 1664, 1792, 1920, 2048, 4096, 6144 and 8192
500
sgemm-cpu
sgemm-mkl
400
sgemm-hmpp1
sgemm-hmpp2
GFLOPS
300 sgemm-hmpp3
sgemm-hmpp4
200 sgemm-cublas
100
0
CPU (nova) GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton)
the biggest improvement factor is

the usage of the GPU shared memory
72 / 118
Performance Evaluation: SGEMM (and II)
blue: sgemm-cublas
red: sgemm-hmpp4
8192 black: sgemm-mkl
6144
4096
k
2048
1024
128
8192
6144 8192
4096 6144
4096
2048 2048
1024 1024
n 128 128
m
GPU Tesla S1070 (nova) GPU Tesla S2050 (pluton)
73 / 118
Outline
1. Introduction
4. Trace-Based Affine Reconstruction of Code
5. Main Contributions and Future Research Lines
74 / 118
Outline
• Problem Formulation
• Problem Resolution with CHOLESKY
• Extensions for Supporting Nonlinear Traces
75 / 118
Outline
76 / 118
10 for (k = 0; k <= i - 1; ++k) k k k k k T
[ 1 ... l 1 l l+1 ... n ] =
11 x = x - A[j][k] * A[i][k]; T
12 A[j][i] = x * p[i]; =[ 0 ... 0 1 ikl+1 ... ikn ]
13 }
14 } Positions {ij , 0 < j < l} will not change between iterations
Problem Statement
Figure 2. Source code of the cholesky application.
k and (k + 1), and therefore jk = 0; while positions {ij , l <
j  n} will be reset to 0, and therefore jk = ikj .
Taking this result into account, it is possible to find all

1 0x1e2d140 valid solutions of the system in linear time, O(n), by simply
2 0x1e2d140 testing the n valid candidates +(l, ! ı k ), calculating their as-
88 0x1e2d248 ! !k
.
. 89 0x1e2d340 sociated strides ˆ k
l =1. cforl , (i
and =accepting
0; i <= those solutions
29; i++) {
. with a stride equal to2.the observed
90 0x1e2d348 for (jone, = 0; ˆl =j <=. These
k k
29-i; are j++) {
91 0x1e2d350 particular solutions of 3. the subtrace
for (k {a1=, . .0;
. , ak
k+1<}, i;which
k++) {
30 0x1e2d140 4.
92 0x1e2d340 will be explored to construct a solution
… A[i][k]for the entire
… trace.
31 0x1e2d240
93 0x1e2d348 5.
Following the cholesky } example, the next access in
32 0x1e2d248
94 0x1e2d350 6. is} a3 = 0x1e2d140. The engine
the trace to be processed
33 0x1e2d240
34 0x1e2d248 .
. computes the access 7. } as 2 = a3 a2 = 0. At this
stride
.
. point, a 1-level loop has been constructed and the engine
.
. ! !
checks whether i 3 = +(1, i 2 ) = [2] produces an stride
! !2
Figure 3. Excerpt of the memory trace generated by the that matches the observed one. The equality ˆ 2
1 = c 1 =
2
holds, and the solution is accepted. The matrix
• We
access assume
A[i][k] that:
(line 11 of Figure 2). [0] [1] =
of reconstructed indices is updated, and the algorithm con-
tinues processing the trace and ⇥updating I in the same ⇤ way
until it builds S130 , with I30 = 0 1 . . . 29 . At this
where ( cAddresses isare generated by a 2single instruction
• ! ! !k
T
c) 2 Z n⇥n
the system matrix, and point, the observed stride changes to:
Z is the solution. There are two possible situations when
n
solving this system: 30

• Instruction is enclosed in an affine loop nest
31 =a a = 0x1e2d240 0x1e2d140 = 256
30
1. The system has one or more integer solutions. In this

!k ! ! The constructed loop with ! c = [0] cannot produce a
case, for each solution , the new index ı k+1
= ı + k
!k stride different from 0. As such, the subtrace {a1 , . . . , a31 }
• Existing ⇥ k !memory optimization techniques based on with
theanpolyhedral model, in a and any
, which must be sequential to !ı k
, is calculated, and
⇤ ! ! cannot be generated affine access enclosed 1-
I k+1
= I |ı k+1
. U, w , and c remain unchanged.
other static or dynamic optimization
Each of these solutions must be explored independently. technique
level loop andinthethe absence
dimensionality of theof source
current solutionand/or
S130
must be increased to build S231 .
2. Thebinary
system code, can generating
has no solution be applied. an index sequen-
!
tial to ı , in which case there are three courses of action:
k 3.2 Increasing solution dimensionality
Compilation
(a)Techniques
Increaseforthe
Automatic Extraction of
dimensionality of Parallelism
the solutionand (Sec.
Locality3.2).
in Heterogeneous {!c , Ik , U, !
Let SArchitectures
k
n = w } be a partial solution for the sub-
77 / 118 !k
iable mechanism to detect and extract loop sections in the trace is assumed.
Our proposal is designed to recreate the same sequence of accesses that th

Problem
mory Formulation
trace contains. (I)
Hence, we model the memory access to be reconstructe
DO i1 = 0, u1 ( !
ı )
DO i = 0, u ( !
2 ı ) 2
.
.
.
DO in = 0, un ( ! ı )
104 Chapter 4. Trace-based Affine Reconstruction of Code
V [ f1 ( !
ı )] . . . [ f m ( !
ı )]
bounds are assumed to be inclusive, i.e., 0  i j  u j ( !
ı ).
ere {u j ,Since 0 <f j jis  affine, n}the are access affinecan be functions
rewritten as: that provide the upper bounds
p i j ; { f d ( i1 , . . . , i n ) , 0 < !
d  m! } is the set of affine functions that converts
V [ f 1 ( ı )] . . . [ f m ( ı )] = V [c0 + i1 c1 + . . . + in cn ] (4.1)
en point in the iteration space of the nest to a point in the data space of V; an
= {iwhere , . . . ,
V i is} T is
the base a column
address of vector the array,which encodes
c0 is a constant theand
stride, state
eachof
{c jeach
, 0 < iteratio
1 n
riable.j Note n} is the that, if theoforiginal
coefficient the loop index access i j , and
is itnot
must accountitfor
affine, the dimension-
can be modeled in th
ality of the original array. For instance, an access A[2 ⇤ i ][ j] to an array A[ N ][ M]
rst case as one loop (with one iteration and 78 /the
118 corresponding stride) per entr
j resets to 0. In all other case, the next access
!ı k+1 = {i k . . . , i k + 1, 0, . . . , 0}.
1.1. A set of indices built complying with these conditions will
1 be referred
j
quential indices. Definition 4.1.1. A set of indices built complyin

Problem Formulation (II) to as a set of sequential indices.
taneous variation of loop index i j between iterations k and (k + 1),
k ), can only take one of three possible values:
j • In our model, only three possible variations ofvariation
The instantaneous a loopof loop index
k = ( i k +1 k ), can only take one of three po
index kbetween two consecutive
d j iterations
j i j are allowed
not change ) dj = 0
1. i j does not change ) djk = 0
eased by one ) djk = 1
2. i j is increased by one ) djk = 1
t to 0 ) djk = ikj
3. i j is reset to 0 ) djk = ikj
owing, vector notation will be used for d:
2 3 2 3 In the following, vector notation will be u
i1k+1 i1k d1k
6 7 6 7 2
6 i2k+1 7 6 i2k 7 !k d2k i1k+1 i1k
(!
ı k +1 ! k
ı )=6
6 .. 7=6 . 7= d
7 6 . 7 6 k +1 k
4 . 5 4 . 5 ! ! 6 i 2 i 2
( ı k +1 k 6
ı )=6 ..
ink+1 ink dnk 4 .
ink+1 ink
79 / 118
Problem Formulation (and III)
• The stride between two consecutive accesses is a linear

combination
106 of the coefficients of the
Chapter 4. Trace-based loop
Affine indicesof Code
Reconstruction
Chapter
Lemma4.4.1.2. The stride
Trace-based between
Affine two consecutive
Reconstruction accesses sk = V ( !
of Code ı k +1 ) V(!
ı k)
is a linear combination of the coefficients of the loop indices.
he stride between two consecutive accesses sk = V ( ! ı k +1 ) V ( ! ı k)
ation of the coefficients of theEquation
Proof. Using (4.1), sk can be rewritten as:
loop indices.
s k = V + ( c + c i k +1 + . . . + c i k +1 )
sk
uation (4.1), can be rewritten as: 0 1 1 n n
V + (c0 + c1 i1k + . . . + cn ink ) =
sk = V + (c0 + c1 i1k+1 + . . . + = cn ink+1 ) c1 d1k + . . . + cn dnk =
k k ! !k
V + ( c0 + c1 i1 + . . . += cn in ) = c d
= c1 d1k + . . . + cn dnk =
! !k
= c d
4.1.2 Reconstruction Algorithm

80 / 118
Outline
81 / 118
Solution Space 2n + 1 candidates
82 / 118
by a single instruction, included ! in the n . Using execution trace. ! Since the upp
ritten !
inner
as:that index; and that its main diagonal is equal to 1 2 Z
the canonical U and w
loop , the
tuple Note ı toUbe is avalid
lower under number
triangular
functions the loop
ofmatrix,
nested
are affine,
! constraints
loops)
sincenumber capable
each no of index
u of!
nested
( ı ) i in
generating
jcancan
loops) be the
depend
capable
written whole on
of observed
as: an
generating theform
access trace.obse
whole can
condition for a given For iteration 108 tuple ı to be valid Chapter
j under the loop constraints in 4. Trace-based Affine Reconstruction of
example, a 2-levelFor loop !with indices
example, a 2-level i and loop j mightwith! iterate sequentially
indices i and j might over iterate
e written
inner
Chapter index;
4. as:and
theTrace-based
that
! T 108
canonical
its main
Affine
loop form
diagonal
Reconstruction
canelements
be
is
written
equal
in of
to
Code
Chapter
as:
1 2 Z n .
4.] Trace-based Using U and
Affine w , the
ifReconstruction ui = of areCod
! !
ı + w for a0given iterationu j tuple
Ucondition
all the ! arrayall
u
A[ N
the ! ı
][ M
elements
(4.4)
if the
w
in upperarray Abounds
u i
[N ][ M] are
. . . u
thedefined upperasbounds
i
N, de
Problem Resolution (I)
number
ı to of nested loops) capable of generating the whole observed access t
thebe valid under the loop constraints in
= M and access is j
V
( i
) M
= j j
.
+ This j,1can 1 +be
+
rewritten j, ( j as 1 )
an ( j 1 )
equivalent
u j = M and the access is V [i ⇤ M + j]. This can be rewritten
[ ⇤ + ]
the !
canonical ! loop !form T can be written
1-level
For
loop
example,
as:
with
U ! ı index !w
a 2-level
1-level 0
i, using
! Tloop u
with indices i and j might iterate sequentially
Nindex ⇤ M and access ui V =[iobserved M(4.4)
. ⇤Section and4.1.5 willV [i ].
pable U ıof + w
generating 0the
number wholeofobserved
and
nested access
loops)
all the elements
therefore it
+
is
trace.
capable loop
in array to
possible
of (4.4)
i with
=
generating
A[ N build][ M] ifathe
i,the using
matrix
whole
upper U bounds
]N
2 Z are
n⇥n
access
access
defined
and a
trac
ascolu
ui =
detail a mechanism to detail increase the dimensionality
a mechanism to increase of the resulting nest.
dimensionality of the resulti
with indices i and jFor might example, iterate
! ! sequentially
a 2-level
nu j ! loopT with indices
= M and
! over
the access k is V [ i i and j might iterate
⇤ M + j ] . This can be rewritten sequentially as an equiv ov
hm has already identified aLet w Upartial
2 Z ! += w{ asolution
such
ı a 1-level that:
, . . .0, awith
loop }Let = S{aVn(=i,=
index
! ! 1 ), . . . , V ( ı N
ı {using
a , .
u , a =
!
} NLet
= )}
⇤ {M be
V us( !the
andı 1assume
(4.4)
) addresses
, .
access. . , V V
( ! ı
[ i ]Nthat
generated
. )} be
Section the
the ada
4.1.5
[ N ][ M] if the Letupperus assume bounds
all the that are
thedefined
elements in array
algorithm as has
1 ui A= [ NN,
N
already ][ M] identified if the upper 1 i N
a! bounds
partial solutionare defined k
Sn = as ui =
s the subtrace a , . . . , a by}ausing single 2instruction,
n anested byincluded loops,
a single in thek =
instruction, 3execution
c
included , I
trace.k , U, inSince
the ! w the
execution, which
upper bounds
trace. recon
Since t
rithm
V [i ⇤ M + has! j . already
This
k {!can
{ c , I , U, w u}j, which
] 1 identified
be
= M functionsk
rewritten
and a as
the are
reconstructs partial
detail
an
access 1equivalent
the is0 solution
mechanism
V
subtrace[ i0 ⇤ ! M
to increase
. .
{ + .a S1 j0
,]n. . .
the{
This
, a k
dimensionality
} can
using !be n rewritten
nested
of } resulting
loops,as an
nest.
equivale 2
affine, ! each u j ( ı ) are
functions canaffine, be written each
! as:
u j ( ı ) can ! Nbek = written as: w
sucts Let
follows: uthe us assume
= subtrace that the algorithm has
6 already
Let a {identified
a , . . . , a } a partial
7
{ V
⇤ Mwhose ı 1solution
, . . . , V ı S
components be the
V [i ]. Sectionare addresses gene
defi
sing
! i whose
!
M and
N ⇤ components a
1-level
{ 1 ,
access . . .
loop
are V,defined
[ai ]with
k using
. }Section u2,1follows:
6index
as
6subtrace
n
4.1.5 =
nested
i, 1using will
10 u.i. loops, N.= 0N 7 = (
7 in nthe
)
and access ( n )} 4.1.5 6w
6 bo w
{ c , •Ik ,Coefficients
U, w } , which
of the reconstructs
Loop Indices the by a single uinstruction, { !aı , . . .
w , aincluded
}
u using
i u. . . !ı nested
u execution
w i loops,
u
trace. Since the
i . . . u ! upper
(4.2)
i
ase the dimensionality of the
detail resulting
a mechanism U = 6nest.
uto u3,2j
increase ( 1 ) 1the. .dimensionality
= j. + k 0j,1!17 + ( + ) j,(of
= j 1jthe
) ( j 1resulting
+ + , +
and nest.j,( jw1) = (j 6
dwhoseas follows: components are
! defined as follows:
6 3,1
functions are affine, each u j ( ı ) can be written as:
6 .. of ..loop ..indices. .! .. 7
7
j ) 1
j,1 1)
6
4
! • Vector ! c 2 Z n!of coefficients . . !
=ts{of V ( loop 1
ı ), . .indices.
. , V ( ı )}LetN be the aand { a1 4
=addresses
therefore, . . . .it, agenerated
isN } .= and{.therefore
possible Vto( build ı! 1 ) ,.it .a.is5 V ( ıUN
.matrix
,possible !
• 2toVector
)} Z n
build⇥
be the n and
a matrix ac column
addresses 2 UZ 2n of
n
generate
vector
Z ⇥ n
coe
and
w
! n ! u ( ı ) = w + u i + . . . + u i
uded in the execution
! by ktrace.
a !
single w 2 Zthe
Since
1 ! such
instruction,k un,1
upperthat: n u⇥bounds
included
n,2
k w u2n,3 Z in n
. such
j. .thethat: 1execution trace. Since the upper boun
j j,1 1 j, ( j 1 ) ( j 1 )
ients of
• • Vector • Matrix
loop
Iteration
n
2indices.
c Indices [ ı | . . . | ıof]loop
Z I of=coefficients 2 Z indices. of reconstructed indices.
(Z! n ⇥
ı ) canof k bereconstructed
written as: functionsindices.
2
are affine,1 each
and 0 u0j ( .ı. .) can
therefore !
it is
23
0 be
possible 1 writtento
0 build 0 .as: a .• . 0Matrix
matrix
3
U 2 ZInk⇥2n= and [3! 1 | . . . v|
aı column
k = [! 1| . . . | ! k ] 2 Z6 n⇥! k2 n such that: 6 7 7since no index w1
• Matrixn ⇥ k I ı ı Note w
u
6 2,1 of
that Z reconstructed
U 1 is0 a .lower
. . 0 6 72,1u indices.
triangular 1 0 matrix,
. . . 0 7 6 7 i j can de
]= 2w Z of reconstructed indices.
6
6 u u ! 1 . . . 0
67
7
7 ! ! 6 w2 7
6 . n7 (4.3) !
j + u j,1 i1 + . . . + u j,( j 1) i( j inner (4.2) u u 1 . . . 0
U = u 2 ı U w 6 u i 3
. . . u 7
, i
and w = , and (4.
w
1) index;
6
6 ..
3,1 and
j (
3,2
..
that
) 1 = 0 its
. . . ..6 7 ..
=j 6+ main
0 7 3,1
j,1
. . . diagonal
10 +
3,2 + is j, equal
( j 7
1 ) ( j to 1 ) 1 6 2
4 . 5
Z. 7 . Using
2 3 U
.. .. . . .. 7
• Bounds 4 . . 6 u .. . .4 5 . . 7 . ! . . 5 w1
condition for a6 given 2,1 1iteration0 . . . tuple 0 7 ı to be valid under wn the6 loop7co
n⇥n and a ucolumn un,2 6 u . . . 1 u u 7u . . . 1 n⇥n and a ! 6 w2 7
to build a matrix Uand 2 Z therefore the canonicalit isn,1possible
U = loop vector
n,3
6 u3,1 form
6 to
u3,2 canbuild 1 . ben,1 a matrix
n,2
. . written
0 7 7 n,3 U as: 2 Z , and w = 6 . vect
column 6 7
! 6 . . . . . 7 4 .. 7 5
w 2 Z n such that: 4 .. .. .. . . .. 5
Note that U is a lowerNote triangular that Umatrix, is a ! lower since no index
triangular ! i j can depend
matrix, since noon wan
index n ij c
3 2inner index; and thatn,1 u u n,2 3 n,3 u . . . U 1 ı + !w ! 0 n T ! !
its
innermain index;diagonal and is
that equal its mainto 1
diagonal . Using
is equal U and
to w
1 , the n . Us
0 1 0 2 0 ... 0 3 2 Z 2 2 3 Z
7 6 condition for w a 1given iteration condition 7 tuple
for a given
! ı to beiteration valid under tuplethe ! ı loop validwunder
to beconstraints 1 inthe lo
0 7 u
6the2,1 16 Note0 7 .
that . . U 0is a7 lower triangular matrix, since no index 6 i j can depend
7 o
7 6 ! canonical
Let us w
loop
6assume 2 7form thecan
that 7
thebe written
canonical
algorithm loop as:form can be written!
has already as:
identified! n .6 a wpartial
2 7 andsol !
0 7 U = , 6
and u w u
= inner
6 index;
1 7 . . .and 0
(4.3) that 7 its main diagonal is equal , to
and 1 w2 Z
= 6 Using U7 w
(4.
7 3,1
6Parallelism
! k
3,26 !. 7 . 7 iteration ! 6 loop .. 7
.. 7
Compilation Techniques for Automatic Extraction of 6{ c.. , I , ..U,
and condition
Locality .
4 w.. },5which
in . . .. reconstructs
for
Heterogeneous a given 7 U ı + w the0 subtrace
Architectures ! ! tuple ! ıT to be
U
valid
! ı { a1w, . . . ,0a4
under
! !the T
k } . (4.4)
using 5 n ne
constrain
. 5 4 . . . . . 5 83 / 118 +
valid solution, Sn has to meet the following requirements:
consecutive pair of indices ! ! k
ı k and !! k + 1
ı k+1 must be sequential a
consecutive
Problem pair of indices ı and ı must be sequential as
nition 4.1.1. Resolution (II)
nition 4.1.1.
that this
• To
condition is stronger than simply requiring that the i
that thisbecondition
a valid solution:
is stronger than simply requiring that the i
indices stay inside the loop bounds, which could be written exten
indices stay inside the loop bounds, which could be written exten
ation (4.4) as: consecutive pair of indices must be sequential
• Each
tion (4.4) as: k ! 1⇥ k n⇥k
UIk + !w 11⇥k 0n⇥k
UI + w 1 0
observed strides are coherent with the reconstructed ones. U
observed
• The strides are strides
observed coherent
arewith the reconstructed
coherent with the ones. U
ma 4.1.2 this can be expressed as:
ma 4.1.2 reconstructed ones as:
this can be expressed
! ! k +1 ! k ! !k k
c (!
! ı k +1 ! ı k) = !c! d k = sk
c( ı ı )= c d =s
processing access ak+1 , the algorithm first calculates the observed st

processing access ak+1 , the algorithm first calculates the observed st
84 / 118
4.2 Case Study: CHOLESKY 7 p[i] = 1.0 / sqrt(x);
8 for (j = i + 1; j < N; ++j) {
9 x = A[i][j];
10 for (k = 0; k <= i - 1; ++k)
that this affine access to the 2D matrix A is enclosed into a 3-level loop
11 x = x - A[j][k] * A[i][k];
Problem Resolution: CHOLESKY (I)

inner indices depend on the outer ones. }
12
13
14 }
A[j][i] = x * p[i];

An excerpt of the memory trace generated by code
Figure 2. Source A[i][k] is shown
of the cholesky in Figur
application.
1 The first column uniquely identifies the instruction that emits the memory a
#define N 32;
2
3
and the second is the address of the accessed
double p[N], A[N][N], x;
int i, j, k;
1 0x1e2d140memory location.
2 0x1e2d140
4 88 0x1e2d248
.
. 89 0x1e2d340
5 #pragma scop 1 0x00400cbe .0x1e2d140
6 for (i = 0; i < N; ++i) { 90 0x1e2d348
2 0x00400cbe 0x1e2d140 91 0x1e2d350
7 x = A[i][i]; 3 30 0x1e2d140
0x00400cbe 0x1e2d140 92 0x1e2d340
8 for (j = 0; j <= i - 1; ++j) 4
31 0x1e2d240
9 4.2.1 Reconstruction Process
x = x - A[i][j] * A[i][j]; 5
...
32 0x1e2d248
0x00400cbe 0x1e2ef18
33 0x1e2d240
93
94
0x1e2d348
0x1e2d350
10 p[i] = 1.0 / sqrt(x); 6 0x00400cbe 0x1e2ef20 .
11 for (j = i + 1; j < N; ++j) { 34 0x1e2d248 .
.
7 0x00400cbe .0x1e2ef28
12 x = A[i][j]; .
13 The pseudocode that implements
for (k = 0; k <= i - 1; ++k) our proposal was presented in Section 4.1
(b) Excerpt of the memory trace
.
14 x = x - A[j][k] * A[i][k] ; Figure 3. Excerpt of the memory trace generated by the

15 mentioned, the extraction generated
A[j][i] = x * p[i]; processby starts
access A[i][k] (line 11EofXTRACT
by
the access calling
A[i][k]
Figure 2). () with the foll
16 (see line 14 of Figure 4.2a).
} 2 S :
17 } 1 8 ! ⇥ 1⇤
18 #pragma endscop
>
> c = swhere=(! c[ a
T!
2c ) 2 aZ1 ]
n⇥n=is [ 0
the ] system matrix, and
!k
2
>
< 2
⇥ !Z1n is! the2 ⇤solution. There are two possible situations when
(a) Source code. I = ı solving = [0, 1]
| ı this system:
>
Figure 4.2 – The Cholesky matrix
>
> U
decomposition.
= [ 11.] The system has one or more integer solutions. In this
!k
: ! T! case, for each solution , the new index !
ı k+1 = !ı k+
w = [1] k , which !k
⇥ must be⇤ sequential to ı , is calculated, and
Ik+1 = Ik |! ı k+1 . U, !
w , and !c remain unchanged.
Each of these solutions must be explored independently.
As commented in Section 4.1.3, this is the only feasible solution for th
2.found
The system has
no solution generating an index sequen-
the engine predicts an iteration of the only loop that has been 85 / 118 so far:
Lemma 4.1.2 this can be expressed as:
! ! k 1 ! k ! !k
Upon processing access akc+(1 , ıthe algorithm
+
ı ) =first = sk the observed stride:
c dcalculates
Problem Resolution: Building the System
k
Upon processing access ak+1 , the =
s ak+1 afirst
algorithm k (4.6)
calculates the observed stride:
• Calculate the observed stride
Afterwards, it builds a diophantine2 linear equation system based on Lemma 4.1.2
!k
to discover the potential indices ı = s k +a1k+
which
1 (4.6)
ak generate an access stride that is
equal to the observed
Afterwards, it builds aone:
diophantine 2 linear equation system based on Lemma 4.1.2
• Build a diophantine linear equation system
to discover the potential indices ! ı k +1 which generate an access stride that is
! ! k 1 ! k k ! T ! !k ! T k
c ( ı + ı ) = s ) ( c c ) d = c s (4.7)
equal to the observed one:
One!or
•!
T morensolutions:
⇥n !!
Explore them independently
k n is the solution. There
where ( c c ) 2! Z
c( ı ! k is
+ 1 the !system matrix,
ı ) = s ) ( c c ) d = c T sk
k k ! andT ! d 2
k Z ! (4.7)
are two• possible No solution situations
under current when solving this system:
boundaries
! T ! n ⇥ n !k n is the solution. There
where ( c • Increase c ) 2 Zdimensionality is the system matrix,
adding a new loop and d 2 Z
1.
are two The system has one or more integer solutions. In this case, for each solution
!kpossible situations ! when solving
k + 1 ! k !this system:
k k + 1
⇥ k ! k +1 ⇤
d , •the newboundaries
Modify index ı = ı + d is calculated, and I = I| ı .
U, !w , and ! chas remain unchanged. Each of theseInsolutions must be explored
1. The •system Discard this one
branch or more integer solutions. this case, for each solution
! ! ! ! ⇥ ! ⇤
independently.
d k , the new index ı k+1 = ı k + d k is calculated, and Ik+1 = Ik | ı k+1 .
2 !
Compilation Techniques !
for Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures
U, w , and c remain unchanged. Each86 of / 118 these solutions must be explored
2 linear equation system based on Lemma 4.1.
ng the Linear
Afterwards, Diophantine
it builds System
a diophantine
!
o discover the potential indices ı k+1 which generate an access stride that i
qualProblem
to the Resolution:
observed one: Solving the System
ugh the system in Equation (4.7) has infinite solutions in the general
a few are valid!
in the
! k
context
1 ! k
of the
k
affine
! T
loop
! !reconstruction,
k ! T k
which m
c( ı + ı )=s )(c c)d = c s (4.7
sible to develop ad-hoc solving strategies.
! T ! n ⇥ n !k n is the solution. Ther
where• ( As
c indices
c ) 2 Z must is the system matrix, and 2
ma 4.1.3. There are at mostben sequential, there are at in Equation (4.7).
d Z
valid solutions to the system
re two most
possible
n situations
solutionswhen solving this system:
pond to indices:
! k +1
{ ıorl more ! k
), 0 < l In
+(l, ı solutions.
= integer n} case, for each solution
 this
1. The system has one
!k ! ! !k ⇥ k ! k +1 ⇤
construction
d , the new index ı k + 1 = ı + d is calculated, and I 111
k k + 1 = I| ı
If index
U, !
w , !ı
andk +!1
c must
remainbe sequential
unchanged. to
Each index
of !ı
these ksolutions
as per Definition
must be 4.1.1
explored
• We only need to calculate ! the predicted
is a independently.
single degree of freedom for k : the position dk that is equal to 1.
ult intostride
account, it is possible
for each valid index to find
andall
d valid solutions
compare withl of the
me (O(the n)) byobserved stride
2 The system must be diophantine, as loop indices may only have
simply testing the valid indices ! k+ 1
2 3 n 2 3 l , calculat-
ı integer values.
d k
k ! !k 0
tride for each combination 6 as 1 ŝ = c d , and accepting those
7
l 6 l . 7
6 .. 7 . 6 k .. k7
rate a stride equal to the6observed 7 one6(ŝl = s 7, obtained using
6 k 7 6 7
hese will be particular solutions6 dl 1 7of the 0 7 { a1 , . . . , a k +1 },
6 subtrace
87 / 118
The pseudocode
13 The for
first that
column 0;8k !
(k =implements ⇥ our
<= identifies
uniquely i - 1;
1
⇤theproposal
++k)
instruction wasemits
that presented
5 0x00400cbe
the memory inaccess,
Section 4.1.4. As
0x1e2ef18
10Figure
14
p[i]4.2 – The Cholesky
= 1.0 / > c
sqrt(x);
> = matrix
s =
A[i][k]
[ a 2 a
decomposition.1 ] (b)
= [60
Excerpt
] of
0x00400cbe the memory trace
0x1e2ef20
mentioned,
11 thethe
and
for (j
xsecond
= x ->
extraction
= i +is1;
A[j][k]
the
< process
address
j 2< N;⇥*
of
! starts
the
++j) 1
⇤
by
accessed
!{ 2
; memory location.
calling E7XTRACT
generated by ()the
with the A[i][k]
access following
2
15
12
A[j][i] = x Ip[i];
x = A[i][j]; *
= ı | ı = [0, 1] 0x00400cbe 0x1e2ef28
S1 : 13
16 } (see line 14 of Figure 4.2a).
for (k = 0;> >8kU <==i [- ⇥1;
! 1 ] ⇤
++k)
Problem Resolution: CHOLESKY (II)
17 }
14 x = x ->
4.2.1 endscop
18 #pragma
15 A[j][i] = x >
:> !
A[j][k]
>
Reconstruction
<
c =
w2 = [⇥1!
* p[i];
* T
Process
s 1 = [ a;
A[i][k]
] 1 !2 ⇤ 2 a 1 ] = [0]
generated by the access A[i][k]
16
I = ı | ı = [0, (see
1] line 14 of Figure 4.2a).
he engine predicts an iteration
}
(a)
of the code.
Source
only loop that has been found so far:
17 The pseudocode that
> implements
U = [ our
1 ] proposal was presented in Section 4.1.4. As
As commented
} intheSection
> 4.1.3, this is the only feasible solution for the sub-
• Solution
18 #pragma for
mentioned, the
endscop first
!
>
:
extraction
! !wtwo
process
1accesses:
starts
TT by calling E XTRACT () with the following
trace { a1 = 0x1e2d140,
2
S1 : Figure
ŝ = ac4.2
2
2 = d– 0x1e2d140
1
2 The
18= [0Cholesky
= [ ]
] [1] } matrixEdecomposition.
=. 0Next, XTRACT () starts to process the
! ⇥ 1⇤
c s = [ a2 a1 ] = [0]
following access in (a) theSource
trace:code.>
>
>
=
< I2 = ! ⇥ 1 ! 2
⇤
Astocommented
hich is equal the observed in stride
Section s>24.1.3,
.UThe
ıthis| ı is=the
matrix
[0, 1]only feasible solution for the sub-
of reconstructed indices is
Figure 4.2 – The > =Cholesky
[ 1 ] matrix decomposition.
pdated: trace { a1 = 0x1e2d140, a2 = 0x1e2d140 }. Next, E XTRACT () starts to process the
>
:
wa3==
! 1] T0x1e2d140
[h i
following access
I = in[ I the
| + trace:
( 1, !ı 2
)] of
= the 0 only1the2loop
the engine predicts an iteration that has beenforfound
the sub-so far:
• Processing the third access:
and computes
As
the
trace
commented
{ a1stride
in Section
witha2the
= 0x1e2d140,
4.1.3, this
prior access
= 0x1e2d140
is only feasible
}. Next,(see
solution
line() 4starts
E XTRACT of Algorithm
to process the 4.1):
nd the loop bounds following
need toaccess be recomputed. 2 a3
in the trace: ! Thanks
= !0x1e2d140
2 to Corollary
T 4.1.6:
ŝ1 = c d 1 = [0] [1] = 0
the engine predicts 2 an iteration hof i the only loop that has been found so far:
s = a
⇥ 3 ⇤ a 2 = T= 0x1e2d140 0x1e2d140 = 0
0x1e2d140
!
w 0
w 0 T
i 3a 3
2 T
and computes
which is equal thetostride = with
the observed
1 =thestride
1prior
! ! = saccess
[
2 . ] The (see
matrix
T
lineof4 reconstructed
of Algorithm indices4.1): is
2 2
and computes the strideŝwith 1 = the c d
prior = [0(see
1access ] [1line
] = 4 of0 Algorithm 4.1):
Then, our method tries to efficiently traverse the
updated: h solutioni space considering:
nd U remains unchanged. The s2 =new asI32 = solution
a = is! linear
0x1e2d140 2 and0x1e2d140
the algorithm = 0 contin-
[ I 2| + (
= a3 a2 = 0x1e2d140
! 1, ı )]
2 . The = 0 1 =20
0x1e2d140
which
es processing the trace! is equal and to the
2 updating ! observed
2 I! and stride
w in s
the same matrix
way of reconstructed
until the observed indices is
g = U ı + w = [ 1] [1] + [1] = [ 1] + [1] = [0]
updated:
andto: the loop bounds 4.2tries Case beStudy: CHOLESKY
triesneed to recomputed. Thanks to iCorollary
Then, our method to efficiently traverse the h solution space considering: 4.1.6:
ride changes Then, our method to efficiently traverse the solution space considering:
• !The reconstruction ! I = !continues
[ I | + ! ( 1, !ı 2 until
)]+h[= 0 1] 1+ [12] = [0]
2
g does not
2
have positive elements,
2
i
g = U
!
ı + w ⇥ 0 ⇤and
= [ 1 ]
T
[ 1 ] thus
1 ]
3
= Tcannot guide the exploration (as
[
T 30
30
= a! 2 a30 ! = ı + w
20x1e2d240! 0
= w = i
1 1] [1] +1 [Thanks = ) [
1] = [ to2 ] 1in = =
+ [1] = [04.1.6: 256 3
expected
s
and theinloop the
!g 2 does
31 = Uneed
gfirst
bounds notiterations).
have positive to w be =
As [ will
recomputed.
elements,
0x1e2d140
andbe thussuggested
cannot guide
s 0x100
] Corollary
the Section
]
exploration4.4.1,
(as a simple
heuristic that solves
expected inthis
the firstkind of
iterations). situations
As will beis considering
suggested in Sectionthat, a simple!
4.4.1, when g iscontin-
not yet
! and U remains unchanged. Neither The !
new solution
g 0 nor h i Tis linear and the
the algorithm
3brute force approach notexploring all (as
possible
2
g does not
Compilation Techniques for Automatic
havethat
Extraction
heuristic positive
of Parallelism and
solves this !elements,
Locality
kind in ⇥ ⇤ Tand
Heterogeneous
0 of situations is thus
Architectures
consideringcannot T when
that, guide !
g isthe exploration
yet
w = w = i 88 / 118= [2]
n +1 l
l, !
onents. There
ı k ) (see are (n4.1.2)
Section + 1) can
potential
predict solutions
a stride that
differentneed fromto be0explored
because (as !
c =
(see Problem
in the right half
Lemma Resolution:
of
4.1.2). Figure 4.1), the
Therefore, onesubtrace
for116
each insertion position
{ a1 , . . . , a31 } cannot of be
theChapter
newly 4. Trac
generated
ered
th anIncreasing
index.
affine The most
access the Solution
common
enclosed Dimensionality
in situation,
a 1-level particularly
loop and thefor (I) values ofofk,the
large
dimensionality
the newly
rrent solutiondiscovered
S130 must loops are outer
be increased than the
to build S 31 .previously
For this known
purpose, ones.
the In
function
where 2a 0 in position p has been added to
se, given
• () an insertion
calls G ROW ()point
for the 0 possible
( p,two p! kn+)1 for the new
insertion loopofindex
points
! the i p , the
new loop to th
TRACT
Add a newk+loop ı = f ( p, ı k ) has been added
endices
lines generated, I 1 2 Z (4.1).
25–30 of Algorithm n+1)⇥(k+1) , is as follows:
As the most common situation is that newly
with the new loop index can be derived fr
covered loops are outer than
2 the previously known 3 ones, it starts with x = 0.
I k
ROW () inserts a new row and6column
(1:p,:) in U: ! ! k +1 !
k +1 ! k + 1 7 c( ı ı
#ı " 5
I "=4 06 1 ⇥ k 7
#
1 Ik 0 1 0
U= ( p+1:n,:) =
0 U(1:1,1:1) 0 1
ew index into I, updating the previous ones with a row of 0 to match the new
h
• In CHOLESKY
mensionality: 0
c ,...,c ,c ,c ,..., 1 p p p +1
" # " #
0 ... 0 1 0 ... 0 1
I= =
I(1:30) 0 0 . . . 29 0
dCompilation
a newTechniques
element in !w:
89 / 118
For the sake
..0 of simp
.
the calculations
6 . 7assoc
6 in.. Sec.73.4.
Problem Resolution:
n
discussed
6 7
X 6 7
0 k k !
Increasing
!
ı
the
! Solution
k +1
Dimensionality
k
h
c = +
= f( p, ı ) has been added to the matrix. The coefficient
i
(and
where a 0 in position p has been padded to each indexc
II)
k
r r ı 2 I , and a new column
i6
r=p+1
0 c associatedThe approach
6
7
propose
7
6 0 priority
3.3 Branch 7
0
! c1 , . . . , c p , c p , c p +1 , . . . , c n 6 1 7 =
p
with the new loop U and index arebe
wcan updated
derivedasfrom described
Equation in Sec.
(4.7): 3.4 to reflect ing the6relevant 7 soluti
any new information available. If no solution is found for the i k
for 6
each address 7of th
• The coefficient for the new loop can be derived from the
Chapter 4. Trace-based Affine Reconstruction
boundary conditions, !c (! +1
of
ı kthen
Code
! ı k )branch
this = sk ) is discarded. Note that number6 .. 7 .
p
6 of potential 7 s
observed stride
on p has been added to each index !
there must be a practical limit to the maximum
ı 2 Ik , and a new column 2
solution size, as in the
3
general case 0any trace {a1 , . . . , aN }
acceptable 4 the remain
processing 5
eral case, the ktime for e
) has been added to the matrix. The coefficient c p associated 6 . 7 0 in
can be generated using at most N6 6 affine
.. 7nested loops. To en-
7 trace containing N add
index can be derived from Equation (4.7):
sure that a minimal solution, in terms 6
6 0 of
7 the dimensionality
7 O(nn
N
). Consequently
h i6 7 the ksolution 0 space k k could take a
! c( ı ! k +1 ! k ofk the generated Z–polyhedron,
0
ı ) = s ) c1 , . . . , c p , c p , c p +1 , . . . , c n 6 1 7
should be traversed in a breadth-first
is
6
6
found,
7
fashion.
7
= s ) c p = s Â
+ ular order
i r cr
of the solution space,
2 3 6 ikp 7 r = p +1
0 Revisiting the cholesky example,
6
6 ... 7 31
7 there are two pos- defined as:
6 sible 7 !
6 ... 7insertion points for the new 4 loop 5 in S2 . As the most
6
6 common
7
7 situation is that newly i
discovered
After calculating the new c , U and w
k
n loops !are outer ! are updated
h i6 0 7 Lemma 3.2. Each ele
6 than 7 the previously
7 k known ones, n it explores p = 0 first. The
more iterations of ind
c1 , . . . , c p , c0p , c p+1 , . . . , cn 6 1
• In CHOLESKY 6 new 7
6 k
in
= s
this
)
loop coefficient
7 section
c 0 k
p vector and
= s + to reflect
Â indexi k
r c r any new information
matrix are calculated bounds
available.
U, !w .
If
6 ip 7 r = p +1
6 as:. 7
6 .. 7 the boundary conditions, then this branch is discarded.
4 5 ! ! ⇥ ⇤ Proof. jk is equal to th
After calculating
cink00 =athe 30 new
+ i 30 c , U and w are updated
practical
1 1c = limit to the maximum
256 + 0 · 29 ) !c =as described previously
256 0 acceptable solution
in ij minus siz
the current
in this section to reflect any new information available. If no solution is found for
n 
0 thek boundary
c p = s + Â ir cr k conditions,any trace { a
then31this branch , . . .
is,
10 . . . N0 1 a }
discarded. can be
Note generated
that there must using
be j at
k
= mostUN (j,
r = p+1limit to the maximum I =acceptable solution size, as in the general case |
a practical
this reason, the solution space should be traversed
0 . . . 29 0 wj +uinj,1 i1a
+..
any trace { a1 , . . . , a N } can be generated using at most N affine nested loops. For
ng the new ! c , U and ! w are Theupdated traversal of the solution
as described previously space continues. The next ob- By number
constructionofofge
the
ensure that
this reason, the solution space31should be traversed in
served If stride is
a minimal solution (in
a breadth-first
a32 fora31 = 8. No increase of the
terms of
fashion, to
eflect any new information available. no solution is=found loops is 1. Therefore,
ensure
Compilation Techniques for that
Automatic a minimal
Extraction of
currently solution
reached.
Parallelism
found
and (in
Locality
loop terms
in of number
Heterogeneous
indices
must produces
of
Architectures generated nested loops) is
ditions, then this branch is discarded. Note that there be 90 / such
118 stride: of loop i before i >
1 l l 1 l l +1 l
the selected branch does not match the observed stride (i.e., ŝlk 6= sk ). A different
ction
truction 115 115
loop index i l 0 will have to be selected as described in Section 4.1.2, but the con-
{j w0j , ln< } canj k be n}calculated
can be calculated exclusively selecting selecting
exclusively vectors invectors the in the
! structed
nreı!
Problem
reduced
Lemma
bez asreduced
I +
to:
4.1.5,
1 willResolution:
to: then
not be valid
index ! k +
ı lindex
Updating
in1 the
= +( !context ! k
+1 ı ) is not
kl,
of thethe ! Loop
extracted
ak )feasible
Bounds
loop bounds, U and
per Lemma
w , because either0 ı l 0 will 4.1.5,
! k+1 then
not be ı
sequential
l = +( tol, ! ıı k , oris itnotwilla violate
feasibleboundary
! k0+1
when
for ıcalculating
z when w
calculating j (since w i l >
(since 0 iby
k+1the definition of the +
> 0 by the definition of the +
conditions.
z •z
i(1:j,1:j Loop In this
windices
0 10⇥(1⇥( scenario,
must
j 1j ) 1) 1⇥(
1 j 1 to the 0one
j it
1 is
be necessary
l
sequential
j j1) 1)
⇥( to generate
and (4.12) stay new intoboundary
loop conditions
ore, i
w
).UTherefore,
jj,1:j
) ) (
0 will
1
!
1:j,1:j ) +
1be
) + w
equal
j0
w j willcan bebe
==
equal 0 calculated
to the one calculated for the (4.12)
previous
for the step
previous(4.5): step
0 and j w 0 . These found by solving the system in Equation
sing I k .bounds
Using the
k same reasoning, if (9 j, 0 < j < l, i k +1
6 = 0 ) ,k +1
orithm using I . Using the thsame th reasoning, 0 if
z z(9 j, 0 <j j < l, i j 6 = 0),
otes
notes thethe first
firstj entries
j entries of ofthethe j0 kj + row
1
row of! U
of 0 U ,
1
and
0 , and
k 1
i ( i
1:j,1:j n
1:j,1:j 1 ) k
1
2 2
1)by the
+a1feasible selection for
is not a feasible selection for calculating calculating
U I + w w 0
l1. Otherwise,
⇥(
z
+ ) (
0 0 ⇥(and +
)
wl . Otherwise, and by the (4.10)
stj ( j 1) 1entries in in thethe first j rows ofof matrix z
) entries first j rows k+1 i .izOnly
matrix . Only ( j( jz 1)1)
operation
of the + operation and indexand sequentiality,
index sequentiality, il th = il will il be
k + 1
= imaximum.
will be maximum.
at
that is the
is
If the system •
the Inconsistent
number
number of of unknowns
is inconsistent, unknownssystem inin
then the Theth
j row
thej generated
the row branch
ofofUU 0 . Sec-
0 l
. Sec-
iteration is discarded
space is not a polytope
nk matrix can be extracted from I k+k1+to 1 to build i z asz as long
rank matrix can be extracted
and the solution is not valid. If the system has solutions, from I build i long then it will be overdeter-
reare iterations • System
iterations wherewhere with
index
index isolutions
j isi j ismaximum.
maximum. Overdetermined
ByBytaking takingad- ad-
aloop loop form, form, this this means means ! thatthat it it is is always
always possibletotobuild
possible build !
lt, the calculation of w 0 has a ! complexity
0 of O ( 1 ) . Once w 0 is ! 0 is
x,this
and and result,
solvesolve the
thethe calculation
system
system inin of
linear
linear w time has
time a complexity
O(Oj)(.j)By . Byapplyingapplying of O ( 1 ) . Once w
known rows { U 0 , l {Uj0 (  nl }0can bencalculated by reducing
d,mplexity
the
complexity unknown of the rows
j,:
calculation
of the calculation of U becomes O(n ).
( ) j,: ) , of U 0j becomes
 } can O be
(2n 2calculated
) . by reducing
malinsystem
Equation (4.10) to ((4.10)
in Equation n l+ to1()nequation l + 1) equation systems of the form:
systems of the form:
U0( j,:) iz + U w00j 11⇥izn + =w00j11⇥1n⇥n = 01⇥n

( j,:)
ents
of of
tsCompilation
the theLoop Loop
Techniques
Indices
Indices
s a fulln⇥nrank matrix of columns extracted from 91 / 118 Ik+1 . As estab-
k +1
n still produce a large number of potential solutions that will be dis
processing the remaining addresses in the trace. In the general ca
Problem Resolution: Accelerating the Traversal
or exploring the entire solution space of a trace containing N add
ted by n loops would be O(n N ). Consequently, exploring all branche
• In the general case, the complexity of exploring the solution
ticularspace
orderfor
could take
a trace a very
with long time.
A addresses In order
generated by to guideisthe trave
n loops
ution O(n A
space, !
) consider the column vector g k 2 Z n defined as:
! k ! k !
g =U ı + w
Each element indicates

• k 112 ! how
k many more iterations of each Affine Re
Chapter 4. Trace-based
a 4.1.4. Each element g j 2 g indicates how many more iterations of i
index are left before it resets under
! the bounds
before it resets under bounds This
U andresultwsuggests
. that, assuming that U and !
w are acc
! k +1 !
• The most plausible value for the next index is l
sible value for the next index is ı = +( l, ı k ), where l
! k.
k
g j is equal where to l is thethe position
value ofofthe
innermost the upper innermost
positive element
bound positive
of g
of theelement
loop in i j , defi
The correctness of ! ı kl +1 can be assessed by comparing th
on (4.2), minus the currentrecognized
• Several accesses are
value of i in j : block
with the observed one sk . Note that using ! g k as described a
sistency with the boundary conditions in Equation (4.4), wh
k
Compilation Techniques for Automatic ! k of Parallelism and Locality in Heterogeneous Architectures
Extraction
g =U ı +w = w +u i +...+u 92 / 118 i i =
1)
I(1:30 00 0 . . . 1029 0 U 0 (1:1,1:1) 0 1
U= =
4.1.2). Therefore, the subtrace {0a1 , .U . .(,1:1,1:1
a31 }) cannot be0 generated 1
and enclosed
ccess a new element in a 1-level
!
in wa: newloopindex and the intodimensionality
I, updating the of previous
the ones with a row
Problem
S130a must
new index into
be increased Resolution: dimensionality:
I, updating
to build
!
Sthe hCHOLESKY
31 . For
2
previous
!
iones
this purpose,
T
with the(III)
afunction
T
row of 0 to match the new
Gdimensionality:
ROW () for the two possible insertion(1:1w = 1 | w points
) = [ 1 | 29 ]
of"the new loop # "
of Algorithm 4.1). As the most common situation 0 . . . 0 1 0 . . . 0 1
" # I = is that newly =
" #
s are outer 0 . . . 0 1 0 . . .I 0
1:30 1 0 0 . . . 29 0
Therethanare not the enough
previously
I = points known in the ones, =it starts with
reconstructed ( x = 0. space to infer any
iteration
)
a new row andbetweencolumn in I(1:30) 0 0 . . . 29 0

dependence theU:number of iterations ! of i2 and the newly inserted i1 (see
and a new element in w :
Equation " (4.12)), and therefore ! # " U remains # unchanged. The engine checks the lin-
and a new element 1 0 in w : 1 0 h iT
earity
U =of the calculated loop = bounds as indicated in! w Equation
= 1 | !
w (4.5): = [1|29] T
h iT (1:1)
0 U(1:1,1:1) ! 0 ! 1 T
w = 1| w (1:1) = [1|29]
I, updating the previous ones UI
withare
!w 1
a row 1⇥(31)
0 2⇥(31)
There + not of 0 to match
enough points theinnew the reconstructed itera
There are not enough "
dependence
1 0
#"
points in
0 ... 0 1 between
the # "
the
1 number
reconstructed #
h iof iterations
iteration space toofinfer
i2 andanythe
+ 1 ... 1 =
dependence
" between# the 0 Equation
"number
1 0 .(4.12)),
.of
. 29 0 and
# therefore
iterations 29of i2 andUthe remains
newly unchanged. The
inserted i1 (see
0 . . . 0 1 " 0 . . . 0 1 # " #
Equation
I= (4.12)), and therefore=0earity
0 0ofU .the calculated
. .remains
0 loop
1 . . .bounds
1 unchanged. as indicated
1 The engine checksintheEquati
lin-
+ =
earity I
ofthat
3 Note the
(1:30 0
)strides are0loop
calculated
the
0
in bytes
.
1 bounds .
2 ... . 29
(see Section 29 0 0
4.2.2). 29 in
as indicated . . .Equation
29 (4.5):
" #
1 1 1 ... 1 0 ! 1⇥(31)
!
nt in w : 29 28 27 . . . 0 29
02⇥(31
UI)
+ w 1 02⇥(31)
UI + !
w 1 1⇥(31)
0 2⇥(31)
h
The insertion of thei Tnewandloop in position p = 0 is accepted and the traversal of the
! = 1| !
Compilation Techniques for Automatic Extraction of Parallelism
w solution w continues
space = [from T
Locality in Heterogeneous Architectures
1|29k]= 31. The next 93 /observed
118 stride is s31 = 8. Until
1 0 11
and the calculated matrix is: h hi
!
w = 29 29!
w =0
2 3 2
1 0 0
Problem Resolution: CHOLESKY (and U0 = IV)
6
4 1 1 0
7
5
1 0 1
4.2.2 Discussion
4.2.2 Discussion
1 #define N 32;
2 double p[N], A[N][N], x;
3 int i, j, k; The engine has now collected all the information that it
4 Note that, since
Noteaddresses
that, sinceinaddresses
the trace inarethe
expresse
trace
5 #pragma scop
problem. From this point on, our method will keep incorpo
loop indexes loop
reconstructed
1 0x00400cbe indexes
!
0x1e2d140 reconstructed
by the engine byisthe
also engin
exp
6
7
for (i = 0; i < N; ++i) {
trace to the solution, with
2 0x00400cbe g predicting
0x1e2d140 all remaining itera
x = A[i][i];
access 3A0x00400cbe
[i ][kaccess
] is reconstructed
A[i ][k] is reconstructed
0x1e2d140 as A[256 ⇤ i + as8A ⇤ [k256
]. Th
⇤
8 the end of the4 trace
for (j = 0; j <= i - 1; ++j) ... having reconstructed the following term
9
10
x = x - A[i][j] * A[i][j];
p[i] = 1.0 / sqrt(x);
type size (double
type 0x1e2ef18
5 0x00400cbe size
, 8 bytes)
(double and, 8the
bytes)
dimensionality
and the dime o
6 0x00400cbe 0x1e2ef20
11 for (j = i + 1; j < N; ++j) { 7 0x00400cbe 0x1e2ef28 h i
12 x = A[i][j]; !c =
13 for (k = 0; k <= i - 1; ++k)
256 0 8
14 x = x - A[j][k] * A[i][k] ;
15 A[j][i] = x * p[i];
generated by the access A[i][k]
2 3
16 } (see line 14 of Figure 4.2a). 1 0 0
17 } 6 7
18 #pragma endscop U=4 1 1 0 5
(a) Source code. 1 0 1
h i
Figure 4.2 – The Cholesky matrix decomposition. !
w = 29 29 0
the engine predicts an iteration of the only loop that has been 94 / 118found so far:
Outline
95 / 118
Supporting Nonlinearity: Input Noise
• Some trace files mainly contain references issued by a

single access,
132 but mixed Chapter with unrelated
4. Trace-based ones of(e.g.,
Affine Reconstruction Code nearly
affine or
can unlabeled traces)
be modified to discard some observations before concluding that a branch
cannot lead to a solution. This feature has to be statistically guided, to avoid
discarding too many points and reaching a very simplified nest version.
• The exploration of the solution space can be modified to
Hence, the algorithm has been extended to allow discarding input noise. When-
discard until
ever !
maxa ŝobserved
g predicts k
accesses
k
that does not match the observed s , the reconstruction k
engine checks whether

( )
e
ŝk = Â sk+r , 0 < e  max
r =0
where max is the maximum number of consecutive noise references to be toler-

• Tolerance parameter for discarding a branch
ated. If this condition holds for some value of e, then it is plausible that accesses
{ ak , . . . , ak+e 1 } are spurious. The engine will discard these references and re-
sume the exploration. A backtracking point is created in case this assumption
proves false.
The use of ! g is capable of identifying errors as long as the current Snk accu-
ratelyExtraction
Compilation Techniques for Automatic represents the trace.
of Parallelism However,
and Locality this does
in Heterogeneous not happen in the initial stages of
Architectures
96 / 118
Supporting Nonlinearity: Missing Data
• A trace file may be missing some data to make it

completely representable by an affine loop
• The 4.3
exploration of the solution space can be modified
Supporting Nonlinearity 133 to
insert until max predicted
access predicted by !
strides
g k allows to continue exploring the branch:
( )
e
sk = Â ŝk+r , 0 < e  max
r =0
where max is the maximum number of missing references to be tolerated. As

• Tolerance parameter to avoid the exploration of
before, if this condition holds then it is plausible that there is a sequence of
improbable branches
missing accesses { a0k , . . . , a0k+e 1 } in between ak and ak+1 such that { a0k+ j 1 =
j
ak + Sr=0 ŝk+r , 0 < j  e}. The engine tentatively inserts these accesses and
resumes the exploration. A backtracking point is created in case no solution is
• Particular case: access guarded by a boolean function
reached by exploring that branch. A tolerance parameter is used to avoid the
exploration of improbable branches, as described in Section 4.3.1.
When allowing for missing data, the final97 solution / 118 for a trace !
a containing
Supporting Nonlinearity:
Automatically Parallelized Codes with PLUTO
• Codes parallelized by simply adding a OpenMP parallel

pragma & the iterations are scheduled statically
• The same algorithm can be applied.
Algorithm 4.3 Pseudocode of the piecewise reconstruction
1:FUNCTION P IECEWISE E XTRACT
Input: !a : the execution trace
• Otherwise Input: max_depth: maximum reconstruction depth
Output: W = {S0 , . . . , S L 1 }: set of perfectly nested affine loops that form a
piecewise reconstruction of ! a
W P IECEWISE E XTRACT( ! a , depth = 1)
• Piecewise 2:
3: curr_depth 2
reconstruction as 4:
5:
while (curr_depth  max_depth) ^ (|W| > 1) do
for Sl 2 W do
E XTRACT(Sl , !
sequence of 6:
7:
Sl0 a , depth = curr_depth)
if Sl0 overlaps perfectly with {Sl , . . . , Sl 0 } 2 W then
perfectly nested 8:
9:
W
end if
(W {Sl , . . . , Sl 0 }) [ Sl0
loops 10:
11:
curr_depth + +
end for
12: end while
13: return W
14: end FUNCTION=0
the problem intractable 98 /due
118 to the vast amount of different alternatives to be ex-
Outline
99 / 118
Experimental Evaluation: Affine Codes Trace

3mm 0.02
% Trace % Trace
lu 0.11 seidel 0.00
%
2mm 0.04 adi 0.01 jac-2D 0.00

syr2k 0.02 doit. 0.58 gesum. 25.01
Reconstruction times (s) Sequential Tracesyrk 0.05
% dynp.
Trace 0.00
% atax 25.00
Trace %
Parallel gemm 0.05 fdtd-a. 24.21 bicg 25.00
0.5 5 50 500 5k 3mm
floyd 0.02
0.00 lu 0.66
lud. 0.11 seidel 0.00
mvt 12.50
3mm 2mm 0.04
symm adi 0.01
0.13 fdtd-2d 0.01 reg_d.
jac-2D 2.07
0.00
2mm syr2k
corr. 0.02 doit. 0.58
0.67 grams. 0.58 gesum.
durbin 25.01
100
syr2k
syrk 0.05
covar. 0.37 dynp. 0.00 trisolv
chol. 0.58 atax 25.00
100
syrk
gemm gemm
trmm 0.050.00 fdtd-a. 24.21 jac-1D
gemv. 21.43 bicg 25.00
100
floyd-warshall floyd 0.00 lud. 0.66 mvt 12.50
Table 4.1 – Percentage of 0.13
trace fdtd-2d
reconstructed !
symm symm 0.01 after 48h without
reg_d. 2.07 g prediction.
correlation corr. 0.67 grams. 0.58 durbin 100
covariance covar. 0.37 chol. 0.58 trisolv 100
trmm trmm 0.00 gemv. 21.43 jac-1D 100
lu
adi
doitgen % of trace reconstructed without gamma in 48h
Table 4.1 – Percentage of trace reconstructed after 48h without !
g prediction.
dynprog
fdtd-apml
ludcmp
fdtd-2d
Trace % Trace % Trace %
gramschmidt 3mm 99.85 lu 99.71 seidel 95.00
cholesky 2mm 99.84 adi 98.00 jac-2D 95.00
gemver
syr2k 99.85 doit. 98.83 gesum. 74.95
seidel
jacobi-2D syrk 99.83 dynp. 99.98 atax 74.96
gesummv Trace
gemm 99.83 % Trace 75.62
fdtd-a. % Trace %
bicg 74.96
atax floyd 99.85
3mm 99.88 lud.
lu 99.99 mvt 87.46
99.71 seidel 95.00
bicg symm 99.80 fdtd-2d 98.00 reg_d. 99.78
2mm 99.84 adi 98.00 jac-2D 95.00
mvt corr. 99.60 grams. 99.61 durbin 99.88
reg_detect syr2k 99.85 doit. 98.83 gesum. 74.95
covar. 99.70 chol. 99.99 trisolv 99.89
durbin syrk 99.83 dynp. 99.98 atax 74.96
trmm 99.97 gemv. 78.53 jac-1D 99.00
trisolv gemm 99.83 fdtd-a. 75.62 bicg 74.96
jacobi-1D
2 20 200 2k 20k
% of accesses predicted by gamma
Table 4.2floyd 99.88 oflud.
– Percentage trace99.99
accesses mvt
symm 99.80 fdtd-2d 98.00 reg_d. 99.78
87.46by !
predicted g.
Total trace refs (millions) corr. 99.60 grams. 99.61 durbin 99.88
covar. 99.70 chol. 99.99 trisolv 99.89
100 / 118
trmm 99.97 gemv. 78.53 jac-1D 99.00
Experimental Evaluation:
Input Noise & Missing Data
p=0.01 p=0.05 p=0.10 p=0.15 guarded
32
Normalized extraction time (log)
16
1
3mm dynprog cholesky trisolv jacobi-1d
101 / 118
Piecewise Reconstruction (I)
1 for
1 for (t1=3;t1<=3
(t1=3;t1<=3 *N-6;t1++)
*N-6;t1++) { {
2 2lbp=max(ceild(t1+1,2),t1-N+2);
lbp=max(ceild(t1+1,2),t1-N+2);
1 for
1 for (i=1;i<=N-2;i++)
(i=1;i<=N-2;i++)
3 3ubp=min(floord(t1+N-2,2),t1-1);
ubp=min(floord(t1+N-2,2),t1-1);
2 2for for (j=1;j<=N-2;j++)
(j=1;j<=N-2;j++)
4 #pragma
4 #pragma omp omp parallel
parallel for for
3 3 a[i][j]
a[i][j] = ...
= ...
5 5for for (t2=lbp;t2<=ubp;t2++)
(t2=lbp;t2<=ubp;t2++)
6 6 a[(t1-t2)][(-t1+2
seidel
a[(t1-t2)][(-t1+2 *t2)]
*t2)] = ...= ...
(a) Original
(a) Original sequential
sequential code.code.
(b) Automatically
(b) Automatically parallelized
parallelized code.code.
(from the PLUTO examples)
102 / 118
Reconstructed Trace Pieces of
max_depth = 1 (161 pieces)
seidel for the Thread #0
103 / 118
104 / 118
105 / 118
106 / 118
max_depth = 4
seidel for All the Threads
107 / 118
Piecewise Reconstruction (and VII)
increasing the
maximum depth has
diminishing returns
a small number of
loops represent most
of the issued accesses
108 / 118
Outline
1. Introduction
5. Conclusions
109 / 118
Main Contributions (I)
Definition of a new compiler intermediate representation

called KIR
• It provides the program characteristics needed for the

automatic parallelization of the input sequential code
• It is built on top of diKernels to handle syntactical variations

of the source code
• diKernels are connected with diKernel-level dependences

and are grouped into execution scopes in order to
recognize the computational stages of the input application
110 / 118
Main Contributions (II)
Generation of parallel code for multicore processors with the insertion of
OpenMP directives
• Automatic partitioning algorithm of the KIR focused on the minimization

of the overhead of thread synchronization
• Comprehensive benchmark suite that includes synthetic codes

representative of frequently used diKernels, routines from dense/sparse
linear algebra and image processing, and simulation applications
• Comparative evaluation in terms of effectiveness with GCC, ICC and

PLUTO. The contenders fail to parallelize codes that contain both regular
computations with complex control flows and irregular computations,
and they do not optimize the joint parallelization of multiple loops
111 / 118
Main Contributions (III)
KIR-based locality-aware automatic parallelization technique that

targets GPU-based heterogeneous systems
• It exploits data locality in the complex GPU memory hierarchy
• Tested with two representative case studies: CONV3D & SGEMM
• Chains of recurrences model accesses to n-dimensional arrays
• OpenHMPP directives enabled a great understandability and

portability of the generated GPU code
• Performance evaluation on NVIDIA GPUs (with two different core

architectures) has corroborated its effectiveness
112 / 118
Main Contributions (and IV)
Reconstruction of affine loop codes from their memory traces,

considering one instruction at a time
• Formulated as the exploration of a tree-like solution space
• Large traces are processed in a matter of minutes, without user

intervention or access to source/binary codes
• Extensions to deal with moderate nonlinearity in the trace and with

automatic parallelized codes
• Applications such as trace compression/storage/communication,

dynamic parallelization, memory placement and memory hierarchy
design
113 / 118
Future Research Lines
• Using the trace-based reconstruction to increase the

information available for the construction of the KIR
• New automatic partitioning algorithm of the KIR that handles

the interactions between computations for heterogeneous
clusters, considering both CPU-GPU interaction and inter-
node communication
• Auto-tuning to select the best performant variant between

several candidates of a parallelized diKernel
• Reconstructing the memory trace of a broader range of

irregular computations
114 / 118
Publications (I)
• J. M. Andión, M. Arenaz, G. Rodríguez, and J. Touriño. A novel compiler support for
automatic parallelization on multicore systems. Parallel Computing, 39(9):442–460,
2013. [Q1 (15/102) in Computer Science, Theory & Methods in JCR 2013]
• J. M. Andión, M. Arenaz, G. Rodríguez, and J. Touriño. A parallelizing compiler for

multicore systems. In Proceedings of the 17th International Workshop on Software
and Compilers for Embedded Systems (SCOPES), pages 138–141, Sankt Goar,
Germany, 2014. [Type A in CORE2014]
• J. M. Andión, M. Arenaz, and J. Touriño. Domain-independent kernel-based intermediate representation for automatic parallelization of sequential
programs. In Poster Abstracts of the 6th International Summer School on Advanced Computer Architecture and Compilation for High-Performance and
Embedded Systems (ACACES), pages 71–74, Terrasa, Spain, 2010.
• J. M. Andión, M. Arenaz, and J. Touriño. Automatic partitioning of sequential applications driven by domain-independent kernels. In Proceedings of the
15th Workshop on Compilers for Parallel Computing (CPC), CDROM, Vienna, Austria, 2010.
• J. M. Andión, M. Arenaz, and J. Touriño. A new intermediate representation for GCC based on the XARK compiler framework. In Proceedings of the 2nd
International Workshop on GCC Research Opportunities (GROW) (in conjunction with HiPEAC), pages 89–100, Pisa, Italy, 2010.
115 / 118
Publications (II)
• J. M. Andión, M. Arenaz, F. Bodin, G. Rodríguez, and J.

Touriño. Locality-aware automatic parallelization for
GPGPU with OpenHMPP directives. International Journal
of Parallel Programming (in press), 2015. [Q4 (79/102) in
Computer Science, Theory & Methods en JCR 2013]
• J. M. Andión, M. Arenaz, F. Bodin, G. Rodríguez, and J. Touriño. Locality-aware

automatic parallelization for GPGPU with OpenHMPP directives. In Proceedings of the
7th International Symposium on High-level Parallel Programming and Applications
(HLPP), pages 217–238, Amsterdam, Netherlands, 2014. [Type C in CORE2014]
116 / 118
Publications (and III)
• G. Rodríguez, J. M. Andión, M. T. Kandemir, and J.

Touriño. Trace-based Affine Reconstruction of Codes. In
Proceedings of the 14th International Symposium on
Code Generation and Optimization (CGO), (accepted),
Barcelona, Spain, 2016. [Type A in CORE2014]
• G. Rodríguez, J. M. Andión, J. Touriño, and M. T. Kandemir. Reconstructing affine

codes from their memory traces. Pennsylvania State University Technical Report CSE
15-001, University Park, PA, USA, 2015.
117 / 118
Compilation Techniques
for Automatic Extraction
of Parallelism and Locality
in Heterogeneous Architectures
José M. Andión
PHD ADVISORS: Gabriel Rodríguez and Manuel Arenaz

Compilation Techniques For Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compilation Techniques For Automatic Extraction of Parallelism and Locality in Heterogeneous Architectures

Uploaded by

Copyright:

Available Formats

Compilation Techniques

for Automatic Extraction

PHD ADVISORS: Gabriel Rodríguez and Manuel Arenaz

2. A Novel Compiler Support for Multicore Systems

3. Locality-Aware Automatic Parallelization for GPGPU

4. Trace-Based Affine Reconstruction of Code

2. A Novel Compiler Support for Multicore Systems

3. Locality-Aware Automatic Parallelization for GPGPU

4. Trace-Based Affine Reconstruction of Code

“HPC has contributed substantially to national economic

“To out-compute is to out-compete”

The TOP500 List

The TOP500 List

The TOP500 List

• Modeling reality in software is already difficult

• Hardware architecture with multiple levels of increasing

• How do we distribute computations?

• How do we distribute data?

• Techniques to improve locality:

• loop interchange, fission and fusion of loops and arrays, tiling…

• hardware and software prefetching

• design of ad-hoc memory systems

• Implemented in compiler frameworks

• Techniques to improve locality:

• loop interchange, fission and fusion of loops and arrays, tiling…

• hardware and software prefetching

• design of ad-hoc memory systems

• Implemented in compiler frameworks

2. A Novel Compiler Support for Multicore Systems

3. Locality-Aware Automatic Parallelization for GPGPU

4. Trace-Based Affine Reconstruction of Code

2. A Novel Compiler Support for Multicore Systems

• Automatic Partitioning driven by the KIR

• Automatic Parallelization of the Benchmark Suite

2. A Novel Compiler Support for Multicore Systems

• Automatic Partitioning driven by the KIR

• Automatic Parallelization of the Benchmark Suite

• Current parallelizing compilers are based on systems of

on of the KIR consists of three steps: first, the construction

1 for (i = 0; i < n; i++) { BB2

2 t = 0; t = t + A[i][j] * x[j]; (2)

3 for (j = 0; j < m; j++) {

– Source code of the dense matrix-vector multiplication.

• Characterizes the computations carried out in a program

• SCC of the DDG ignoring flow-of-control statements

t = 0; (2) Figure 2.2 – Source code of the dense matrix-vector m

j++; (2) K < jBB2 > K < tBB1 >

if (j < m) T K < tBB2 >

(1) y[i] = t; (2) K < yBB4 >

For illustrative purposes, Figure 2.4 highlights the diKe

K < iBB0 >

j++; (2) K < jBB2 > K < tBB1 >

if (j < m) T K < tBB2 >

(1) y[i] = t; (2) K < yBB4 >

• To expose the computational stages of the program

• Based on the hierarchy of loops: one execution scope for

• The root execution scope is a special node that represents

• diKernels belong to the innermost execution scope that

• diKernels that compute the loop indices belong to the ES

1 for (i = 0; i < n; i++) {

4 t = t + A[i][j] * x[j]; ES_fori (Figure 2.2, lines 1-7)

– Source code of the dense matrix-vector multiplication. K < tBB2 >

K < yBB4 >

1 #pragma omp parallel shared(disp) private(disp_disptplus_private,...)