Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A Graph-Based Iterative Compiler Pass

Selection and Phase Ordering Approach


Ricardo Nobre Luiz G. A. Martins João M. P. Cardoso
Faculty of Engineering, University of Faculty of Computing, Federal University Faculty of Engineering, University of
Porto, and INESC TEC, Porto, Portugal of Uberlândia Uberlândia, Brazil Porto, and INESC TEC, Porto, Portugal
ricardo.nobre@fe.up.pt lgamartins@ufu.br jmpc@acm.org

Abstract General Terms Algorithms, Measurement, Performance, Experi-


Nowadays compilers include tens or hundreds of optimization mentation
passes, which makes it difficult to find sequences of optimizations Keywords Phase-ordering, design space exploration, compilers
that achieve compiled code more optimized than the one obtained
using typical compiler options such as –O2 and –O3. The problem
involves both the selection of the compiler passes to use and their 1. Introduction
ordering in the compilation pipeline. The improvement achieved Programmers typically rely on high-level languages and optimize
by the use of custom phase orders for each function can be signif- applications in two ways: performing manual source code trans-
icant, and thus important to satisfy strict requirements such as the formations and/or relying on a tool (the compiler) to automatically
ones present in high-performance embedded computing systems. apply a set of transformations in a specific order. Programs are usu-
In this paper we present a new and fast iterative approach to the ally compiled with one of the default compiler optimization levels.
phase selection and ordering challenges resulting in compiled code Compiler sequences present in the form of the –OX optimization
with higher performance than the one achieved with the standard level flags, such as the GCC [1] –O1, –O2 and –O3 flags, rely on
optimization levels of the LLVM compiler. The obtained perfor- a fixed sequence of compiler passes. Although these flags gener-
mance improvements are comparable with the ones achieved by ally provide performance improvements, they do not leverage all
other iterative approaches while requiring considerably less time potential for optimization from all functions/programs and target
and resources. Our approach is based on sampling over a graph hardware/software platforms pairs.
representing transitions between compiler passes. We performed a Different functions and different instances of the same functions
number of experiments targeting the LEON3 microarchitecture us- can benefit from optimization using specialized compiler pass se-
ing the Clang/LLVM 3.7 compiler, considering 140 LLVM passes quences; i.e., can benefit from the execution of a specific set of
and a set of 42 representative signal and image processing C func- compiler passes, executed in a specific order. This has been shown
tions. An exhaustive cross-validation shows our new exploration by a number of authors [2, 3], and by our own experimental results
method is able to achieve a geometric mean performance speedup [4–6]. Individual function improvements regarding a given metric
of 1.28× over the best individually selected -OX flag when con- (e.g., performance, code size, energy) have been addressed by ap-
sidering 100,000 iterations; versus geometric mean speedups from proaches focused on automatic phase selection and/or automatic
1.16× to 1.25× obtained with state-of-the-art iterative methods phase ordering (e.g., [3, 7]). The specialized sequences of com-
not using the graph. From the set of exploration methods tested, piler optimizations provided by phase selection and phase ordering
our new method is the only one consistently finding compiler se- methods can be a way to comply with stringent/strict requirements.
quences that result in performance improvements when considering Moreover, the compiler pass sequences represented by the tra-
100 or less exploration iterations. Specifically, it achieved geomet- ditional compiler optimization levels tend to favor performance or
ric mean speedups of 1.08× and 1.16× for 10 and 100 iterations, code size over other metrics. Power, energy, area and clock fre-
respectively. quency are especially important in domains such as embedded sys-
tems (e.g., when also involving custom hardware). Relying on the
Categories and Subject Descriptors D.3.4 [Programming Lan- compiler to optimize code diminishes the effort needed to manu-
guages]: Processors - compilers, optimization ally optimize and/or maintain different versions of the same func-
tion/program. By relying on a compiler pass sequence tailored to a
program-to-program basis, a single source of the program/function
can target multiple hardware/software platforms or retarget the
same platforms taking into account different metrics and/or require-
ments.
Permission to make digital or hard copies of all or part of this work for personal or Specialization of compiler phase orders can be efficiently ac-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation complished by an automatic Design Space Exploration (DSE) ap-
on the first page. Copyrights for components of this work owned by others than ACM proach. The DSE can deal with the difficulty to find phase selec-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a tions and especially phase orders that result in better generated
fee. Request permissions from Permissions@acm.org. code. The difficulties are related to the large number of compiler
Copyright is held by the owner/author(s). Publication rights licensed to ACM. passes to select, and because compiler passes interact each other
LCTES’16, June 13–14, 2016, Santa Barbara, CA, USA in the compiler optimization pipeline, and highly depend on code
ACM. 978-1-4503-4316-9/16/06...$15.00 features in ways difficult to predict. Although users may know cer-
http://dx.doi.org/10.1145/2907950.2907959 tain compiler passes are to be executed before other passes, the

21
impact on other optimization passes in latter stages in the compi- Our approach relies on a directed graph (possibly cyclic) rep-
lation pipeline is difficult (if not impossible) to predict with accu- resenting favorable compiler pass transitions to more efficiently
racy. This is the case even for compiler writers, and is exacerbated generate suitable compiler sequences. With our approach only a
by the fact that even the smallest changes to a number of compiler fraction of the compiler sequences space is considered for itera-
passes (e.g., between different versions of the compiler) can lead to tive compilation/execution, thus making exploration of compiler
changes in the interdependencies between them; making automatic sequences faster. Each graph node represents a compiler pass and
DSE, regarding a single metric (e.g., execution time) or multiple the paths in the graph formed by nodes connected through directed
metrics (e.g., performance and energy), worth seeking. edges represent subsequences of compiler passes. In this graph
The exploration of compiler sequences can be an intensive weights are assigned to directed graph edges, so that some sub-
computing process. Iterative approaches tend to require a num- sequences of compiler passes are favored over others when gener-
ber of exploration iterations (i.e., compile and simulate/execute ating new sequences.
cycles) ranging from hundreds to millions (depending on pro- The use of the graph to guide an iterative compiler sequence
gram/function, simulator/platform) in order to find compiler se- exploration scheme causes a reduction of the search space, by
quences that result in suitable optimizations regarding the given avoiding the test of a large number of compiler sub-sequences and
metric(s). Compilers typically provide tens to hundreds of compiler giving preference to the ones that can be represented in the graph
passes, and in many cases it is beneficial to explore large compiler by paths associated with large weights.
sequences including repetitions of the same compiler passes in
different points of the optimization pipeline. Each iteration of an 2.1 Building the Graph
iterative exploration method in the context of compiler sequences
consists on the generation of candidate solution(s), compilation(s), The graph can be built by compiler experts based on their knowl-
and estimation(s), simulation(s) and/or execution(s); being esti- edge about compiler pass interdependence. Alternatively, in case
mation and simulation/execution the most computational intensive far reaching human expertise about the nuances of compiler pass
steps. interdependence is not available, a statistical approach can be used.
Efficient automatic DSE of compiler pass phase selection and The graph can be built from a list of compiler sequences, each com-
phase ordering is of high importance, especially when stringent re- posed of two or more compiler pass instances (including repeti-
quirements are not satisfied by generic –OX optimization options tions) ordered in a particular way, previously found for a group of
and developers are otherwise required to modify the code and/or functions/programs when compiling for the same or a similar target
apply different optimizations. By applying specific compiler opti- platform (e.g., CPU with similar architecture).
mization phase orders, fewer manual code modifications (if any) Building the graph from sequences previously found with iter-
are needed. ative compilation for other functions/programs has potential as an
In this paper we present a new algorithm for efficiently suggest- approach. Given an objective metric, if some orderings of compiler
ing compiler pass phase selections and phase orders. The algorithm passes were useful when compiling a set of programs/functions,
uses a graph representing transitions between passes to generate an then their use will more likely result in high quality solutions when
arbitrary number of new compiler phase selections/orders. Connec- used for optimizing new program(s)/functions(s) than phase or-
tions between graph nodes are weighted to favor sub-sequences that ders that did not produce the same quality. The list of compiler
are more likely to result in more suitable compiler sequences. The sequences is translated into the graph by providing graph paths ac-
graph resembles previous compilation sequences and is built from cording to those sequences and by assigning weights to connec-
compilation sequences previously achieved for a number of func- tions between nodes. These weights are based on how many times
tions. the pairs of compiler passes represented by the connected nodes are
The experimental results targeting a LEON3 microprocessor [8] present in the sequences used as input for the graph construction.
with LLVM and considering 140 compiler passes when compiling In the experiments presented in this paper, we use the default
42 functions from two Texas Instruments C libraries, show the new LLVM implementation for all the compiler passes with input pa-
approach is able to achieve a geometric mean speedups of 1.28× rameters. For instance, for loop unrolling we do not specify the un-
over the best individually selected -OX flag, versus geometric mean rolling factor and let the LLVM compiler select the unrolling factor
speedups from 1.16× to 1.25× obtained with other state of the using its internal heuristic. In the case one wants to explore the pa-
art iterative methods that do not rely on the use of the graph. rameters allowed in some compiler passes, those compiler passes
Additionally, using our approach with 100 and 1,000 iterations must be represented in the graph as many times as the number of
results in geometric mean speedups of 1.16× and 1.20×, over the parametric configurations considered.
best performance using –O1, –O2 or –O3. The graph can contain loops, as it must be possible to select the
The rest of the paper is organized as follows. Section 2 ex- same compiler pass multiple times. This allows compiler pass tran-
plains how compiler pass positioning information is represented in sitions (including subsequences of a number of passes) to repeat.
a graph structure and how it is used by the new approach. Section
3 presents our compiler and DSE framework, and the methodology 2.2 Graph-based Algorithm
of the experiments. Section 4 presents and discusses experimental
results. Section 5 analyses and discusses limitations specific to the The new algorithm (see Algorithm 1) is based on sampling the
presented approach and general limitations and/or challenges in the graph representing possible compiler pass sequences by nodes and
context of compiler phase order exploration. Related work is pre- weighted connections.
sented in Section 6. Final remarks about the presented work and The algorithm receives as input the number of iterations N , the
possible future work are presented in Section 7. maximum number of passes M , and a graph G. A new sequence
is generated and evaluated (i.e., compile and execute/simulate) in
each iteration, starting with a call to selectF irstN ode(G) (line
2. Graph-based DSE 4) to get the first compiler pass, represented by a graph node, to
Knowledge about what compiler passes to use and their order of add to each new sequence being generated. The heuristic to select
execution, so non-functional requirements can be best met, can be the first compiler pass is defined by the user. For the experiments
leveraged to develop more efficient DSE methods in the context of presented in this paper we select as first pass the one corresponding
exploration of compiler sequences. to the graph’s most connected node. Other heuristics can be used;

22
Algorithm 1: New graph-based METHOD -block-freq -dce
Input: Number of iterations (N ), maximum number of 3/14 3/28
passes per sequence (M ) and graph (G) ... -loop-unroll 5/14 -early-cse
Output: Best optimization sequence found (bestSeq)
3/28 3/14
1 bestSeq ← {} -indvars
-reg2mem
2 bestF it ← getStartF itness()
3 for i ← 1 to N do
4 newSeq ← selectF irstN ode(G) Figure 1. Representation of graph nodes and connections from
5 for j ← 1 to M do node representing the loop-unroll compiler pass.
6 newP ass ← nextN ode(G, last(newSeq))
7 if isN ull(newP ass) then
8 break 3/14 3/28 5/14 3/14 3/28

9 newSeq ← append(newSeq, newP ass) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
10 newF it ← evaluate(newSeq) -block-freq -dce -early-cse -indvars -reg2mem
11 if isN ewSeqBetter(newF it, bestF it) then
12 bestSeq ← newSeq Figure 2. Probability distribution for next compiler pass selection
13 bestF it ← newF it considering the loop-unroll node as the current pass and the graph
example shown in Figure 1.
14 return bestSeq

next compiler pass to append to the sequence, the graph is inspected


in order to suggest the compiler pass to be executed after the loop-
e.g., select the pass more frequently used as first pass in previously unroll compiler pass. Figure 1 shows part of an example graph
generated sequences. representing the loop-unroll node and the nodes it connects to.
Then, new compiler passes are added to the current candi- The weights of the connections represent the probability a given
date sequence by following graph connections using function compiler pass is chosen to be appended to the compiler sequence
append(newSeq, newP ass) (line 9) until the sequence is com- after the loop-unroll compiler pass. The next compiler pass to
posed of the maximum allowed number M of compiler passes; or append to the compiler sequence being generated depends on a
until nextN ode(G, last(newSeq)) (line 6) does not return a new random number between 0 and 1 and the probability distribution
pass; which happens if there are no connections in the graph from generated from the weights of the out edges connecting the loop-
the node representing the compiler pass returned by last(newSeq) unroll node to other nodes, depicted in Figure 2. As an example,
to other nodes. In case the graph was build from a set of compiler if the random number equals 0.75, then the indvars compiler pass
sequences (see Section 2.1), this represents the situations when a is selected; resulting in the sequence: loop-rotate, basicaa, loop-
compiler pass is always in the rightmost/last position of the consid- reduce, loop-unroll, indvars.
ered compiler sequences.
The function nextN ode(G, last(newSeq)) selects a new node
for appending to the new candidate sequence by generating a ran- 3. Experiments
dom real number from 0 to 1; looking at the weights of the connec- We tested our new approach targeting a LEON3 [8] core with
tions from the node corresponding to last(newSeq) to other nodes LLVM 3.7, using the LLVM Optimizer (to optimize LLVM as-
and selecting a next node (corresponding to a single compiler pass) sembly IR generated by the Clang frontend) and the SPARC back-
based on the random real number and the probability distribution end (with flag for V8 architecture) of the LLVM Static compiler.
represented by the weights. The selection of the compiler passes and their execution order with
Function evaluate(newSeq) (line 10) evaluates the new se- the LLVM framework is accomplished by passing the flags repre-
quence by compiling and testing (i.e., executing/simulating) a senting the passes in a specific order to the LLVM Optimizer tool
given program/function and returns a fitness value (e.g., CPU (a.k.a. opt). After the LLVM IR is transformed by the execution of
cycles for execution, energy consumption). Given an objective the compiler passes in the requested order, the DSE system calls
function (independent of the DSE algorithm and defined/selected the LLVM static compiler (a.k.a. llc), to generate assembly code
by the user), isN ewSeqBetter(newF it, bestF it) returns true and then linked with a GNU Linker from the LEON Bare-C Cross
if the fitness value for the new sequence is better than bestF it Compilation System.
(the best fitness value until this point), and f alse otherwise. If Specialization of compiler phase orders is performed at function
isN ewSeqBetter(newF it, bestF it) is evaluated to true then level. Each function to optimize goes through a separated compila-
the new sequence newSeq and the fitness value newF it are stored tion flow, which includes the search for and the compilation with a
as the best sequence, bestSeq, and the best fitness value, newF it; specialized compiler sequence.
respectively. At termination, the algorithm returns bestSeq, the In these experiments we use TSIM2 [9] to simulate a LEON3
best found sequence (line 14). The function getStartF itness() core with cycle accuracy. We use the default TSIM2 configuration;
(line 2) returns a very large positive or negative number, depending which configures a LEON3 core with 4,096 KB of SRAM in a
on if the purpose of the DSE is to minimize or maximize the fitness single bank, 32 MB of SDRAM in a single bank and 2,048 KB
value bestF it, respectively. of ROM.
We use a set of programs/functions targeting embedded sys-
2.3 Example of Pass Selection Using the Graph tems from two Texas Instruments C libraries, DSPLIB (DSP Signal
Let’s suppose during sequence generation (for loop in line 5 of Processing Library) [10] and IMGLIB (DSP Image/Video Process-
Algorithm 1) the sequence being generated is currently composed ing Library) [11]; and a set of 11 C functions (adpcm code, ad-
of the following compiler passes in the following order: loop- pcm deco, autcor, bubble sort, dotprod, fdct, fibonacci, max, min,
rotate, basicaa, loop-reduce, loop-unroll. Then, when selecting the popcnt, sobel) of the embedded systems domain used as reference

23
Table 1. Description and number of lines of code of Texas Instru- stances both in our new graph-based approach and as the structure
ments DSP and IMG functions (descriptions based on comments in to guide an implementation of the SA-based algorithm described
original code). in Section 3.3.4. Second, and last, we used our new DSE method
to search for compiler sequences for the same Texas Instruments
Function Description CLOC
functions using a single graph instance built using sequences pre-
DSP autocor Autocorrelation of an input vector. 11 viously found for the set of 11 additional functions.
DSP blk eswap16 Endian-swap a block of 16-bit values. 27
DSP blk eswap32 Endian-swap a block of 32-bit values. 31
For the leave-one-out experiments, the number of nodes of the
DSP blk eswap64 Endian-swap a block of 64-bit values. 39 graph was 140 and the average number of connections was 2,353
DSP blk move Move block of memory. 13 (minimum: 2,280, maximum: 2,402). In the case of the second set
DSP dotprod Vector product of two input arrays. 7 of experiments, the graph has 98 nodes and 305 connections.
DSP dotp sqr Dot product of two arrays. 17 We compared the new graph-based approach with other algo-
DSP fir cplx Complex FIR. 24
rithms; which we implemented in our DSE infrastructure [12]. The
DSP firlms2 Least Mean Square Adaptive Filter. 17
DSP fltoq15 Convert IEEE FP into Q.15 format. 16 algorithms we considered for the experiments, in order to compare
DSP mat mul Matrix Multiply. 19 with the approach presented in this paper, include a sequential in-
DSP mat trans Transposes a matrix of 16-bit values. 8 sertion based algorithm [13], a GA-based algorithm from [5], a SA-
DSP maxidx Finds the largest element in an array. 12 based algorithm from [4], and a SA-based from [6] that relies on
DSP maxval Finds the maximum value of a vector. 9 compiler pass positional historical information to guide the algo-
DSP minerror Minimum Energy Error Search. 23
DSP minval Finds the minimum value of a vector. 9
rithm.
DSP mul32 32-bit multiply. 25
DSP neg32 32-bit vector negate. 11 3.1 DSE Framework
DSP q15tofl Q.15 to IEEE float conversion. 6
DSP vecsumsq Sum of squares. 15
We rely on an in-house modular DSE infrastructure. Exploration
DSP w vec Weighted vector sum. 13 schemes, objective functions, and target-specific exploration and
IMG boundary Returns coordinates of boundary pixels. 17 configuration parameters are programmed with the LARA aspect-
IMG conv 3x3 3x3 convolution. 42 oriented programming language [14]. Modularity is assured by the
IMG corr gen Generalized correlation with 1xM tap filter. 16 fact that a new DSE infrastructure component (e.g., new objective
IMG dilate bin 3x3 binary dilation. 47
function) can be paired with other components from other types
IMG erode bin 3x3 binary erosion. 47
IMG fdct 8x8 8x8 Block FDCT With Rounding. 116 (e.g., available targets and available exploration schemes).
IMG idct 8x8 12q4 IEEE-1180/1990 Compliant IDCT. 121 Instead of relying on code annotations, code transformations
IMG mad 8x8 8x8 block Minimum Absolute Difference. 30 and compiler mapping strategies are described in a separated file;
IMG median 3x3 3x3 median filter on 8-bit unsigned values. 43 allowing the reuse of theses transformation/mapping specifications
IMG perimeter Returns the boundary pixels of an image. 31 across different sources/applications and/or targeting multiple plat-
IMG pix expand 8-bit unsigned to 16-bit array. 11
IMG pix sat 16 bit signed numbers to 8 bit unsigned. 23 forms/requirements using a single annotation-free source code.
IMG quantize Matrix Quantization w/ Rounding. 27 DSE algorithms are programmed in a completely separated way
IMG sad 16x16 16x16 Sum of Absolute Differences. 14 and are independent from objective functions, from the compiler
IMG sad 8x8 8x8 Sum of Absolute Differences. 14 toolchain used and from the simulator calls or hardware interfacing
IMG sobel Sobel filter. 27 (e.g., sending code and getting result from a board through special
IMG wave horz Orthogonal Wavelet decomposition. 25
IMG wave vert Compute vertical wavelet transform. 27
link or TTL 232R) specific to the target platform.
IMG ycbcr422p rgb565 YCbCr 4:2:2/4:2:0 to 16-bit RGB 5:6:5. 61 The DSE infrastructure can be extended to use any compila-
IMG yc demux be16 De-interleave 4:2:2 BIG ENDIAN stream 18 tion toolchain [15]. In this paper we use a version of the toolflow
into LITTLE ENDIAN 16-bit planes. targeting Clang/LLVM [16, 17]. The framework provides means
IMG yc demux le16 De-interleave 4:2:2 LITTLE ENDIAN 18 to abstract calls to the individual tools of a given toolchain (e.g.,
stream into BIG ENDIAN 16-bit planes.
LLVM Optimizer, LLVM Static compiler) from inside the LARA
execution environment with a given set of parameters (e.g., com-
piler flags).
for our clustering approach presented in [5, 12]. Table 1 presents We have developed an interface between the LARA framework
a description of the Texas Instruments functions, including lines of and LLVM, which allows calling the Clang frontend, the LLVM
code. Optimizer and the LLVM static compiler from LARA aspects.
Experiments were executed on common x86-64 workstation Relying on a simple interface, a LARA programmer can access the
hardware running Ubuntu 14.04 64-bit, serving as host for the Clang/LLVM toolchain and instruct the compilation of any given
exploration process, the compiler toolchain, and the simulator. source code considering the execution of a sequence of compiler
For the experiments presented in this paper, the compiler pass passes at an arbitrary order.
graph instances are built from compiler sequences found using The code generated is verified by representative tests and/or
a simulated annealing based exploration process, considering all other currently feasible verification schemes. Developers are open
LLVM optimizer (a.k.a. opt) compiler passes except the ones re- to use optimized versions using non-standard compiler sequences
lated to visualization (e.g., view-*), printing (e.g., print-*), gener- knowing that they were not possibly previously tested or they were
ating dot graphs (dot-*), and compiler passes that require loading tested to a limited extent.
external information (e.g., insert-gcov-profiling) or the use of exter-
nal modules (e.g., asan, tsan). Note, however, that we could have 3.2 DSE Parameters
used sequences found using other iterative algorithm (e.g., Genetic When comparing the algorithms, we considered the execution of
Algorithm). each DSE algorithm with the parameters previously explained (see
We validated our approach with two experiments. First, we per- Section 4) for different number of iterations. More specifically, 10,
formed leave-one-out cross-validation using only the Texas Instru- 100, 1,000, 10,000 and 100,000 compilations/simulations. In the
ments functions; i.e., 42 different instances of the graph were cre- case of Sequential Insertion and the GA-based method, we neither
ated using 41 compiler sequences each (all sequences except the consider the execution for 10 nor for 100 iterations, as they repre-
one for the function being compiled). We used the same graph in- sent a too low number of iterations for those algorithms. They are

24
Table 2. LLVM Optimizer compiler passes used for exploration. For the experiments presented in this paper, the first node visited
in the graph is the one corresponding to the compiler pass more
-aa-eval -adce -add-dis. -alig.-f.-ass.
-alloca-hoisting -always-inline -argpromotion -ass.-cache-track. frequently present in the compiler sequences used to generate the
-atomic-expand -barrier -basicaa -basiccg graph, i.e., the loop-rotate compiler pass.
-bb-vectorize -bdce -block-freq -bounds-checking
-branch-prob -break-crit-edg. -cfl-aa -codegenprepare 3.3 Baseline Algorithms and Specific Parameters
-consthoist -constmerge -constprop -correlated-prop.
-cost-model -count-aa -da -dce We describe next the algorithms used to compare, in terms of
-deadargelim -debug-aa -delinearize -die efficiency, with our new approach.
-divergence -domfrontier -domtree -dse
-early-cse -elim-avail-ext. -extract-blocks -flattencfg
3.3.1 Sequential Insertion
-float2int -functionattrs -globaldce -globalopt
-globalsmodref-aa -gvn -indvars -inline The sequential insertion algorithm is an implementation of the al-
-inline-cost -instcombine -instcount -instnamer gorithm presented in [13]. The algorithm consists on individually
-instrprof -instsimplify -intervals -ipconstprop
-ipsccp -irce -iv-users -jump-threading
testing the insertion of compiler passes (each iteration considers
-lazy-value-info -lcssa -libcall-aa -licm a distinct pass) in all positions of a candidate sequence, and ac-
-lint -load-combine -loop-accesses -loop-deletion cepting the configuration (i.e., compiler sequence phase order) that
-loop-distribute -loop-extract -loop-ex.-single -loop-idiom results in better optimizing the target metric; if the insertion of a
-loop-instsimpl. -loop-interchan. -loop-reduce -loop-reroll given compiler pass in any position results in better code. The list of
-loop-rotate -loop-simplify -loop-unroll -loop-unswitch
compiler passes to consider for insertion can be traversed multiple
-loop-vectorize -loops -lower-expect -loweratomic
-lowerbitsets -lowerinvoke -lowerswitch -mem2reg times until algorithm termination; which happens when the maxi-
-memcpyopt -memdep -mergefunc -mergereturn mum number of iterations is reached or if the list of compiler passes
-mldst-motion -mod.-debuginfo -nary-reass. -no-aa to consider for insertion is completely traversed without finding a
-objc-arc -objc-arc-aa -objc-arc-apelim -objc-arc-contrac. better sequence than the candidate sequence.
-objc-arc-expand -pa-eval -part.-inliner -part.-inl.-libcal.
-pl.-ba.-safe.-im. -place-safep. -postdomtree -prune-eh
-reassociate -reg2mem -regions -rewr.-sta.-for-gc 3.3.2 Genetic Algorithms
-rewrite-symbols -safe-stack -sancov -scalar-evolution The Genetic Algorithm (GA) [18] is a well-known meta-heuristic
-scalarizer -scalarrepl -scalarrepl-ssa -sccp
usually applied to a variety of search problems. It is an evolution-
-scev-aa -scoped-noalias -s.-c.-o.-f.-gep -simplifycfg
-sink -slp-vectorizer -slsr -spec.-execution ary search method which consists in generating an initial popula-
-sroa -strip -str.-dead-d.-info -str.-d.-proto. tion of random solutions with a subsequent iterative evolution of
-strip-d.-declare -strip-nondebug -structurizecfg -tailcallelim their individuals, determined by their evaluation and ranking. Each
-targetlibinfo -tbaa -tti -verify evolutionary step is called generation and is formed by simple pro-
cedures: (i) the selection of the parent solutions (sampled pairs of
solutions); (ii) the application of the crossover and mutation opera-
tors; (iii) the evaluation of new solutions generated after the opera-
neither enough iterations for Sequential Insertion to consider the tors; and (iv) the reinsertion procedure which decides the survivors
insertion of all compiler passes, nor enough iterations for the GA- for the next generation. This iterative process is performed until
based considering as we have chosen the initial population to be achieving a stop criteria.
composed of 300 compiler sequences. In the case of the Sequential In the context of compiler optimization sequence search, each
Insertion algorithm, the number of iterations does not represent the solution is a sequence in the search space represented as an array
total, but instead, the maximum number of iterations; as one of the of compiler passes. We used GA with the same configuration as
stopping conditions of the algorithm considers all compiler passes described in [5, 12].
for insertion in all positions of the candidate sequence, without any
insertion. 3.3.3 Simulated Annealing
A reduction in exploration time is expected at the expense of Simulated Annealing (SA) [19] is an effective and practical algo-
disregarding some compiler sequences, which may in some cases rithm for optimization problems. It is especially suitable in cases
result in not allowing the DSE algorithm to achieve possible-close- where the design space is too large for an exhaustive approach,
to-optimum results. One of our goals is to measure how much which is the case when exploring compiler sequences resulting
exploration time can be reduced without sacrificing too much the from the combination of tens or hundreds of compiler passes.
quality of the solutions when using the approach presented in this SA is not memory intensive as it does not need to keep infor-
paper. mation about last explored design points and can escape the trap of
The maximum sequence length was set to 128; i.e., the gen- local optima by accepting worse solutions with a probability that
erated sequences are composed of up to 128 compiler passes. We starts as fairly high and lowers as the algorithms gets closer to ter-
select a maximum of 128 compiler pass instances (compiler passes mination. The probability a “bad move” is accepted decreases with
can appear multiple times in the same sequence) from the graph, each successive loop iteration of the algorithm.
which has 140 nodes, corresponding to the LLVM passes repre- In our implementation of the SA algorithm, the next optimiza-
sented in Table 2. tion sequence to be tested for latency is generated in each itera-
The target of our exploration processes is performance. There- tion by the following perturbation/transformation rule: insert a new
fore the objective function passed to the DSE methods consists on compiler pass (from the set of compiler passes to consider for ex-
minimizing the number of CPU clock cycles needed to execute a ploration) in a random position of the current candidate compiler
particular function. sequence, in case the candidate compiler sequence is smaller than
The GA-based, the SA-based, and the Graph-based exploration a maximum considered length; or replace a compiler sequence in a
algorithms are stochastic methods. When using those exploration random position from the candidate compiler sequence with a new
methods, we executed the algorithm with the same number of compiler pass. All positions in the compiler sequence, and all com-
iterations for three times and registered the geometric mean of the piler passes from the considered compiler passes list, have equal
performance speedups achieved with each individual execution. probability of being selected.

25
No single values for the SA parameters are the best possible 4.1.2 Geometric Mean Speedups
(or even acceptable for all cases) choice when exploring compiler Figure 5 presents the geometric mean speedups over the best per
sequences for each input program code, and therefore a careful function chosen –OX LLVM optimization level for each DSE algo-
selection of them is of utmost importance. Thus, we adopted a rithm.
strategy where the parameters are automatically set at the start of Sequential Insertion resulted always in the worst geometric
the execution of the algorithm; as described in [4]. mean performance speedup. GA and SA tend to result in com-
3.3.4 Extended Simulated Annealing piled code with similar performance. The former results in better
compiled code than the latter (1.14× vs. 1.12×) when the number
The only difference from the two SA-based DSE approaches is that of iterations is set to 1,000; and the inverse is true for 10,000 iter-
the extended SA relies on compiler pass positional information ations (1.20× vs 1.21×) and 100,000 iterations (1.22× vs 1.25×).
to limit the number of compiler passes that are considered for The SA+Graph always results in a higher geometric mean speedup
insertion/replacement between two other passes in the candidate than the other SA-based method. The new Graph-based explo-
sequence, as in [6]. ration algorithm results in the highest geometric mean speedup, for
In the SA approach previously introduced, when replacing or all numbers of iterations considered up to 10,000 iterations.
inserting, i.e, during the initial phase of exploration when the max- For 100,000 compile/simulate cycles, the geometric mean
imum allowed compiler sequence length is yet to be reached, it speedup is 1.16 for Sequential Insertion, 1.22× for GA, 1.25×
randomly selects a compiler pass from the list of compiler passes for SA, 1.31 for the SA+Graph and 1.28 for the new approach. For
considered for exploration. With information about what compiler 100 iterations, the new approach resulted in a speedup of 1.16×,
passes usually work well immediately before or/and immediately while SA and SA+Graph resulted in speedups of 0.84× and 1.00×.
after, the replacement or insertion on any given position of the There was only a >1 geometric mean speedup over the best –OX
current candidate compiler sequence (the one that is currently ac- with 10 iterations when relying on the new algorithm, with a re-
cepted by the iterative exploration algorithm) can be disregarded sulting speedup of 1.08×.
if it represents an option not represented in the structure holding Given a number of compile/simulate cycles, the GA-based and
that information. One of the allowed compiler passes at the in- the non-extended SA-based approach resulted in sequences that
sertion/replacement position is randomly chosen, while respecting when passed to the LLVM optimizer tool result in compiled code
the probability distribution outputted by the compiler pass position with very similar performance. This may indicate that both ap-
data structure (in a similar way as described in Section 2.3. proaches are well tuned, and that further increases of efficiency
when using these algorithms can only be achieved by adding clever
4. Results heuristics; such as is the case of using the graph structure repre-
We present and analyze herein the results we achieved from ex- senting compiler pass transitions to guide the new graph-based ap-
ecuting our new exploration algorithm with different numbers of proach.
iterations; and compare those results with the ones achieved by us-
ing the algorithms described in the previous section. We also char-
acterize the graph structures used in the context of validating our 4.1.3 Best and Worst speedups
new approach using the leave-one-out method. We used the per-
Figure 6 depicts the geometric mean speedups calculated using the
formance (i.e., measured in number of CPU cycles) resulting from top 10/20 and worst 10/20 individual function speedups. Consider-
optimization with the best LLVM Optimizer –OX flag individually ing only the 10 or 20 best individual function speedups, the geo-
found for each function as the baseline for calculating the individ-
metric mean speedups increased substantially.
ual functions speedups when using the specialized sequences found Using the new graph-based approach, when considering the best
with DSE. 10/20 individual speedups the geometric mean speedups increase
We rely on the same graph structure as described in this paper
to 1.31×/1.20× (best 10 / best 20), 1.45×/1.29×, 1.50×/1.34×,
to guide the implementation of the SA extended with compiler pass 1.62×/1.42× and 1.77×/1.49× (for 10, 100, 1,000, 10,000 and
positioning information (see Section 3.3.4). Herein we refer to this 100,000 iterations), respectively, over the original speedups of
algorithm as SA+Graph.
1.08×, 1.16×, 1.20×, 1.24×, and 1.28×.
4.1 Leave-One-Out Validation
Here we present the results achieved using the leave-one-out cross-
validation method with the Texas Instruments functions. 4.2 Validation with Reference Functions
Figure 7 presents the geometric mean speedups achieved for the
4.1.1 Individual Performance Speedups Texas Instruments functions by our new graph-based algorithm
Figures 3 and 4 present individual performance speedups ob- over the best per function chosen –OX LLVM optimization level,
tained with 1,000 and 100,000 compile/simulate cycles. Com- when the graph is built using 11 sequences previously found for
pared with other functions, DSP mat trans had the highest per- the set of 11 additional functions (see Section 3). We also include
formance improvement. The lowest speedups are obtained for the the results obtained using the SA+Graph algorithm using the same
IMG wave vert, for all algorithms except SA+Graph (had lowest graph structure.
speedup was registered with IMG pix expand). When relying on the new Graph-based method for 10 or 10,000
For 100,000 iterations, the maximum individual function speedups iterations, the compiled functions are not as fast as the programs
are 1.85×, 2.22×, 2.13×, 2.24×, 2.02×, and the minimum are generated in the leave-one-out experiments (1.03× vs 1.08× and
1.02×, 1.02×, 1.03×, 1.06×, 1.05×; for the sequential inser- 1.22× vs 1.24×). For 100, 1,000 iterations and 100,000 iterations
tion, GA, SA, SA+Graph and Graph algorithms, respectively. As the quality (i.e., performance) of the resulting optimized functions
the number of iterations decreases, the maximum and minimum was the same; 1.16×, 1.20× and 1.28× respectively.
speedups decrease. For 1,000 iterations the maximum speedups are In the case of the SA+Graph method, speedups increase for 10
1.59×, 1.90×, 1.87×, 1.90×, 1.92×; and the minimum speedups (0.81× vs 0.59×), 100 (1.07× vs 1.00×), 1000 (1.19× vs 1.18×)
are 0.55×, 0.82×, 0.78×, 0.97×, 1.01×; for the sequential inser- and 10,000 (1.25× vs 1.24×) iterations; and remain unchanged for
tion, GA, SA, SA+Graph and Graph algorithms, respectively. 100,000 iterations.

26
1.9 1.20
Seq. Insertion GA SA SA+Graph Graph
speedup over -OX

1.7 1.18
1.12
1.5
1.14
1.3
1.01
1.1
0.9
0.7
0.5

IMG_ycbcr422p_rgb565
IMG_corr_gen
DSP_mat_mul

DSP_neg32
DSP_q15tofl
DSP_dotprod

DSP_fltoq15

DSP_vecsumsq
DSP_firlms2

IMG_dilate

IMG_pix_sat

IMG_sobel

GEOMEAN
DSP_autocor

IMG_conv_3x3

IMG_fdct_8x8
DSP_dotp_sqr
DSP_fir_cplx

DSP_maxidx
DSP_maxval

DSP_minval

DSP_w_vec

IMG_erode_bin

IMG_mad_8x8

IMG_sad_16x16
DSP_blk_move

DSP_mul32

IMG_boundary

IMG_perimeter

IMG_sad_8x8
IMG_quantize

IMG_wave_vert
DSP_blk_eswap16

DSP_minerror

IMG_idct_8x8_12q4

IMG_median_3x3
DSP_mat_trans
DSP_blk_eswap32
DSP_blk_eswap64

IMG_pix_expand

IMG_wave_horz

IMG_yc_demux_le16
IMG_yc_demux_be16
Figure 3. Individual speedups over the best per function chosen –OX LLVM optimization level for each DSE algorithm, when executing
1,000 compilations/simulations.

2.4
Seq. Insertion GA SA SA+Graph Graph
speedup over -OX

2.2 1.28
2.0 1.31
1.8 1.25
1.6 1.22
1.4 1.16
1.2
1.0
DSP_autocor

DSP_maxidx
DSP_maxval
DSP_minerror

IMG_perimeter
DSP_mat_mul

IMG_pix_expand
DSP_mul32
DSP_neg32

IMG_mad_8x8

IMG_sad_16x16
DSP_blk_eswap16
DSP_blk_eswap32
DSP_blk_eswap64

IMG_ycbcr422p_rgb565
IMG_sad_8x8
DSP_dotprod
DSP_dotp_sqr

DSP_mat_trans

DSP_vecsumsq
DSP_w_vec

IMG_fdct_8x8
DSP_blk_move

DSP_fir_cplx

IMG_wave_horz
IMG_dilate
IMG_erode_bin

IMG_wave_vert
IMG_median_3x3

IMG_quantize

GEOMEAN
DSP_minval

IMG_boundary
IMG_conv_3x3

IMG_sobel
IMG_pix_sat
DSP_firlms2

DSP_q15tofl

IMG_yc_demux_be16
IMG_yc_demux_le16
DSP_fltoq15

IMG_corr_gen

IMG_idct_8x8_12q4

Figure 4. Individual speedups over the best per function chosen –OX LLVM optimization level for each DSE algorithm, when executing
100,000 compilations/simulations.

1.5
1.4 1.311.28
1.31 SA+Graph Graph
Seq. Insertion GA 1.24 1.25 1.3 1.251.22
1.4 1.20
SA SA + Graph 1.14 1.18 1.20 1.24
1.21 1.22 1.28 1.3 1.19 1.20
speedup over -OX

1.3 1.16 1.16


speedup over -OX

Graph 1.16 1.12 1.16 1.2


1.2 1.2
1.08
1.1 1.00 1.01 1.1 1.07
1.0 1.03
1.1
0.9 0.84 1.0
0.8 1.0
0.7 0.59 0.9
0.6 0.54 0.9 0.81
0.5 0.8
10 100 1K 10K 100K 10 100 1K 10K 100K
# iterations # iterations

Figure 5. Geometric mean speedups over the best per function Figure 7. Geometric mean speedups over the best per function
chosen –OX LLVM optimization level for each DSE algorithm. chosen –OX LLVM optimization level.

1.31 1.50 1.77 Seq. Insertion GA SA SA+Graph Graph


2.0 1.45 1.62 1.20 1.34 1.49
speedup over -OX

1.8 1.29 1.42 0.95 1.05 1.07 1.05 1.09 5. Challenges and Limitations
1.5 1.02 1.06 0.98 1.08 1.12
1.3 We discuss next some of the general limitations of our approach to
1.0
0.8
0.5
DSE in the context of compiler pass phase ordering, and limitations
0.3 specific to our new graph-based exploration method.
0.0
10
100
1,000
10,000
100,000
10

1,000
10,000
100,000
100

10
100
1,000
10,000
100,000
10

1,000

100,000
100

10,000

5.1 General Limitations


The exploration of compiler sequences based on iterative ap-
Best 10 Best 20 Worst 10 Worst 20 proaches requires the compilation and simulation/execution of the
# iterations
program or parts of the program one wants to improve regarding
Figure 6. Geometric mean speedups considering the best/worst the given objective metric(s). The simulation and/or execution of
10 and 20 individual function speedups for different numbers of some programs/kernels can take too long with representative in-
iterations and different algorithms. put parameters, thus posing a challenge to searching for compiler
phase orders for some applications. The exploration time of a given

27
iterative optimization phase order exploration algorithm is directly namic/static program features and problems emerging when using
proportional to the time it takes for a single compilation and simula- some sequences/subsequences with code having those features.
tion/execution step. Approaches to reduce the simulation/execution
time have been used (out of the scope of this work), such as the use 6. Related Work
of a smaller but still representative set of input parameters or the
use of techniques that estimate metrics (e.g., performance, energy, Given the large number of optimizations present in the most com-
power) based on statistical analysis of the source code or the IR. monly used compilers (e.g., GCC [1], Clang/LLVM [16, 17]), com-
For some programs/kernels, the configuration of the suitable piler optimization selection and phase ordering are hot-topics in
optimization phase orders may change with the input parameters. compiler research. Specifically when considering techniques and
Such examples include programs/kernels with different flows de- methods that automatically and efficiently explore different com-
piler optimization selections and/or phase orders. A number of such
pending on input parameters, or programs/kernels that have a dif-
ferent performance depending on the shape of the input parameters. techniques and methods rely on heuristics for pruning the design
The use of specialized optimization phase order found when using space by reducing the number of considered compiler phase selec-
a set of input parameters may not result in the same improvement tions/orders without affecting the quality of the generated solutions.
over the standard optimization levels of the compiler when execut- Some others rely on predictive models based on relations between
ing the program/kernel with a different set of input parameters. In static/dynamic program features, the target platform and compiler
the case different parameters change the way the application be- pass interdependencies specific to the compiler version used.
haves (i.e., the computations) and/or how effectively the platform We present related work concerning enumeration-based (also
where it executes is used, then it might be recommended to search known as iterative approaches) and machine-learning based ap-
for different compiler phase orders for each case. proaches, which rely on predictive models, for compiler optimiza-
Compilers must generate functionally equivalent code. When tion selection and/or phase ordering.
specialized compiler pass sequences are used, there is always the 6.1 Enumeration-based Approaches for Phase Ordering
risk of bugs in any given compiler pass not previously detected
by the battery of tests performed by compiler developers. We ex- Copper et al. [20] explore phase orders at program-level with
perimentally found that this is specially the case when generating randomized search algorithms based on genetic algorithms, hill
large optimization sequences. Without going into formal verifica- climbers and randomized sampling. They target a simulated ab-
tion of the optimized function/program by asserting its equivalence stract RISC-based processor with a research compiler, and report
to a non-optimized version of the function/program, which could properties of several of the generated sub-spaces of phase ordering
be performed at the IR or assembly level, the only option left is and the consequences of those properties for the search algorithms.
to verity the output of the optimized program/function for a set or Almagor et al. [2] rely on Genetic Algorithms (GAs), hill
multiple sets of input parameters. Although we currently use the climbers, and greedy constructive algorithms to explore compiler
latter approach, we are looking into possible formal verification ap- phase ordering at program-level. With 200 to 4,550 compilations,
proaches. their approach can find custom sequences that are 15% to 25% bet-
ter than the human-designed fixed sequence originally used by the
compiler when targeting a SPARC processor.
Kulkarni et al. [21] propose GAs to iteratively explore compiler
5.2 Limitations of the New Graph-based Algorithm pass sequences for improving performance at function-level, tar-
The new approach is more likely to generate sequences that re- geting an Intel StrongARM SA-100 processor. In this work, 15
sult in wrong compiled code, a compiler crash, or in exceeding compiler passes (including loop unrolling) of the Very Portable
the imposed simulation time limit, than the other DSE approaches Optimizer [32] were considered for exploration. Two approaches
tested. We experimentally found that when compiling some func- for achieving faster searches when using GAs are presented. They
tions, the number of generated sequences that lead to one of those improve exploration efficiency by avoiding unnecessary executions
situations (which we are able to detect using our DSE system) is and modified the search, resulting in average search time reductions
considerably higher than the sequences generated by any of the of 62% and in a reduction of average GA generations by 59%. Ad-
other algorithms. As an example, when exploring sequences for the ditional techniques to prune the exploration space are presented in
DSP blk move function with 100,000 iterations, the use of the new [22], culminating in a novel search algorithm with requires only a
graph-based approach generated (for one of the three executions) single program simulation for evaluating the performance of poten-
29, 168 sequences with problems (close to 30%), while that num- tially unique sequences for each function [23].
ber was 17, 1, 173, 1, 499 and 5, 046, for sequential insertion, GA, Huang et al. [13] analyzed three sequential insertion based it-
SA and SA+Graph, respectively. erative approaches for compiler optimization phase ordering in the
Considering a given number of algorithm iterations, we look context of hardware compilation targeting FPGAs and in the con-
at this as an opportunity to improve the quality of the sequences text of high level synthesis, considering 41 compiler passes and
(i.e., results in code better optimized regarding the given objective variable sequence length. Improvements regarding clock cycle la-
function) generated by our new graph-based approach by detecting tencies were between 10% and 17%, depending on the algorithm
sequences resulting in errors and preventing wasting resources by used.
testing them. As compilation and simulation take most of the re- Purini et al. [24] present an approach which relies on a list of
sources in the DSE of compiler sequences that rely on those two compiler sequences previously found for a representative set of pro-
steps (the execution of the DSE algorithm logic is virtually com- grams. Given a new program, each of these compiler sequences is
putationally insignificant when comparing with the compilation or tested and the one leading to better performance is used to compile
simulation steps), the avoidance of compile/simulation steps for se- the new program. The approach is tested considering 62 machine-
quences resulting in errors is important and will allow the DSE to independent LLVM 3.0 compiler passes when generating the list of
explore additional sequences giving a timing constraint. compiler sequences considered for testing with new programs. Re-
A possibility to deal with this is to characterize what consti- sults show an average speedup up to 14% when targeting an Intel
tutes bad sequences. This can be in a first phase, based on identi- Xeon W35550.
fying which sequences and/or subsequences always lead to prob- Compiler pass positioning information has been used to guide
lems, and in a second step, based on a correlation between dy- sequential insertion and simulated annealing based algorithms,

28
achieving geometric mean performance improvements of 1.23× x86 processors. Compiler phase selection is performed according
and 1.20× when targeting the MicroBlaze processor using the to program features and each program is classified regarding 56
CoSy-based REFLECTC compiler and considering 49 compiler features. The framework is tested considering 88 compiler passes
passes, and 1.94× and 2.65× when targeting the LEON3 proces- for exploration when targeting Intel and AMD x86 processors.
sor using LLVM 3.5 and considering 215 compiler passes ; for each Ashouri et al. [7] propose a model using Bayesian networks to
of two exploration algorithms and two kernel sets considered. In correlate machine-independent features extracted at runtime with
previous work [6] we presented an approach that uses a structure compiler pass selection configurations previously found for a set
capable of representing the same information as the graph used of programs. Given a new program, the model generates a prob-
here to guide a simulated annealing algorithm. Our new approach ability distribution, representing different probabilities for the dif-
differs in the way the graph is used to iteratively generate new se- ferent considered compiler options, and used to introduce a bias
quences, and as evidenced by the results the experiments presented when sampling the search space. Experiments targeted an ARMv7
in this paper, it is able to consistently achieve optimized code with Cortex-A9 (TI OMAP 4430) with GCC, and resulted in perfor-
better performance than both the extended sequential insertion and mance speedups of up to 2.8× (1.5× on average) with respect to
the extended SA-based approach presented in [6] when considering –O2 and –O3, and a 3× speedup in search time in comparison with
up to 10,000 compile/simulate iterations. an iterative approach.
Sher et al. [28] describe a compilation system that relies on
6.2 Enumeration-based Approaches for Phase Selection evolutionary neural networks for phase ordering exploration using
Compiler optimization phase selection has been addressed by a LLVM and taking into account features of functions and/or pro-
number of authors. E.g., Chen et al. [3] experimentally demon- grams. The neural networks output a set of probabilities of use for
strate, using GCC and Intel ICC targeting an Intel Xeon E3110 each compiler pass, which is then sampled a number of times to
processor, that there exists at least one program-level compiler op- generate different compiler sequences. The neural networks use
timization phase selection that achieves at least 83% or more of 48 and 44 features as input for the program- and the function-
the best possible speedup across 1,000 different datasets for each level approaches, respectively. The system was able to find com-
of 32 programs considered. Relying on a random probing based piler sequences resulting in performance improvements between
DSE method, the optimal program-specific combination yields 5% and 50% on Intel Core i7 considering 53 (program-level) and
performance improvements of up to 3.75×, averaged across all 34 (function-level) LLVM compiler passes for exploration.
datasets, over –O3 and -fast in GCC and ICC, respectively. These Martins et al. [5, 12] proposed a clustering method to reduce the
results were obtained considering 300 different compiler options exploration space in the context of compiler pass phase selection
phase selection configurations, and considering 132 compiler op- and order exploration. Performing clustering on top of source code
tions/optimizations (including loop unrolling and vectorization) representations generated with a fingerprinting method allows the
when using GCC. The authors suggest adding a compiler selection classification of a new source code into one of the existing clusters.
phase (e.g., use GCC or ICC), when further performance improve- Each cluster has associated with it only the compiler passes that are
ments are required. known to perform well with codes that are represented by similar
In the context of Java VMs, Jantz and Kulkarni [25] present an fingerprints, so that the exploration space (and as direct result, the
iterative approach relying on GAs for optimization of steady-state exploration time) is considerably reduced. The approach explored
performance of automatically detected hot-spot functions. In the the use of 49 compiler passes of the CoSy-based REFLECTC [15]
experiments they considered a set of 28 general optimizations. The compiler and of 124 passes when considering the use of LLVM
phase selections found using GA resulted in average steady-state 3.5 [17]. Experimental results reveal that the clustering-based DSE
performance speedups of 6.2% and 4.3%, per-function and whole- approach achieved a significant reduction on the total exploration
program, respectively. time of the search space (18× over a Genetic Algorithm approach
for DSE) at the same time that important performance speedups
6.3 Machine Learning-based and Hybrid Approaches (43% over the baseline) were obtained by the optimized codes.
This approach is orthogonal to the DSE algorithm presented in this
Agakov et al. [26] present a methodology to reduce the number paper.
of evaluations of the program. Models are generated taking into
account program features (30 features reduced to 5 using prin-
cipal component analysis) and the shapes of compiler sequence 7. Conclusions
spaces generated from iteratively evaluating a training set of pro- This paper presented a new iterative Design Space Exploration
grams. These models are then used by the iterative exploration for (DSE) approach for suggesting compiler pass sequences that in-
a new program. They present results concerning the evaluation of crease performance. The new method directly samples new com-
two distinct models, an independent identically distributed model piler sequences from a graph representing compiler pass transi-
and a stationary Markov model, when compiling with the SUIF tions.
source-to-source compiler coupled with Code Composer and GCC, The new approach was compared to state-of-the-art iterative ap-
for generating code for the TI C6713 and AMD Au1500 embed- proaches for DSE, including a sequential insertion based algorithm,
ded processors. The two models are tested with GAs in order to a Genetic Algorithm (GA) and two Simulated Annealing (SA)-
determine how much of the design space can be pruned by the based algorithms; one of which extended to rely on the same graph
proposed approach. Experimental results using the leave-one-out structure. We used as objective metric the number of clock cycles
method show the exploration process can be accelerated by an or- a function needs to execute on a simulated LEON3 processor core.
der of magnitude, with no negative impact on the performance of The results strongly show that our approach significantly outper-
the generated code. forms those approaches as it is able to achieve the phase orders
A GCC-based framework, named Milepost GCC [27], is used resulting in similar performance but with a significant reduction of
to automatically extract program features and learn the best opti- the number of iterations; especially when considering 100 or less
mizations across programs and architectures. The framework uses exploration points. When targeting a set of 42 image and digital sig-
a probabilistic model that correlates new program static features nal processing functions to a LEON3, the new algorithm is able to
with the closest one seen earlier in order to suggest a custom se- consistently find a compiler phase order better than the best LLVM
lection of compiler optimizations when targeting Intel and AMD standard optimization levels in only 10 compile/simulate iterations;

29
while none of the evaluated algorithms were able to achieve that. [12] Luiz G. A. Martins, Ricardo Nobre, João M.P. Cardoso, Alexandre
Executing for only 100 iterations our algorithm suggested compiler C.B. Delbem, and Eduardo Marques. Clustering-Based Selection for
phase orders achieving a geometric mean speedup of 1.16× (both the Exploration of Compiler Optimization Sequences. ACM Trans.
cross-validated and non-cross-validated experiments) over the best Archit. Code Optim. 13, 1, Article 8 (March 2016), 28 pages.
individually (i.e., per function) found –OX flag, while the closest [13] Huang Qijing, Ruolong Lian, Andrew Canis, Jongsok Choi, Ryan Xi,
method tested was the SA+Graph with 1.07× speedup for the non- Nazanin Calagar, Stephen Brown, Jason Anderson, 2013. The Effect
cross-validated experiments. of Compiler Optimizations on High-Level Synthesis for FPGAs. In
IEEE 21st Annual International Symposium on Field-Programmable
Ongoing work is focused on schemes to prune the design space
Custom Computing Machines (FCCM), 2013, 89-96.
by reducing the number of nodes (i.e., compiler passes) and/or
connections (i.e., legal transitions) of the graph structure. We are [14] João M.P. Cardoso, Tiago Carvalho, Jos G.F. Coutinho, Wayne Luk,
Ricardo Nobre, Pedro Diniz, and Zlatko Petrov, 2012. LARA: an
also measuring the performance impact of using rules to restrict
aspect-oriented programming language for embedded systems. In
the occurrence of repetitions of subsequences in the sequences Proceedings of the 11th annual international conference on Aspect-
suggested by our system. We are also evaluating the contribution oriented Software Development (Potsdam, Germany, 2012), ACM,
to the DSE algorithm presented in this paper to further reduce 2162071, 179-190.
DSE execution time when using a clustering-based approach such [15] Ricardo Nobre, João M.P. Cardoso, Bryan Olivier, Razvan Nane,
as the one described in [5, 12], by removing the compiler passes Liam Fitzpatrick, Jos Gabriel de F. Coutinho, Hans van Someren,
(represented by nodes in the graph) that are not associated with the Vlad-Mihai Sima, Koen Bertels, and Pedro C. Diniz, 2013. Hard-
cluster selected for a given input program/function. ware/Software Compilation. In Compilation and Synthesis for Em-
bedded Reconfigurable Systems, J.M.P. Cardoso, P.C. Diniz, J.G.F.
Acknowledgments Coutinho and Z.M. Petrov Eds. Springer New York, 105-134.
[16] clang: a C language family frontend for LLVM,
This work was partially supported by FCT (Portuguese Science http://clang.llvm.org/.
Foundation) under research grant SFRH/BD/82606/2011, and by
[17] The LLVM Compiler Infrastructure, http://llvm.org/.
the TEC4Growth project, ”NORTE-01-0145-FEDER-000020”, fi-
nanced by the North Portugal Regional Operational Programme [18] David E. Goldberg, Genetic Algorithms in Search, Optimization and
(NORTE 2020), under the PORTUGAL 2020 Partnership Agree- Machine Learning, 1st ed., Addison-Wesley Longman, 1989.
ment, and through the European Regional Development Fund [19] Scott Kirkpatrick, C. D. Gelat, Mario P. Vecchi. Optimization by
(ERDF). simulated annealing. Science 220, 671-680 (1983).
[20] Keith D. Cooper, Alexander Grosul, Timothy J. Harvey, Steve
Reeves, Devika Subramanian, Linda Torczon, and Todd Waterman,
References 2006. Exploring the structure of the space of compilation sequences
[1] GCC, the GNU Compiler Collection, using randomized search algorithms. The Journal of Supercomputing
https://www.gnu.org/software/gcc/. 36, 2 (2006/05/01), 135-151.
[2] Lelac Almagor, Keith D. Cooper, Alexander Grosul, Timothy J. [21] Prasad A. Kulkarni, Stephen R. Hines, David B. Whalley, Jason D.
Harvey, Steven W. Reeves, Devika Subramanian, Linda Torczon, Hiser, Jack W. Davidson, and Douglas L. Jones, 2004. Fast searches
and Todd Waterman, 2004. Finding effective compilation sequences. for effective optimization phase sequences. SIGPLAN Not. 39, 6,
SIGPLAN Not. 39, 7, 231-239. 171-182.
[3] Yang Chen, Shuangde Fang, Yuanjie Huang, Lieven Eeckhout, Grig- [22] Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack
ori Fursin, Olivier Temam, and Chengyong Wu, 2012. Deconstruct- W. Davidson, 2009. Practical exhaustive optimization phase order
ing iterative optimization. ACM Transactions on Architecture and exploration and evaluation. ACM Trans. Archit. Code Optim. 6, 1,
Code Optimization (TACO). 9, 3, 1-30. 1-36.
[4] Ricardo Nobre, 2013. Identifying sequences of optimizations for [23] Prasad A. Kulkarni, Michael R. Jantz, and David B. Whalley, 2010.
HW/SW compilation. In 23rd International Conference on Field Pro- Improving both the performance benefits and speed of optimization
grammable Logic and Applications (FPL), 2013, 1-2. phase sequence searches. SIGPLAN Not. 45, 4, 95-104.
[5] Luiz G.A. Martins, Ricardo Nobre, Alexandre C.B. Delbem, Eduardo [24] Suresh Purini and Lakshya Jain, 2013. Finding good optimization
Marques, and João M.P. Cardoso, 2014. Exploration of compiler sequences covering program space. ACM Trans. Archit. Code Optim.
optimization sequences using clustering-based selection. In ACM 9, 4, 1-23.
Proc. 2014 SIGPLAN/SIGBED conference on Languages, compilers [25] Michael R. Jantz and Prasad A. Kulkarni, 2013. Performance poten-
and tools for embedded systems (LCTES), 63-72. tial of optimization phase selection during dynamic JIT compilation.
[6] Ricardo Nobre, Luiz G.A. Martins, and João M.P. Cardoso, 2015. SIGPLAN Not. 48, 7, 131-142.
Use of Previously Acquired Positioning of Optimizations for Phase [26] Felix Agakov, Edwin Bonilla, John Cavazos, Björn Franke, Grig-
Ordering Exploration. In Proc. 18th International Workshop on Soft- ori Fursin, Michael F.P. O’Boyle , John Thomson, Marc Toussaint,
ware and Compilers for Embedded Systems (SCOPES ’15) (Schloss and Christopher K.I. Williams, 2006. Using Machine Learning to
Rheinfels, St. Goar, Germany, June 1-3, 2015). Focus Iterative Optimization. In Proc. International Symposium on
[7] Amir H. Ashouri, Giovanni Mariani, Gianluca Palermo, and Cristina Code Generation and Optimization (2006), IEEE Computer Society,
Silvano, 2014. A Bayesian network approach for compiler auto- 1122412, 295-305.
tuning for embedded processors. In IEEE 12th Symposium on Em- [27] Grigori Fursin, Yuriy Kashnikov, Abdul Wahid Memon, Zbigniew
bedded Systems for Real-time Multimedia (ESTIMedia), 2014, 90- Chamski, Olivier Temam, Mircea Namolaru, Elad Yom-Tov, Bilha
97. Mendelson, Ayal Zaks, Eric Courtois, Francois Bodin, Phil Barnard,
[8] Aeroflex Gaisler, LEON3 Processor, Elton Ashton, Edwin Bonilla, John Thomson, Christopher K. I.
http://www.gaisler.com/index.php/products/processors/leon3. Williams, and Michael OBoyle, 2011. Milepost GCC: Machine
[9] Aeroflex, TSIM2 ERC32/LEON simulator, Learning Enabled Self-tuning Compiler. International Journal of Par-
http://www.gaisler.com/index.php/products/simulators/tsim. allel Programming 39, 3 (2011/06/01), 296-327.
[10] Texas Instruments, 2008. TMS320C64x+ DSP Little-Endian Library [28] Gene Sher, Kyle Martin, and Damian Dechev, 2014. Preliminary
Programmer’s Reference (Rev. B). results for neuroevolutionary optimization phase order generation for
static compilation. In Proc. 11th Workshop on Optimizations for DSP
[11] Texas Instruments, 2008. TMS320C64x+ DSP Image/Video Process- and Embedded Systems (Orlando, Florida, USA, 2014), ACM, 33-
ing Library (v2.0) Programmer’s Reference (Rev. A). 40.

30

You might also like