Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Learning to Superoptimize Real-world Programs

Alex Shypula1 , Pengcheng Yin1 , Jeremy Lacomis2 ,


Claire Le Goues2 , Edward Schwartz3 , Graham Neubig1
1
Carnegie Mellon University Language Technologies Institute
2
Carnegie Mellon University Institute for Software Research
3
Carnegie Mellon University Software Engineering Institute
arXiv:2109.13498v1 [cs.LG] 28 Sep 2021

Abstract executable machine code. In an effort to automatically cre-


ate optimized programs that surpass human-defined heuris-
Program optimization is the process of modifying software tics, the research community has pioneered automated op-
to execute more efficiently. Because finding the optimal pro- timization methods, or “superoptimizers,” based on brute-
gram is generally undecidable, modern compilers usually re-
force search (Massalin 1987), heuristic search (Schkufza,
sort to expert-written heuristic optimizations. In contrast, su-
peroptimizers attempt to find the optimal program by employ- Sharma, and Aiken 2013), and satisfiability modulo theo-
ing significantly more expensive search and constraint solv- ries (SMT) solvers (Sasnauskas et al. 2017). These superop-
ing techniques. Generally, these methods do not scale well timizers may outperform compiler-based optimizations, but
to programs in real development scenarios, and as a result are difficult to employ in practice.
superoptimization has largely been confined to small-scale, Recent work by the machine learning community has
domain-specific, and/or synthetic program benchmarks. In demonstrated the ability of deep learning models to reason
this paper, we propose a framework to learn to superop- about and even generate code to assist developers for a va-
timize real-world programs by using neural sequence-to-
riety of tasks (Allamanis et al. 2018; Chen et al. 2021). De-
sequence models. We introduce the Big Assembly bench-
mark, a dataset consisting of over 25K real-world functions spite this, research in machine learning-based program opti-
mined from open-source projects in x86-64 assembly, which mization remains underexplored. Existing work focused on
enables experimentation on large-scale optimization of real- the highly constrained problem of expression simplification
world programs. We propose an approach, Self Imitation in a simple domain-specific language (DSL), and has only
Learning for Optimization (SILO) that is easy to implement been tested on synthetic examples (Shi et al. 2020). While
and outperforms a standard policy gradient learning approach this is an important stepping stone demonstrating the fea-
on our Big Assembly benchmark. Our method, SILO, super- sibility of machine learning-based program optimization, it
optimizes programs an expected 6.2% of our test set when also avoids many of the interesting but more complex opti-
compared with the gcc version 10.3 compiler’s aggressive mization opportunities that emerge with the combination of
optimization level -O3. We also report that SILO’s rate of
control flow, memory reasoning, and sheer scale, as seen in
superoptimization on our test set is over five times that of a
standard policy gradient approach and a model pre-trained on real world software.
compiler optimization demonstration. In this work, we demonstrate the ability of deep neu-
ral networks to optimize real-world programs on Github.
For this, we introduce Big Assembly, a dataset consisting
of over 25K functions in x86-64 assembly mined from on-
1 Introduction line open-source projects from Github, which enables ex-
Program optimization is a classical problem in computer sci- perimentation on large-scale optimization of real-world pro-
ence that has existed for over 50 years (McKeeman 1965; grams. We also propose an easy to implement algorithm Self
Allen and Cocke 1971), and most years, entire tracks at com- Imitation Learning for Optimization (SILO) that progres-
piler conferences are dedicated to it (Pouchet and Jimborean sively improves its superoptimization ability with training.
2020). The standard tool for generating efficient programs Our results indicate that it superoptimizes an expected 6.2%
is an optimizing compiler that not only converts human- of our test set beyond gcc -O3, over five times the rate of a
written programs into executable machine code, but also model pre-trained on the outputs of an optimizing compiler
performs a number of semantics-preserving code transfor- as well as a model fine-tuned with policy gradient methods.
mations to increase speed, reduce energy consumption, or Instead of focusing on customized search method unique to
improve memory footprint (Aho et al. 2006). Most opti- a language’s implementation and semantics, our methodol-
mizing compilers use semantics-preserving heuristic-based ogy relies on a dataset of demonstrations for pre-training as
optimizations. These optimizing transformations generally well as a test case generator, a sandbox for executing pro-
need to be written by experts for an individual compiler, and grams, and a method for verifying program equivalence. We
are applied to an intermediate representation of the code pro- find examples of our model applying compiler optimizaiton
duced over the course of transforming high-level code into techniques such as register allocation, redundant instruction
elimination, and instruction combination to superoptimize Table 1: A list of program optimization benchmarks from
programs. machine learning and programming languages / systems
works. The three criteria for evaluating listed are the number
2 Problem Formulation of individual examples in the benchmark (Sz.), are the pro-
grams written by humans (H.), are the programs found “in
The superoptimization task is, when given a specification the wild” (i.e. in open source projects) (R.W.), and does the
for a program S and a reference program Fref (which meets benchmark contain either branching or control flow (CTL.)
the specification), to generate an optimized program Fo that
runs more efficiently and is equivalent to the reference pro-
gram Fref . In this paper we focus on program optimization
at the assembly level, so S, Fref , and Fo are all programs DATASET SZ H. R.W. CTL.
written in X86-64 assembly code. S is a program with no S HI ET AL . (2020) 12,000 7 7 7
optimizations applied at all and its purpose is to demonstrate G ULWANI ET AL . (2011) 25 3 7 7
desired program semantics. Fref is an assembly program C HURCHILL ET AL . (2017) 13 3 3 3
produced by the optimizing compiler gcc at its aggressive O URS 25,141 3 3 3
-O3 optimization level.
We assume that programs are deterministic, and as a re-
sult model each program as a function that maps inputs to 2011; Churchill et al. 2017). While these datasets are suf-
outputs. We will use I to represent the hardware state prior ficient for demonstrating methodological capabilities, they
to executing the program (i.e., input) and O to represent do not necessarily reflect the properties of real code, and
the hardware state after executing the program (i.e., output). thus do not predict their performance relative to modern op-
Specifically, our goal is to learn a model fθ : S 7→ Fo timizing compilers on real-world programs. Lastly, small-
such that a model-generated (optimized) program F̂o at- scale benchmarks are insufficient for data-hungry modern
tains lower cost, and ideally minimal cost, under a cost deep learning.
function C(·) evaluated on a suite of K input-output test In our work, we introduce the Big Assembly benchmark,
cases {IO}K k=1 : for example energy consumption or run-
which contains 25,141 C functions that we compiled using
time. This model-generated program must meet the spec- gcc, both with (-O3) and without (-O0) aggressive opti-
ification, which is determined by a verification function mizations. We started by collecting 1.61 million functions
V(·) ∈ {0, 1}. More details about C and V are located in from open source projects on Github that were written in
Section 5. The learned model’s objective is then to produce C. Of these 1.61M, we were able to mine testcases for a
rewrites that meet the condition: dataset of over 100,000 functions. We performed two stages
of sanity checks and analysis to compute a conservative ap-
    proximation of the live out registers. The first involved us-
C F̂o ; {IOk }kk=1 < C Fref ; {IOk }K
k=1 ing SMT solvers to find equivalence between the gcc -O0
(1) function used as the specification S and the gcc -O3 func-
s.t. V( F̂o , S) = 0 tion used as Fref and filtered out trivial programs such as
In order to train our model on some of the optimizations those equivaent to return 0, this reduced our dataset to
that are present in modern compilers in a supervised manner 77,813 functions. We then performed a similar pass, but in-
and to improve it by learning from experience a dataset is stead using the test case suite {IO}K k=1 reducing our total
necessary. Our training set Do therefore consists of N tuples dataset to 25,141 functions (19,819 train, 2,193 dev, 3129
of (1) an I/O test suite {IOk }K test). For testcases, we automatically generated random test-
k=1 , (2) a compiled and un-
optimized program specification S, (3) and a compiled and cases, which are part of the dataset. These testcases can be
aggressively optimized program as Fref : used to test for equivalent behavior and to estimate the per-
formance of the code. We used the STOKE toolkit1 with
  additional modifications for this purpose.
Do = {IOki }k=1...K , i
S, Firef , (2) Table 1 compares the basic properties of our dataset to
i=1...N those from existing work; beyond its size, our dataset is
In Section 4 we explain our methodology for learning to salient in that it is representative of functions from real world
superoptimize programs; before this, however, we first intro- code. It also contains examples with more complex opera-
duce our benchmark dataset for optimizing real-world pro- tions such as SIMD instructions, branching, and loops. Ad-
grams in Section 3. ditional details on how the dataset was collected are avail-
able in the supplementary materials section.
3 Big Assembly Benchmark
There is no standard benchmark for evaluating superop- 4 Learning Program Optimizations
timization research. Some researchers have evaluated on In this section we elaborate on our approach that learns to
randomly-generated programs in a simplified domain spe- optimize real-world programs. We first describe the underly-
cific language (Chen and Tian 2019; Shi et al. 2020), ing neural model before introducing the learning algorithm.
while others have tested on small and hand-picked pro-
1
grams (Joshi, Nelson, and Randall 2002; Gulwani et al. https://github.com/StanfordPL/stoke
Neural Program Optimizer Algorithm 1: SILO for Program Optimization
Our program optimization model fθ is a neural sequence- 1: Initialize model f parameters θ from pre-trained model
to-sequence network, where the input specification (unop- 2: Initialize dataset of program function pairs and test
timized program) and output (optimized program) are rep- cases: n o
i i
resented as sequences of tokens. Specifically, fθ is parame- D = Do = {IOi }Kk=1 , S , Fref ,
terized with a standard Transformer-based encoder-decoder i=1...N
3: while budget not exhausted do
model (Vaswani et al. 2017). 4: Sample a batch Bex from Do
i i
Learning Algorithms 5: for {IOi }K k=1 , S , Fref , in Bex do
i
For learning to optimize, we develop a two-stage learn- 6: sample F̂o ∼ fθ (Si )
i i
ing approach. First, in a pre-training stage, to capture 7: calculate C(F̂o ), and V(F̂o , Si )
commonly-used optimization heuristics adopted by existing i i
optimizing compilers, we use supervised learning to train the 8: if V(F̂o , Si ) = 0 and C(F̂o ) < C(Firef ) then
i
model on the mined corpus Do of gcc-optimized programs 9: Replace Firef with sample F̂o in D
(E.q. 2) described in Section 3. Next, to discover more effi- 10: end if
cient optimization strategies, we investigate fine-tuning us- 11: end for
ing policy-gradient methods and a proposed iterative learn- 12: Sample a batch Btr from D
ing approach, SILO. 13: Update θ via supervised learning on Btr from D
14: end while
Policy Gradient Approach As in Eq. (1), our goal is to
synthesize a correct program F̂o verified by V and outper-
forms a reference program Fref on the cost function C. An
verification function V and also achieve a lower cost under
intuitive choice is to use policy gradient methods to learn a
C compared to the compiler-optimized targets in the origi-
policy that directly minimizes our cost function C and pro-
nal dataset, the compiler-optimized target in the dataset is
duces correct programs under V in expectation. Specifically,
then replaced with the model’s newly-discovered optimal
we express this dual objective via the Lagrangian relaxation:
rewrite. After the exploration step is taken, in the learning
step, a separate training batch Btr is sampled for maximum-
 
J(F̂o ) = C F̂o ; {IOk }kk=1 + λ V( F̂o , S). (3)
likelihood training from the dataset which may now contain
A commonly used policy gradient approach is REIN- model-optimized targets.
FORCE with baseline (Williams 1992). Based on our min- Self-imitation learning (Oh et al. 2018) is an off-policy re-
imization objective, we can express the loss using the fol- inforcement learning algorithm intended to help agents solve
lowing equation, where b(S) is a baseline value for the given challenging exploration problems by learning from good
specification and p(at |a<t ; S) is the model-given probabil- past actions. Intuitively, besides the ordinary on-policy rein-
forcement learning using the latest model-predicted actions,
ity for generating a token at time step t for sequence F̂o . In a
the model is also trained on historical states s and actions a
traditional reinforcement learning context we would seek to
that achieve high rewards R using a cross-entropy loss:
perform gradient ascent on the following term; because we
are trying to minimize the objective, we perform gradient

Lsil = − log p(a|s) max (R − Vθ (s)), 0 (5)
descent instead.
where each sample ha, si is weighted by how much better
T
the off-policy return was compared to the learned baseline
X  Vθ (s). Our algorithm, SILO, has a few differences with stan-
L= log p(at |a<t ; S) J(F̂o ) − b(S) (4)
dard self-imitation learning. First, for sequences that outper-
t=1
form Fref we omit a learned value function and train on the
Self Imitation Learning for Optimization (SILO) Al- entire sequence using cross-entropy loss, as opposed to indi-
gorithm 1 illustrates our SILO learning approach. It con- vidual actions. Additionally, we do not interpolate our loss
sists of two steps, an exploration step (lines 4-11) where with an on-policy reinforcement learning algorithm. Rather,
the model seeks to discover alternative optimizations that we avoid the policy gradient altogether, and instead train
are more efficient than the compiler generated targets used only on the best sequence found so far in our dataset, be it
in pre-training, and a learning step (lines 12-13), where the the compiler-optimized outputs Fref or a sequence discov-
model parameters are updated using newly discovered op- ered that outperforms it Fo . This is also broadly related to
timized programs. First, in the exploration step, an explo- the “hard” EM algorithm that uses the currently best model-
ration batch Bex is sampled from the dataset D initialized predicted results as optimization targets (Kearns, Mansour,
with program specifications (in our case, unoptimized as- and Ng 1998).
sembly programs) Si and their compiler-optimized outputs
Firef . For each input specification Si in Bex , we sample a Actor-Learner with Evaluation Server
model-predicted optimization F̂io , and execute F̂io on the I/O Architecture
test suite {IO}K k=1 to compute the cost function C. If any One hurdle to performing program optimization at scale is
of the new samples are both functionally equivalent by our that the the time required to evaluate C and V is costly, limit-
lenges also lie in utilizing a testcase suite {IO}K
k=1 for either
Learner benchmarking a program’s performance or coming up with

s
er
et
an approximate estimate for performance.

m
ra
Pa

es
nc
Measuring Program Correctness

rie
pe
Actor Actor

Ex
Actor
A key claim of this work is that our superoptimizer outputs
Program
Rewrites
Feedback correct programs — that is, programs that are more efficient,
but still semantically equivalent, to the input programs. Pro-
gram equivalence is undecidable in the general case, mo-
Assess I/O Correctness
tivating a complementary set of mechanisms for verifying
Evaluation
Server Approximate Performance output program correctness. In our experiments, we con-
firm output program correctness for our verification function
Verify with SMT Solvers
V in two ways. First, we run synthesized programs on the
provided test cases. Second, we formally verify correctness
using two solver based verifiers: the bounded SMT solver
Figure 1: An overview of our actor-learner setup. It features based verifier from the standard STOKE toolkit, and an ad-
interaction between a centralized learner module, numerous ditional verifier available from the artifacts in the program
asynchronously-operating actors, as well as a server for eval- verification work in Churchill et al. (2019). These program
uating program rewrites. verifiers are based on the state-of-the-art Z3 SMT solver
(De Moura and Bjørner 2008); SMT (“Satisfiability Modulo
Theories”) solvers decide the satisfiability of boolean for-
ing model throughput of learning examples. To alleviate this mulas with respect to a set of underlying theories (such as
bottleneck, we utilize an actor-learner set up (Liang et al. the theory of integer arithmetic).
2018; Espeholt et al. 2018). Before the training process be- We use both test cases and the two verifiers for several
gins, multiple actor threads inheriting the parameters of the reasons. High coverage test suites are informative in terms
parent learner are created. For every iteration, each of the of program correctness, but intrinsically incomplete. Mean-
actor threads independently samples a batch of program re- while, verifiers do not always scale, especially to programs
writes from the distribution of the inherited model. with arbitrary numbers of loop iterations to ensure termina-
After sampling a batch of re-writes, an attempt is made tion. As is standard we configured a maximum bound of b
the evaluate the rewrites by sending them to an evaluation (set to 4) loop iterations, and the process does not timeout
server. Inside the evaluation server, the programs are assem- by taking over T seconds (set to 150). Verification is thus
bled and tested for correctness and performance to calculate also incomplete past those bounds, and limited by the cor-
C and V; more details on evaluation are located in Section 5. rectness of the verifiers’ underlying models.
The actor then sends the samples with their related cost and Indeed, we manually observed cases where our fine-
correctness information to the learner module for learning. tuning methods exploited either gaps in test suite coverage,
Then, the actor attempts to synchronize its parameters by or bugs in the verifiers’ models of x86-64 semantics.3 Mo-
inheriting a new copy, if available, from the learner module. tivated by this, we use both testcases and the two verifiers
Fig. 1 contains an overall diagram for our entire system. for additional robustness. While this would intuitively help
mitigate spuriousness during evaluation, it could still remain
Model Configuration and Training an issue when evaluating optimization of open-domain pro-
Our neural optimizer fθ uses a 3-layer transformer encoder- grams. As we later explain in Section 7, we also resort to hu-
decoder with embedding dimension of 512 with 8 attention- man verification to get reliable results when reporting model
heads. We utilize the Adam optimizer (Kingma and performance on test sets.
Ba 2014) with the inverse square root schedule from
Vaswani et al. (2017). We pre-trained the model for 88K Measuring Program Performance
steps and subsequently performed our fine-tuning algo- We follow previous work on superoptimization of x86 as-
rithms for an additional 5K steps. We additionally use sembly and primarily calculate the cost function C (E.q. 1)
SentencePiece2 to pre-process the assembly with byte- as a static heuristic approximation of expected CPU-clock
pair encoding (Sennrich, Haddow, and Birch 2015). Addi- cycles. We compute the sum of both performance cost
tional details on training hyperparameters and data prepro- functions from Schkufza, Sharma, and Aiken (2013) and
cessing are located in our supplementary materials section. Churchill et al. (2017). The former is a sum of all expected
latencies for all instructions in the assembly function (de-
5 Evaluation noted as Call ), while the latter computes expected latencies
A primary concern in constructing the verification function only using executed instructions (Cexe ) from the randomly
V in Eq. (1) is the undecidability of program equivalence generated test suite {IO}K k=1 . Cexe is a better approxima-
for programs with control flow such as loops. If using test- tion, especially for functions that contain loop constructs,
cases for equivalence V as well as the cost function C chal-
3
We have since reached out to the authors of (Churchill et al.
2
https://github.com/google/sentencepiece 2019) regarding issues we have found
while Call may additionally penalize redundant instructions .trailing_zeros_mask: .trailing_zeros_maski:
that are not executed. Expected latencies were calculated by movl %edi, %eax leal -0x1(%edi), %eax
the authors of the STOKE toolkit for the Intel Haswell ar- subl $0x1, %edi notl %eax
notl %eax andl %edi, %eax
chitecture by benchmarking and measuring instructions for
andl %edi, %eax retq
servers. retq
(b) Model-optimized code
Testing (a) -O3 code
In our experiments, we test our methods by taking our final
model, and generate 10 model-optimized candidates through Figure 2: An example of the trailing zeros mask from
beam search, and then calculate C and V for each. We then H ACKER ’ S D ELIGHT demonstrating a commonly used in-
report results based on the best result of all 10 candidate struction combine optimization for x86 assembly.
programs.

6 Application to Hacker’s Delight .absolute_value: .absolute_value:


We apply our methods to the 25 functions chosen from the movl %edi, %eax movl %edi, %eax
H ACKER ’ S D ELIGHT benchmark (Warren 2002), first used sarl $0x1f, %eax sarl $0x1f, %edi
xorl %eax, %edi xorl %edi, %eax
in Gulwani et al. (2011) for program synthesis and later in
subl %eax, %edi subl %edi, %eax
Schkufza, Sharma, and Aiken (2013) for x86-64 program
movl %edi, %eax retq
superoptimization. In the latter work, authors express they retq
were able to either match or outperform gcc -O3 when (b) Model-optimized code
provided programs compiled with LLVM -O0. The super- (a) -O3 code
optimization benchmark consists of bit-vector manipulation
challenges such as “take the absolute value of x.” Before Figure 3: An example of the H ACKER ’ S D ELIGHT absolute
evaluating on our large scale Big Assembly dataset in Sec- value function, demonstrating superior register allocation.
tion 7, we perform controlled experiments on H ACKER ’ S
D ELIGHT allowing for interpretable optimizations and con-
sistency with prior work.
Results We examined the results of REINFORCE and commonly-used optimization on the fast -O3 setting; how-
SILO with respect to two quantities: (1) the number of pro- ever, it is known that compilers may not always choose the
grams where a superoptimized version was found at least best optimizations to apply (Leather and Cummins 2020).
once during the training process, and (2) the number of pro-
grams for which a superoptimized version was found within The second example in Fig. 3 comes from the task of tak-
the top-10 hypotheses generated by beam search from the ing the absolute value of the input integer, and demonstrates
last model at the end of training. Regarding the former met- more sophisticated behavior in performing efficient proce-
ric, REINFORCE and SILO respectively found 3 and 2 dural optimizations. In the gcc -O3 optimized assembly,
superoptimized programs during the training. For the lat- the program performs an auxiliary computation in the %eax
ter metric, the final models produced by REINFORCE and (return) register and majority of the computation in the %edi
SILO output 1 and 2 superoptimized programs respectively. argument register, and finally moving it into the %eax regis-
This can be explained by the fact that policy gradient meth- ter. A more clever solution lies in eliminating the last move
ods have a wider breadth for exploration during the training instruction by switching the auxiliary computation into the
%edi register and then performing majority of the computa-
process compared to SILO, but also are less stable in their
final solutions. tion directly in the %eax register used for returning the re-
sult. This is an example of more-efficient register allocation
Interpretation of Superoptimizations We include exam- which is the process of assigning hardware register loca-
ples of the two programs found by both algorithms that il- tions, of which there are a small number, to the variables and
lustrate interesting patterns of program superoptimization. computations performed by the program. Improving regis-
The first example in Fig. 2 comes from the challenge of ter allocation can result in the program copying data fewer
creating a mask to identify the trailing zeros in a program. times, which can improve performance. For a human, realiz-
In this example, our model combines the first two instruc- ing such an optimization would require ingenuity to realize
tions into one using the x86 “load effective address” instruc- that the final instruction may be redundant and a competent
tion, which is named after the ability to compute an offset procedural understanding of assembly to realize and re-write
from a memory address, but highly useful for simplifying with the computation flipped. A plausible explanation for
some simple arithmetic. It is an example of a peephole opti- this result is that by pre-training our model with gcc -O0
mization, an optimization applied to a small window of in- unoptimized code as the original reference program and op-
structions with a scope metaphorical of peering through a timized gcc -O3 code as the optimized target program, the
door’s peephole. This one optimization is commonly used model may have learned a degree of competence in regis-
in compilers, and was likely learned from demonstration. It ter allocation via demonstration, thereby placing probability
may be surprising that the compiler did not apply such a mass on an alternative, yet more efficient implementation.
7 Application to Big Assembly 12
SMT (SILO)
In applying our methods to the Big Assembly dataset, we 10 SMT and Human (SILO)
followed the same general experimental setup as H ACKER ’ S SMT (RL)
D ELIGHT; however, we evaluated our results on held-out 8
SMT and Human (RL)

Percent Verified
sets. As mentioned in Section 5, we observed cases where
our models exploited bugs in the verifiers’ models of x86-64 6
semantics, thus we also incorporated manual human evalua-
tion into our reporting methodology for the benchmark. 4
For both learning methods, we chose the best model of the
ones checkpointed every 1K steps during fine-tuning. We did 2

this by choosing the model with the highest proportion of


0
programs that were superoptimized on a randomly-sampled 0 1000 2000 3000 4000 5000
subset of 329 functions from the validation set according Step Number
to our cost function C, correct according to our verification
methodology V, and correct again by manual human evalu- Figure 4: A plot reflecting the proportion of the validation
ation. set sub-sample population (329 programs) superoptimized
Using the chosen model, we then evaluated performance every 1000 steps of training.
on the test set. We manually evaluated 25% of our test set
to find what proportion of reported superoptimizations were Table 2: Test set results on the Big Assembly benchmark
actually correct. In Table 2 we report (1) the proportion of comparing the pre-trained model, SILO and REINFORCE
the entire test set that was superoptimzied according to our with baseline. The first column (SMT Ver.) reports the pro-
automatic methods discussed in Section 5 as well as (2) our portion of programs that beat the gcc -O3 baseline and ver-
estimate of the actual proportion that would manually verify ify using our automated evaluation methods in Section 5.
correct. For the latter (2), we use our measured proportion The second column (SMT + Human Ver.) presents our ex-
of reported superoptimizations that manually verified cor- pected proportion of programs that would additionally pass
rect from the 25% subset along with (1) the proportion of a human evaluation step, based on the 25% sample of the
reported superoptimizations for the entire test set to gener- test set we manually verified.
ate our expected number of programs that verified correct.
Results We witness that our proposed algorithm SILO far
outperformed REINFORCE with baseline on this task: on Model SMT Ver. SMT + Human Ver.
the test set, SILO superoptimized an expected 6.2% of pro- P RE - TRAIN 1.2% 1.0%
grams and REINFORCE with baseline superoptimized an SILO 8.3% 6.2%
expected 0.9%. We also note that despite the fact we used REINFORCE 0.9% 0.9%
two separate SMT based verifiers to prove correctness, our
SILO approach was capable of finding and learning general-
izable exploits in a manner that REINFORCE did not. the model gets.
In our study of manually verifying assembly programs,
we witnessed that across all earlier and later stages of train- Interpretation of Superoptimizations In our study of the
ing, the REINFORCE model consistently superoptimized assembly that outperformed gcc -O3 we saw two main pat-
the same 5 to 6 programs in the validation set; in other terns emerge: pushing operations out to branches and re-
words, it did not seem to apply superoptimization patterns moval of redundant instructions; Fig. 5 contains an example
to any new programs or learn any new superoptimization of both. In this one function, the first instruction, movl $0x1,
patterns. In contrast, the SILO model consistently increased %eax, is overwritten by later instructions and is only useful
the number of programs it superoptimized in the held-out if the first jump of the program is taken. Given this fact, it
set over time: it seemed to broaden its capacity to apply is more efficient under our cost model to push out the com-
patterns to new programs and simultaneously learn differ- putation after the first jump is taken to prevent it from being
ent strategies to superoptimize or even exploit the verifier; executed redundantly.
this trend is reflected in the Fig. 4 plot. We hypothesize that The second example of an optimization lies in removing
the reward space for program superoptimizations is highly the redundant line highlighted in red xorl %eax, %eax which
sparse: while training, we saw that less than 1 in every 1,000 zeros out the lower 32 bits of the return register %rax. This
samples the REINFORCE model sampled were program su- instruction is redundant because the same instruction is ex-
peroptimizations. We believe, without a method to re-learn ecuted two instructions earlier, and in between, there is no
from past experience, the on-policy REINFORCE algorithm operation with side-effects on the %rax register. As a result,
struggles to find the signal in the noise, and instead is faced it is safe to remove the second operation. We do note that
with optimizing a challenging objective of synthesizing syn- it is possible that on modern pipelines architectures, such
tactically and semantically correct programs. This is a non- optimizations may or may not speed up the runtime of a
trivial objective, because making a small error in program program depending on how instruction level parallelism and
syntax or semantics can have a dramatic effect on the reward branch prediction is handled on hardware: this is a challenge
.prstree_empty: .prstree_empty: Sharma, and Aiken (2013). While the learned proposal
movl $0x1, %eax testq %rdi, %rdi distribution showed improvements over the baseline, the
testq %rdi, %rdi je .L1 method ultimately still used stochastic search, except with
je .L1 xorl %eax, %eax
new hyper-parameters. Unlike our work, the model is un-
xorl %eax, %eax cmpl $0xb, (%rdi)
cmpl $0xb, (%rdi) je .L2
able to fully control program transformations end-to-end as
je .L1 cmpq $0x0, 0x10(%rdi)
it provides no learned priors on where in a program to apply
xorl %eax, %eax sete %al program transformations.
cmpq $0x0, 0x10(%rdi) .L1: A recent body of work centers on learning and searching
sete %al movl $0x1, %eax for more efficient schedules for tensor computations (Chen
.L1: .L2: et al. 2018; Adams et al. 2019). These works address real-
retq retq world problems of performance in image processing and
deep learning performance and often utilize machine learn-
(a) -O3 code (b) Model-optimized code ing to achieve these goals. However, these works differ in
that the search space is defined over how a program is run on
Figure 5: An example from the Big Assembly benchmark of hardware, a schedule. This search space is decoupled from
simultaneously removing a redundant instruction and push- the underlying algorithms themselves. Whereas in optimiz-
ing operations to branches. ing x86-64 assembly, both the algorithm and how it is com-
puted on hardware are in the domain of our search space.
in count-based models of CPU runtime that are common in
the program superoptimization works we are aware of. Be- 9 Conclusion
ing said, we have seen examples where our model created In our work, we explored the task of program superopti-
superoptimized assembly removing redundant instructions mization with neural sequence models. Towards this goal,
in straight line code as well as nop that were executed, and we introduced the Big Assembly benchmark consisting of
not just used as padding. 1.61 million programs mined from open source projects on
Github for pre-training along with and a subset of over 25K
8 Related Work functions with testcases that can additionally be passed off to
Program Optimization The general undecidability of the SMT based verifiers in the STOKE toolkit. We proposed
program equivalence means that there may always be room SILO, a learning approach with a two step process (1) an
for improvement in optimizing programs (Rice 1953). This exploration step to search for program superoptimizations,
is especially true as hardware options and performance goals and (2) a learning step on the best sequences found dur-
become more diverse: what transformations are best for a ing training. Our experiments on the Big Assembly dataset
scenario may vary greatly on performance objectives such demonstate that SILO is able to outperform REINFORCE
as such as energy consumption or runtime or other factors. with baseline. We believe that REINFORCE struggles, be-
There are classes of functions that are amenable to ex- cause program superoptimziation is a highly-challenging ex-
haustive search, where the optimal sequence of calcula- ploration task with a very sparse reward space. By incor-
tions is found (Massalin 1987). However, these are limited porating supervision on superoptimized sequences, SILO is
to very small sequences of machine instructions. State-of- able to learn optimizations more effectively from its explo-
the-art methods for superoptimization either rely on search- ration.
based procedures (Schkufza, Sharma, and Aiken 2013), or Recently, large neural sequence models have been pro-
constraint-based methods (Sasnauskas et al. 2017). How- posed as an effective method for program synthesis in high-
ever, these methods have difficulty scaling to larger prob- level programming languages such as python from natural
lems, and as a result, typically do not meet the performance language specifications (Yin and Neubig 2017; Chen et al.
requirements of real development scenarios at compile time. 2021); however, to our knowledge, relatively little work
has been done to refine these models to go beyond sim-
Machine-Learning-Based Program Optimization Per- ply synthesizing correct programs, and additionally opti-
haps the closest work to ours is Shi et al. (2020) which mize for important metrics such as performance or readabil-
attempted to learn symbolic expression simplification on a ity. Given the increased availability of natural language to
dataset of synthetically generated symbolic expressions in code datasets with testcases, tuning neural sequence models
Halide by re-writing sub-trees of the parsed expression with to go beyond program synthesis and optimize for additional
reinforcement learning. The domain differs from ours; how- metrics is a promising direction for future work.
ever, as the domain specific language contains simple ex-
pressions and randomly generated programs may contain re-
dundancies not seen in assembly optimized by a compiler
10 Acknowledgments
like gcc. We would like to thank Bogdan Vasilescu, Vincent Hellen-
Another work that addressed automatic program opti- doorn, Chris Cummins, Berkeley Churchill, and Qibin Chen
mization is Bunel et al. (2016). Unlike the Halide-based ex- for their feedback, thoughts, and ideas during the drafting of
periments, the work used reinforcement learning to learn a this work. This work was supported in part by the National
proposal distribution for stochastic search used in Schkufza, Science Foundation under grants 1815287 and 1910067
References Kingma, D. P.; and Ba, J. 2014. Adam: A method for
Adams, A.; Ma, K.; Anderson, L.; Baghdadi, R.; Li, T.-M.; stochastic optimization. arXiv preprint arXiv:1412.6980.
Gharbi, M.; Steiner, B.; Johnson, S.; Fatahalian, K.; Du- Leather, H.; and Cummins, C. 2020. Machine learning in
rand, F.; et al. 2019. Learning to optimize halide with tree compilers: Past, present and future. In 2020 Forum for Spec-
search and random programs. ACM Transactions on Graph- ification and Design Languages (FDL), 1–8. IEEE.
ics (TOG), 38(4): 1–12. Liang, C.; Norouzi, M.; Berant, J.; Le, Q. V.; and Lao, N.
Aho, A. V.; Lam, M. S.; Sethi, R.; and Ullman, J. D. 2006. 2018. Memory augmented policy optimization for program
Compilers: Principles, Techniques, and Tools. Pearson Ed- synthesis and semantic parsing. In Advances in Neural In-
ucation, Inc., 2 edition. formation Processing Systems, 9994–10006.
Allamanis, M.; Barr, E. T.; Devanbu, P.; and Sutton, C. 2018. Massalin, H. 1987. Superoptimizer: a look at the small-
A survey of machine learning for big code and naturalness. est program. ACM SIGARCH Computer Architecture News,
ACM Computing Surveys (CSUR), 51(4): 1–37. 15(5): 122–126.
Allen, F. E.; and Cocke, J. 1971. A Catalogue of Optimizing McKeeman, W. M. 1965. Peephole optimization. Commu-
Transformations. Technical report, IBM Thomas J. Watson nications of the ACM, 8(7): 443–444.
Research Center. Oh, J.; Guo, Y.; Singh, S.; and Lee, H. 2018. Self-imitation
Bunel, R.; Desmaison, A.; Kumar, M. P.; Torr, P. H.; and learning. arXiv preprint arXiv:1806.05635.
Kohli, P. 2016. Learning to superoptimize programs. arXiv Pouchet, L.; and Jimborean, A., eds. 2020. CC 2020: Pro-
preprint arXiv:1611.01787. ceedings of the 29th International Conference on Compiler
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Ponde, H.; Kaplan, Construction. Association for Computing Machinery.
J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Rice, H. G. 1953. Classes of recursively enumerable sets
2021. Evaluating large language models trained on code. and their decision problems. Transactions of the American
arXiv preprint arXiv:2107.03374. Mathematical Society.
Chen, T.; Zheng, L.; Yan, E.; Jiang, Z.; Moreau, T.; Ceze, L.; Sasnauskas, R.; Chen, Y.; Collingbourne, P.; Ketema, J.;
Guestrin, C.; and Krishnamurthy, A. 2018. Learning to op- Taneja, J.; and Regehr, J. 2017. Souper: A Synthesizing Su-
timize tensor programs. In Advances in Neural Information peroptimizer. arXiv preprint arXiv:1711.04422.
Processing Systems, 3389–3400.
Schkufza, E.; Sharma, R.; and Aiken, A. 2013. Stochastic
Chen, X.; and Tian, Y. 2019. Learning to perform local superoptimization. ACM SIGARCH Computer Architecture
rewriting for combinatorial optimization. In Advances in News, 41(1): 305–316.
Neural Information Processing Systems, 6281–6292.
Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural ma-
Churchill, B.; Padon, O.; Sharma, R.; and Aiken, A. 2019. chine translation of rare words with subword units. arXiv
Semantic program alignment for equivalence checking. In preprint arXiv:1508.07909.
Proceedings of the 40th ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation, 1027– Shi, H.; Zhang, Y.; Chen, X.; Tian, Y.; and Zhao, J. 2020.
1040. Deep Symbolic Superoptimization Without Human Knowl-
edge. In International Conference on Learning Representa-
Churchill, B.; Sharma, R.; Bastien, J.; and Aiken, A. 2017. tions.
Sound loop superoptimization for google native client. ACM
SIGPLAN Notices, 52(4): 313–326. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
De Moura, L.; and Bjørner, N. 2008. Z3: An efficient tention is all you need. In Advances in neural information
SMT solver. In International conference on Tools and Algo- processing systems, 5998–6008.
rithms for the Construction and Analysis of Systems, 337–
340. Springer. Warren, H. S. 2002. Hacker’s Delight.
Espeholt, L.; Soyer, H.; Munos, R.; Simonyan, K.; Mnih, Williams, R. J. 1992. Simple statistical gradient-following
V.; Ward, T.; Doron, Y.; Firoiu, V.; Harley, T.; Dunning, I.; algorithms for connectionist reinforcement learning. Ma-
et al. 2018. Impala: Scalable distributed deep-rl with impor- chine learning, 8(3): 229–256.
tance weighted actor-learner architectures. arXiv preprint Yin, P.; and Neubig, G. 2017. A Syntactic Neural Model for
arXiv:1802.01561. General-Purpose Code Generation. In Proceedings of the
Gulwani, S.; Jha, S.; Tiwari, A.; and Venkatesan, R. 2011. 55th Annual Meeting of the Association for Computational
Synthesis of loop-free programs. ACM SIGPLAN Notices, Linguistics (Volume 1: Long Papers), 440–450. Vancouver,
46(6): 62–73. Canada: Association for Computational Linguistics.
Joshi, R.; Nelson, G.; and Randall, K. 2002. Denali: A goal-
directed superoptimizer. ACM SIGPLAN Notices, 37(5):
304–314.
Kearns, M.; Mansour, Y.; and Ng, A. Y. 1998. An
information-theoretic analysis of hard and soft assignment
methods for clustering. In Learning in graphical models,
495–520. Springer.
A SMT based Verifier Exploits Found
Target Rewrite
In our manual evaluation stage, we witnessed the primary
pattern of exploiting the SMT solver based verifier was that .smtp_is_end.s: .smtp_is_end.s:
of branch deleation. We present a concrete example of one xorl %eax, %eax xorl %eax, %eax
such exploit paired with the verifier’s output in Fig. 6. In cmpb $0x2e, (%rdi) cmpb $0x2e, (%rdi)
the first subfigure we demonstrate the pattern exploiting the je .L1 je .L1
verifier from the original STOKE toolkit4 , and in the second retq .L1:
.L1: retq
subfigure we show the output when running on the verifier
movzbl 0x1(%rdi), %edx
included in the artifacts from a follow up work on program
cmpb $0xd, %dl
verification.5 In this example, a comparison is done between sete %al
the hex constant 0x2e and the value located at the address cmpb $0xa, %dl
in register %rdi; if the two are equal, the program jumps to sete %dl
location .L1 executing a sequence of code to place a value orl %edx, %eax
in the %eax register conditional on multiple tests. If the con- movzbl %al, %eax
stant is not equal to the value in memory, the jump is not retq
taken, and the program returns with a 0 in the %eax register:
this is because the very first line of the program xorl %eax, Equivalent: yes
%eax zeros out the the %eax register. Deletion of the branch
following .L1 is incorrect. We believe that the verifiers strug- (a) Output from the original STOKE bounded verifier
gle with forms of branching and jump statements very often
found in real-world programs. This is despite the fact that Target Rewrite
previous works the verifiers were used for included loops
.smtp_is_end.s: .smtp_is_end.s:
(which depend on jump statements) or branching. We found
xorl %eax, %eax xorl %eax, %eax
that when we removed the jump and location statement from cmpb $0x2e, (%rdi) cmpb $0x2e, (%rdi)
the spurious rewrite, thereby preserving function semantics je .L1 je .L1
and eliminating all branching, both verifiers correctly iden- retq .L1:
tified that the two programs were not equivalent. .L1: retq
movzbl 0x1(%rdi), %edx
B Additional Information on the Big cmpb $0xd, %dl
sete %al
Assembly Benchmark cmpb $0xa, %dl
sete %dl
Data Collection Our Big Assembly benchmark was
orl %edx, %eax
mined from open source projects implemented in the C
movzbl %al, %eax
programming language on Github. Our programs were dis- retq
assembled using GNU Binutils’ objdump into x86-64 as-
sembly, and split into individual functions. We performed [bv] Checking pair: 0 1 2 4; 0 1 2 3
the process twice on the same set of source code, so that Couldn’t take short-circuit option without memory.
we could create a parallel corpus of functions. We split [bv] Paths 0 1 2 4 / 0 1 2 3
our dataset into train, development, and test sets based at verified: true
the level of entire individual github projects. We dedupli- [bv] Checking pair: 0 1 2 4; 0 1 2 3
cated by removing any overlapping binaries between the Couldn’t take short-circuit option without memory.
[bv] Paths 0 1 2 4 / 0 1 2 3
datasets. We also deduplicated at the individual function
verified: true
level by removing string matches after removing function
[bv] Checking pair: 0 1 3 4; 0 1 2 3
names from assembly functions. Lastly, we removed pro- We’ve finished early without modeling memory!
grams pairs from the dataset if either the source or target [bv] Paths 0 1 3 4 / 0 1 2 3
program had length greater than 512 after byte-pair encod- verified: true
ing tokenization. [bv] Checking pair: 0 1 3 4; 0 1 2 3
We’ve finished early without modeling memory!
Setting Live out and Filter Spurious Examples As men- [bv] Paths 0 1 3 4 / 0 1 2 3
tioned in Section 3 for our each function in our dataset, we verified: true
were required to determine the live out set, the portion of the Equivalent: yes
CPU state required for determining the equivalence between
programs. We also determine heap out, which is a boolean (b) Output from verifier in the artifacts from (Churchill et al. 2019).
flag that determines whether or not we should also check the
heap for equivalence as well. We perform this sanity check Figure 6: An example of the common exploit where the right
by determining what parts of the CPU state are equivalent hand side deletes the branch of code following .L1. If the
third and fourth lines (je .L1; .L1:) are removed, then the
4
https://github.com/StanfordPL/stoke verifier actually returns correctly.
5
https://github.com/bchurchill/pldi19-equivalence-checker
between the gcc -O0 function and the gcc -O3. Pseu- level these may be canonicalized with ordinal locations (i.e.
docode for how live out is described in Algorithm 2, in this .L1, .L2, and so on). This fully preserves function semantics
algorithm we refer to the gcc -O0 function as Fu . while removing noise from the prediction task.
In line 1, the algorithm begins by initializing live out with
all possible CPU registers. In lines 2-5, until either the com- C Hyperparameters and Settings
putation budget is exhausted or the live out set reaches a In this section we report the hyperparameters used for fine-
fixed point, we iteratively execute the function get live out tuning our models.
which incrementally determines the candidate live out set. It
works by either executing the testcase suite or runs the SMT General Hyperparameters We found that our SILO al-
solver based verifier, analyzing any difference in the result- gorithm did not need hyperparameter fine-tuning; whereas,
ing CPU state, pruning any part of the CPU state that differs, our REINFORCE with baseline experiments were more brit-
and then returning the subset of the CPU state that may be tle. We witnessed that without a lower learning rate and
equivalent between Fu and Fref . This may need to be run it- a carefully tuned learning rate schedule, our REINFORCE
eratively, because after pruning live out with respect to one experiments would very often diverge before the full fine-
counter example, it is possible another counterexample may tuning budget was exhausted. For all fine-tuning, we used
still trigger a difference in other parts of the CPU state. For 2,000 steps warmup. For the SILO experiments, we utilized
most of the general purpose CPU registers, we also perform a constant factor of .50 applied to the “noam scheduler”
this pruning at the sub-register level, allowing the register from Vaswani et al. (2017). For the REINFORCE models,
size for equivalence to be pruned down to the lower 32, 16, we used a factor of .01 for the Big Assembly dataset and a
and 8 bits. If the computation budget is exhausted, the pro- factor of .0025 for the H ACKER ’ S D ELIGHT dataset.
gram returns early, and the program is discarded from the REINFORCE Hyperparameters For our baseline in
dataset. Eq. (3), we used a mean of the objective function for the pre-
After determining live out, in lines 6-10 an additional vious 256 samples for each unique program. After subtract-
check is done to see if both programs are equivalent when
ing the mean from the return, we then subsequently normal-
checking the heap. If they are, heap out is set to true, and
ized by the standard deviation of the objective function of
this information is recorded to be used for the fine-tuning the previous 256 samples. For the lagrangian multiplier λ in
phases. Lastly in lines 12-15, we perform a final sanity check Eq. (3), we used a penalty of 50,000. Additionally, we follow
to ensure that none of our programs are equivalent to a set Schkufza, Sharma, and Aiken (2013) in adding an additional
of spurious programs such as a null program, a return 0, penalty of 100 for every bit in the CPU state that differed be-
and a return 1. If a program is equivalent to one of these tween the reference implementations and the rewrite output
highly simplistic programs, it is discarded from our dataset. such that functions with similar semantics would be penal-
ized less than those with dramatically different semantics.
Algorithm 2: Set Live Out and Filter Examples To prevent our objective function from growing too great,
1: Initialize: live out = ALL LIVE OUT we also clipped the maximum cost so it would not exceed
2: repeat 100,000. As is typical in many policy gradient algorithms,
3: old live out = live out we also included an entropy bonus β to encourage additional
4: live out = get live out(Fu , Fref , live out) exploration: we used a constant entropy bonus of β = 0.01
5: until old live out 6= live out or budget exhausted for both our H ACKER ’ S D ELIGHT and Big Assembly exper-
6: if diff (Fu , Fref , live out, heap out = True) then iments.
7: heap out = False
8: else
9: heap out = True
10: end if
11: is spurious = False
12: for Fspur in SPURIOUS do
13: if ¬diff (Fspur , Fref , live out, heap out) then
14: is spurious = True
15: end if
16: end for

Data Preprocessing for Training We perform additional


processing on our programs that we feed into the model to
remove noise; we do this for the gcc -O0 function S as well
as the gcc -O3 functions used for pre-training Fref . x86-64
assembly often uses GOTO-like instructions to jump to dif-
ferent parts of a binary: this is one way that control flow
is implemented. The jump targets are often marked as off-
sets in the binary itself; however, at the individual function

You might also like