Professional Documents
Culture Documents
Pldi 16
Pldi 16
Abstract have been ported and integrated into many software pack-
ages, ready to be deployed in binary form to home comput-
We address the problem of finding similar procedures in
ers, enterprise servers and even appliance firmware.
stripped binaries. We present a new statistical approach for
A security savvy individual, or more commonly a security
measuring the similarity between two procedures. Our no-
aware company, would want to use a sample of the vulner-
tion of similarity allows us to find similar code even when
able product (in binary form) to search for the vulnerability
it has been compiled using different compilers, or has been
across all the software installed in the organization, where
modified. The main idea is to use similarity by composition:
source-code is mostly not available. Unfortunately, automat-
decompose the code into smaller comparable fragments, de-
ically identifying these software packages is extremely chal-
fine semantic similarity between fragments, and use sta-
lenging. For example, given a sample from the Debian Linux
tistical reasoning to lift fragment similarity into similarity
distribution, finding other (recent or older) vulnerable ver-
between procedures. We have implemented our approach
sions of it is already hard. This is because even though older
in a tool called Esh, and applied it to find various promi-
distributions were probably compiled with the same com-
nent vulnerabilities across compilers and versions, including
piler, gcc, they used older gcc versions, producing syntacti-
Heartbleed [3], Shellshock [5] and Venom [7]. We show that
cally different binary code. Trying to find OSX applications
Esh produces high accuracy results, with few to no false pos-
is even harder, as they are commonly compiled with a dif-
itives – a crucial factor in the scenario of vulnerability search
ferent compiler (CLang), and the firmware of an appliance,
in stripped binaries.
using Intel chips, might be compiled with icc (the Intel C
Categories and Subject Descriptors D.3.4 [Processors: compiler), such that the resulting binary procedures differ
compilers, code generation]; F.3.2 (D.3.1) [Semantics of vastly in syntax. We address this challenge by providing an
Programming Languages: Program analysis]; effective means of searching for semantically similar proce-
Keywords static binary analysis; verification-aided similar- dures, at assembly code level.
ity; partial equivalence; statistical similarity Problem definition Given a query procedure q and a large
collection T of (target) procedures, in binary form, our goal
1. Introduction is to quantitatively define the similarity of each procedure
During December 2014, several vulnerabilities were discov- t ∈ T to the query q. The main challenge is to define a
ered in the prominent implementation of the network time semantic notion of similarity that is precise enough to avoid
protocol (NTP) ntpd [1]. As this implementation is the de false positives, but is flexible enough to allow finding the
facto standard, many products from major vendors were af- code in any combination of the following scenarios: (i) the
fected, including RedHat’s Linux distribution, Apple’s OSX code was compiled using different compiler versions; (ii)
and Cisco’s 5900x switches. Because some of these vulner- the code was compiled using different compiler vendors;
abilities were introduced many years ago, different versions and (iii) a different version of the code was compiled (e.g.
a patch). We require a method that can operate without
information about the source code and/or tool-chain used
in the creation of the binaries only of procedures in binary
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
form.
for profit or commercial advantage and that copies bear this notice and the full citation Existing techniques Previous work on clone detection in bi-
on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to naries [31] can overcome different instruction selections, but
republish, to post on servers, or to redistribute to lists, contact the Owner/Author. is unable to handle syntactic differences beyond single in-
Request permissions from permissions@acm.org or Publications Dept., ACM, Inc.,
fax +1 (212) 869-0481. Copyright 2016 held by Owner/Author. Publication Rights structions. Binary code search [14] is mostly syntactic and
Licensed to ACM. therefore fails to handle differences that result from differ-
PLDI ’16, June 13 - 17, 2016, Santa Barbara, CA, USA
Copyright c 2016 ACM 978-1-4503-4261-2/16/06. . . $15.00 ent compilers. Equivalence checking and semantic differenc-
DOI: http://dx.doi.org/10.1145/2908080.2908126 ing techniques [17, 25, 26, 29] operate at the source-code
target image 𝑖𝑡 similar regions in 𝑖𝑡 (c) target code 𝑡
4 shr eax, 8
5 3 lea r14d, [r12+13h]
6 1 mov
lea
r13, rbx
rcx, [r13+3]
2 mov [r13+1], al
mov [r13+2], r12b
mov rdi, rcx
query image 𝑖𝑞1 query image 𝑖𝑞2 similar regions in 𝑖𝑞1 similar regions in 𝑖𝑞2
(d) query code 𝑞1 (e) query code 𝑞2
mov rsi, 14h mov r9, 13h
1 4 3 mov rdi, rcx mov r12, rbx
5
6 shr eax, 8
add rbp, 3
mov
add
ecx, r13
esi, 1h
mov
lea
rsi, rbp
rdi, [r12+3]
2 xor ebx, ebx mov [r12+2], bl
test eax, eax lea r13d, [rcx+r9]
jl short loc_22F4 shr eax, 8
(a) (b)
Figure 1. Image vs. Code Similarity by Composition. Query image iq2 is similar to the target image it. Query code q2 is
similar to the target code t. Images courtesy of Irani et al. [12]
level and require a matching (labeling of variables) to oper- vulnerable to “Heartbleed”, and were compiled using differ-
ate. Furthermore, they do not provide quantitative measures ent compilers: gcc 4.9 and CLang 3.5, respectively. Snippet
of similarity when procedures are not equivalent. Dynamic (d) is taken from an unrelated procedure in Coreutils and
equivalence checking techniques [29] are hard to apply as compiled using gcc 4.9. For simplicity and brevity, we only
they require obtaining sufficient coverage for each proce- present a small part of the code.
dure. A dynamic approach for binary code search [15] co- Finding similarity between the procedures using syntac-
erces execution using a randomized environment to achieve tic techniques is challenging, as different compilers can pro-
coverage, but suffers from a high false positive rate, and does duce significantly different assembly code. Instead, we de-
not address or evaluate patching. Data-driven equivalence compose each procedure to small code segments we name
checking [32] is aimed at translation validation and is limited strands (a strand is a basic-block slice, as explained in Sec-
to short segments of assembly code. tion 3.2), and semantically compare the strands to uncover
Similarity by composition We draw inspiration from Boiman similarity, and lift the results into procedures.
and Irani’s [12] work on image similarity, where the key idea In Fig. 1(c) & (e), the query code, q2 , and the target code,
is that one image is similar to another if it can be composed t, share three matching strands, numbered in the figure as
using regions of the other image, and that this similarity can 1 , 2 , and 3 . Each strand is a sequence of instructions,
be quantified using statistical reasoning. and strands are considered as matches when they perform
Fig. 1(a) and (b) illustrate the idea of similarity by com- an equivalent computation. In the figure, we mark match-
position for images. Looking at the three images in Fig. 1(a), ing strands using the same circled number. Two syntacti-
we wish to determine which of the two query images, iq1 cally different strands can be equivalent. For example, the
and iq2 , is more similar to the target image, it. Our intuition strands numbered 2 in q2 and t differ syntactically but are
is that iq2 is probably more similar to it. equivalent (up to renaming and ignoring the change of r9).
Fig. 1(b) provides some explanation for this intuition. Furthermore, strands need not be syntactically contiguous.
Looking at this figure, we identify similar regions in iq2 and This is because they are based on data-flow dependencies
it (marked by the numbered outlined segments). Although rather than on syntactic properties. For example, strand 3 in
the images are not identical, their similarity stems from the the query procedure q2 (mov r12, rbx; lea rdi, [r12+3])
fact that query image iq2 can be composed using significant matches the strand mov r13, rbx;lea rcx, [r13+3] in the
(and maybe transformed) regions from target image it. In target procedure t. In contrast, the code in Fig. 1(d) only
contrast, image iq1 shares only two small regions with it, and matches the single strand 1 in the target.
indeed we intuitively consider it to be different. These vague Our approach We present a novel notion of similarity for
notions are made precise within a statistical framework that procedures that is based on the following key components:
lifts similarity between significant regions into similarity Decomposing the procedure into strands: We decompose
between images. procedures into smaller segments we refer to as strands,
In this paper, we show that the general framework of which are feasible to compare.
“similarity by composition” can also be applied to code. Comparing strands: We use a program verifier [11] to check
Specifically, we consider assembly code extracted from whether two strands are semantically equivalent by assum-
stripped binaries (with no debug information). Fig. 1(c), (d), ing input equivalence and checking if intermediate and out-
and (e) show partial assembly code of three procedures. put values are the same. When they are not equivalent, we
Snippets (c) and (e) are taken from the OpenSSL procedure define a quantitative notion of strand similarity based on the
proportion of matching values to the total number of values 1 mov r9 , 13h
2 mov r12 , rax
in the strand. 1 lea r14d , [r12 +13h]
3 mov eax , ebx
2 mov r13 , rax
Statistical reasoning over strands: We present a statistical 3 mov eax , r12d
4 add rbp , 3
5 mov rsi , rbp
framework for reasoning about similarity of whole proce- 4 lea rcx , [r13 +3]
6 lea rdi , [r12 +3]
5 shr eax , 8
dures using strand similarity. We compute a global similar- 6 lea rsi , [rbx +3]
7 mov [r12 +2], bl
8 lea r13d , [rbx+r9]
ity evidence score between procedures using the sum of the 7 mov [r13 +1] , al
9 shr eax , 8
8 mov [r13 +2] , r12b
local evidence scores (LES ) between strands. A key aspect 9 mov rdi , rcx
10 mov [r12 +1], al
11 call _intel_memcpy
of our framework is that we amplify the similarity scores of 10 call memcpy
12 add r9 , 5h
11 mov ecx , r14d
unique strands, expressed by a high LES , and diminish the 12 mov esi , 18h
13 mov esi , r9d
14 mov ecx , r13d
significance of “common” strands (with respect to the target 13 mov eax , ecx
15 mov eax , ecx
14 add eax , esi
database examined) as they are less indicative of similarity. 15 call write_bytes
16 add eax , esi
17 call write_bytes
Main contributions The contributions of this paper are: 16 test eax , eax
18 mov ebx , eax
17 js short loc_2A38
19 test ebx , ebx
• A framework for reasoning about similarity of proce- 20 jl short loc_342E
dures in stripped binaries. The main idea is to decompose
procedures into strands, perform semantic comparison of (a) gcc v.4.9 -O3 (b) icc v.15.0.1 -O3
them, and lift strand similarity into similarity between Figure 2. Heartbleed vulnerability code snippets.
procedures using statistical techniques.
• A technique for checking input-output equivalence of
assume r12q == rbxt
strands of code, where all variables are unlabeled.
• A statistical model that quantifies the probability of simi- 1 v1 t = 13h
1 v1 q = r12 q 2 r9 t = v1 t
larity between procedures by examining the similarity of 2 v2 q = 13h + v1 q 3 v2 t = rbx t
their strands. 3 v3 q = int_to_ptr (v2 q ) 4 v3 t = v2 t + v1 t
4 r14 q = v3 q 5 v4 t = int_to_ptr (v3 t )
• A prototype implementation in a tool called Esh, which 5 v4 q = 18h 6 r13 t = v4 t
6 rsi q = v4 q 7 v5 t = v1 t + 5
is publicly available at github.com/tech-srl/esh. We com- 7 v5 q = v4 q + v3 q 8 rsi t = v5 t
pare Esh to previous binary code-search techniques us- 8 rax q = v5 q 9 v6 t = v5 t + v4 t
10 rax t = v6 t
ing challenging search scenarios that combine patched
and non-patched versions of real-world vulnerable pro- assert v1q == v2t ,v2q == v3t , v3q == v4t ,r14q == r13t ,
v4q == v5t ,rsiq == rsit ,v5q == v6t ,raxq == raxt
cedures in binary form, compiled with different compiler
versions and vendors. These experiments show that Esh Figure 3. Semantically similar strands.
achieves significantly better results.
specified over the bar itself). All unrelated procedures were aminer who receives the list of procedures sorted by simi-
left blank. larity scores, and we define the number of false positives as
As we can see from the results in Fig. 5, our method gives the number of non-matching procedures the human exam-
high scores to all other similar versions of the “Heartbleed” iner will have to test until all the true similar procedures are
procedure, despite them being compiled using different com- found. The effectiveness of a method can be measured more
pilers, different compiler versions or from a patched source precisely and quickly by using CROC. We also included ad-
code. A gap of 0.08 in the GES score exists between the true ditional information about the number of basic-blocks and
positives from the rest of the procedures. (0.419 for the icc the number of strands extracted from them, and the CVE for
15 compiled procedure of openssl-1.0.1g “Heartbleed” vs. every vulnerable procedure we attempted to find.
0.333 for the bash 4.3 “ShellShock” procedure compiled These results clearly show that each layer of our method
with Clang 3.5). It is important to note that we will not try increases its accuracy. Comparison of the different experi-
to establish a fixed threshold to evaluate the quality of these ments shows that CROC & ROC scores do more than simply
results. As mentioned, this clean separation between the true count the number of false positives for every threshold they
positives and the false positives is not always possible. In- compare the rate by which the false positive rate grows. (In-
stead, this result and others, as shown in the following sec- formally, this may be regarded as a prediction of the number
tions, are evaluated according to the produced ranking. The of attempts after which the human researcher will give up.)
result in Fig. 5 receives a ROC = CROC = 1.0 score as it An important point is that the size of the query, in terms of
puts all of the true positives in the top of the ranking. the number of basic blocks or strands, does not directly cor-
relate with easier matching.
6.2 Decomposing Our Method into Sub-methods
Experiment #3 shows an interesting event, where even S-
When examining our method bottom up, we can divide it VCP gets a perfect score. Upon examination of the query
into three layers: procedure, we discovered that this occurs because the said
procedure contains several distinct numeric values which are
• S-VCP: The first layer of our method is the way we
only matched against similar procedures. (These are used to
calculate the VCP between strands. Without the use of
explore a data structure used to communicate with the QEMU
the statistical processing, we still define a similarity score
floppy device.)
as: st ∈T max sq ∈Q (VCP(st , sq )). This approach attempts
P
Examining the results as a whole, we see that in more than
to generalize the VCP from a pair of strands to a pair of
half of the experiments the use of S-VCP, which doesn’t em-
procedures by counting the maximal number of matched
ploy the statistical amplification of strands, results in a high
variables in the entire procedure.
number of false positives. To understand this better, we per-
• S-LOG: The next layer of our approach incorporates the formed a thorough analysis of experiment #5. When we ex-
statistical significance of every query strand, by using amine the values of Pr(sq |H0 ), which express the frequency
local and global significance. By alternatively defining of appearance of strand sq , we see that several strands get an
Pr(sq , st ) = VCP(sq , st ) and applying it to the LES unusually high score (appear more frequently). One of these
and GES equations, we can see how our method looks strands was found to be a sequence of push REG instructions,
without applying the sigmoid function to the VCP. which are commonplace for a procedure prologue.
• Esh: Adding the use of the sigmoid function results in our 6.3 Comparison of TRACY and Esh
full method as described in Section 3
Following our explanation of the different problem aspects
To fully justify and explain each layer of our method, in Sec. 5.3, we tested both tools, with a focus on experiment
Tab. 1 shows the results of each sub-method compared with number one in Tab. 1. Each line in Tab. 2 represents a
our full method, in terms of (i) false positives (FT), and (i) single experiment; the 3in one of the columns specifies
ROC & CROC (explained in Sec. 5.4). Note that we count that this specific problem aspect is applied: (i) Compiler
the number of false positives as determined by a human ex- version from the same vendor, (ii) Cross-compiler, meaning
Versions Cross Patches TRACY (Ratio-70) Esh Alias Matched? Similarity Confidence
3 1.0000 1.0000 Heartbleed 7 - -
3 0.6764 1.0000 Shellshock 7 - -
3 1.0000 1.0000 Venom 7 - -
3 3 0.5147 1.0000 Clobberin’ Time 7 - -
3 3 0.4117 1.0000 Shellshock #2 7 - -
3 3 0.8230 1.0000 ws-snmp 3 0.89 0.91
3 3 3 0.3529 1.0000 wget 7 - -
ffmpeg 3 0.72 0.79
Table 2. Comparing TRACY and Esh on different problem
aspects. Table 3. Evaluating BinDiff
different compiler vendors, and (iii) applying patches to the value (Eq. 1) for the query-target pair. Queries that originate
code. For example, when (i) and (ii) are checked together from the same procedure (but are compiled with different
this means that all variations of queries created using all compilers, or patched) are coalesced. Different procedures
compiler vendors and compiler versions were put in the are separated by ticks. We included at least two different
target database. compilations for each procedure. The first procedure (left-
As we can see, because TRACY was designed to handle most on the X, bottom on Y axis) is ftp_syst() from wget
patches, it achieves a perfect grade when dealing them. 1.8, queried in 6 different compilations. The second is taken
Moreover, TRACY can successfully handle the different com- from ffmpeg 2.4.6, queried in 7 different compilations. The
piler versions on their own. However when it is used in a rest are taken from Coreutils 8.23. The average ROC and
cross-compiler search, its accuracy begins to plummet. Fur- CROC values for the experiment were 0.986 and 0.959 re-
thermore, when any two problem aspects are combined, and spectively.
especially when all three are addressed, the method become
parse_integer()
practically unusable. dev_ino_compare()
6.4 Evaluating BinDiff default_format()
BinDiff [8] is a tool for comparing whole executables/li- print_stat()
braries by matching all the procedures within these libraries. cached_umask()
It works by performing syntactic and structural matching re- create_hard_link()
lying mostly on heuristics. The heuristic features over pro- i_write()
cedures which are the basis for similarity include: the num- compare_nodes()
ber of jumps, the place of a given procedure in a call-chain, ffmpeg-2.4.6:
the number of basic blocks, and the name of the procedure ff_rv34_decode_
init_thread_copy()
(which is unavailable in stripped binaries). Detailed infor-
mation about how this tool works can be found in [9], along wget-1.8:
ftp_syst()
with a clear statment that BinDiff ignores the semantics of
concrete assembly-level instructions.
Figure 6. All vs. all experiment result in heat map form
Tab. 3 shows the results of running BinDiff on each of
(pixel intensity = similarity GES measure).
the procedures in Tab. 1. As BinDiff operates on whole
executables/libraries, the query/target was a whole library Several observations can be made from Fig. 6:
containing the original vulnerability. We only compared one
target for each query, the same library compiled with a 1. The graph’s diagonal represents the “ground truth” i.e.,
different vendor’s compiler, and patched (for the queries each query is matched perfectly with itself.
where patching was evaluated). BinDiff failed to find the 2. Our GES measure is not symmetrical (as it is based on
correct match in all experiments but two. The experiments the asymmetrical VCP metric).
in which BinDiff found the correct match are those where 3. Esh provides a very clear distinction for ffmpeg-2.4.6’s
the number of blocks and branches remained the same, and ff_rv34_decode_init_thread_copy(), marked with a
is relatively small, which is consistent with [9]. dashed region numbered 1 , where all compiled queries
6.5 Pairwise Comparison of the procedure receive high GES scores when com-
Fig. 6 shows the similarity measures produced in an all-vs- pared with each other and low ones when compared with
all experiment, where 40 queries were chosen at random random code. In general, Esh correctly matches proce-
from our corpus and compared. The result is shown in heat- dures compiled with different compilers, as the pattern of
map form, where the axes represent individual queries (X “boxes” along the diagonal shows.
and Y are the same list of queries, in the same order) and 4. Groups of queries that originate from the same procedure
each pixel’s intensity represents the similarity score GES generally receive similar GES values (i.e. similar shade)
when compared to groups of targets that originate from # define DEFINE_SORT_FUNCTIONS (key_name , key_cmp_func ) \
the same (but different from the queries’) code. static int strcmp_ ## key_name (V a, V b) \
{ return key_cmp_func (a, b, strcmp ); }
Although some procedures, such as default_format(),
...
seem to perform poorly as they have high GES matches with
wrong targets (marked with a dashed region numbered 2 ), DEFINE_SORT_FUNCTIONS (ctime , cmp_ctime )
DEFINE_SORT_FUNCTIONS (mtime , cmp_mtime )
the correct evaluation of the matching should be relative to DEFINE_SORT_FUNCTIONS (atime , cmp_atime )
the correct matching. For default_format(), the GES val-
ues of the correct matching is high (specified by dark pixels Figure 8. DEFINE_SORT_FUNCTIONS macro for creating
around the diagonal), so relative to that, the set of wrong “template” procedures in ls.c of Coreutils 8.23
matchings (somewhat dark pixels in the middle) becomes
less significant, which is reflected by a ROC = .993 and
CROC = .960 AUC score. 7. Related Work
6.6 Limitations In this section, we briefly describe closely related work.
In this subsection we discuss the limitations of our approach Equivalence checking and semantic differencing [25, 26,
using study cases of procedures where Esh yields low recall 29] are aimed at proving equivalence and describing differ-
and precision. ences between (versions of) procedures in high-level code,
“Trivial” procedures and Wrappers Our approach relies on and do not apply to searching for similarity in machine code.
matching semantically similar, non-trivial fragments of ex- The authors of [25, 26] offer some handling for loops, but
ecution. Thus, when matching a very small query (i.e. a apply computationally expensive abstract domain libraries,
“trivial” procedure), we are forced to rely on the statisti- which do not scale in our setting. Sharma et al. [32] present a
cal significance of a very small number of relatively short technique for proving equivalence between high-level code
strands, which yields poor results. Wrappers are procedures and machine code, with loops, by trying to find a simula-
that mostly contain calls to other procedures, and hold very tion relation between the two. However the search is com-
little logic of their own e.g. the exit_cleanup() procedure putationally demanding and will not scale. Lahiri et al. [21]
in Fig. 7. As we operate on stripped binaries, we cannot rely narrow the scope to assumed-to-be but cannot-be-proved-to-
on identifiers like procedure names to try and match strands be-equivalent snippets of binary code, and attempts to find a
based on procedure calls. set of changes (add or delete) to complete the proof. While
having interesting implications for our problem, the assump-
static void exit_cleanup (void) { tion of a variable mapping encumbers the transition. Ng and
if ( temphead )
{ Prakash [24] use symbolic execution for similarity score cal-
struct cs_status cs = cs_enter (); culation, but it is not geared towards cross-compiler search,
cleanup ();
cs_leave (cs ); and is limited to handling patched procedures (specifically,
} only handles one calling convention and rejects procedures
close_stdout ();
} based on the number of inputs.)
Compiler bug-finding Hawblitzel et al. [17] present a tech-
Figure 7. Wrapper exit_cleanup procedure from sort.c of
nique that handles compiled versions of the same procedure
Coreutils 8.23
from different compilers. Their goal is to identify root causes
Generic procedures The C concatenation preprocessor di- of compiler bugs, and their approach cannot be directly ap-
rective (##) is sometimes used as a mechanism for generics, plied to our setting as they: (i) require strict equivalence and
where “template” procedures are created, that have similar thus even a slight change would be deemed a bug, (ii) know
structure but vary in type, or use a function pointer to pro- the IL originated from the same code allowing them to easily
vide different functionality as shown in Fig. 8, where dif- match inputs and outputs (i.e. these are labeled) for solving,
ferent string comparison procedures are created to compare which is not the case for machine code, and (iii) offer no
creation, modification and access time of a file. As these hold method for segmenting procedures and thus are limited in
the same general structure (and are specifically small), Esh handling loops (they use loop unrolling up to 2 iterations).
deems the whole set of procedures similar. We do not con- Dynamic methods Egele et al. [15] present a dynamic ap-
sider these matches to be strict false positives, as the pro- proach which executes the procedure in a randomized envi-
cedures are in fact similar in structure. Note that this prob- ronment and compares the side effects of the machine for
lem only arises when these types of procedures are used as similarity. As they base the similarity on a single random-
queries, as it is not clear whether the different variants should ized run, similarity may occur by chance, especially since
be considered as true positives (they perform similar seman- they coerce execution of certain paths to achieve full cov-
tics) or false positives (as they might operate on different erage. Their evaluation indeed outperforms [8]. However, it
data structures). is still wanting as they rank similar functions in the top 10
in only 77% of the cases, and do not evaluate over patched We implemented our technique, and applied it to find vari-
versions. ous prominent vulnerabilities across compilers and versions,
An interesting method by Pewny et al. [27] uses a transi- including Heartbleed, Shellshock and Venom.
tion to an intermediate language (VEX-IR), a simplification Our statistical notion of similarity is general, and not lim-
using the Z3 theorem prover, sampling and a hashing-based ited to binaries. We applied it to binaries where the utility
comparison metric. In their results they report several prob- of other techniques (that use the structure of the code) is
lems with false positives. We believe that this is because severely limited. In this paper, we used our notion of sim-
sampling alone is used to match two basic-blocks without ilarity for retrieval. In the future, we plan to investigate the
proving the match, and that basic-blocks are not weighted use of our technique for clustering and classification.
against how common they are in the corpus (and these basic-
blocks might be a compilation by-product). Moreover, the Acknowledgment
goal of their method is to find clones for whole binaries. We would like to thank Michal Irani for lively discussions,
Thus, it might be hard to apply in situations where patch- and for permitting the use of figures from [12]. We would
ing was performed. like to thank Armando Solar-Lezama, our shepherd, for
Structure-based static methods Jacobson et al. [18] attempt helping us improve and clarify the paper.
to fingerprint binary procedures using the sequence of sys- The research leading to these results has received fund-
tem calls used in the procedure. This approach is unstable to ing from the European Union’s Seventh Framework Pro-
patching, and is only applicable to procedures which contain gramme (FP7) under grant agreement no. 615688 - ERC-
no indirect calls or use system calls directly. COG-PRIME and the Israel Ministry of Science and Tech-
Smith and Horwitz [33] recognized the importance of sta- nology, grant no. 3-9779.
tistical significance for similarity problems, yet their method References
is geared towards source-code and employs n-grams, which
were shown ([14]) to be a weak representation for binary [1] Clobberingtime: Cves, and affected products. http://www.
kb.cert.org/vuls/id/852879.
similarity tasks.
The authors of [28] show an interesting approach for [2] Gnu coreutils. http://www.gnu.org/software/
finding similarity using expression trees and their similar- coreutils.
ity to each other, but this approach is vulnerable to code- [3] Heartbleed vulnerability cve information. https:
motion and is not suited to cross-compiler search as the dif- //cve.mitre.org/cgi-bin/cvename.cgi?name=
ferent compilers generate different calculation “shapes” for CVE-2014-0160.
the same calculation. [4] Hex-rays IDAPRO. http://www.hex-rays.com.
Detecting software plagiarism Moss [10] (Measure Of Soft- [5] Shellshock vulnerability cve information. https:
ware Similarity) is an automatic system for determining //cve.mitre.org/cgi-bin/cvename.cgi?name=
the similarity of programs. To date, the main application CVE-2014-6271.
of Moss has been in detecting plagiarism in programming [6] Smack: A bounded software verifier for c programs. https:
classes. Moss implements some basic ideas that resemble our //github.com/smackers/smack.
approach: (i) it decomposes programs and checks for frag- [7] Venom vulnerability cve information. http://cve.mitre.
ment similarity and (ii) it provides the ability to ignore com- org/cgi-bin/cvename.cgi?name=CVE-2015-3456.
mon code. However, in Moss the code is broken down to [8] zynamics bindiff. http://www.zynamics.com/bindiff.
lines and checked for an exact syntactic match, while Esh html.
decomposes at block level and checks for semantic similar- [9] zynamics bindiff manual - understanding bindiff.
ity. Furthermore, in Moss the ignored common code must be http://www.zynamics.com/bindiff/manual/index.
supplied by the user as the homework base template code, html#chapUnderstanding.
which is expected to appear in all submissions, while Esh [10] Aiken, A. Moss. https://theory.stanford.edu/
finds common strands by statistical analysis. All the ideas ~aiken/moss/.
implemented in Moss are preliminary ideas that bear resem-
[11] Barnett, M., Chang, B. E., DeLine, R., Jacobs, B., and Leino,
blance to ours but are not fully developed in a research paper
K. R. M. Boogie: A modular reusable verifier for object-
or evaluated by experiments over binaries. oriented programs. In Formal Methods for Components and
Objects, 4th International Symposium, FMCO 2005, Amster-
8. Conclusions dam, The Netherlands, November 1-4, 2005, Revised Lectures
We presented a new statistical technique for measuring sim- (2005), pp. 364–387.
ilarity between procedures. The main idea is to decompose [12] Boiman, O., and Irani, M. Similarity by composition. In NIPS
procedures to smaller comparable fragments, define seman- (2006), MIT Press, pp. 177–184.
tic similarity between them, and use statistical reasoning to [13] Brumley, D., Jager, I., Avgerinos, T., and Schwartz, E. J.
lift fragment similarity into similarity between procedures. Bap: A binary analysis platformIn Proceedings of the 23rd
International Conference on Computer Aided Verification [25] Partush, N., and Yahav, E. Static Analysis: 20th International
(2011), CAV’11, Springer-Verlag, pp. 463–469. Symposium, SAS 2013, Seattle, WA, USA, June 20-22, 2013.
[14] David, Y., and Yahav, E. Tracelet-based code search in ex- Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg,
ecutablesIn Proceedings of the 35th ACM SIGPLAN Confer- 2013, ch. Abstract Semantic Differencing for Numerical Pro-
ence on Programming Language Design and Implementation grams, pp. 238–258.
(2014), PLDI ’14, ACM, pp. 349–360. [26] Partush, N., and Yahav, E. Abstract semantic differenc-
[15] Egele, M., Woo, M., Chapman, P., and Brumley, D. Blanket ing via speculative correlation. In Proceedings of the 2014
execution: Dynamic similarity testing for program binaries ACM International Conference on Object Oriented Program-
and components. In Proceedings of the 23rd USENIX Security ming Systems Languages & Applications, OOPSLA 2014, part
Symposium, San Diego, CA, USA, August 20-22, 2014. (2014), of SPLASH 2014, Portland, OR, USA, October 20-24, 2014
pp. 303–317. (2014), pp. 811–828.
[16] Ferrante, J., Ottenstein, K. J., and Warren, J. D. The [27] Pewny, J., Garmany, B., Gawlik, R., Rossow, C., and Holz, T.
program dependence graph and its use in optimization. ACM Cross-architecture bug search in binary executables. In 2015
Trans. Program. Lang. Syst. 9, 3 (1987), 319–349. IEEE Symposium on Security and Privacy, SP 2015, San Jose,
CA, USA, May 17-21, 2015 (2015), pp. 709–724.
[17] Hawblitzel, C., Lahiri, S. K., Pawar, K., Hashmi, H., Gokbu-
lut, S., Fernando, L., Detlefs, D., and Wadsworth, S. Will [28] Pewny, J., Schuster, F., Bernhard, L., Holz, T., and Rossow,
you still compile me tomorrow? static cross-version com- C. Leveraging semantic signatures for bug search in bi-
piler validation. In Joint Meeting of the European Software nary programsIn Proceedings of the 30th Annual Computer
Engineering Conference and the ACM SIGSOFT Symposium Security Applications Conference (2014), ACSAC ’14, ACM,
on the Foundations of Software Engineering, ESEC/FSE’13, pp. 406–415.
Saint Petersburg, Russian Federation, August 18-26, 2013 [29] Ramos, D. A., and Engler, D. R. Practical, low-effort equiva-
(2013), pp. 191–201. lence verification of real codeIn Proceedings of the 23rd Inter-
[18] Jacobson, E. R., Rosenblum, N., and Miller, B. P. Labeling li- national Conference on Computer Aided Verification (2011),
brary functions in stripped binariesIn Proceedings of the 10th CAV’11, Springer-Verlag, pp. 669–685.
ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for [30] Rosenblum, N., Miller, B. P., and Zhu, X. Recovering
Software Tools (2011), PASTE ’11, ACM, pp. 1–8. the toolchain provenance of binary codeIn Proceedings of
[19] Khoo, W. M., Mycroft, A., and Anderson, R. Rendezvous: the 2011 International Symposium on Software Testing and
A search engine for binary codeIn Proceedings of the 10th Analysis (2011), ISSTA ’11, ACM, pp. 100–110.
Working Conference on Mining Software Repositories (2013), [31] Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D. J., and
MSR ’13, IEEE Press, pp. 329–338. Su, Z. Detecting code clones in binary executables. In
[20] Kleinbaum, D. G., and Klein, M. Analysis of Matched Data Proceedings of the Eighteenth International Symposium on
Using Logistic Regression. Springer, 2010. Software Testing and Analysis, ISSTA 2009, Chicago, IL, USA,
[21] Lahiri, S. K., Sinha, R., and Hawblitzel, C. Automatic root- July 19-23, 2009 (2009), pp. 117–128.
causing for program equivalence failures in binaries. In Com- [32] Sharma, R., Schkufza, E., Churchill, B., and Aiken, A. Data-
puter Aided Verification - 27th International Conference, CAV driven equivalence checkingIn Proceedings of the 2013 ACM
2015, San Francisco, CA, USA, July 18-24, 2015, Proceed- SIGPLAN International Conference on Object Oriented Pro-
ings, Part I (2015), pp. 362–379. gramming Systems Languages & Applications (2013),
[22] Lattner, C., and Adve, V. Llvm: A compilation framework OOPSLA ’13, ACM, pp. 391–406.
for lifelong program analysis & transformation. In Code [33] Smith, R., and Horwitz, S. Detecting and measuring similar-
Generation and Optimization, 2004. CGO 2004. International ity in code clones. In Proceedings of the International Work-
Symposium on (2004), IEEE, pp. 75–86. shop on Software Clones (IWSC) (2009).
[23] Leino, K. R. M. This is boogie 2. http://research. [34] Swamidass, S. J., Azencott, C., Daily, K., and Baldi, P. A
microsoft.com/en-us/um/people/leino/papers/ CROC stronger than ROC: measuring, visualizing and opti-
krml178.pdf. mizing early retrieval. Bioinformatics 26, 10 (2010), 1348–
[24] Ng, B. H., and Prakash, A. Expose: Discovering poten- 1356.
tial binary code re-use. In Computer Software and Applica- [35] Weiser, M. Program slicing. In Proceedings of the 5th In-
tions Conference (COMPSAC), 2013 IEEE 37th Annual (July ternational Conference on Software Engineering, San Diego,
2013), pp. 492–501. California, USA, March 9-12, 1981. (1981), pp. 439–449.