Graduation Assignment: Binary Similarity With Machine Learning

Graduation
assignment
Binary similarity with machine learning
Together building safe software
21-06-2017 www.securify.nl Hamza

Program
• Introduction
• Research question
• Project approach
• Research
• Proof of concept
• Results
• Conclusion
• Follow-up research
• Questions
Initial self made assignment
• Decompilation with machine learning
Complexity
• Compilers:
• Remove important information
• Optimize code
• Decompilers:
• Disassembly
• Lifting & dataflow analysis
• Control flow analysis
• Type analysis
• …
Source: https://www.backerstreet.com/decompiler/decompiler_architecture.htm
Complexity
• Compilers:
• Remove important information
• Optimize code
• Decompilers:
• Disassembly
• Lifting & dataflow analysis
• Control flow analysis
• Type analysis
• …
Source: https://redshift.autodesk.com/machine-learning/
State of the art
• Machine learning is used in large scale processes (malware detection)
• Decompilation is used in small scale processes (manual reverse engineering)
• Little to no research in the field of decompilation with machine learning
Changed the assignment (due to difficulty)
• Subject: binary similarity with machine learning
• Main question: which machine learning algorithm gives the most accurate results to
compare Linux binary programs?
How similar are we?

Goal & purpose
• Clone & piracy detection
• Malware variant detection (out of scope)
How similar are we?

Sub questions
• Machine learning types
• Linux binary
• Machine learning algorithms for Linux binaries
• Features in Linux binaries
• Binary analysis frameworks
• Evaluation / implementation
Project approach
• Agile
• Scope of the assignment
• Project phases
• Research methods
Project phases
1. Exploratory research
2. Targeted research
3. Implementation and evaluation
4. Processing results
Research methods
1. Exploratory research => exploratory research
2. Targeted research => desk research
3. Implementation and evaluation => experimental research
4. Processing results
Machine learning types
• Supervised learning
• Unsupervised learning
• Reinforced learning
Supervised classification machine learning
• Algorithms:
• Naive Bayes (NB)
• Support Vector Machine (SVM)
• K Nearest Neighbors (KNN)
• Decision Trees (DTs)
Source: Hands-On Machine Learning with Scikit-Learn and TensorFlow, met toestemming van O'Reilly Media, Copyright © Aurélien Géron
Classifying Linux binaries
Source: http://www.nltk.org/book/ch06.html
• Requirements:
• A labeled dataset with Linux binaries
• Extract features from Linux binaries
Linux binary
• ELF format
• Contains machinecode
• Machinecode is hard to dissassemble
Source: https://commons.wikimedia.org/wiki/File:Elf-layout--en.svg
Linux binary analysis frameworks
Radare2 is opensource
Generate a dataset
• Compiled 60 C files to 1000 Linux binaries with gcc & clang with different
optimalization flags
Features in Linux binaries
• Features need to be relevant and discriminative
• Unsuitable features:
• Hashes
• Strings
• Entropy
• Suitable features
• Control flow graphs
• Call graphs
• N-grams
Control flow graph & call graph
control flow graph call graph

Graph problem
• Machine learning algorithms do not work with graphs
• Custom graph kernel is needed
Kernel trick
Source: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Kernel trick
Source: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Graph kernel
• Complex
N-gram
• Is a contiguous sequence of N items
• Can be extracted on different levels:
• Byte level
• Call graph level
• Function level
Source: http://recognize-speech.com/images/Antonio/Unigram.png
Byte level
• hexdump
Function level
Simplification
From function to N-grams to ML vector
Function: [push, mov, push, sub, call, …]
↓ ↓↓ ↓↓
Bigrams: [push mov, mov push, push sub, sub call, ….]
↓ ↓↓ ↓↓
ML vector: […, 0, 0, 1, 1, 0, 0, 1, …]
Proof of concept
1. Extract all functions from binaries
2. Filter functions smaller than 5 instructions
3. Simplify functions
4. Make N-grams
5. Convert N-grams to feature vectors
6. Put labels according to the function name
7. Apply machine learning algorithms
K-fold cross validation
Source: https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.jpg
Results- unigram
Results - bigram
Conclusion
• which machine learning algorithm gives the most accurate results to compare Linux
binary programs?
• Decision trees with 92.5% accuracy
Further possible research
• Parameter optimization in learning algorithms
• Graph kernel implementation
• Ensemble learning algorithms
• Assemble and use a bigger dataset

Graduation Assignment: Binary Similarity With Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Graduation Assignment: Binary Similarity With Machine Learning

Uploaded by

Copyright:

Available Formats

Graduation

Together building safe software

21-06-2017 www.securify.nl Hamza

How similar are we?

How similar are we?

control flow graph call graph

You might also like