Professional Documents
Culture Documents
Graduation Assignment: Binary Similarity With Machine Learning
Graduation Assignment: Binary Similarity With Machine Learning
assignment
Binary similarity with machine learning
• Decompilers:
• Disassembly
• Lifting & dataflow analysis
• Control flow analysis
• Type analysis
• …
Source: https://www.backerstreet.com/decompiler/decompiler_architecture.htm
Complexity
• Compilers:
• Remove important information
• Optimize code
• Decompilers:
• Disassembly
• Lifting & dataflow analysis
• Control flow analysis
• Type analysis
• …
Source: https://redshift.autodesk.com/machine-learning/
State of the art
• Machine learning is used in large scale processes (malware detection)
• Decompilation is used in small scale processes (manual reverse engineering)
• Little to no research in the field of decompilation with machine learning
Changed the assignment (due to difficulty)
• Subject: binary similarity with machine learning
• Main question: which machine learning algorithm gives the most accurate results to
compare Linux binary programs?
• Algorithms:
• Naive Bayes (NB)
• Support Vector Machine (SVM)
• K Nearest Neighbors (KNN)
• Decision Trees (DTs)
Source: Hands-On Machine Learning with Scikit-Learn and TensorFlow, met toestemming van O'Reilly Media, Copyright © Aurélien Géron
Classifying Linux binaries
Source: http://www.nltk.org/book/ch06.html
• Requirements:
• A labeled dataset with Linux binaries
• Extract features from Linux binaries
Linux binary
• ELF format
• Contains machinecode
• Machinecode is hard to dissassemble
Source: https://commons.wikimedia.org/wiki/File:Elf-layout--en.svg
Linux binary analysis frameworks
Radare2 is opensource
Generate a dataset
• Compiled 60 C files to 1000 Linux binaries with gcc & clang with different
optimalization flags
Features in Linux binaries
• Features need to be relevant and discriminative
• Unsuitable features:
• Hashes
• Strings
• Entropy
• Suitable features
• Control flow graphs
• Call graphs
• N-grams
Control flow graph & call graph
Source: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Kernel trick
Source: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Graph kernel
• Complex
N-gram
• Is a contiguous sequence of N items
• Can be extracted on different levels:
• Byte level
• Call graph level
• Function level
Source: http://recognize-speech.com/images/Antonio/Unigram.png
Byte level
• hexdump
Function level
Simplification
From function to N-grams to ML vector
Function: [push, mov, push, sub, call, …]
↓ ↓↓ ↓↓
Bigrams: [push mov, mov push, push sub, sub call, ….]
↓ ↓↓ ↓↓
ML vector: […, 0, 0, 1, 1, 0, 0, 1, …]
Proof of concept
1. Extract all functions from binaries
2. Filter functions smaller than 5 instructions
3. Simplify functions
4. Make N-grams
5. Convert N-grams to feature vectors
6. Put labels according to the function name
7. Apply machine learning algorithms
K-fold cross validation
Source: https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.jpg
Results- unigram
Results - bigram
Conclusion
• which machine learning algorithm gives the most accurate results to compare Linux
binary programs?
• Decision trees with 92.5% accuracy
Further possible research
• Parameter optimization in learning algorithms
• Graph kernel implementation
• Ensemble learning algorithms
• Assemble and use a bigger dataset