Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Graduation

assignment
Binary similarity with machine learning

Together building safe software

21-06-2017 www.securify.nl Hamza


Program
• Introduction
• Research question
• Project approach
• Research
• Proof of concept
• Results
• Conclusion
• Follow-up research
• Questions
Initial self made assignment
• Decompilation with machine learning
Complexity
• Compilers:
• Remove important information
• Optimize code

• Decompilers:
• Disassembly
• Lifting & dataflow analysis
• Control flow analysis
• Type analysis
• …

Source: https://www.backerstreet.com/decompiler/decompiler_architecture.htm
Complexity
• Compilers:
• Remove important information
• Optimize code

• Decompilers:
• Disassembly
• Lifting & dataflow analysis
• Control flow analysis
• Type analysis
• …

Source: https://redshift.autodesk.com/machine-learning/
State of the art
• Machine learning is used in large scale processes (malware detection)
• Decompilation is used in small scale processes (manual reverse engineering)
• Little to no research in the field of decompilation with machine learning
Changed the assignment (due to difficulty)
• Subject: binary similarity with machine learning
• Main question: which machine learning algorithm gives the most accurate results to
compare Linux binary programs?

How similar are we?


Goal & purpose
• Clone & piracy detection
• Malware variant detection (out of scope)

How similar are we?


Sub questions
• Machine learning types
• Linux binary
• Machine learning algorithms for Linux binaries
• Features in Linux binaries
• Binary analysis frameworks
• Evaluation / implementation
Project approach
• Agile
• Scope of the assignment
• Project phases
• Research methods
Project phases
1. Exploratory research
2. Targeted research
3. Implementation and evaluation
4. Processing results
Research methods
1. Exploratory research => exploratory research
2. Targeted research => desk research
3. Implementation and evaluation => experimental research
4. Processing results
Machine learning types
• Supervised learning
• Unsupervised learning
• Reinforced learning
Supervised classification machine learning

• Algorithms:
• Naive Bayes (NB)
• Support Vector Machine (SVM)
• K Nearest Neighbors (KNN)
• Decision Trees (DTs)
Source: Hands-On Machine Learning with Scikit-Learn and TensorFlow, met toestemming van O'Reilly Media, Copyright © Aurélien Géron
Classifying Linux binaries

Source: http://www.nltk.org/book/ch06.html

• Requirements:
• A labeled dataset with Linux binaries
• Extract features from Linux binaries
Linux binary
• ELF format
• Contains machinecode
• Machinecode is hard to dissassemble

Source: https://commons.wikimedia.org/wiki/File:Elf-layout--en.svg
Linux binary analysis frameworks
Radare2 is opensource
Generate a dataset
• Compiled 60 C files to 1000 Linux binaries with gcc & clang with different
optimalization flags
Features in Linux binaries
• Features need to be relevant and discriminative
• Unsuitable features:
• Hashes
• Strings
• Entropy
• Suitable features
• Control flow graphs
• Call graphs
• N-grams
Control flow graph & call graph

control flow graph call graph


Graph problem
• Machine learning algorithms do not work with graphs
• Custom graph kernel is needed
Kernel trick

Source: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Kernel trick

Source: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
Graph kernel
• Complex
N-gram
• Is a contiguous sequence of N items
• Can be extracted on different levels:
• Byte level
• Call graph level
• Function level

Source: http://recognize-speech.com/images/Antonio/Unigram.png
Byte level
• hexdump
Function level
Simplification
From function to N-grams to ML vector
Function: [push, mov, push, sub, call, …]
↓ ↓↓ ↓↓
Bigrams: [push mov, mov push, push sub, sub call, ….]
↓ ↓↓ ↓↓
ML vector: […, 0, 0, 1, 1, 0, 0, 1, …]
Proof of concept
1. Extract all functions from binaries
2. Filter functions smaller than 5 instructions
3. Simplify functions
4. Make N-grams
5. Convert N-grams to feature vectors
6. Put labels according to the function name
7. Apply machine learning algorithms
K-fold cross validation

Source: https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.jpg
Results- unigram
Results - bigram
Conclusion
• which machine learning algorithm gives the most accurate results to compare Linux
binary programs?
• Decision trees with 92.5% accuracy
Further possible research
• Parameter optimization in learning algorithms
• Graph kernel implementation
• Ensemble learning algorithms
• Assemble and use a bigger dataset

You might also like