Professional Documents
Culture Documents
FINAL REPORT 2.pdf
FINAL REPORT 2.pdf
50 Pages 682.9KB
Mar 26, 2024 1:58 PM GMT+5:30 Mar 26, 2024 1:58 PM GMT+5:30
Summary
CLASSIFICATION OF MALWARE
USING REVERSE ENGINEERING
4
PROJECT REPORT
Submitted by
SUTHARSAN M (212IT512)
DHARANIENDRAN P (212IT503)
THARUN J (212IT513)
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
MARCH 2024
BONAFIDE CERTIFICATE
2
Ms. MOHANAPRIYA K
Assistant Professor Level 1,
Department Of Artificial Intelligence and Machine Learning
5
ACKNOWLEDGEMENT
We would like to thank our friends, faculty and non-teaching staff who have
directly and indirectly contributed to the success of this project.
4
SUTHARSAN M (212IT512)
DHARANIENDRAN P (212IT503)
THARUN J (212IT513)
i
ABSTRACT
Keywords:
Malware analysis, Reverse engineering, Machine learning, Malware
detection Dimensionality reduction, Malware classification, Cybersecurity
ii
TABLE OF CONTENTS
ABSTRACT ii
12
TABLE OF CONTENTS iii
LIST OF ABBREVATIONS ix
1. INTRODUCTION 1
1.2.6 DECOMPILERS 3
17
2. LITERATURE REVIEW 5
3.1 OBJECTIVES 8
3.2 METHODOLOGY 8
iii
3.2.3.1 CALCULATION OF ENTROPY 10
FILE
OPTIMIZATION
4.1 CLUSTERING 18
iv
4.4.3 A VERSION OF THE MAIN IMAGE 21
4.4.10 KMEANS 24
5.1 RESULTS 26
5.2.1 VISUALIZATION 28
5.2.6 INTERPRETABILITY 29
v
5.2.7 KMEANS 29
2
6. CONCLUSION AND SUGGESTION FOR FUTURE WORK 30
6.1 CONCLUSION 30
REFERNCE 32
WORK CONTRIBUTION 38
25
vi
LIST OF TABLES
vii
LIST OF FIGURES
viii
LIST OF ABBREVIATIONS
ix
CHAPTER-1
INTRODUCTION
x
forensics. Malware family classification: Malware analysis tools help group
similar malware samples into families and help researchers understand the origins
and evolution of different strains.
These tools examine the code and structure of malware without actually
running it. They identify patterns, signatures and indicators of compromise
(IOCs) to help classify malware and develop detection rules.
9
1.2.2 Dynamic Analysis Tools
2
1.2.6 Decompilers
3
Presently, antivirus vendors and Security Operations Centers primarily
classify malware based on its behavior within the target computer or operating
system. This involves analyzing the malware's signature and comparing it to
known signatures of other malware samples. As attackers employ increasingly
sophisticated evasion and propagation techniques, there is a pressing need for
1
advancements in detection techniques. This would enable security operations
centers to swiftly and accurately evaluate the potential risks associated with a
particular file and take the necessary steps for mitigation and remediation.
1
A helpful way to conceptualize the components of malware is to think of
the malicious code as a projectile, much like a bullet or missile. Just as a missile
requires a method to penetrate its target, malware relies on various means, such
as email attachments or remote code injections, to infiltrate a system.
4
Chapter 2
LITERATURE SURVEY
5
T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Polychronakis, and S.
Ioannidis, “Rage against the Virtual Machine: Hindering Dynamic Analysis of
Android Malware”. This research provides a comprehensive understanding of
malware analysis, its goals, and the different types of analysis techniques (static,
dynamic, and hybrid).
1
Dynamic analysis entails executing malware within a controlled
environment, typically on a virtual machine. During this process, the behavior of
the malware while running is closely observed, with a focus on monitoring system
events and network activity. This approach to malware analysis offers a
significant advantage by reducing the uncertainty that static analysis may have,
as the malware operates in an environment that closely resembles actual system
conditions.
6
A. Moser on “Limits of static Analysis for malware detection in 2007”.
However, static analysis does come with its limitations as referred with. It is
unable to fully dissect the behavior of a binary that employs techniques like self-
modifying code or relies on dynamic data such as the current date and time.
Achieving precise results through static analysis can be computationally
demanding, posing challenges, especially for systems requiring real-time threat
detection. It's worth noting that not all static analysis systems face this issue; for
instance, the PE Miner tool developed by Shafiq et al in 2009 demonstrated near-
real-time detection capabilities.
1
Static analysis has also been shown to be prone to performance degradation
when used with obfuscated binaries. Malware authors are constantly refining their
tactics to evade detection and analysis, which presents ongoing challenges for
malware analysts. Some of the key challenges include
7
10
CHAPTER - 3
OBJECTIVES AND METHODOLOGY
3.1 OBJECTIVES
3.2 METHODOLOGY
8
3.2.1 Selection of Languages and Libraries:
Our Python IDE was Spyder, which comes with Anaconda. My GUI was
made using the QtCreator IDE's built-in tools. A cross-platform C++ IDE called
QtCreator is included with QtDesigner. With the help of the command-line tool
pyuic5, a PyQt GUI class can be created using the WYSIWYG form design tool
9
QtDesigner. The time spent developing intricate layouts, which could have been
better used developing other aspects of the solution, was reduced thanks to this
GUI design and build solution. Furthermore, it permits layouts that are
considerably more complicated than those made feasible by Python's built-in GUI
modules, such as TKinter.
10
value between startOffset and startOffset+length. Then, after looping through the
byteOccurrences array, any byte values that did not occur are disregarded. We
determine the frequency for every byte value that has happened, and then we set
1
the entropy value to be equal to: prior entropy value - frequency * base 256 log
frequency * 8. An entropy value between 0 and 8 is the outcome. Within the
bounds established by the caller, a greater value here denotes more random data
in the file.
It was crucial to build these methods in a way that would make them as
reusable as feasible due to the tool's frequent need to read values from a file. The
readByte method is the Portal Executable 32 class's smallest component, and each
time the class is used to retrieve a value, one or more calls to the readByte method
are made. This is demonstrated by the readBytesmethod, which cycles over the
1
given number of bytes until it has finished reading them, attaching each one to a
local variable that will be returned to the caller.
The bigEndian bytes and read littleEndian bytes methods extend the
readBytesmethod with a new level of abstraction; the only difference in the output
is that the read bigEndian bytes function reverses the order of the characters in
the returned byte string using Python's slicing function. The Windows Portal
1
Executable 32-bit file specification does not guarantee that a value can be
11
found at an offset from the beginning of the file, only that it can be found relative
to a previously calculated offset, typically a COFF header, an optional header, or
1
a Portal Executable header. It should also be noted that these two methods accept
parent offsets as well as offsets that are relative to the parent offset.
3.2.3.4 Get the Names of the Imports from the 32-bit File
Some files had relative virtual addresses that were too long for the size of
the file, which resulted in null being returned and the system hanging. Using the
built-in'strings' command found on Unix-based systems provided the solution to
this issue. The specified binary's readable strings are all returned by this
command. Windows has a command-line tool called strings.exe that is
comparable. The strings command was called from the system using the
subprocess module, which then returned the results as a list of strings that were
separated by newlines.
Filtering out any imported strings was another issue, although it was very
1
simple to do so using Python's re. regex module. By using a straightforward
search all in the resulting list of strings with a regular expression that would look
for one or more word characters, i.e., any non-special characters followed
by.dll,.exe,. DLL, or. EXE. This method of loading import names has shown to
12
be incredibly effective and dependable which is referred in “Attributes of
malicious files” by J. Yonts, 2012.
The resulting product can suggest the ideal KMeans settings. This
function's implementation is really straightforward. a little portion of the
1
PeMachineLearning class' populate_table function. This code just keeps the
best silhouette score and the decomposition algorithm used to preprocess the data
prior to executing KMeans clustering as two variables that are then used to update
the label displayed in the upper right corner. This reasoning is carried out as a
result of the iterative evaluation of the numerous permutation algorithms in use
1
for the purpose of populating the table on the Algorithm Evaluation tab.
13
3.2.3.7 Calculation of Clustering Accuracy
Unique labels are discovered for each file ty32-bit in the dataset in order to
determine the accuracy of the clustering that was done on it. In each iteration of
the file list, we assign the first unique label that is, a label that has never been
allocated to another file type as a baseline for comparison. Since there are more
possibilities to consider, the logic for determining whether these labels fit
expectations is essentially the same as the logic for determining detection
accuracy. The caller is subsequently provided with a percentage that represents
the computed accuracy.
This boolean flag is set by the user and functions as logic for plotting
1
cluster centers or centroids. When this is set to true, white circles are drawn over
the grid points that serve as the cluster centers. The output is then obtained by
plotting the matching label values on top of the white circles.
14
3.2.3.9 Color Charts
Compared to the relatively straightforward scatterplots covered in the
preceding part, color charts proved to be more difficult to plot. Finding the
upper and lower limits for both of the dimensions of the data set was the first
stage in graphing the data. To prevent points from being plotted at the graph's
absolute edge, these have been set to +- 1 real values. The granularity or accuracy
of the later-drawn linear discrimination lines is then allocated a step size. Then,
1
by increasing the sequential values by the step size, we produce lists of values for
both dimensions between the upper and lower boundaries using this step size.
Then, a grid of points is made using these two lists of values.
1
At this point, we approach each grid point as if it were a piece of data to
which the clustering technique should be applied. By doing so, we are able to
obtain labels for every point in the graph and determine the borders of each cluster
on the grid. When calling the imshow() method, we use this data to specify a
colormap that will be used to draw solid colors on the chart. The original two-
dimensional data are finally plotted in full color as a last step.
15
The padding on the y-axis is constrained by yLower and yUpper. After
that, we label the filled area with the cluster number and create a space between
1
the filled regions by setting the lower bound for the subsequent filled area to be
5 bigger than the upper bound of the previous area.
A single wrapper class, PE32, has been used to implement the 32-bit file
parser component of this project. PE32 exposes the attributes of the 32-bit file
that is instantiated with. This class' complete UML diagram may be seen in
1
Appendix 2. Additionally, I've developed a command-line program that, when
given the path to the executable, will display all the fields that the wrapper class
exposes in order to output these file properties. exam. Run this program by typing
python peparse.py file-path/file-name.exe on the command line.
16
features.csv. Because further steps of the process used by the finished tool require
that the data input into the algorithms be numeric, it is necessary to save the file
names separately. We only read from files when filenames are required.csv, and
we only read from features when we need information for our machine learning
1
algorithms. The initial part of the data in one file will correlate with the first row
of data in the second, etc., because the indexes in csv files are consistent.
17
CHAPTER - 4
PROPOSED WORK MODULES
In the first primary tab of modulo, there are two main tabs and three sub-
tabs. The user can navigate the directory using a button and workbar located
outside the application's main area. Three tabs in the main box allow for the
creation of numerous alternative visuals as well as the testing of various
combinations and data integration procedures.
4.1 CLUSTERING
The user can choose the cluster range for the parsing and clustering
algorithms, as well as the KMeans algorithm alone, on the first tab. The label's
cluster centers or graph centroids' look can be modified by the user.
1
The four graphs are, starting from the top left, clockwise:
1. Decomposition graph
18
2. Graph clustering
1
3. Color Map Graph showing linear discriminants between clusters
When the rendering is complete, the silhouette score becomes visible below the
graphs.
Controls for setting the maximum and minimum number of clusters used
in the analysis by the KMeans method are found on the second tab. The optimal
KMeans parameters are displayed on the page thanks to these tags, which enable
users to rapidly identify the parameters to assess for any combination of
separation and integration techniques. The assessment is shown as a table with
1
seven columns:
19
1. Decomposition algorithm
2. Number of features/dimensionalities of reduced data
3. Clustering algorithm
4. Number of clusters
5. Score of the silhouette
6. Detection accuracy
7. Clustering accuracy
1
Figure 5: Design for elbow plot tab
The third tab offers the user milestone charts for the two supported parsing
algorithms using the KMeans algorithm. This can be used as an alternate
technique to calculate the initial map's negative KMeans.
20
4.4 PROPOSED MACHINE LEARNING FEATURES
4.4.1 Entropy of Files and Entropy of Partitions
This value, which ranges from 0 to 8, indicates how random the process's
data is. Packaged archives and encrypted executables typically have greater
entropy values than unpackaged and unencrypted files, according to the findings
given by Lyda and Hamrock. Entropy can be used as a criterion for whether data
is wrapped or encrypted and, consequently, benign or malicious because this
technique is frequently used by malware but not by dangerous data. This
measurement is nevertheless a helpful tool even though it cannot accurately
differentiate between benign and malignant tumors on its own. The identical
computation is performed for partial entropy but only for a subset of the data.
21
4.4.4 Number of Sections
The chapter title table has this many chapter names. The number of sections
in a file is an excellent indicator of whether the file is good or terrible, as Yont
snotes. In general, normal files have 0 to 10 sections, but malicious files
typically have 3 or 4 sections. When we look at the dataset we can see this pattern
as well.
1
Import Name Function
KERNEL32.DLL Memory Management & I/O operations
USER32.DLL Window Management
ADVAPI32.dll Security& registry Management
msvcrt.dll C library for the Visual C++ Compiler
GDI32.dll Graphics Device Interface
SHELL32.dll Opening webpages and files
ole32.dll Object Linking and Embedding
WS2_32.dll Networking
Registry Entry, URL and Colour
SHLWAPI.dll
Management
COMCTL32.dll UI Components
22
4.4.6 Dimension Reduction Techniques
The goal of dimensionality reduction techniques is to identify the dataset
that most accurately captures all the data gathered for a particular object. In other
words, data that is deemed to be unimportant is filtered away as part of the
dimensionality reduction process, which transforms the high-dimensional data set
into a low-dimensional space. Since nothing can be planned in more than three
dimensions, this has the advantage of making it simpler to view whole data sets.
Machine learning methods can also be used to enhance data processing for low-
dimensional data. The table displays the dll file's functionalities.
23
algorithms come in a variety of forms. The hierarchical clustering techniques
KMeans and MeanShift are both evaluated in this research.
4.4.10 KMeans
21
By reducing the number of frames in each group, the K-Means algorithm
separates the data into K groups. The amount of clusters that the KMeans
1
algorithm should produce is a crucial component. The KMeans algorithm does
not choose the number of clusters based on the data, unlike other clustering
7
algorithms like MeanShift. K-Means starts by randomly initializing K centroids
6
in the feature space. These centroids represent the initial cluster centres. In this
step, each data point is assigned to the nearest centroid based on a distance metric,
8
typically Euclidean distance. Each point becomes a member of the cluster
associated with the nearest centroid. After all data points are assigned to clusters,
the centroids are recalculated. The new centroids are determined as the mean of
all data points within each cluster.
24
the function value. Since there are more centers of gravity and fewer distances
between each point and the adjacent one, we observe that the K value drops as K
grows.
25
CHAPTER - 5
RESULTS AND DISCUSSIONS
5.1 RESULTS
• This feature simply involves counting the number of sections within the
PE file.
26
• An unusually high or low number of sections can be an indicator of a
suspicious or malicious PE file. Legitimate software typically has a
predictable number of sections, so deviations may be noteworthy.
• This feature quantifies the balance between code and data within the PE
file.
32
• Calculate the size of the code section and the size of the data sections, then
compute the ratio (code size / data size).
• Malware often has a different code-to-data ratio than legitimate software.
Anomalies in this ratio can be indicative of malicious behavior, such as
packing or data hiding.
• This feature extracts the major image version from the PE file header.
• The major image version can provide information about the tools and
compilers used to create the binary. Unusual or outdated versions might
suggest suspicious activity, but this feature is less commonly used for
analysis compared to the others.
5.1.5 Common DLL Imports
• This feature involves analysing the Import Address Table (IAT) of the PE
file to identify the dynamic-link libraries (DLLs) and functions that the
binary imports.
• Malware often uses specific DLLs and functions for its malicious
operations. Detecting uncommon or suspicious imports can be a strong
indicator of malicious behaviour.
27
5.2. DIMENSIONALITY REDUCTION TECHNIQUES
5.2.1 Visualization
28
5.2.5 Overfitting Prevention
5.2.6. Interpretability
Simpler models with fewer features are often more interpretable and
explainable, making it easier to understand and communicate the results.
5.2.7. KMeans
29
24
CHAPTER - 6
CONCLUSIONS & SUGGESTIONS FOR FUTURE WORK
6.1 CONCLUSION
As part of this experiment, a tool was developed that can run multiple
1
variations of separation and integration algorithms and present the results to the
user. The tool also allows users to view 2D data in various formats. In this project,
a system was developed that uses heuristic information obtained from
Windows32 portable executables to effectively classify malware.
Given the performance of existing systems, pipelines have proven to be
very effective in classifying malicious or malicious files using portable
1
executable heuristics. When used with the NMF parsing algorithm and 3 features
as input to the 11clusterxKMeans clustering algorithm, the system achieved
100% accuracy. This is a significant achievement and demonstrates the
possibility of using heuristic information from Windows32 portable executables
to effectively classify malware.
However, the system is not successful in distributing bad information to
families. The accuracy of the group is only 46.76%, which is not high enough for
this use. This fact may be improved with further research and testing. However,
there is currently not enough evidence to use on this subject. Overall, the
proposed method demonstrates the ability to identify malware using heuristic
information obtained from WIN32 portable executables. Although there is room
for improvement in the classification of negative data, the system has
demonstrated its potential and needs further research and development.
30
1
This will include evaluating which combinations of features provide the best
results and evaluating how the combinations are used.
However, this provides many benefits and may not be possible within the
time allocated to the project. For example, assuming the system currently uses 18
1
characteristics, if all possible combinations of 2 variables are evaluated, the
system will produce 302 results for each combination, group algorithm, and
number of groups. This will result in over 5,000 result rows. Scaling this index
from 1 to 18 points can result in hundreds of thousands of results. Another area
that can be explored is improving the accuracy of the distribution of bad
information in the family. This will require further research and testing to
determine which combination of features and algorithms provides the best results.
Additionally, new techniques or methods can be developed to increase its
accuracy.
The environment and IDE selected for project work are suitable for the task
at hand. It reduces the risk of bad data usage and increases efficiency when
1
creating solutions. Although some tasks, such as extracting material from PE
archives, proved difficult, the overall experience with all tasks was good.
In summary, there are several areas where further research and
20
development can be done to improve the accuracy and performance of the
proposed system. These include using more features for in-depth analysis,
1
evaluating which combinations of features provide the best results, and improving
the accuracy with which malicious elements are excluded. Through continuous
research and development, the system has the potential to become an effective
tool for malware detection and profession a classification.
31
REFERENCES
[1] Muhammad Shoaib Akhtar, Tao Feng, "Malware Analysis and Detection
Using Machine Learning Algorithms", 2022, Symmetry 2022, 14(11),
2304;
https://doi.org/10.3390/sym14112304
[2] Lei Fang, Hongbin Wu, Kexiang Qian, Wenhui Wang, Longxi Han, "A
Comprehensive Analysis of DDoS attacks based on DNS" -
iopscience, 2021
https:/iopscience.iop.org/article/10.1088/1742-6596/2024/1/012027/meta
[3] Arkajit Datta, Kakelli Anil Kumar, Aju. D, "An Emerging Malware
Analysis Techniques and Tools: A Comparative Analysis", 2021
https://www.ijert.org/research/an-emerging-malware-analysis-techniques-
and-tools-a-comparative-analysis-IJERTV10IS040071.pdf
[4] Sanfeng Zhang, Jiahao Wu, Mengzhe Zhang, andWang Yang, "Dynamic
Malware Analysis Based on API Sequence Semantic Fusion", 2023, Appl.
Sci. 2023, 13(11), 6526;
https://doi.org/10.3390/app13116526
https://www.semanticscholar.org/paper/Semantics-aware-malware-
detectionChristodorescuJha/265b6313093a3c5ea4a5c75096592739f2999f
05
32
[6] Mohamed Lebbie, S Raja Prabhu, Animesh Kumar Agrawal,
“Comparative Analysis of Dynamic Malware Analysis Tools”, 2022.
https://www.researchgate.net/publication/357506552_Comparative_Anal
ysis_of_Dynamic_Malware_Analysis_Tools
https://www3.cs.stonybrook.edu/~mikepo/papers/ratvm.eurosec14.pdf
https://www.semanticscholar.org/paper/Techniques-for-Malware-
Analysis-Gadhiya-
Bhavsar/27bdcb57c86e8ab9521b6cde2d1a3dc49284bc88
https://web.cs.ucdavis.edu/~zubair/files/raid09-zubair.pdf
33
[11] M. Christodorescu and S. Jha, “Static analysis of executables to detect
malicious patterns,” SSYM’03 Proc. 12th Conf. USENIX Secur. Symp.,
vol. 12, pp. 12–12, 2003.
https://pages.cs.wisc.edu/~jha/jha-papers/security/usenix_2003.pdf
[12] A. Moser, C. Kruegel, and E. Kirda, “Limits of Static Analysis for Malware
Detections,” Acsac, pp. 421–430, 2007.
https://ieeexplore.ieee.org/document/4413008
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-
9fde-d599bac8184a/pecoff_v83.docx.
[14] R. Lyda and J. Hamrock, “Using Entropy Analysis to Find Encrypted and
Packed Malware,” IEEE Secur. Priv., vol. 5, pp. 40–45, 2007
https://ieeexplore.ieee.org/abstract/document/4140989
https://link.springer.com/chapter/10.1007/978-981-16-5747-4_33
https://www.cs.ubc.ca/~murphyk/MLbook/pml-toc-1may12.pdf
34
[19] S. Wold, K. H. Esbensen, and P. Geladi, “Principal Component Analysis,”
Chemom. Intell. Lab. Syst., vol. 7439, no. August, pp. 37–52, 1987.
https://www.sciencedirect.com/science/article/abs/pii/0169743987800849
https://papers.nips.cc/paper_files/paper/2000/hash/f9d1152547c0bde0183
0b7e8bd60024c-Abstract.html
”https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/..
https://www.jstor.org/stable/2346830
https://ieeexplore.ieee.org/document/400568
[24] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 17, no. 8, 1995.
https://www.continuum.io/anaconda-overview.
35
[26] “Matplotlib: Python plotting — Matplotlib 2.0.0 documentation.”
https://matplotlib.org/.
https://www.qt.io/ide/.
[29] Forbes, D. A. "Reverse engineering the Milky Way. Monthly Notices of the
Royal Astronomical " Society, 493(1), 847-854. 2020
https://academic.oup.com/mnras/article/493/1/847/5717311
https://dl.acm.org/doi/10.1007/978-3-030-28954-6_7
36
APPENDICES
PE FILE STRUCTURE:
Entropy
Randomness for benign file is from 0 to 8
Code to data ratio
Benign files often have high data and less executable code
Major Image version
Benign files often have version number
Number of sections
Benign files often have 0 – 10 sections
Common dll imports
It can be used to find the motive of software regardless of malicious or benign
files
37
WORK CONTRIBUTION
SUTHARSAN M (212IT512)
1. Research about the Machine learning Algorithms, feature Analysis
2. Downloading necessary modules and configuring the setup for further
development
3. Code for Malware analysis tool
4. Testing and Validation
DHARANIENDRAN P (212IT503)
THARUN J (212IT513)
38
Similarity Report
TOP SOURCES
The sources with the highest number of matches within the submission. Overlapping sources will not be
displayed.
bannariamman on 2024-03-19
2 1%
Submitted works
bannariamman on 2023-10-07
3 1%
Submitted works
bannariamman on 2023-10-08
4 <1%
Submitted works
fastercapital.com
6 <1%
Internet
Sources overview
Similarity Report
bannariamman on 2023-10-09
10 <1%
Submitted works
scholarworks.wm.edu
12 <1%
Internet
bannariamman on 2024-03-23
17 <1%
Submitted works
e.diklatgarbarata.id
18 <1%
Internet
bannariamman on 2024-03-15
19 <1%
Submitted works
Sources overview
Similarity Report
dspace.plymouth.ac.uk
24 <1%
Internet
saulibrary.edu.bd
25 <1%
Internet
researchgate.net
28 <1%
Internet
arxiv.org
32 <1%
Internet
Sources overview