FINAL REPORT 2.pdf

Similarity Report
PAPER NAME AUTHOR
FINAL REPORT 2.pdf mohanapriya
WORD COUNT CHARACTER COUNT
9450 Words 51310 Characters
PAGE COUNT FILE SIZE
50 Pages 682.9KB
SUBMISSION DATE REPORT DATE
Mar 26, 2024 1:58 PM GMT+5:30 Mar 26, 2024 1:58 PM GMT+5:30
20% Overall Similarity

The combined total of all matches, including overlapping sources, for each database.
6% Internet database 1% Publications database
Crossref database Crossref Posted Content database
18% Submitted Works database
Excluded from Similarity Report

Bibliographic material Quoted material
Cited material Small Matches (Less then 10 words)
Summary
CLASSIFICATION OF MALWARE
USING REVERSE ENGINEERING
4
PROJECT REPORT
Submitted by
SUTHARSAN M (212IT512)
DHARANIENDRAN P (212IT503)
THARUN J (212IT513)
In partial fulfilment for the award of the degree of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
BANNARI AMMAN INSTITUTE OF TECHNOLOGY

(An Autonomous Institution Affiliated to Anna University, Chennai)
SATHYAMANGALAM-638401
ANNA UNIVERSITY: CHENNAI 600 025
MARCH 2024
BONAFIDE CERTIFICATE
Certified that this project report “CLASSIFICATION OF MALWARE

USING REVERSE ENGINEERING” is the Bonafide work of
“SUTHARSAN M (212IT512), DHARANIENDRAN P (212IT503) and
3
THARUN J (212IT513)” who carried out the project work under my
supervision.
Dr. ARUN SHALIN L V Ms. MOHANAPRIYA K
HEAD OF THE DEPARTMENT ASSISTANT PROFESSOR LEVEL 1
Department of Information Technology Department of Artificial Intelligence and

Machine Learning
Bannari Amman Institute of Technology Bannari Amman Institute of Technology
Submitted for Project Viva Voce examination held on ………………
External Examiner I External Examiner II

DECLARATION
We affirm that the project work titled “CLASSIFICATION OF MALWARE

3
USING REVERSE ENGINEERING” being submitted in partial fulfilment
for the award of the degree of Bachelor of Technology in Information
Technology is the record of original work done byus under the guidance of Ms.
MOHANAPRIYA K, Assistant Professor Level 1, Department of Artificial
Intelligence and Machine Learning. It has not formed a part of any other
project work(s) submitted for the award of any degree or diploma, either in this
or any other University.
SUTHARSAN M DHARANIENDRAN P THARUN J

4
(212IT512) (212IT503) (212IT513)
I certify that the declaration made above by the candidates is true.
2
Ms. MOHANAPRIYA K
Assistant Professor Level 1,
Department Of Artificial Intelligence and Machine Learning
5
ACKNOWLEDGEMENT
We would like to enunciate heartfelt thanks to our esteemed Chairman Dr.

S.V. Balasubramaniam, and the respected Director Dr. M.P. Vijayakumar,
for providing excellent facilities and support during the course of study in this
institute.
2
We wish to express our sincere thanks to Dr. Palanisamy C, Principal and for
his constructive ideas, inspirations, encouragement and much needed
Technical support extended to complete our Project work.
19
We are grateful to Dr. Arun Shalin L V, Head of the Department, Department
2
of Information Technology for his valuable suggestions to carry out the project
work successfully.
We wish to express our sincere thanks to Ms. Mohanapriya K, Assistant

Professor Level 1, Department of Artificial Intelligence and Machine
Learning for his constructive ideas, inspirations, encouragement, excellent
guidance and much needed technical support extended to complete our project
work.
We would like to thank our friends, faculty and non-teaching staff who have
directly and indirectly contributed to the success of this project.
4
THARUN J (212IT513)
i
ABSTRACT
The proliferation of malware threats in the digital landscape has posed

significant challenges to cybersecurity practitioners. In response to this evolving
threat landscape, this research paper explores a holistic approach to malware
analysis that leverages both reverse engineering and machine learning techniques.
Reverse engineering allows for the in-depth examination of malicious code,
enabling a deeper understanding of its functionality and evasion techniques.
Concurrently, machine learning models are employed to automate the
classification and detection of malware based on extracted features and
behavioral patterns. This Project investigates various aspects of malware analysis,
including the reverse engineering of malware binaries to uncover their inner
workings, identification of malicious code patterns, and extraction of relevant
features. Additionally, machine learning algorithms are trained on these features
to distinguish between benign and malicious software efficiently. Moreover, the
paper explores the significance of continuously updating machine learning
27
models to adapt to evolving malware tactics. The experimental results presented
in this study demonstrate the effectiveness of the proposed methodology in
detecting previously unknown malware samples and improving overall threat
intelligence. By fusing reverse engineering expertise with machine learning
26
capabilities, this research offers a promising avenue for enhancing the resilience
of cybersecurity systems against ever-evolving malware threats.
Keywords:
Malware analysis, Reverse engineering, Machine learning, Malware
detection Dimensionality reduction, Malware classification, Cybersecurity
ii
TABLE OF CONTENTS
CHAPTER NO. TITLE PAGE NO.

18
ACKNOWLEDGMENT i
ABSTRACT ii
12
TABLE OF CONTENTS iii
LIST OF TABLES vii
LIST OF FIGURES viii
LIST OF ABBREVATIONS ix
1. INTRODUCTION 1
1.1 IMPORTANCE OF MALWARE ANALYSIS TOOLS 1

14
1.2 TYPES OF MALWARE ANALYSIS TOOLS 2
1.2.1 STATIC ANALYSIS TOOLS 2
1.2.2 DYNAMIC ANALYSIS TOOLS 2
1.2.3 BEHAVIORAL ANALYSIS TOOLS 2
1.2.4. CODE DISASSAMBLERS AND DEBUGGERS 2
1.2.5 MEMORY ANALYSIS TOOLS 2
1.2.6 DECOMPILERS 3
17
2. LITERATURE REVIEW 5
3. OBJECTIVE AND METHODOLGY 8
3.1 OBJECTIVES 8
3.2 METHODOLOGY 8
3.2.1 SELECTION OF LANGUAGES AND LIBRARIES 9
3.2.2 DEVELOPMENT ENVIRONMENT 9
3.2.3 IMPLEMENTATION OF IMPORTANT ALGORITHMS 10
iii
3.2.3.1 CALCULATION OF ENTROPY 10
3.2.3.2 BYTE TO DECIMAL DATA CONVERSION 11
3.2.3.3 READING VALUES FROM PE FILES 11
3.2.3.4 GET THE NAMES OF THE IMPORTS FROM THE PE 12
FILE
3.2.3.5 AUTOMATION OF KMEANS PARAMETER 13
OPTIMIZATION
3.2.3.6 CALCULATION OF DETECTION ACCURACY 13
3.2.3.7 CALCULATION OF CLUSTERING ACCURACY 14
3.2.3.8 DECOMPOSITION AND CLUSTERING SURFACES 14
3.2.3.9 COLOR CHARTS 15
3.2.3.10 SILHOUETTE GRAPHS 15
3.2.3.11 ELBOW FENCE 16
3.2.4 IMPLEMENTATION OF INDIVIDUAL COMPONENTS 16
3.2.4.1 PE FILE PARSER 16
3.2.4.2 FEATURE EXTRACTION 16
3.2.4.3 USER INTERFACE 17
3.2.4.4 CLUSTERING AND CATEGORIZATION 17

10
4. PROPOSED WORK MODULES 18
4.1 CLUSTERING 18
4.2 ALGORITHM EVALUATION 19
4.3 ELBOW PLOT 20
4.4 PROPOSED MACHINE LEARNING FEATURES 21
4.4.1 ENTROPY OF FILES AND ENTROPY OF PARTITIONS 21
4.4.2 RATIO OF CODE TO INITIALIZED DATA 21
iv
4.4.3 A VERSION OF THE MAIN IMAGE 21
4.4.4 NUMBER OF SECTIONS 22
4.4.5 COMMON DLL IMPORTS 22
4.4.6 DIMENSION REDUCTION TECHNIQUES 23
4.4.7 PRINCIPLE COMPONENT ANALYSIS (PCA) 23
4.4.8 NEGATIVE MATRIX FACTORIZATION (NMF) 23
4.4.9 CLUSTERING ALGORITHMS 23
4.4.10 KMEANS 24
4.4.11 SILHOUETTE METRICS 24
4.4.12 ELBOW METHOD 24
4.4.13 AVERAGE DISPLACEMENT 25

28
5. RESULT AND DISCUSSION 26
5.1 RESULTS 26
5.1.1 ENTROPY OF SECTIONS 26
5.1.2 NUMBER OF SECTIONS 26
5.1.3 CODE TO DATA RATIO 27
5.1.4 MAJOR IMAGE VERSIONS 27
5.1.5 COMMON DLL IMPORTS 27
5.2 DIMENSIONALITY REDUCTION TECHNIQUES 28
5.2.1 VISUALIZATION 28
5.2.2 FEATURE ENGINEERING 28
5.2.3 NOISE REDUCTION 28
5.2.4 COMPUTATIONAL EFFICIENCY 28
5.2.5 OVERFITTING PREVENTION 29
5.2.6 INTERPRETABILITY 29
v
5.2.7 KMEANS 29
2
6. CONCLUSION AND SUGGESTION FOR FUTURE WORK 30
6.1 CONCLUSION 30
6.2 SUGGESTIONS FOR FUTURE WORK 30
REFERNCE 32
WORK CONTRIBUTION 38
25
vi
LIST OF TABLES
TABLE TITLE OF THE TABLES PAGE NO.

NO.
1. Dll imports and its functions 22
2. Input features of samples (features.csv) 26
vii
LIST OF FIGURES
FIGURE TITLE OF THE FIGURES PAGE NO.

NO.
1. Information flow in the system 8
2. Flow diagram of the project workflow 10
3. Design for the clustering tab 18
4. Design for the algorithm evaluation tab 19
5. Design for elbow plot tab 20
viii
LIST OF ABBREVIATIONS
PCA - Principal Component Analysis
NMF - Non-Negative Matrix Factorization
IDE - Integrated Development Environment
GUI - Graphical User Interface
PE - Portable Executable (windows 32-bit files)
IAT - Import Address Table
ix
CHAPTER-1
INTRODUCTION
Malware analysis tools play a key role in cybersecurity by providing the

necessary means to dissect and understand malicious software, commonly known
11
as malware. Malware is a broad term that includes various types of malicious
software such as viruses, worms, trojans, ransomware, spyware, and more. These
digital threats pose significant risks to individuals, organizations and even nations
as they can lead to data breaches, financial losses and disruption of critical
services.
Malware analysis is the process of examining and understanding the inner

workings of malware to uncover its intent, capabilities, and potential impact.
Malware analysts use specialized tools to reverse engineer malware, extract
valuable information, and design effective countermeasures to protect systems
from these threats.
1.1 IMPORTANCE OF MALWARE ANALYSIS TOOLS
Threat detection and identification: Malware analysis tools help identify

and categorize different types of malware, allowing cybersecurity professionals
to quickly spot new and emerging threats.
Behavioral analysis: These tools help analyze the behavior of malware in

controlled environments and determine how it interacts with the operating system
and other software components.
Signature generation: Malware analysts create unique signatures based on

specific malware characteristics, which antivirus software then uses to detect and
block malware on infected systems.
Forensic and incident response: By analyzing malware, analysts can

reconstruct the attack chain and gather evidence for incident response and digital
x
forensics. Malware family classification: Malware analysis tools help group
similar malware samples into families and help researchers understand the origins
and evolution of different strains.
1.2 TYPES OF MALWARE ANALYSIS TOOLS

9
1.2.1 Static Analysis Tools
These tools examine the code and structure of malware without actually
running it. They identify patterns, signatures and indicators of compromise
(IOCs) to help classify malware and develop detection rules.
9
1.2.2 Dynamic Analysis Tools
These tools run malware in a controlled environment, such as a sandbox,

to monitor its behavior, network interactions, and system modifications. This
approach helps uncover hidden malicious activity.
1.2.3 Behavioral Analysis Tools
These tools focus on real-time monitoring of malware to analyze its

actions, resource usage, and potential damage. They provide insight into how the
malware interacts with the target system.
1.2.4 Code Disassemblers and Debuggers
These tools are essential for reverse engineering malware to understand

its internal logic, functions and encryption techniques.
1.2.5 Memory Analysis Tools
These tools help extract valuable information from system memory,

uncover hidden processes, embedded code, and other advanced malware
persistence techniques.
2
1.2.6 Decompilers
Decompilers convert compiled malware binaries back into source code,

facilitating deeper analysis and understanding of malware functionality.
Malicious software, or malware, poses a significant threat to both

businesses and individuals, as it can engage in activities like intercepting network
data, encrypting or deleting files, and, in extreme cases, causing severe hardware
failures by infecting or replacing existing firmware.
The symptoms of a malware infection can vary widely based on the

specific type or family of malware involved. These varying symptoms correspond
to different levels of risk for infected systems. Categorizing malware helps us
assess the potential threat posed by a particular file. However, it's important to
note that while there are numerous antivirus scanners available, their results may
not always align when it comes to classifying malware.
1
Malware takes on diverse forms and employs a variety of techniques to
deliver its payloads to vulnerable devices. Common propagation methods include
viruses and worms, which have the capability to self-replicate within local
1
systems or across networks. The payloads delivered by malware can range from
relatively benign adware to highly destructive ransomware and rootkits. These
payloads determine the relative risk associated with a specific malware sample,
with adware posing minimal risk compared to a highly dangerous rootkit.
The proliferation of networked devices has surged since 2012, with

numbers expected to reach 50.1 billion by 2020. This growth presents a
formidable challenge for security operations centers, as it provides attackers with
1
an increasing number of potential targets and opportunities to exploit the
interconnected nature of these devices for malicious purposes. Regardless of the
computing platform, there is no foolproof method for detecting or categorizing
malicious files.
3
Presently, antivirus vendors and Security Operations Centers primarily
classify malware based on its behavior within the target computer or operating
system. This involves analyzing the malware's signature and comparing it to
known signatures of other malware samples. As attackers employ increasingly
sophisticated evasion and propagation techniques, there is a pressing need for
1
advancements in detection techniques. This would enable security operations
centers to swiftly and accurately evaluate the potential risks associated with a
particular file and take the necessary steps for mitigation and remediation.
1
A helpful way to conceptualize the components of malware is to think of
the malicious code as a projectile, much like a bullet or missile. Just as a missile
requires a method to penetrate its target, malware relies on various means, such
as email attachments or remote code injections, to infiltrate a system.
4
Chapter 2
LITERATURE SURVEY
J. Yonts, “Attributes of Malicious Files,” in 2012 .This research contains

various methods to observe attributes of the malware. The majority of antivirus
software checks a file's syntactic signatures to see if it contains any dangerous
code. These signatures might include combinations or patterns of previously
known harmful instruction combinations or patterns. Antivirus software can
1
examine a file's hash in more detail and compare it to a database of hash values
from samples of known malware. The drawback of the signature-based strategy
employed by the majority of antivirus products is that minor modifications to the
malware's source code may produce a signature that hasn't been seen before and,
as a result, isn't listed in the most recent database of signature definitions. Similar
to that, these signature-based techniques offer scant defense against zero-day
assaults. Malware that is metamorphic or polymorphic might have numerous
signatures, which complicates signature-based detection methods.
M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant,

“Semantics-aware malware detection,” in 2005. To counter the aforementioned
syntactic signature-based evasion techniques, some research has been done on
semantics-based detection systems. These systems use abstractions and templates
to try to determine what the executable does rather than how it accomplishes it.
D. Peraković, M. Periša, and I. Cvitić, “Analysis of the IoT impact on

volume of DDoS attacks,” 2015. Malware has developed into a recurring menace
in the digital sphere, harming people, businesses, and governments all over the
world severely. In order to successfully detect, analyze, and mitigate these threats,
powerful malware analysis tools are now essential as the sophistication and
frequency of malware attacks continue to rise.
5
T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Polychronakis, and S.
Ioannidis, “Rage against the Virtual Machine: Hindering Dynamic Analysis of
Android Malware”. This research provides a comprehensive understanding of
malware analysis, its goals, and the different types of analysis techniques (static,
dynamic, and hybrid).
1
Dynamic analysis entails executing malware within a controlled
environment, typically on a virtual machine. During this process, the behavior of
the malware while running is closely observed, with a focus on monitoring system
events and network activity. This approach to malware analysis offers a
significant advantage by reducing the uncertainty that static analysis may have,
as the malware operates in an environment that closely resembles actual system
conditions.
M. Egele, T. Scholte, E. Kirda, and S. Barbara, “A survey on automated

dynamic malware analysis techniques and tools,” in 2011. However, dynamic
1
analysis is not without its drawbacks. There is a nonzero risk of the malware
spreading to other machines on the same network during the analysis.
Additionally, researchers have encountered instances where malware exhibits
different behavior when running within a virtualized environment, making it
challenging to achieve accurate categorization.
M. Christodorescu and S. Jha, “Static analysis of executables to detect

29
malicious patterns,” in 2003. Static analysis is a method employed to assess the
characteristics of malware without the need to load its executable file into
memory. This approach revolves around scrutinizing the malware file's heuristics
for indicators that provide insights into potential behaviors it might exhibit if
31
executed. One clear advantage of static analysis is that it eliminates the risk of the
malware spreading to other networked machines and safeguards the host
operating system from harm since the malware remains dormant.
6
A. Moser on “Limits of static Analysis for malware detection in 2007”.
However, static analysis does come with its limitations as referred with. It is
unable to fully dissect the behavior of a binary that employs techniques like self-
modifying code or relies on dynamic data such as the current date and time.
Achieving precise results through static analysis can be computationally
demanding, posing challenges, especially for systems requiring real-time threat
detection. It's worth noting that not all static analysis systems face this issue; for
instance, the PE Miner tool developed by Shafiq et al in 2009 demonstrated near-
real-time detection capabilities.
1
Static analysis has also been shown to be prone to performance degradation
when used with obfuscated binaries. Malware authors are constantly refining their
tactics to evade detection and analysis, which presents ongoing challenges for
malware analysts. Some of the key challenges include
Polymorphism and Metamorphism: Polymorphic and metamorphic

malware changes its code and appearance with each infection, making it difficult
to create static signatures for detection. They achieve this by using encryption,
obfuscation, and code reshuffling techniques that make traditional signature-
based detection less effective.
Rootkits and kernel-level malware: Rootkits and kernel-level malware

operate at the lowest levels of the operating system, hiding their presence and
making them difficult to detect and analyze.
7
10
CHAPTER - 3
OBJECTIVES AND METHODOLOGY
3.1 OBJECTIVES
The goal of developing a malware analysis tool is to create a cutting-edge

solution that enables cybersecurity professionals to effectively identify, classify,
and understand malicious software, thereby enhancing overall security
preparedness and enabling rapid responses to emerging threats. This
comprehensive tool will include dynamic analysis that examines malware in
controlled environments to reveal its behavioral characteristics, static analysis to
examine code without execution, behavioral analysis to track and record
malicious actions, signature-based detection, and innovative heuristic and
machine learning techniques. Detection of previously unknown malware.
It will provide clear visualization and reporting of analysis results,

seamless integration with threat intelligence feeds, ensure scalability and optimal
performance, offer an intuitive user interface for simplified use, and prioritize
continuous updates and research to stay at the forefront of the evolving threat
landscape. Achieving this goal will make the malware analysis tool the
cornerstone of an organization's cybersecurity strategy, facilitating rapid threat
detection and response
3.2 METHODOLOGY
Figure 1: Information flow in the system
8
3.2.1 Selection of Languages and Libraries:
For this project, we chose to implement it in Python 3.11 due to Python's

1
suitability for rapid prototyping. Python often requires significantly less code
compared to other object-oriented languages like Java and C++, thanks to its
dynamic typing.
We choose to use scikit-learn for the project's machine learning

component. Common clustering methods like KMeans and MeanShift are
1
implemented in this free machine learning toolkit for Python. Principle
Component Analysis (PCA), Non-Negative Matrix Factorization (NMF), and
Fast Independent Component Analysis (FastICA) are three implementations of
dimensionality reduction/decomposition that are also accessible. Both PCA and
NMF are used in my solution. To cluster the scattered data, we employed the
KMeans and MeanShift algorithms.
Because Matplotlib enables the creation of graphs with comparatively

1
minimal code, we decided to use it for data visualization. We used PCA and NMF
to project the data into a lower-dimensional subspace that retains the majority of
the variance from the original dataset because the features we extracted from the
data were initially too high-dimensional to visualize. We next plot the aggregated
data using Matplotlib, using the output from this as input for the KMeans
algorithm.
3.2.2 Development environment
Because the examined Windows 32-bit files were dangerous, we used

Ubuntu as the development environment in VirtualBox.
Our Python IDE was Spyder, which comes with Anaconda. My GUI was
made using the QtCreator IDE's built-in tools. A cross-platform C++ IDE called
QtCreator is included with QtDesigner. With the help of the command-line tool
pyuic5, a PyQt GUI class can be created using the WYSIWYG form design tool
9
QtDesigner. The time spent developing intricate layouts, which could have been
better used developing other aspects of the solution, was reduced thanks to this
GUI design and build solution. Furthermore, it permits layouts that are
considerably more complicated than those made feasible by Python's built-in GUI
modules, such as TKinter.
3.2.3 Implementation of Important Algorithms
Figure 2: Flow diagram of the project workflow
3.2.3.1 Calculation of Entropy
An empty array of 256-bit integers of length zero is first initialized before

doing the entropy computation by R. Lyda and J. Hamrock, 2007. You may
determine the entropy of individual file subsections as well as the overall entropy
1
value using the calculate Entropy method. The count index in the array is also a
decimal value for that particular byte, allowing for efficient iteration of the array
afterwards. This array will store the number of occurrences of a specific byte
10
value between startOffset and startOffset+length. Then, after looping through the
byteOccurrences array, any byte values that did not occur are disregarded. We
determine the frequency for every byte value that has happened, and then we set
1
the entropy value to be equal to: prior entropy value - frequency * base 256 log
frequency * 8. An entropy value between 0 and 8 is the outcome. Within the
bounds established by the caller, a greater value here denotes more random data
in the file.
3.2.3.2 Byte to Decimal Data Conversion

1
It was crucial to be able to translate between the hexadecimal byte strings
extracted from the 32-bit file and their equivalent decimal representations since
1
the data required for decomposition and clustering had to be in numerical
format. Using Python's built-in capability to convert between numbers in
multiple base representations, the straightforward approach above was developed
to avoidrepeating this conversion procedure.
3.2.3.3 Reading Values from 32-bit File
It was crucial to build these methods in a way that would make them as
reusable as feasible due to the tool's frequent need to read values from a file. The
readByte method is the Portal Executable 32 class's smallest component, and each
time the class is used to retrieve a value, one or more calls to the readByte method
are made. This is demonstrated by the readBytesmethod, which cycles over the
1
given number of bytes until it has finished reading them, attaching each one to a
local variable that will be returned to the caller.
The bigEndian bytes and read littleEndian bytes methods extend the
readBytesmethod with a new level of abstraction; the only difference in the output
is that the read bigEndian bytes function reverses the order of the characters in
the returned byte string using Python's slicing function. The Windows Portal
1
Executable 32-bit file specification does not guarantee that a value can be
11
found at an offset from the beginning of the file, only that it can be found relative
to a previously calculated offset, typically a COFF header, an optional header, or
1
a Portal Executable header. It should also be noted that these two methods accept
parent offsets as well as offsets that are relative to the parent offset.
The output demonstrates the adaptability of the previously stated

techniques for reading bytes from a 32-bit executable and how these techniques
may be applied to read various functions. The input parameters, which were taken
from the 32-bit file specification, were the only variation between the calls to
these procedures.
3.2.3.4 Get the Names of the Imports from the 32-bit File
It turned out to be more challenging than anticipated to read import names

from 32-bit files as strings. The previous implementation entailed reading from
1
that address until a null termination byte was detected by using the import
database to determine the relative virtual address of a certain import.
Some files had relative virtual addresses that were too long for the size of
the file, which resulted in null being returned and the system hanging. Using the
built-in'strings' command found on Unix-based systems provided the solution to
this issue. The specified binary's readable strings are all returned by this
command. Windows has a command-line tool called strings.exe that is
comparable. The strings command was called from the system using the
subprocess module, which then returned the results as a list of strings that were
separated by newlines.
Filtering out any imported strings was another issue, although it was very
1
simple to do so using Python's re. regex module. By using a straightforward
search all in the resulting list of strings with a regular expression that would look
for one or more word characters, i.e., any non-special characters followed
by.dll,.exe,. DLL, or. EXE. This method of loading import names has shown to
12
be incredibly effective and dependable which is referred in “Attributes of
malicious files” by J. Yonts, 2012.
3.2.3.5 Automation of KMeans Parameter Optimization
The resulting product can suggest the ideal KMeans settings. This
function's implementation is really straightforward. a little portion of the
1
PeMachineLearning class' populate_table function. This code just keeps the
best silhouette score and the decomposition algorithm used to preprocess the data
prior to executing KMeans clustering as two variables that are then used to update
the label displayed in the upper right corner. This reasoning is carried out as a
result of the iterative evaluation of the numerous permutation algorithms in use
1
for the purpose of populating the table on the Algorithm Evaluation tab.
3.2.3.6 Calculation of Detection Accuracy
To check whether the files have been labeled correctly, the

1
calculateAccuracyDetection method requires a list of labels and a list of
filenames as inputs. Choosing which label to compare against is the first step in
calculating accuracy. As a result, the procedure sets the variables used for the
comparison in accordance with the harmless first three letters of the first file,
1
which are checked. Once it determines which label belongs to the benign cluster,
it examines all the filenames and counts the number of matches to the expected
label and the number of matches that are unexpected based on the first three letters
of the file name. Following that, a percentage of the total number of files is
returned for this count.
13
3.2.3.7 Calculation of Clustering Accuracy
Unique labels are discovered for each file ty32-bit in the dataset in order to
determine the accuracy of the clustering that was done on it. In each iteration of
the file list, we assign the first unique label that is, a label that has never been
allocated to another file type as a baseline for comparison. Since there are more
possibilities to consider, the logic for determining whether these labels fit
expectations is essentially the same as the logic for determining detection
accuracy. The caller is subsequently provided with a percentage that represents
the computed accuracy.
3.2.3.8 Decomposition and Clustering Surfaces
PCA decomposition plot implementation. The constructor call with

1
nComponents = 2 indicates that this is the first phase, which is to declare how
many dimensions we want to reduce the data to. The dataset is then subjected to
dimension reduction using the PCA.fit() and PCA.transform() methods. Then, a
1
straightforward scatterplot is created using this data, with the x and y values
representing the two-dimensional data obtained by using the PCA.transform()
function. It should be noted that the same process is used to construct the graph
for NMF. As opposed to using the initial dataset as input, the clustering
algorithms utilized utilise the information supplied by the decomposition
techniques in “Principal Component Analysis” by S. Wold, K. H. Esbensen.
Show how a graph's clusters are given different colors based on their label values.
When plotting the points on the scatterplot, this list of colors is then supplied.
This boolean flag is set by the user and functions as logic for plotting
1
cluster centers or centroids. When this is set to true, white circles are drawn over
the grid points that serve as the cluster centers. The output is then obtained by
plotting the matching label values on top of the white circles.
14
3.2.3.9 Color Charts
Compared to the relatively straightforward scatterplots covered in the
preceding part, color charts proved to be more difficult to plot. Finding the
upper and lower limits for both of the dimensions of the data set was the first
stage in graphing the data. To prevent points from being plotted at the graph's
absolute edge, these have been set to +- 1 real values. The granularity or accuracy
of the later-drawn linear discrimination lines is then allocated a step size. Then,
1
by increasing the sequential values by the step size, we produce lists of values for
both dimensions between the upper and lower boundaries using this step size.
Then, a grid of points is made using these two lists of values.
1
At this point, we approach each grid point as if it were a piece of data to
which the clustering technique should be applied. By doing so, we are able to
obtain labels for every point in the graph and determine the borders of each cluster
on the grid. When calling the imshow() method, we use this data to specify a
colormap that will be used to draw solid colors on the chart. The original two-
dimensional data are finally plotted in full color as a last step.
3.2.3.10 Silhouette Graphs
ScikitLearn's silhouette metric methods were used to construct

silhouette graphs. We start by assigning a silhouette score to each piece of data
in the collection. The value for the gap at the bottom of the chart, yLower, was
then set . The following step is to iterate through the number of clusters the user
1
has provided, starting at 0. The scores for the data points in the current
cluster's silhouette are then sorted. The cluster's size is then determined, and an
upper limit is established for the y-axis in order to depict it. For each data point,
1
we fill in the space between 0 on the x-axis and the silhouette score with a specific
color dependent on the cluster number currently in effect.
15
The padding on the y-axis is constrained by yLower and yUpper. After
that, we label the filled area with the cluster number and create a space between
1
the filled regions by setting the lower bound for the subsequent filled area to be
5 bigger than the upper bound of the previous area.
3.2.3.11 Elbow Fence

By computing the score for each combination of the KMeans method
between the smallest and largest number of clusters, elbow plots are created. This
calculation is done by the score method of the sci-kit-learn KMeans class. The
1
scores are added to the list before being plotted, with the number of clusters being
plotted on the x-axis and the score value being plotted on the y-axis.
3.2.4 Implementation of Individual Components
3.2.4.1 32-bit File Parser
A single wrapper class, PE32, has been used to implement the 32-bit file
parser component of this project. PE32 exposes the attributes of the 32-bit file
that is instantiated with. This class' complete UML diagram may be seen in
1
Appendix 2. Additionally, I've developed a command-line program that, when
given the path to the executable, will display all the fields that the wrapper class
exposes in order to output these file properties. exam. Run this program by typing
python peparse.py file-path/file-name.exe on the command line.
3.2.4.2 Feature Extraction
By creating an instance of the Portable Executable 32 class with each valid

file in the specified path, features are extracted. The only files for which the
program was created are.EXE and.dll files, therefore it will always attempt to
extract functionality from those. The Portable Executable 32 class's methods are
used to retrieve the values from the file after a Portable Executable 32 class
1
instance has been created. The associated filenames are then recorded in a another
1
csv file called files.csv, whereas these values are stored in a csv file called
16
features.csv. Because further steps of the process used by the finished tool require
that the data input into the algorithms be numeric, it is necessary to save the file
names separately. We only read from files when filenames are required.csv, and
we only read from features when we need information for our machine learning
1
algorithms. The initial part of the data in one file will correlate with the first row
of data in the second, etc., because the indexes in csv files are consistent.
3.2.4.3 User Interface
Two classes, UIMainWindow and Portable Executable Machine

Learning, are used to construct the user interface for the finished tool. Portable
Executable Machine Learning utilizes UIMainWindow as a superclass ,
inheriting all of the superclass's attributes. The layout, nomenclature, and the
majority of the display details of the employed UI components are handled by
UIMainWindow, which serves as a UI setting and cannot be constructed without
1
being inherited. In order to display the outcomes of these actions on the user
interface, the PortableExecutableMachineLearning class uses its superclass
connection to handle additional specific activities like plotting graphs.
Additionally, the PortableExecutableMachineLearning class makes use of the
PortableExecutableUtil class to carry out a few non UI/UX based tasks like
computing precision values and carrying out IO operations.
3.2.4.4 Clustering and Categorization
The PortableExecutableMachineLearning class handles clustering and

categorization. Methods can accept parameters to adjust their behavior and
outputs when necessary and have been designed to be flexible and reusable.
17
CHAPTER - 4
PROPOSED WORK MODULES
In the first primary tab of modulo, there are two main tabs and three sub-
tabs. The user can navigate the directory using a button and workbar located
outside the application's main area. Three tabs in the main box allow for the
creation of numerous alternative visuals as well as the testing of various
combinations and data integration procedures.
4.1 CLUSTERING
Figure 3: Design for the clustering tab
The user can choose the cluster range for the parsing and clustering
algorithms, as well as the KMeans algorithm alone, on the first tab. The label's
cluster centers or graph centroids' look can be modified by the user.
1
The four graphs are, starting from the top left, clockwise:
1. Decomposition graph
18
2. Graph clustering
1
3. Color Map Graph showing linear discriminants between clusters
4. Silhouette Plot showing how well the data is clustered
When the rendering is complete, the silhouette score becomes visible below the
graphs.
4.2 ALGORITHM EVALUATION
Figure 4: Design for the Algorithm Evaluation Tab
Controls for setting the maximum and minimum number of clusters used
in the analysis by the KMeans method are found on the second tab. The optimal
KMeans parameters are displayed on the page thanks to these tags, which enable
users to rapidly identify the parameters to assess for any combination of
separation and integration techniques. The assessment is shown as a table with
1
seven columns:
19
1. Decomposition algorithm
2. Number of features/dimensionalities of reduced data
3. Clustering algorithm
4. Number of clusters
5. Score of the silhouette
6. Detection accuracy
7. Clustering accuracy
4.3 ELBOW PLOT
1
Figure 5: Design for elbow plot tab
The third tab offers the user milestone charts for the two supported parsing
algorithms using the KMeans algorithm. This can be used as an alternate
technique to calculate the initial map's negative KMeans.
20
4.4 PROPOSED MACHINE LEARNING FEATURES
4.4.1 Entropy of Files and Entropy of Partitions
This value, which ranges from 0 to 8, indicates how random the process's
data is. Packaged archives and encrypted executables typically have greater
entropy values than unpackaged and unencrypted files, according to the findings
given by Lyda and Hamrock. Entropy can be used as a criterion for whether data
is wrapped or encrypted and, consequently, benign or malicious because this
technique is frequently used by malware but not by dangerous data. This
measurement is nevertheless a helpful tool even though it cannot accurately
differentiate between benign and malignant tumors on its own. The identical
computation is performed for partial entropy but only for a subset of the data.
4.4.2 Ratio of Code to Initialized Data

This is the Windows 32-bit file's startup file to executable code ratio [30].
When we first looked at data mining from both good and bad data, We discovered
that good data frequently contained a lot of enormous data but just a small amount
of successful code. Instead, I've discovered that faulty data frequently contains
little to no data, yet the majority of the code still works. We choose to treat this
example as a single machine learning job rather than two distinct tasks involving
big numbers and initial knowledge in order to analyze these insights and make
them more relevant.
4.4.3 Versions of the Main Image

This performance indicator is positive. According to Raman, benign
profiles typically have a larger value than malignant profiles, which typically
have a value of zero. In my initial study, we noticed the similar pattern; this is the
easiest way to tell if the data we come across is unreliable or malicious.
21
4.4.4 Number of Sections
The chapter title table has this many chapter names. The number of sections
in a file is an excellent indicator of whether the file is good or terrible, as Yont
snotes. In general, normal files have 0 to 10 sections, but malicious files
typically have 3 or 4 sections. When we look at the dataset we can see this pattern
as well.
4.4.5 Common DLL Imports

The DLL that the executable imported provides a very thorough
description of the function and motive of the executable. For instance, you can
infer that a program is utilizing the network if it imports Wsock32.dll. Despite
22
the possibility of overlap in the import of malicious and malicious files, it should
be possible to categorize malware by family using this information because
families behave differently. We advise setting the flag to true or false when
looking for a certain import in order to obtain a number that corresponds to the
item. The table displays the dll file's functionalities.
1
Import Name Function
KERNEL32.DLL Memory Management & I/O operations
USER32.DLL Window Management
ADVAPI32.dll Security& registry Management
msvcrt.dll C library for the Visual C++ Compiler
GDI32.dll Graphics Device Interface
SHELL32.dll Opening webpages and files
ole32.dll Object Linking and Embedding
WS2_32.dll Networking
Registry Entry, URL and Colour
SHLWAPI.dll
Management
COMCTL32.dll UI Components
Table 1: dll imports and its functions
22
4.4.6 Dimension Reduction Techniques
The goal of dimensionality reduction techniques is to identify the dataset
that most accurately captures all the data gathered for a particular object. In other
words, data that is deemed to be unimportant is filtered away as part of the
dimensionality reduction process, which transforms the high-dimensional data set
into a low-dimensional space. Since nothing can be planned in more than three
dimensions, this has the advantage of making it simpler to view whole data sets.
Machine learning methods can also be used to enhance data processing for low-
dimensional data. The table displays the dll file's functionalities.
4.4.7 Principle Component Analysis (PCA)

1
By dividing a matrix (dataset) into two submatrices, one of which is the
principal component of the structure that best represents the data in the original
matrix, principal component analysis, which normally tries to minimize
dimensionality, does this. Information that needs to be prepared is viewed as
noise.
4.4.8 Negative Matrix Factorization (NMF)

Similar to fundamental factorization, negative matrix factorization entails
1
dividing a large matrix into two smaller matrices. This approach differs from
others in that it only uses a small number of data points to make the data set or
matrix non-negative. This restriction means that NMF might alter the original
matrix.
4.4.9 Clustering Algorithms

Clustering techniques are used to find groups in data sets that may not be
immediately apparent to human readers. Different access parameters and
integration algorithms come in a variety of forms. The hierarchical clustering
techniques KMeans and MeanShift are both evaluated in this research. Clustering
techniques are used to find groups in data sets that may not be immediately
apparent to human readers. Different access parameters and integration
23
algorithms come in a variety of forms. The hierarchical clustering techniques
KMeans and MeanShift are both evaluated in this research.
4.4.10 KMeans
21
By reducing the number of frames in each group, the K-Means algorithm
separates the data into K groups. The amount of clusters that the KMeans
1
algorithm should produce is a crucial component. The KMeans algorithm does
not choose the number of clusters based on the data, unlike other clustering
7
algorithms like MeanShift. K-Means starts by randomly initializing K centroids
6
in the feature space. These centroids represent the initial cluster centres. In this
step, each data point is assigned to the nearest centroid based on a distance metric,
8
typically Euclidean distance. Each point becomes a member of the cluster
associated with the nearest centroid. After all data points are assigned to clusters,
the centroids are recalculated. The new centroids are determined as the mean of
all data points within each cluster.
4.4.11 Silhouette Metrics

The silhouette metric is one of many tools that may be used to assess the
caliber of the group. This effectively represents a score indicating how well the
information from each place fits into its category. Using a scale of -1 to 1, this
measurement is represented as a score. A score that is almost -1 indicates that the
data may be excessively large or little or that it does not properly belong in that
category. If the score is close to 1, the data was categorised accurately. An object
can fall under more than one category if it receives a score close to 0.
4.4.12 Elbow Method

Another method for assessing the ensemble's quality given a set of data is
1
the elbow method. In order to do this, the cost function must be calculated for a
value of K starting at k = 2 and increasing by 1 each time the cost function for K
23
drops. The sum of squares of the error, which is the sum of the squares of the
linear distance between each point and the nearest centroid, can be used to define
24
the function value. Since there are more centers of gravity and fewer distances
between each point and the adjacent one, we observe that the K value drops as K
grows.
4.4.13 Average Displacement

A type of hierarchical clustering technique is the mean replacement
algorithm. Each piece of data is moved to the closest vertex to make it function.
The probability density derived from the provided data is used to pinpoint these
peaks. Each point is first given the vertex designation. The number of clusters we
have will be determined wrongly by the core bandwidth value that we must
provide. The radial distance to each point is the core's bandwidth. The center of
all of these points is utilized as the vertex if any more points are still inside the
range. Repeat this procedure until every point has been transferred to the closest
corner. Data points are grouped as a result.
25
CHAPTER - 5
RESULTS AND DISCUSSIONS
5.1 RESULTS
Table 2: Input Features of samples (Features.csv)
The results are generated from extracting features from a Windows32 PE

(Portable Executable) file, it can involve various techniques to gain insights into
the file's characteristics. Here are some of the features involved:
5.1.1 Entropy of Sections
• Entropy is a measure of randomness or disorder within a data stream. In

the context of a PE file, you calculate the entropy for each section of the
file.
• For each section in the PE file, you calculate the Shannon entropy. This
involves examining the bytes in the section and determining how evenly
distributed the byte values are. High entropy indicates that the data is more
random, while low entropy suggests patterns or structure.
• Sections with high entropy might contain compressed or encrypted data.
Malware often uses such techniques to obfuscate its code or payload.
5.1.2 Number of Sections
• This feature simply involves counting the number of sections within the
PE file.
26
• An unusually high or low number of sections can be an indicator of a
suspicious or malicious PE file. Legitimate software typically has a
predictable number of sections, so deviations may be noteworthy.
5.1.3 Code to Data Ratio
• This feature quantifies the balance between code and data within the PE
file.
32
• Calculate the size of the code section and the size of the data sections, then
compute the ratio (code size / data size).
• Malware often has a different code-to-data ratio than legitimate software.
Anomalies in this ratio can be indicative of malicious behavior, such as
packing or data hiding.
5.1.4 Major Image Versions
• This feature extracts the major image version from the PE file header.
• The major image version can provide information about the tools and
compilers used to create the binary. Unusual or outdated versions might
suggest suspicious activity, but this feature is less commonly used for
analysis compared to the others.
5.1.5 Common DLL Imports
• This feature involves analysing the Import Address Table (IAT) of the PE
file to identify the dynamic-link libraries (DLLs) and functions that the
binary imports.
• Malware often uses specific DLLs and functions for its malicious
operations. Detecting uncommon or suspicious imports can be a strong
indicator of malicious behaviour.
27
5.2. DIMENSIONALITY REDUCTION TECHNIQUES
Dimensionality reduction techniques are methods used in data analysis and

30
machine learning to reduce the number of variables or features in a dataset while
preserving essential information. These techniques are used for various reasons:
5.2.1 Visualization
Reducing the dimensionality of data allows for easier visualization of

complex datasets. By projecting data into a lower-dimensional space (e.g., 2D or
3D), patterns and relationships between data points can become more apparent.
5.2.2 Feature Engineering
Many machine learning models perform better when provided with a

smaller set of relevant features. Dimensionality reduction helps identify and
retain the most informative features while discarding irrelevant or redundant
ones.
13
5.2.3 Noise Reduction
High-dimensional data often contains noise or uninformative variability.

Dimensionality reduction techniques can help filter out noise and focus on the
underlying structure in the data.
5.2.4. Computational Efficiency
Lower-dimensional data requires less computational resources for

processing and modelling. This is especially important for large datasets and real-
time applications.
28
5.2.5 Overfitting Prevention
High-dimensional data is prone to overfitting, where a model captures

noise instead of underlying patterns. Reducing dimensionality can mitigate
overfitting and improve model generalization.
5.2.6. Interpretability
Simpler models with fewer features are often more interpretable and
explainable, making it easier to understand and communicate the results.
5.2.7. KMeans
After performing the Dimensionality Reduction Techniques, The data is

categorized using the KMeans. K-Means is a commonly used clustering
algorithm that can be applied to data with a lower number of dimensions to
discover patterns and group similar data points. Here's how you can use K-Means
15
for clustering after dimensionality reduction. K-Means aims to partition a dataset
into K clusters, where K is a predefined number chosen by the user. The goal is
16
to group data points into clusters such that each point belongs to the cluster with
the nearest mean (centroid).
7
K-Means starts by randomly initializing K centroids in the feature space.
6
These centroids represent the initial cluster centers. In this step, each data point
is assigned to the nearest centroid based on a distance metric, typically
Euclidean distance.
29
24
CHAPTER - 6
CONCLUSIONS & SUGGESTIONS FOR FUTURE WORK
6.1 CONCLUSION
As part of this experiment, a tool was developed that can run multiple
1
variations of separation and integration algorithms and present the results to the
user. The tool also allows users to view 2D data in various formats. In this project,
a system was developed that uses heuristic information obtained from
Windows32 portable executables to effectively classify malware.
Given the performance of existing systems, pipelines have proven to be
very effective in classifying malicious or malicious files using portable
1
executable heuristics. When used with the NMF parsing algorithm and 3 features
as input to the 11clusterxKMeans clustering algorithm, the system achieved
100% accuracy. This is a significant achievement and demonstrates the
possibility of using heuristic information from Windows32 portable executables
to effectively classify malware.
However, the system is not successful in distributing bad information to
families. The accuracy of the group is only 46.76%, which is not high enough for
this use. This fact may be improved with further research and testing. However,
there is currently not enough evidence to use on this subject. Overall, the
proposed method demonstrates the ability to identify malware using heuristic
information obtained from WIN32 portable executables. Although there is room
for improvement in the classification of negative data, the system has
demonstrated its potential and needs further research and development.
6.2 SUGGESTIONS FOR FUTURE WORK

Given more time, research and further development can be done in many
areas to improve the accuracy and efficiency of the planning process. One area
that could be explored is the use of many additional features for in-depth analysis.
30
1
This will include evaluating which combinations of features provide the best
results and evaluating how the combinations are used.
However, this provides many benefits and may not be possible within the
time allocated to the project. For example, assuming the system currently uses 18
1
characteristics, if all possible combinations of 2 variables are evaluated, the
system will produce 302 results for each combination, group algorithm, and
number of groups. This will result in over 5,000 result rows. Scaling this index
from 1 to 18 points can result in hundreds of thousands of results. Another area
that can be explored is improving the accuracy of the distribution of bad
information in the family. This will require further research and testing to
determine which combination of features and algorithms provides the best results.
Additionally, new techniques or methods can be developed to increase its
accuracy.
The environment and IDE selected for project work are suitable for the task
at hand. It reduces the risk of bad data usage and increases efficiency when
1
creating solutions. Although some tasks, such as extracting material from PE
archives, proved difficult, the overall experience with all tasks was good.
In summary, there are several areas where further research and
20
development can be done to improve the accuracy and performance of the
proposed system. These include using more features for in-depth analysis,
1
evaluating which combinations of features provide the best results, and improving
the accuracy with which malicious elements are excluded. Through continuous
research and development, the system has the potential to become an effective
tool for malware detection and profession a classification.
31
REFERENCES
[1] Muhammad Shoaib Akhtar, Tao Feng, "Malware Analysis and Detection
Using Machine Learning Algorithms", 2022, Symmetry 2022, 14(11),
2304;
https://doi.org/10.3390/sym14112304
[2] Lei Fang, Hongbin Wu, Kexiang Qian, Wenhui Wang, Longxi Han, "A
Comprehensive Analysis of DDoS attacks based on DNS" -
iopscience, 2021
https:/iopscience.iop.org/article/10.1088/1742-6596/2024/1/012027/meta
[3] Arkajit Datta, Kakelli Anil Kumar, Aju. D, "An Emerging Malware
Analysis Techniques and Tools: A Comparative Analysis", 2021
https://www.ijert.org/research/an-emerging-malware-analysis-techniques-
and-tools-a-comparative-analysis-IJERTV10IS040071.pdf
[4] Sanfeng Zhang, Jiahao Wu, Mengzhe Zhang, andWang Yang, "Dynamic
Malware Analysis Based on API Sequence Semantic Fusion", 2023, Appl.
Sci. 2023, 13(11), 6526;
https://doi.org/10.3390/app13116526
[5] M. Christodorescu, S. Jha, S. A. Seshia, D. Song, and R. E. Bryant,

“Semantics-aware malware detection,” in 2005 IEEE SYMPOSIUM ON
SECURITY AND PRIVACY, PROCEEDINGS, 2005, pp. 32–46.
https://www.semanticscholar.org/paper/Semantics-aware-malware-
detectionChristodorescuJha/265b6313093a3c5ea4a5c75096592739f2999f
05
32
[6] Mohamed Lebbie, S Raja Prabhu, Animesh Kumar Agrawal,
“Comparative Analysis of Dynamic Malware Analysis Tools”, 2022.
https://www.researchgate.net/publication/357506552_Comparative_Anal
ysis_of_Dynamic_Malware_Analysis_Tools
[7] B. Lau and V. Svajcer, “Measuring virtual machine detection in malware

using DSD tracer,” J ComputVirol, vol. 6, pp. 181–195, 2010
https://link.springer.com/article/10.1007/s11416-008-0096-y
[8] T. Petsas, G. Voyatzis, E. Athanasopoulos, M. Polychronakis, and S.

Ioannidis, “Rage Against the Virtual Machine: Hindering Dynamic
Analysis of Android Malware,” in Proceedings of the Seventh European
Workshop on System Security, 2014, p. 5:1--5:6.
https://www3.cs.stonybrook.edu/~mikepo/papers/ratvm.eurosec14.pdf
[9] S. Gadhiya, K. Bhavsar, and P. D. Student, “Techniques for Malware

Analysis,” Int. J. Adv. Res. Comput. Sci. Softw. Eng., vol. 3, no. 4, pp.
2277–128, 2013.
https://www.semanticscholar.org/paper/Techniques-for-Malware-
Analysis-Gadhiya-
Bhavsar/27bdcb57c86e8ab9521b6cde2d1a3dc49284bc88
[10] M. ZubairShafiq, S. MominaTabish, F. Mirza, and M. Farooq, “‘PE-Miner:

Mining Structural Information to Detect Malicious Executables in
Realtime’ in Recent Advances in Intrusion Detection,” Springer Science +
Business Media, 2009, pp. 121–141.
https://web.cs.ucdavis.edu/~zubair/files/raid09-zubair.pdf
33
[11] M. Christodorescu and S. Jha, “Static analysis of executables to detect
malicious patterns,” SSYM’03 Proc. 12th Conf. USENIX Secur. Symp.,
vol. 12, pp. 12–12, 2003.
https://pages.cs.wisc.edu/~jha/jha-papers/security/usenix_2003.pdf
[12] A. Moser, C. Kruegel, and E. Kirda, “Limits of Static Analysis for Malware
Detections,” Acsac, pp. 421–430, 2007.
https://ieeexplore.ieee.org/document/4413008
[13] “Microsoft Portable Executable and Common Object File Format

Specification,” 2015.
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-
9fde-d599bac8184a/pecoff_v83.docx.
[14] R. Lyda and J. Hamrock, “Using Entropy Analysis to Find Encrypted and
Packed Malware,” IEEE Secur. Priv., vol. 5, pp. 40–45, 2007
https://ieeexplore.ieee.org/abstract/document/4140989
[15] Sambanthan Gurunathan, Thangaraj Yogalakshmi ,” Statistical Analysis

on the Topological Indices of Graphs” 2022
https://link.springer.com/chapter/10.1007/978-981-16-5747-4_33
[16] J. Yonts, “Attributes of Malicious Files,” 2012. [Online]. Available:

https://uk.sans.org/reading-room/whitepapers/malicious/attributes-
malicious-files-33979
[17] “ProcessLibrary.com - The Online Resource For Process Information!”

[Online]. http://www.processlibrary.com/en
[18] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge,

MA: MIT Press, 2012.
https://www.cs.ubc.ca/~murphyk/MLbook/pml-toc-1may12.pdf
34
[19] S. Wold, K. H. Esbensen, and P. Geladi, “Principal Component Analysis,”
Chemom. Intell. Lab. Syst., vol. 7439, no. August, pp. 37–52, 1987.
https://www.sciencedirect.com/science/article/abs/pii/0169743987800849
[20] D. Lee and H. Seung, “Algorithms for non-negative matrix factorization,”

Adv. Neural Inf. Process. Syst., no. 1, pp. 556–562, 2001.
https://papers.nips.cc/paper_files/paper/2000/hash/f9d1152547c0bde0183
0b7e8bd60024c-Abstract.html
[21] “Clustering-An Introduction."
”https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/..
[22] T. M. Kodinariya and P. R. Makwana, “Review on determining number of

Cluster in K-Means Clustering,” Int. J. Adv. Res. Comput. Sci. Manag.
Stud., vol. 1, no. 6, pp. 2321–7782, 2013.
https://www.jstor.org/stable/2346830
[23] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering

Algorithm,” Source J. R. Stat. Soc. Ser. C (Applied Stat., vol. 28, no. 1, pp.
100–108, 1979.
https://ieeexplore.ieee.org/document/400568
[24] Y. Cheng, “Mean Shift, Mode Seeking, and Clustering,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 17, no. 8, 1995.
[25] “Anaconda | Continuum.”
https://www.continuum.io/anaconda-overview.
35
[26] “Matplotlib: Python plotting — Matplotlib 2.0.0 documentation.”
https://matplotlib.org/.
[27] “Qt - Product | The IDE.”
https://www.qt.io/ide/.
[28] “VirusTotal - Free Online Virus, Malware and URL Scanner.”

https://www.virustotal.com/en/.
[29] Forbes, D. A. "Reverse engineering the Milky Way. Monthly Notices of the
Royal Astronomical " Society, 493(1), 847-854. 2020
https://academic.oup.com/mnras/article/493/1/847/5717311
[30] Oh, & Fritz, M. "Towards reverse-engineering black-box neural networks"

pp. 121-144,. Springer, Cham. (2019).
https://dl.acm.org/doi/10.1007/978-3-030-28954-6_7
36
APPENDICES
PE FILE STRUCTURE:
Entropy
Randomness for benign file is from 0 to 8
Code to data ratio
Benign files often have high data and less executable code
Major Image version
Benign files often have version number
Number of sections
Benign files often have 0 – 10 sections
Common dll imports
It can be used to find the motive of software regardless of malicious or benign
files
37
WORK CONTRIBUTION
1. Research about the Machine learning Algorithms, feature Analysis
2. Downloading necessary modules and configuring the setup for further
development
3. Code for Malware analysis tool
4. Testing and Validation
1. Research about the Feature extraction and Reverse Engineering

2. Implementing core functionalities for extracting and analyzing malware
3. Code for malware analysis tool
4. UI development
THARUN J (212IT513)
1. Research about Malwares, Signature and payloads

2. Data collection(malware)
3. Reporting and Documentation
4. Code for machine learning algorithms and testing
38
Similarity Report
20% Overall Similarity

Top sources found in the following databases:
6% Internet database 1% Publications database
Crossref database Crossref Posted Content database
18% Submitted Works database
TOP SOURCES
The sources with the highest number of matches within the submission. Overlapping sources will not be
displayed.
Queen's University of Belfast on 2017-05-06

1 10%
Submitted works
bannariamman on 2024-03-19
2 1%
Submitted works
3 1%
Submitted works
4 <1%
Submitted works
Bannari Amman Institute of Technology on 2020-09-22

5 <1%
Submitted works
fastercapital.com
6 <1%
Internet
University of Leicester on 2024-03-13

7 <1%
Submitted works
Manipal University on 2023-07-20

8 <1%
Submitted works
Sources overview
Similarity Report
University of Portsmouth on 2022-12-23

9 <1%
Submitted works
10 <1%
Submitted works
Anni Karimatul Fauziyyah, Ronald Adrian, Sahirul Alam. "Analyzing Ima...

11 <1%
Crossref
scholarworks.wm.edu
12 <1%
Internet
Conestoga College on 2024-02-16

13 <1%
Submitted works
The University of Manchester on 2006-02-22

14 <1%
Submitted works
University of Florida on 2023-04-10

15 <1%
Submitted works
University of Hertfordshire on 2023-08-28

16 <1%
Submitted works
17 <1%
Submitted works
e.diklatgarbarata.id
18 <1%
Internet
19 <1%
Submitted works
"Advanced Technologies, Systems, and Applications VIII", Springer Sci...

20 <1%
Crossref
Sources overview
Similarity Report
Manipal University on 2023-07-06

21 <1%
Submitted works

22 <1%
Submitted works
University of Lancaster on 2021-09-02

23 <1%
Submitted works
dspace.plymouth.ac.uk
24 <1%
Internet
saulibrary.edu.bd
25 <1%
Internet
Capella University on 2024-03-02

26 <1%
Submitted works
M. Acharyya, R.K. De, M.K. Kundu. "Extraction of features using M-ban...

27 <1%
Crossref
researchgate.net
28 <1%
Internet
M. Jayasudha, Ayesha Shaik, Gaurav Pendharkar, Soham Kumar, B. Mu...

29 <1%
Crossref
Midlands State University on 2023-10-31

30 <1%
Submitted works

31 <1%
Submitted works
arxiv.org
32 <1%
Internet
Sources overview

FINAL REPORT 2.pdf

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FINAL REPORT 2.pdf

Uploaded by

Copyright:

Available Formats

Similarity Report

PAPER NAME AUTHOR

FINAL REPORT 2.pdf mohanapriya

WORD COUNT CHARACTER COUNT

9450 Words 51310 Characters

PAGE COUNT FILE SIZE

SUBMISSION DATE REPORT DATE

20% Overall Similarity

Excluded from Similarity Report

In partial fulfilment for the award of the degree of

BANNARI AMMAN INSTITUTE OF TECHNOLOGY

ANNA UNIVERSITY: CHENNAI 600 025

Certified that this project report “CLASSIFICATION OF MALWARE

Dr. ARUN SHALIN L V Ms. MOHANAPRIYA K

HEAD OF THE DEPARTMENT ASSISTANT PROFESSOR LEVEL 1

Department of Information Technology Department of Artificial Intelligence and

Submitted for Project Viva Voce examination held on ………………

External Examiner I External Examiner II

We affirm that the project work titled “CLASSIFICATION OF MALWARE

SUTHARSAN M DHARANIENDRAN P THARUN J

I certify that the declaration made above by the candidates is true.

We would like to enunciate heartfelt thanks to our esteemed Chairman Dr.

We wish to express our sincere thanks to Ms. Mohanapriya K, Assistant

The proliferation of malware threats in the digital landscape has posed

CHAPTER NO. TITLE PAGE NO.

LIST OF TABLES vii

LIST OF FIGURES viii

1.1 IMPORTANCE OF MALWARE ANALYSIS TOOLS 1

1.2.1 STATIC ANALYSIS TOOLS 2

1.2.2 DYNAMIC ANALYSIS TOOLS 2

1.2.3 BEHAVIORAL ANALYSIS TOOLS 2

1.2.4. CODE DISASSAMBLERS AND DEBUGGERS 2

1.2.5 MEMORY ANALYSIS TOOLS 2

3. OBJECTIVE AND METHODOLGY 8

3.2.1 SELECTION OF LANGUAGES AND LIBRARIES 9

3.2.2 DEVELOPMENT ENVIRONMENT 9

3.2.3 IMPLEMENTATION OF IMPORTANT ALGORITHMS 10

3.2.3.2 BYTE TO DECIMAL DATA CONVERSION 11

3.2.3.3 READING VALUES FROM PE FILES 11

3.2.3.4 GET THE NAMES OF THE IMPORTS FROM THE PE 12

3.2.3.5 AUTOMATION OF KMEANS PARAMETER 13

3.2.3.6 CALCULATION OF DETECTION ACCURACY 13

3.2.3.7 CALCULATION OF CLUSTERING ACCURACY 14

3.2.3.8 DECOMPOSITION AND CLUSTERING SURFACES 14

3.2.3.9 COLOR CHARTS 15

3.2.3.10 SILHOUETTE GRAPHS 15

3.2.3.11 ELBOW FENCE 16

3.2.4 IMPLEMENTATION OF INDIVIDUAL COMPONENTS 16

3.2.4.1 PE FILE PARSER 16

3.2.4.2 FEATURE EXTRACTION 16

3.2.4.3 USER INTERFACE 17

3.2.4.4 CLUSTERING AND CATEGORIZATION 17

4.2 ALGORITHM EVALUATION 19

4.3 ELBOW PLOT 20

4.4 PROPOSED MACHINE LEARNING FEATURES 21

4.4.1 ENTROPY OF FILES AND ENTROPY OF PARTITIONS 21

4.4.2 RATIO OF CODE TO INITIALIZED DATA 21

4.4.4 NUMBER OF SECTIONS 22

4.4.5 COMMON DLL IMPORTS 22

4.4.6 DIMENSION REDUCTION TECHNIQUES 23

4.4.7 PRINCIPLE COMPONENT ANALYSIS (PCA) 23

4.4.8 NEGATIVE MATRIX FACTORIZATION (NMF) 23

4.4.9 CLUSTERING ALGORITHMS 23

4.4.11 SILHOUETTE METRICS 24