FRP Design

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Dataset

Malware, also known as "malicious software," can be categorized in many ways to distinguish
different types of malware from each other. Separation and separating different types of
malware from each other is important in better understanding how they can infect computers
and devices, the level of threat they pose and how they can protect themselves from them.

The dataset used for the analysis “Malware_Classification.csv” here includes details for various
parameters to be used for malware classification as in columns present. The importance for
malware classification is significant as to get the baseline for the futureproofing of the newer
approaches towards malware detection and classification algorithms which work over the
previously fetched results from the provided sets of features as entry methodology of the
malware, code size, exploit types whether it is zero day or broad spectrum, nature of
compromised machines, hashing tables used, type of encryption, types of Operating Systems
attacked over and much more.

Simulation and implementation


1. Data Preprocessing and loading
The dataset is imported from the directory using the pandas library’s
function .read_csv(filename) from the comma separated file named
“Malware_Classification.csv” forming the dataframe for the same dimensions and attributes as
for the dataset file. A DataFrame is the most common structured API, forming a table of data
with rows and columns. The concluded list for the columns and their respective types in those
columns is called the “schema”.
If encountered with any unnamed column, here the key axis for columns is 1, thus dropped as in
for dropping the “Unnamed: 57” column using the .drop() function.

The shape of the dataset can be described as 216352 rows by 57 columns as using the .shape
function for the data frame, shape of the data frame is the amount of rows and columns present
corresponding to the continuous indices present.

Further, check for the sum of null values present in the columns. Null or NaN values are present
as anomalies, checked using the .isna() function which returns a boolean value as 0 or 1
indicating the presence of the null value at that location, followed by the sum of the null values
present for the column and thus for every column it is repeated traversing the whole dataframe
along the axis 1. On completion of this step only one null value was reported in
“MajorLinkerVersion”, all other columns were reported as 0 null values.
Post check for null values, Column “md5” and “Machine” representing the checksum and
machine code for the type of machine exploited for is changed to categorical data to provide
each unique entity present in the column with a corresponding code forming the lexical order as
required aiding further in the more efficient analysis. This operation is carried out with
the .cat.code function.
After processing the data into the categorical form in lexical order, all the null values present in
the columns are filled with their column-wise mean respectively using the .mean() function. It
uses column-wise traversal for the calculation of mean of the numerical values present in the
data type of float, double or int.

Data Analysis

Upon processing the data and checking for any of the anomalies present in the data, data
analysis is carried out. It is the conjunction of various operations such as inspecting, cleansing,
transforming and modelling the data with the sole purpose of discovering and exploring
information, information conclusions, and supportive decision-making. It has multiple facets and
approaches, encompassing diverse techniques in rendezvous with data. As for the provided
dataframe, legitimate malwares are counted to distinguish them from pseudo malwares,
spywares, ransomwares and other forms of suspicious softwares. It is counted for the entities
present in the “legitimate” column and plotted for frequency distribution pie plot. The following
plot provides the information as 34.9 % of the present softwares are legitimate counting to
75,503 and 65.1% are malwares counting to 140,849.
Post exploratory analysis dataset is further segregated into boolean categorical entities namely
for “legitimate” key by replacement.
Visualization of the entropy of a eah binary to their nature of being malicious or legitimate is
provided using matplotlib using the keys “SectionsMeanEntropy” versus Frequency as for
malicious and legitimate activities by the binary files . It can be clearly observed as malicious
activated topped with entropy of 6 counting to roughly 50000 binaries and for legitimate
activities entropy is between 4-5 counting to roughly 27000 binaries.
Each and every activity is presented in the form of binary in a DLL file, Dynamic Link Library file
that consists of the code to be loaded into the memory. Upon loading into the memory each DLL
is provided with its base page alignment space as 4 kilobytes for a machine using 32 bit
addressing and 8 kilobytes for a machine using 64 bit addressing, further to which multiple
allocations are provides post which the instructions in DLL will be executed for activities whether
malicious or not. Upon analysis it is observed that most of the malicious activities had a size of 0
to 40000 kilobytes, whereas majority of the legitimate activities had size of 0 to 20 kilobytes
inference drawn from the visualizations over the “ImportsNbDLL” key and frequency of the
distribution observed.

Version of the operating system plays an important role in the spread of malicious activity or its
successful breach as newer operating systems are patched for the vulnerabilities discovered.
The following pie plot shows the percentage of affects in alliance to the OS version on the
machine, as majority of the machines operating on the OS version 5 were affected as 58.8%
counting to 82, 760, followed by OS version for 49,837 machines.
Feature selection

The correlation is used to find the relationship between two variables present predominantly in
the set over the quantitative scale, useful for selecting the features to be analysed upon.Every
variable will show a complete correlation with itself, scaling on 1 can be seen clearly as a
diagonal in the correlation matrix.
Correlation matrix is created for a dataframe using the .corr() function which can be plotted over
heatmap to visualize the correlation factors using hues by heatmap() function of seaborn.
Correlation can be further sorted to get the list for final features as for best and worst performing
predictors by sorting them to a list for best above +0.2 and worst below -0.2 framing the list as
below.
These features can also be visualized as horizontal bar graphs displayed below,

DataSplitting
The data is split into a train and test set using train_test_split() function to create these tests
separately for further analysis. The observations in the training set form the experience that the
algorithm uses to learn. In supervised learning problems, each observation consists of an
observed output variable and one or more observed input variables.
The test set is a set of observations used to evaluate the performance of the model using some
performance metric. It is important that no observations from the training set are included in the
test set. If the test set does contain examples from the training set, it will be difficult to assess
whether the algorithm has learned to generalize from the training set or has simply memorized
it.
A program that generalizes well will be able to effectively perform a task with new data. In
contrast, a program that memorizes the training data by learning an overly complex model
could predict the values of the response variable for the training set accurately, but will fail to
predict the value of the response variable for new examples. Memorizing the training set is
called over-fitting. A program that memorizes its observations may not perform its task well, as
it could memorize relations and structures that are noise or coincidence. Balancing
memorization and generalization, or overfitting and under-fitting, is a problem common to many
machine learning algorithms. Regularization may be applied to many models to reduce
overfitting.
Here, the train set contains 151446 instances (70.0%) and the test set contains 64906
instances (30.0%).

Logistic regression
Logistic model is used to model the probability of a certain attribute or event existing such as
pass/fail, alive/dead, win/lose, true/false. Each event is designated as probability between 0 and
1 with a sum of one. The model is trained using the function LogisticRegression.fit(x_train,
y_train), for prediction LogisticRegression.predict(x_test) is used and to check accuracy
accuracy_score(y_test, irregpred) is used whereas test set and predictions are passed a
arguments. The outcome comes with an accuracy of 87.47% and classification report.
Classification report has metrics such as Precision, Recall, F1 Score and Support.
Precision is defined as the ratio of true positives to the sum of true and false positives. Recall is
defined as the ratio of true positives to the sum of true positives and false negatives. The F1 is
the weighted harmonic mean of precision and recall. The closer the value of the F1 score is to
1.0, the better the expected performance of the model is. Support is the number of actual
occurrences of the class in the dataset. It doesn’t vary between models, it just diagnoses the
performance evaluation process.
This can also be visualized, which shows the predictions to the actual data over a confusion
matrix.

Decision tree classifier


This is a type of classifier used for predictive modelling. It's used by the function
DecisionTreeClassifer() from scikit learn library. It uses a decision tree (as a predictive model) to
go from observations about an item (represented in the branches) to conclusions about the
item's target value (represented in the leaves). Tree models where the target variable can take
a discrete set of values are called classification trees; in these tree structures, leaves represent
class labels and branches represent conjunctions of features that lead to those class labels.
Upon training accuracy of decision tree classifier comes to be 99.84%, along with classification
report with multiple metrics as below.
This can also be visualized, which shows the predictions to the actual data over a confusion
matrix.

Random forest classifier


This is a meta estimator that fits a number of decision tree classifiers on various sub samples of
the dataset and uses averaging to improve the prediction accuracy and control over-fitting. It is
called using the function RandomForestClassifier() from the Scikit Learn library. The accuracy
for random forest classifier comes to be 99.997%, with the classification report below.
This can also be visualized, which shows the predictions to the actual data over a confusion
matrix.

SVM
Training over support vector machines is done using pipelines to feed data parallely to the
algorithm to create the model. It constructs a hyperplane or a set of multiple hyperplanes in a
high or infinite dimensional space as applicable for the required dataset, used for classification,
regression, or other tasks like outliers detection. This is a supervised machine learning model
that uses classification algorithms for two-group classification problems. The model provided
with labeled training data. The output accuracy for SVM comes to 91.192% along with
classification report as
Following,
This can also be visualized, which shows the predictions to the actual data over a confusion
matrix.

Naive Bayes Classifier


Gaussian naive bayes working with continuous data, an assumption often taken is that the
continuous values associated with each class are distributed according to a normal (or
Gaussian) distribution to predict the likelihood of the event or the variance using naive bayes
probability theorem as base. To use the NB Classifier GaussianNB() function is called from
scikit learn library. The output for the Naive Bayes Classifier comes to be 90.24% with the
following report,
The above mentioned results can also be concluded in the form of confusion matrix as shown
below.

MLP
Multilayer Perceptron is a class of feedforward artificial neural network(ANN), composed of
multiple layers of perceptrons (with threshold activation), colloquially referred to as “vanilla”
neural networks, especially when they have a single hidden layer in conjunction. It consists of
three layers of nodes as an input layer, a hidden layer and an output layer. Except input nodes,
each node in the network is a neuron that uses a nonlinear activation function. It uses a
supervised learning technique called backpropagation for training. To implement the Multilayer
Perceptron function MLPClassifier() is called from Scikit learn and is passed on with training
data to get the output as accuracy of 97.164%. Further it can be described for the metrics using
classification report as below,
The frequency of the instances predicted can be visualized as confusion matrix for Multilayer
Perceptron below,

Comparison
Upon completion of training all the outcomes can be compared effectively to show the accuracy
of all the models used in the analysis to look for the best and worst performer that comes to be
Random Forest Classifier at 99.99% and Logistic Regression at 87.47. Where model accuracy
is the measurement used to determine which model is best at identifying relationships and
patterns between variables in a dataset based on the input, or training, data. The better a model
can generalize to ‘unseen’ data, the better predictions and insights it can produce.
Similarly, comparison of precision can be made to find the best and worst performance as
Random Forest classifier and logistic regression whereas it is a metric that quantifies the
number of correct positive predictions made. It can be visualized as the following plot

Over comparing Recall the best performer comes to be Random Forest Classifier and worst
performer comes to be Logistic Regression again clearly depicted in the visualization below,
Comparing the fourth metric F1 score of reported performance by all the models, similar results
are observed with Random Forest Classifier coming on top and Logistic Regression on bottom
as shown below

These results can be compiled together to form the below given table for easier comparison.
Discussion

It is tested on different classifiers with different tuning parameters with the goal of achieving the
mean value of 100% for accuracy. It turned out that achieving such a high accuracy on model
fitting and prediction was infeasible. While some of the classifiers demonstrated very high
accuracy and thus promising results, it seems that building a model that reduces the false
positive and false negative ratios to zero is very difficult and hence there are some penalties for
missing such cases for detecting zero-day malware. However, it is also possible that building a
perfect model with 100% accuracy may imply that the model is overfitted and may perform
poorly for classifying unseen data. The random forest was the best classifier with outstanding
performance of achieving 99.99% on average for accuracy, followed by decision tree achieving
99.69%. An MLP model with default parameters achieved 94.14% accuracy. Since there are
some computational costs associated with the number of neurons on the layer, and given the
slight improvement on the observed accuracy, the question is whether a more complex model is
worthy to be built or a simpler model with slightly lower accuracy would be sufficient for the
prediction. The choice of this trade-off totally depends on the application domain. As a special
case, detecting zero-day malware is an important and critical task and thus increasing the
accuracy as much as possible is indeed needed regardless of the cost.

According to results, standardization and categorical conversion is critical for classification. The
primary reason might be because of computational expenses involved in dealing with large
numbers and thus with higher standard variations. Some of these classifiers utilize a distance
metric (e.g., Euclidean distance) where the square roots of the sum of the squared differences
between the observations are calculated for clustering the data items. As a result, to
accommodate such expensive computation when larger values are provided as data, the
demands for computational needs will be increased. Hence, since a larger standard deviation
will affect the accuracy of the prediction. While these deep learners performed very well,
surprisingly, some of the conventional machine learning classifiers performed comparatively
similar or even better. Given the lower cost of training associated with the conventional machine
learning algorithms and at the same time a considerably greater cost for training deep
classifiers, the conventional machine learning algorithms might be even a better choice
compared to the deep learning-based algorithms. The deep learning-based classifiers
demonstrate a consistent improvement achieved by building larger models and additional
training. However, a simple random forest algorithm still outperforms even larger deep learning-
based classifiers with additional training. For instance, the performance demonstrated by
Random Forest (i.e., 99.99%).

You might also like