10 1016@j Compeleceng 2020 106729

Computers and Electrical Engineering 86 (2020) 106729
Contents lists available at ScienceDirect
Computers and Electrical Engineering

journal homepage: www.elsevier.com/locate/compeleceng
Improving malware detection using big data and ensemble

learning ✩
Deepak Gupta∗, Rinkle Rani
Department of Computer Science & Engineering, Thapar Institute of Engineering & Technology (Deemed to be University), Patiala, India
a r t i c l e i n f o a b s t r a c t
Article history: Malware detection and classification play a critical role in computer and network security.
Received 23 May 2019 Although, many machine learning models have been used in the detection of malicious
Revised 11 June 2020
binaries, however, the performance of ensemble methods has not been investigated exten-
Accepted 11 June 2020
sively. Besides, the massive volume of malware has established it as a big data problem
forcing security researchers and practitioners to deploy big data technologies to manage,
Keywords: store, analyze, and visualize malware data. In this paper, the authors have designed two
Apache Spark methods based on ensemble learning and big data for improving the performance of mal-
Big data ware detection at a large scale. The first method is based on the weighted voting strategy
Ensemble learning of ensemble learning, and the second method chooses an optimal set of base classifiers for
Malware detection stacking purpose. The proposed methods are implemented using Apache Spark, a popular
Stacking
big data processing framework, and their performance is tested and evaluated on a dataset
Weighted voting
of 198,350 Windows files including 10 0,20 0 malicious and 98,150 benign samples. The ex-
perimental results successfully validate the effectiveness of the proposed approach since it
improves the generalization performance in detecting new malware.
© 2020 Elsevier Ltd. All rights reserved.
1. Introduction
According to a recent threat report [1], around 973 million malware (targeting Windows systems) were detected in year
2018, i.e., nearly 2,666,957 per day, 111,123 per hour, 1852 per minute. Apart from huge volume, malware are becoming
more sophisticated and complex as well. The WannaCry ransomware exploited a vulnerability in Windows server message
block protocol and spread so quickly that it covered 150 countries within 24 hours of its outbreak [2]. The attack infected
around 40 0,0 0 0 machines worldwide and 98% of these were running Windows 7 as an operating system. It resulted in
a major outage in healthcare, transport, banking, financial institutions, and telecom, etc. The ever-expanding Internet of
Things exposes more devices to the Internet and makes them even more vulnerable to such attacks. Consequently, malware
are emerging at an astonishing rate which is overwhelming for traditional malware analysis technologies, and require new
detection techniques for managing threats. As a whole, the malware data has now become a big data problem [3] and there
is a need to adopt big data technologies [4] which can effectively handle a massive influx of malware data.
The most popular methods for malware detection are based on signature matching. The drawback of this approach is that
it is not suitable for the detection of unknown (zero-day) malware whose signatures do not exist. Recent research mostly
✩
This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Area Editor Dr. G. Martinez
Perez.
∗
Corresponding author.
E-mail address: deepak.vd@gmail.com (D. Gupta).
https://doi.org/10.1016/j.compeleceng.2020.106729
0045-7906/© 2020 Elsevier Ltd. All rights reserved.
2 D. Gupta and R. Rani / Computers and Electrical Engineering 86 (2020) 106729
includes machine learning (ML) based methods to discover significant patterns from the malware data. Such approaches
employ diverse abstraction level attributes to develop a classification model for detecting unknown malware. The features
may be obtained using static malware analysis (i.e. examining the malicious code without executing the programs) or dy-
namic malware analysis (i.e. monitoring the behavior of malware by executing the malicious samples on a virtual machine).
The malware authors are using ever-evolving practices in designing the malicious software which has thwarted static mal-
ware analysis techniques leading to more advances in dynamic malware analysis. Unfortunately, dynamic malware analysis
is more time-consuming and resource hungry. To counteract the trade-off between detecting obfuscated malware and anal-
ysis speed, the researchers have adopted an integrated set of attributes acquired from static and dynamic malware analysis
for detecting and classifying malware with high accuracy [5].
The ML has become a leading force in research and industry security applications. It uses static and dynamic malware
features to recognize the complex correlations, trends, and patterns to discover malware. However, the major drawback
of traditional ML algorithms pertaining to malware detection and classification is that these tend to generate high false
positive (FP) and false negative (FN) rate for evolving complex and sophisticated malware. On the other hand, ensemble
learning combines the output of individual base classifiers to produce accurate predictions for many complex classification
problems, though it brings more complexity and computational cost. Recently, it has become a popular technique in various
competitions, e.g., Kaggle competition [6], and is commonly used to improve accuracy benchmarks for various solutions.
Ensemble methods work better when the ensemble models are loosely correlated. The key is to choose the right ensemble
method for better prediction. In general, enhancing the complexity of the model results in a reduction of errors due to
lower bias in the model. However, after a certain point, the model starts suffering from over-fitting due to high variance.
Ensemble learning tends to achieve optimum model complexity by maintaining a trade-off between bias and variance errors.
In a nutshell, the combination of multiple loosely correlated models, which are based on different algorithms, leads to a
more robust and stable model with better accuracy. In general, the diversity of the base models is the key to create a good
ensemble model.
The authors present and investigate a novel method for malware detection. The proposed method uses an integrated set
of features obtained after performing static and dynamic malware analysis at a large scale, and deploys ensemble learning
and big data technologies in a distributed environment to detect malware. The approach builds a diverse set of classification
models at the lower level. Thereafter, a collection of robust algorithms is deployed to assign rank and weight to each of
the base classifiers. Subsequently, the weights are used in majority voting and selection of an optimal set of classifiers
for stacking. The proposed solution is implemented on the top of Apache Spark which has already become a standard for
distributed computation of big data due to its ease-of-use and better performance than Apache Hadoop. The experiments
show that the proposed method improves the generalization performance of malware detection at a large scale.
Contributions. The primary objective is to develop a classification framework for malware detection using an integrated
set of features through ensemble learning methods and big data technologies. The main contributions are summarized as
below:
(1) An automated malware analysis environment is setup using distributed Cuckoo Sandbox for performing static and
dynamic malware analysis.
(2) A scalable solution is proposed using Apache Spark, which can efficiently process massive malware samples. The
scalability in terms of data processing can be increased linearly by attaching more nodes into Apache Spark cluster.
(3) Two ensemble methods are proposed which are based on the algorithms for computing the ranks and weights of the
base classifiers (used in weighted voting method), and selecting a set of base classifiers (used in stacking).
(4) Experiments are conducted to evaluate the performance of proposed ensemble methods on a dataset (having 10 0,20 0
malware and 98,150 benign samples), and the results are compared with the best base classifier and traditional en-
semble methods.
The remainder of the paper is organized as follows: Section 2 describes the basics of ensemble learning and its broad
categories, and the related research work. Next, Section 3 elaborates on the proposed methodology. Section 4 describes the
environmental setup and experimental results. Finally, Section 5 provides conclusion and future scope.
2. Background and related work
This section provides an overview of ensemble learning methods followed by related research work pertaining to mal-
ware detection and classification.
2.1. Ensemble learning – An overview
In ML, a united decision system is widely used to build the learning models. The ensemble learning systems unite the
decisions of individual classifiers to achieve accuracy and better generalization performance [7]. An ensemble method deliv-
ers highly accurate classifiers by combining less accurate methods. This is accomplished through different approaches like
resampling the training set, heterogeneous algorithms, homogeneous algorithms with diverse parameters or using different
methods for combining the decisions of classifiers, etc. Many ensemble methods have been suggested in various applications
in the past few years and are generally categorized into following four types depending on the way these are built-up [7]:
D. Gupta and R. Rani / Computers and Electrical Engineering 86 (2020) 106729 3
a) Data level
This type of ensemble method (e.g. bagging) creates subsets of training samples by using resampling methods and ex-
ploits these to build base classifiers. Different resampling methods that can be used to generate sub-datasets are random
selection with or without replacement, leave one out, etc. The results of the base classifiers are combined together using
various voting methods. This technique works well for unstable classifiers where a small change in the training data brings
a significant change in the output of the classifier.
b) Feature level
Such methods can be used in two ways. The first approach builds classification models using individual feature view and
then their decisions are combined together through some ensemble strategy. The second approach merges different feature
views to build a classification model which gives better generalization performance as compared to the individual feature
view.
c) Classifier level
It uses either heterogeneous or homogenous classifiers with different parameters like injecting randomness into the
same classifiers. Thereafter, it combines the results using some rules to overcome the bias and improves the performance of
classifiers. Such models can be built by using sequential or parallel method. Boosting method falls under this category.
d) Combination level
This method is based on combining the various classifier decisions. It includes stacking, voting, and ensemble selection
as described below:
i) Stacking
Stacking is a technique in which all base models are stacked together one over another, passing the output from the
lower layered model to the model above it. Stacking helps to reduce bias or variance error depending upon the learning
algorithm used. It deploys a group of homogenous or heterogeneous base classifiers, whose predictions are exploited to
train a meta-classifier that gives the final prediction. The meta-classifier rectifies any errors made by the base classifiers
which improves the generalization performance.
ii) Voting
Voting is a simple combination method to combine the outputs of the basic classifiers. It allows each base classifier to
predict the class of each instance of the dataset and subsequently, the class label of each instance is decided through a
voting scheme [7]. In majority or plurality voting, the class label yˆ of each instance y is predicted using the vote of each
base model Yj where j = {1, 2, 3, …, m} as shown in Eq. (1).
yˆ = mode{Y1 (y ), Y2 (y ), . . . , Ym (y )} (1)
For example, suppose we have three classifiers Y1 , Y2 , and Y3 that classify an instance of the dataset into binary class
(either 1 or 0). According to majority voting, the label for that instance would be 1 if at least two classifiers predict the
class of the instance as 1.
On the other hand, in a weighted voting method, we can associate a weight wj with each classifier Yj . In this case, the
label yˆ of an instance can be computed using Eq. (2).

m
yˆ = arg max w j fL Y j ( y ) = i (2)
j=1
where fL is characteristic function [Yj (y) = i ∈ L] and L is the set of unique class labels. For example, if the classifiers have
ˆ=1.
weights as {0.6, 0.2, 0.2}, it would yield a prediction y
iii) Ensemble selection
Most of the algorithms based on ensemble learning combine all base classifiers to build an ensemble, which may some-
times diminish the performance. Ensemble selection ensures to pick up an optimal subset of accessible classifiers which
improves generalization performance. This can be achieved through two steps. First, a set of base classifiers (homoge-
nous/heterogeneous) are built and second, a heuristic algorithm is used to choose an optimal set of base classifiers [7].
2.2. Related work
Several data mining and ML techniques have been proposed in the literature for malware detection. Some of the pro-
posed methods have performed better than the others. Literature points that the advantages of various techniques could be
combined into a single classifier to improve the performance. Ensemble learning is one way to do so. This section explores
various contributions made by the researchers and practitioners in the field of malware detection and classification using
ML.
Zhang et al. [8] presented a method for classification of malware based on n-gram feature. Afterwards, information gain
method is used for selecting the best n-gram features. Then, probabilistic neural network (PNN) is used to build the classi-
fiers followed by Dempster-Shafer theory to combine the individual predictions made by PNN classifiers. They evaluated the
proposed method on a dataset consisting of 450 malicious and 423 benign files. The results show the better ROC for the
ensemble of PNNs.
Mukkamalaa et al. [9] made use of majority voting method to ensemble the prediction of various classifiers for detecting
intrusions in the network traffic. They used the base classifiers as SVM, MARS and three types of ANNs (RP, SCG and OSS).
A dataset of DARPA is used for evaluation purpose. They showed that majority voting method improves the accuracy of the
detection.
Menahem et al. [10] used three categories of features, i.e., n-gram, function-based and portable executable (PE) features.
They constructed five different datasets and considered five base classifiers, namely, OneR, VFI, KNN, Naïve Bayes, and C4.5
decision tree to process these datasets. Afterwards, majority voting, distribution summation, Bayesian combination, perfor-
mance weighting, stacking and troika were used to combine the base classifiers. The experimental results showed that troika
and stacking outperformed the base classifiers. The distribution summation and Bayesian combination gave the same result
as that of the best base classifiers and voting, but Naïve Bayes and performance weighting performed poorly than the base
classifiers.
Ye et al. [11] used application programming interface (API) calls and strings as features. They constructed eight base
classifiers on various combinations of features. Afterwards, they used simple voting method to combine the base classifiers.
The results pointed out that the method outperformed the base classifiers.
Guo et al. [12] proposed a method in which they categorize the API calls into seven classes and used these to construct
the base models. The predictions made by the base models are combined and the results are compared with the best base
classifiers and voting method.
Landage and Wankhade [13] used opcode sequence of the malware samples to build base classifiers. The output of the
three base classifiers was combined using majority voting and veto-based voting. The experimental results reveal that veto-
based voting method gave better detection rate as compared to majority voting method, but increased the false positive
rate.
Sheen et al. [14] constructed a set of heterogeneous base classifiers based on the API calls and features extracted from PE
header. They proposed two ensemble methods which were used to select and combine a set of base classifiers. The method
attained 99.7% detection rate which was better than the traditional methods of bagging, boosting and stacking.
A similar approach has been proposed in [15] where the authors divided the features into various subspaces and each
of these is used to construct the base classifiers. Afterwards, they used an evolutionary algorithm to assign weight to each
base classifier. A weighted voting method is used to combine the selected base classifiers. The experimental results establish
that this method provides better results.
Ozdemir and Sogukpinar [16] proposed a method for detection of Android malware based on API calls. Using these API
calls, they build different base classifiers. Thereafter, the predictions made by base classifiers are combined using majority
voting. The results show that the proposed method has improved the accuracy in detecting malware.
Another method [17] is proposed for the detection of Android malware in which the authors used permissions and
API calls. These features were used to construct six base classifiers. Later a collaborative decision fusion method was used
to combine the predictions of all classifiers. The authors claim that this method gave better results as compared to the
traditional methods like Adaboost and bagging.
Yerima and Sezer [18] proposed a technique ‘Droidfusion’ for combining the predictions of base classifiers. This technique
is based on multilevel architecture. They proposed four ranking algorithms to rank the base classifiers and later considered
their combinations to combine the results of base classifiers. They used four different datasets to demonstrate that the
proposed technique is better than the traditional ensemble approaches.
Bai and Wang [19] designed two ensemble methods along with the integrated feature set based on opcode and n-gram.
They have used two datasets to evaluate the performance of malware detection using the proposed schemes. The results
reveal that the proposed methods have improved the generalization performance of the learning models.
In [20], the authors demonstrated that instead of considering all base classifiers, their subset can be considered to en-
hance the generalization performance. They used greedy method to add iteratively new classifiers that maximize the per-
formance of ensemble based on area under curve. In each iteration, some base classifiers are randomly selected and the
performance of the current ensemble is evaluated. The classifier leading to the best generalization performance is selected
as a part of ensemble. This process is repeated until a maximum ensemble size is reached.
Kuncheva [21] pointed out that there should be diversity in the classifiers to make the ensemble learning more effective.
In ensemble methods using multiple classifiers, diversity is obtained by making use of different types of classifiers each con-
taining an explicit or implicit bias. The combination of such type of classifiers achieves accuracy that none of the individual
classifiers would be able to achieve.
The aforementioned studies point to the following: (1) in general, integrating multiple feature view of malware outper-
form the single view feature methods, (2) a single ML model or optimization of parameters does not yield the best infer-
ences and thus, a better approach is to combine multiple learning models to produce a strong model, and (3) researchers
have yet to explore big data tools and technologies with ensemble learning for malware detection. In this paper, we have
used a set of features extracted from static as well as dynamic malware analysis, and applied novel ensemble learning
methods to enhance the generalization performance for malware detection at a large scale. The proposed approach deploys
Apache Spark for efficient processing of massive malware data and builds multiple diverse base models at the lower level.
The experimental results obtained at this level are used for two purposes: (1) to compute ranks and weights for the base
models to be used in weighted voting, and (2) to select a set of optimal base classifiers to be used in stacking. Thereafter, the
results of the proposed schemes are compared with the traditional ensemble methods which show a definite improvement
in the generalization performance of malware detection.
3. Proposed methodology
It is obvious from the literature that multiple feature view consisting of static and dynamic features helps in providing
an inclusive depiction of a program and thus, assists in detecting the malware more accurately. Therefore, the proposed
methodology makes use of multiple features obtained through static and dynamic malware analysis.
This section discusses the detailed methodology used to detect a malware. Fig. 1 presents an overview of the proposed
methodology and the subsequent subsections describe its various steps in detail. The malware and benign files are mainly
collected from public malware repositories and clean Windows machines respectively. An automated environment is setup
to perform static and dynamic malware analysis, and the generated reports are used as a raw dataset for malware detec-
tion. Thereafter, these reports are stored in a distributed storage system and processed using Apache Spark to extract the
malware features. After extracting the features, we have used five diverse inducers as base classifiers to build models that
are evaluated on the parameters like true positive rate, true negative rate, precision, accuracy, F-measure and Matthews
correlation coefficient (MCC). The results obtained from these are used to design the schemes for:
(1) Computing the weights for base classifiers to be used for the weighted voting ensemble method.
(2) Selecting an optimal pool of classifiers to be used for stacking.
Subsequently, the outcomes of the proposed methods are compared with the best base classifiers and some of the tradi-
tional ensemble methods.
3.1. Dataset preparation
The proposed ensemble methods for malware detection are evaluated on a dataset of 10 0,20 0 malicious and 98,150
benign files. The malicious files are collected from diverse sources such as VirusShare, Nothink, VXHeaven, etc. The benign
files are mainly collected from a clean Windows XP, 7, and 10 installations directory. Some benign samples are collected
from online free software archives, such as CNET Download and Softonic, and our local development lab. In all, our dataset
is balanced as it contains nearly equal numbers of malicious and benign files.
We have setup an automated environment using distributed Cuckoo Sandbox for conducting static and dynamic malware
analysis. The host machine is having Ubuntu as an operating system and Windows 7 guest machines which are setup using
Oracle VM VirtualBox. The raw binary files are executed in the virtual environment which provides analysis reports in the
JavaScript Object Notation (JSON) format. The JSON reports contain various static (i.e. binary metadata, packer detection,
and sections’ information) and dynamic attributes (i.e. dynamic linked libraries, dropped files, Windows API calls, mutex
operations, file operations, registry operations, network activities, and processes) of malware binaries. These JSON reports
form a raw dataset to the proposed architecture for further analysis.
The data preprocessing steps include data filtering, data labeling, replacing the missing numerical feature values with
the mean of the observed values for the feature, removing samples having all null values, and transforming the data into
numerical values. The duplicate samples are filtered and removed on the basis of their MD5 hashes. In addition, a voting
method is adopted to label a sample. In fact, VirusTotal scans a malware sample using a collection of anti-virus tools and
thus, we set a threshold of five detections to label a sample as malicious because it is rare to find false positives from five
or more detectors. Using this threshold, only six samples were discarded in our dataset.
3.2. Feature extraction
We have identified an integrated set of static and dynamic features based on state of art malware classification and our
dataset. The various features are briefly described below and summarized in Table 1:
(1) File Metadata: The metadata of malware binary includes information like file size, FileVersion, ProductVersion, Pro-
ductName, CompanyName, etc.
(2) File Size: Malware authors tend to keep the size of a malicious binary small. In our dataset, around 75% of binaries
are having a size less than 400 KB and only 5% have size more than 1 MB.
(3) Packer Detection: Packers are extremely popular among malware authors to compress and/or encrypt its content to
minimize the download time and storage space. It changes the byte structure of the malware and thus, can bypass
the signature based detection.
(4) Sections’ Information: It consists of total number of sections in the malicious program and number of sections which
are suspicious. A suspicious section is a section for which the virtual size and raw data size has a huge difference. It
is due to the packing or encryption of that particular section.
(5) Dynamically Linked Libraries (DLLs): DLL is a predominant feature for malware detection as it reveals the resource
information for a program.
(6) Dropped Files: Malware drops some files on the system during its execution so that it cannot be detected.
Fig. 1. Proposed methodology for malware detection.

Table 1
List of static and dynamic features.
Types Category Features Raw Type Derived Type Derived Value
Static File Metadata File metadata present String Boolean [0, 1]

Features File Size File size Integer Integer Integer
Packer Detection Packer used String Boolean [0, 1]
Sections’ Information No. of suspicious sections String Integer Integer
Dynamic DLLs Frequency of DLLs String Integer Integer
Features Dropped Files No. of dropped files Integer Integer Integer
Windows API Calls Frequency of API calls String Integer Integer
Mutex Operations No. of mutexes String Integer Integer
File No. of files accessed String Integer Integer
Activities No. of files read or created
No. of files modified
No. of files deleted
Registry No. of registries accessed String Integer Integer
Activities No. of registries read/created
No. of registries modified
No. of registries deleted
Network No. of TCP connections String Integer Integer
Activities No. of ICMP connections
No. of UDP connections
No. of IRC connections
No. of HTTP connections
No. of SMTP connections
No. of DNS requests
No. of hosts requests
No. of domains
No. of IP addresses
Process No. of processes created String Integer Integer
Activities No. of processes modified
No. of processes terminated
(7) Windows API Calls: It is one of the most dominant features for knowing malware behavior.
(8) Mutex Operations: Mutexes ensure that only a copy of malware is running in the system.
(9) File Activities: File activity helps in understanding the role of malware. For example, if a malware creates a file to
store browsing history, it is a kind of spyware. The file activities can provide information about the number of files
accessed, read, created, modified, and deleted.
(10) Registry Activities: The Windows registry consists of a large hierarchical database to store information about system
configuration. The features include the registries read/created, modified, and deleted along with their key-values.
(11) Network Activities: These features cover the count of IP addresses, type of protocol used to establish a connection
with their command and control server, ICMP requests made, DNS requests, etc.
(12) Process Activities: Process activities can show whether the malware has created, modified or terminated one or more
processes.
3.3. Feature vectorization
Many ML algorithms work well on numeric feature space, where rows and columns represent instances and features re-
spectively, rather than raw attributes data. The process of transforming raw data into feature space is known as vectorization.
There are four commonly used techniques to convert raw data into numerical feature vectors, namely, categorical, free-form
string, ordinal, and sequential vectorization [22]. We have applied categorical vectorization to all the attributes which set
the value of each attribute to a unique dimension in the feature vector. For example, OpenProcess API will correspond to ith
index in the feature vector V, where Vi = 1 if and only if the binary makes a OpenProcess API call.
3.4. Malware classification using base classifiers
The diversity of the classifiers makes the ensemble learning more effective and it can be achieved by incorporating dif-
ferent types of classifiers, which use distinct approaches to classify the input data. By combining diverse classifiers, the
learning model can achieve accuracy that none of the individual classifiers would be able to achieve [7, 21]. Consequently,
we have used five diverse inducers as base classifiers, namely, Naïve Bayes (NB), K-Nearest Neighbour (KNN), Decision Table
(DT), Support Vector Machine (SVM), and Random Forest (RF). Each of these classifiers belongs to a diverse family of classi-
fiers and therefore, uses different approaches to classify the input data. These classifiers are used for building base models
and are evaluated on the basis of various evaluation parameters. The description of each of these base classifiers is given
below:
3.4.1. Naïve Bayes

NB classifier is based on Bayes’ theorem with strong independent assumptions. It predicts the probabilities of a given
instance belonging to a class in a dataset. It assumes that the occurrence of a feature in a class is independent of any other
feature, i.e., all features are assumed to contribute independently in calculating the probability for classification of data. This
model is useful for very large datasets and is easy to build. Let the dataset D has a feature vector x = {x1 , x2 , …, xn }, then
according to Bayes’ theorem, the posterior probability is computed by using Eq. (3).
P ( x | c )P ( c )
P ( c |x ) = (3)
P (x )
where P(c|x) is the posterior probability of class c, given predictor attribute x. P(x|c) is the probability of predictor attribute
of class c. P(c) is the prior probability of class c and P(x) is the prior probability of predictor attribute.
3.4.2. K-Nearest neighbour

KNN is a representative of lazy ML algorithms that stores the training data and waits until it is provided the test data. On
getting the instance of test data, it does prediction by scanning the training data for the k most similar instances/neighbours.
These learners support incremental learning. However, these are computationally expensive. We have chosen k = 5 and used
normalized Euclidean distance to find the training instance closest to the given test instance. Euclidean distance between
two points, a = {a1 , a2 , …, an } and b = {b1 , b2 , …, bn }, in n-dimensional space is calculated using Eq. (4).

n
Distance (a, b) = ( a i − b i )2 (4)
i=0
If multiple instances are there at the smallest distance from the test instance, the first such instance is used as its class.
3.4.3. Decision table

In this classifier, specific features are selected while performing the learning process [23]. It is done by computing cross-
validation performance of the table for different sub-sets of attributes and selecting the sub-set which is the best performer.
The cross-validation error is obtained by manipulating the class counts related to each entry of the table because the struc-
ture of the table doesn’t change when instances are added or deleted. The feature space is usually searched by using the
best-first search algorithm.
3.4.4. Support vector machine

SVM is a supervised ML algorithm in which each data point is plotted in n-dimensional space, where n represents the
number of features. Each feature value represents the value of a particular coordinate. Then, the classification is performed
by identifying a hyper-plane that differentiates the data of one class from another one. The training instances known as
support vectors and margins defined by them are used to recognize the hyper-plane.
3.4.5. Random forest

It is an ensemble method consisting of several decision trees and a technique called bagging (short form of bootstrap
aggregation). Bagging involves the training of each decision tree on a subset of the original dataset which is created by
sampling with replacement. The final class is calculated using the majority voting on the result of all decision trees. It is a
highly efficient and effective algorithm on a large dataset.
3.5. Proposed schemes
The proposed schemes for building an ensemble classifier are based on: (1) weighted voting, and (2) selecting an optimal
pool of classifiers to be used for stacking. These schemes are based on the rank algorithms designed in [18].
3.5.1. Scheme 1
It has been observed from the literature that most of the classifiers usually give different results for different classes. For
example, they do not classify the malware and benign classes with equal accuracy. This fact has been exploited to compute
the weights of base classifiers which are used for implementing weighted voting. The various methods for computing the
ranks and weights are described below:
3.5.1.1. Average accuracy (AA)

In this method, the ranking is computed by taking the average of prediction accuracies of classes. The base classifier with
better average accuracy will get a higher rank. Consider that there are c classifiers denoted by x1, x2 , …, xc being used to
classify the data into two classes, i.e., malware and benign. Let Pm,xi and Pb,xi are the accuracies of malware and benign class
for a classifier xi respectively. The average accuracy Pavgxi of each classifier is computed by using Eq. (5) as given below:
nm × Pm,xi + nb × Pb,xi
Pavgxi = ∀ i ∈ {1, 2, . . . , c} (5)
nm + nb
where nm and nb denote number of malicious and benign files respectively. Suppose A = {Pavgx1, Pavgx2 , . . . , Pavgxc } is the set
of average accuracies of all classifiers. Then the rank is calculated by using Rankdesc () function which assigns the rank to
every classifier on the basis of its average accuracy as depicted in Eq. (6) (i.e. larger the value of Pavgxi , higher is the rank).
Rank(xi, Pavgxi ) = Rankdesc (A ) ∀ i ∈ {1, 2, . . . , c} (6)

Then the weight of each classifier is computed by dividing the rank of the classifier by sum of the rank values of all
classifiers as shown in Eq. (7).

Rank xi, Pavgxi
W t x i = c ∀ i ∈ {1 , 2 , . . . , c } (7)
i=1 Rank xi, Pavgxi
3.5.1.2. Class accuracy differential (CAD)

In this approach, a weight is assigned to each base classifier on the basis of average accuracy and the absolute difference
between the class accuracies. Let Dxi is the ratio of average performance accuracy, and |Pm,xi − Pb,xi | is the difference of
accuracies of malware and benign class for a classifier. Eq. (8) computes Dxi as shown below:
P
Dx i = avgxi ∀ i ∈ {1, 2, . . . , c} (8)
Pm,xi − Pb,x
i
Suppose D = {Dx1 , Dx2 , . . . Dxc } is the set of values of class differential for all classifiers. Then the rank is calculated by
using Rankdesc () function (as shown in Eq. (9)) which assigns the rank to every classifier on the basis of its class accuracy
differential (i.e. larger the value of Dxi , higher is the rank).
Rank(xi, Dxi ) = Rankdesc (D ) (9)
Further, the weight of each classifier is computed by dividing the rank of the classifier by the sum of the rank values of
all classifiers as shown in Eq. (10).
Rank(xi, Dxi )
W t x i = c ∀ i ∈ {1, 2, . . . , c} (10)
i=1 Rank (xi, Dxi )
3.5.1.3. Ranked aggregate per class accuracies (RACA)

This approach assigns rank individually to each classifier based on the class accuracy. The final rank for a classifier is
computed using the sum of per class ranking. Let B = {Bx1, Bx2 , . . . , Bxc } be the set of accuracies of all classifiers for benign
class and M = {Mx1, Mx2 , . . . , Mxc } is the set of accuracies of all classifiers for malware class. The rank to every classifier is
assigned separately on the basis of class accuracies using Rankdesc () function (i.e. larger the value of Bxi or Mxi , higher is the
rank) as shown in Eqs. (11) and (12).
Rank(xi , b) = Rankdesc (B ) (11)
Rank(xi , m ) = Rankdesc (M ) (12)

Now compute the aggregate of per class rank as shown in Eq. (13).
Zxi = Rank(xi , b) + Rank(xi , m ) ∀ i ∈ {1, 2, . . . , c} (13)
Let Z = {Zx1, Zx2 , . . . , Zxc } is the set of values of aggregate per class rank for all classifiers. The final rank for the classifier
is computed using a ranking function, i.e., Rankdesc () as shown in Eq. (14).
Rank(xi , Zxi ) = Rankdesc (Z ) (14)
Thereafter, the weight of each classifier is computed by dividing the rank of the classifier by the sum of the rank values
of all classifiers as shown in Eq. (15).
Rank(xi, Zxi )
W t x i = c ∀ i ∈ {1, 2, . . . , c} (15)
i=1 Rank (xi, Zxi )
3.5.1.4. Ranked aggregate of average accuracy and class differential (RACD)

In this approach, the final ranking is computed on the basis of the values obtained after summing up the rank calculated
using the average accuracy of the classifier and the difference between the class accuracies. Let txi is the class accuracy
differential as shown in Eq. (16).

txi = Pm,xi − Pb,xi ∀ i ∈ {1, 2, . . . , c} (16)
Suppose T = {tx1 , tx2 , . . . , txc } is the set of class differential values for all classifiers. Then the rank for each class is
computed by using Rankascen () function (as shown in Eq. (17)) which assigns the rank to every classifier on the basis of txi
(i.e. smaller the value of txi , higher is the rank).
Rank(xi , txi ) = Rankascen (T ) (17)
Now, aggregate the value of rank for both average accuracy and class differential for each classifier by using Eq. (18).
Hxi = Rank(xi, Pavgxi ) + Rank(xi , txi ) (18)
Suppose H = { Hx1, Hx2 , . . . , Hxc } is the set of aggregate rank on the basis of average accuracy and class differential
accuracies for all classifiers. The final ranking will be computed by using ranking function Rankdesc () as shown in Eq. (19).
Rank(xi , Hxi ) = Rankdesc (H ) (19)
Then the weight of each classifier is computed by dividing the rank of that classifier by the sum of the rank values of all
classifiers as shown in Eq. (20).
Rank(xi , Hxi )
W t x i = c ∀ i ∈ {1, 2, . . . , c} (20)
i=1 Rank (xi , Hxi )
3.5.2. Scheme 2
We have further aggregated the rank assigned to various classifiers by different ranking algorithms to select an optimal
set of base classifiers to be used in stacking. Instead of taking all base classifiers, the top 60% is selected as an optimal
set on the basis of aggregate rank. In case, two or more base classifiers have equal aggregate rank, the classifier having the
higher individual rank based on a ranking algorithm with the highest accuracy in weighted voting (as explained in proposed
scheme 1) will be given high priority and so on.
4. Evaluation
This section describes the experimental setup, evaluation parameters, and the outcomes of the proposed solution which
is implemented on the top of Apache Spark, a popular open-source big data processing framework. An integrated set of
selected features (as described in 3.2) are combined with the proposed ensemble methods, and the results are compared
with best base classifier and traditional ensemble methods. In our evaluation, we noted the model performance in terms of
precision, recall, accuracy, F-measure, and Matthews correlation coefficient.
4.1. Experimental environment
The proposed architecture leverages the open-source components, which include Hadoop Distributed File System (HDFS)
for distributed storage of massive data, Apache Spark for efficient preprocessing of big data, and Python for data-intensive
analysis. The experiments are conducted on Apache Spark based multinode cluster - a master and two worker nodes, with
Ubuntu 18.04.2 as an operating system. The master node has Intel(R) Core TM i7–5500 U CPU @2.4 GHz, 16 GB DDR3
1600 MHz, and 1 TB HDD. The worker nodes are Intel(R) Core TM i5–8250 U CPU @1.6 GHz, 8 GB DDR4, and 1 TB HDD. A
brief description of the open-source software components is given below:
(a) HDFS: HDFS is a popular tool for distributed storage of data at a large scale. It resembles existing distributed file sys-
tems to store data on commodity hardware. The major benefits of HDFS include high fault-tolerance, high throughput,
and effortless portability among heterogeneous platforms.
(b) Apache Spark: Apache Spark is a highly reliable and scalable processing engine. It uses a resilient distributed dataset
(RDD) [24] which is a collection of fault-tolerant elements that can be operated in parallel. Apache Spark is best
known for processing large datasets in memory and runs faster than Apache Hadoop MapReduce.
(c) Development Language: Python is used for data-intensive analysis which is preferred as a development language due
to two compelling reasons: (1) availability of rich ecosystem of scientific libraries, and (2) almost all important ML
research articles use Python for implementation. The other development tools include PySpark – a Python API for
Apache Spark, and Jupyter notebook which provides the development environment.
4.2. Cross-validation
Some of the methods used to learn the classifier are random sub-sampling, leave-one-out, and cross-validation. In our
work, we have used k-fold cross-validation which is a standard approach to assess likely predictions on new data. In this
method, a labeled dataset is randomly partitioned into k parts with the same size. Out of these, k-1 subsets are deployed
for model training and the remaining one is used for data validation. The whole process is repeated k times so that each
subset is used exactly once as the validation data. In our experiments, 10-fold cross-validation has been used to increase
the chances of learning all relevant information in the training set.
Table 2
Table of confusion or confusion matrix for malware
class.
Predicted Class
Malware Benign
Actual Class Malware TP FN

Benign FP TN
4.3. Performance parameters
In a supervised classification, there exists a class label and predicted value for each data element. The result for each data
element or point can be assigned any of the four categories which form the building blocks of evaluation metrics. These are
known as True Positive (TP), False Negative (FN), False Positive (FP) and True Negative (TN). These outcomes can be devised
in a 2 × 2 confusion or error matrix (as shown in Table 2). In a confusion matrix, each row and column denote instance of
an actual and predicted class respectively, and vice versa. The various evaluation parameters used in classification problems
are described as below:
True Positive Rate (TPR): It is the rate of correctly predicted malware samples. It is also known as recall or sensitivity.
TP
T PR = (21)
TP + FN
False Positive Rate (FPR): It is the rate of incorrectly predicted benign samples.
FP
F PR = (22)
FP + TN
False Negative Rate (FNR): It is the rate of incorrectly predicted malware samples.
FN
F NR = (23)
TP + FN
Precision: It is a measure of exactness or quality.
TP
P recision = (24)
TP + FP
F-Measure: It is the harmonic mean of precision and recall.
2 × Precision × Recall 2 × TP
F -Measure = = (25)
Precision + Recall 2 × TP + FP + FN
Accuracy (%): It is the percentage of correctly predicted samples (both malware and benign).
TP + TN
Accuracy (% ) = × 100 (26)
TP + FN + TN + FP
Matthews Correlation Coefficient (MCC): It is used to measure and compare the performance of ML algorithms for binary
classification. It measures the correlation between predicted and actual data labels. It takes values between −1 (i.e. perfect
anti-correlation between predicted and real instances) and +1 (i.e. perfect prediction). A value equal to 0 means no better
than random classification. It is computed by using Eq. (27).
TP × TN − FP × FN
MCC = (27)
(T P + F P ) (T P + F N ) (T N + F P ) (T N + F N )
4.4. Evaluation results
We start examining the experimental results using a big data environment (as discussed in 4.1) for classification of a
dataset. First of all, five base models are build using 10-fold cross-validation and the various evaluation parameters are
computed by using Eqs. (21)–(27) as shown in Table 3.
Figs. 2–5 visualize the comparison of base classifiers based on FPR/FNR, precision/F-Measure, accuracy, and MCC respec-
tively. Fig. 2 depicts that the values of FPR and FNR are maximum and minimum for NB and RF classifier respectively. Fig. 3
shows that precision and F-measure are maximum and minimum for RF and NB classifier respectively. Further, RF has de-
livered the highest average accuracy, i.e., 98.1% followed by 94.9% for DT (as shown in Fig. 4). Fig. 5 shows that RF records
the maximum value using MCC, i.e., 0.961 (depicting the highest correlation between predicted and the actual data labels)
followed by DT, i.e., 0.899.
In order to obtain the rank and weight of each classifier, the four methods as explained in Section 3.5.1 are applied on
the classification results obtained for base classifiers. All computations are based on class accuracies, i.e., TPR (accuracy of
Table 3
Classification results using base classifiers.
Base Classifier TPR FPR FNR Precision F-Measure Accuracy (%) MCC
NB 0.844 0.159 0.156 0.844 0.844 84.2 0.685

KNN 0.921 0.067 0.079 0.933 0.927 92.6 0.854
DT 0.944 0.045 0.056 0.955 0.949 94.9 0.899
SVM 0.896 0.108 0.104 0.894 0.894 89.3 0.788
RF 0.977 0.015 0.023 0.985 0.981 98.1 0.961
Fig. 2. Comparison of base classifiers on the basis of FPR/FNR.
Fig. 3. Comparison of base classifiers on the basis of precision/F-Measure.
Fig. 4. Comparison of base classifiers on the basis of accuracy.

Fig. 5. Comparison of base classifiers on the basis of MCC.
Table 4
Rank and weight of different classifiers using four ranking algorithms.
Classifier TPR TNR AA CAD RACA RACD
Rank Weight Rank Weight Rank Weight Rank Weight
NB 0.844 0.841 1 0.067 5 0.300 2 0.133 4 0.2

KNN 0.921 0.933 3 0.200 1 0.067 3 0.200 3 0.15
DT 0.944 0.955 4 0.267 2 0.133 4 0.267 4 0.2
SVM 0.896 0.892 2 0.133 4 0.267 1 0.067 4 0.2
RF 0.977 0.985 5 0.300 3 0.200 5 0.300 5 0.25
Table 5
Aggregate rank of different classifiers based on various ranking algorithms.
Classifier Rank using AA Rank using CAD Rank using RACA Rank using RACD Aggregate Rank
NB 1 5 2 4 12
KNN 3 1 3 3 10
DT 4 2 4 4 14
SVM 2 4 1 4 11
RF 5 3 5 5 18
Table 6
Comparison of traditional ensemble methods with proposed ensemble methods.
Ensemble Method TPR FPR FNR Precision F-Measure Accuracy (%) MCC
Majority Voting 0.988 0.014 0.012 0.986 0.987 98.7 0.974

WeightedVoting_AA 0.994 0.008 0.006 0.992 0.993 99.3 0.986
WeightedVoting_CAD 0.987 0.016 0.013 0.984 0.985 98.5 0.971
WeightedVoting_RACA 0.996 0.005 0.004 0.995 0.995 99.5 0.991
WeightedVoting_RACD 0.989 0.013 0.011 0.988 0.988 98.8 0.977
Stacking (All classifiers) 0.993 0.009 0.007 0.991 0.992 99.2 0.984
Stacking (RF, DT, SVM) 0.994 0.009 0.006 0.991 0.992 99.2 0.985
malicious class) and TNR (accuracy of benign class). Table 4 shows the class accuracies for each of the classifier obtained
from 10-fold cross-validation method, and also presents the ranking and weights calculated using AA, CAD, RACA and RACD
methods.
Each of the four ranking methods gives a different set of rank and weight for the base classifiers. The weights obtained
from the different methods are used separately in an ensemble method based on weighted voting as explained in Section
2.1. Further, the ranks obtained from these methods are summed up to compute the aggregated rank of each classifier as
shown in Table 5. Based on aggregated rank, the top three classifiers (i.e. 60% of base classifiers), namely RF, DT, and NB,
are chosen for stacking purpose.
A comparative analysis of the traditional ensemble methods, i.e., majority voting and stacking (considering all base clas-
sifiers), is performed with the outcome of the proposed methods. In stacking, logistic regression (LR) has been deployed
as level-2 meta-classifier. LR is an algorithm which is used to classify the data instances into a discrete set of classes. It
converts the output using the logistic sigmoid function (as shown in Eq. (28)) and gives a probability value which can be
Fig. 6. Comparison of best base classifier, traditional and proposed ensemble methods using FPR/FNR.
Fig. 7. Comparison of best base classifier, traditional and proposed ensemble methods using precision/F-Measure.
easily mapped to distinct classes.
1
p= (28)
1 + e−(b0 +b1 x1 ...bn xn )
where p is the probability having a value between 0 and 1, and (b0 + b1 x1 , …, bn xn ) is the input to the function having n
variable inputs.
Fig. 8. Comparison of best base classifier, tradition ensemble and proposed ensemble method using accuracy.
Fig. 9. Comparison of base classifiers, tradition ensemble, and proposed ensemble method using MCC.
Besides, a threshold value is selected for mapping the probability to discrete classes. If the value of p for a data instance
is greater than or equal to selected threshold, it is classified as 1 (malware), otherwise 0 (benign). It is represented by using
Eq. (29).
I f p ≥ 0.5, class = 1, else class = 0 (29)
Table 6 shows the classification results for traditional ensemble methods (i.e. majority voting, and stacking considering
all base classifiers) and proposed methods (WeightedVoting_AA, WeightedVoting_CAD, WeightedVoting_RACA and Weight-
edVoting_RACD, and stacking of top three classifiers).
Fig. 6 visualizes the FPR/FNR for best base classifier, traditional majority voting, stacking (considering all classifiers) and
proposed methods. It vividly shows that the value of FPR/FNR for WeightedVoting_RACA is minimum. Fig. 7 plots comparison
of best base classifier, traditional and proposed ensemble methods based on precision/F-Measure. The value of precision/F-
Measure is maximum for WeightedVoting_RACA.
Fig. 8 shows the accuracy comparison of best base classifier with traditional and proposed ensemble methods. It shows
that all ensemble methods provide better accuracy as compared to RF (i.e. the best base classifier). It also shows that the
proposed weighted voting based methods, except WeightedVoting_CAD, give better accuracy as compared to the majority
voting and stacking. Further, both stacking with all classifiers and stacking with top 3 base classifiers give the same accuracy.
Therefore, it is better to use the optimal set of classifiers instead of considering all base classifiers as it reduces the overhead
in terms of computation time.
Fig. 9 visualizes MCC for comparing the performance of best base classifiers, proposed and traditional ensemble methods
for malware detection. It vividly depicts that the value of MCC is close to 1 for all ensemble methods and WeightedVot-
ing_RACA has the highest MCC value. This means that the ensemble methods provide a better correlation between predicted
and actual data labels. Therefore, the proposed schemes for ensemble methods can be used to improve the generalization
performance for detecting unknown malware.
5. Conclusion and future scope
Malware data has seen exponential growth in recent years and become a popular use case of big data. Moreover, the
malware are getting more complex and sophisticated. Consequently, the traditional approaches for malware detection have
been proved ineffective.
The authors have presented a scalable solution for malware detection based on big data and ensemble learning. Two
methods are designed for building an ensemble classifier which is based on calculating ranks and weights for the base
classifiers. These weights are used in majority voting and selection of an optimal set of classifiers for stacking. The pro-
posed methods are evaluated on a dataset consisting of 198,350 Windows files and compared with the traditional ensemble
methods. The experimental results show that the proposed scheme based on weighted voting (i.e. WeightedVoting_RACA)
provides the highest accuracy of 99.5%. Further, it is found that stacking with top 3 base classifiers gives the same accu-
racy as stacking with all base classifiers which may help in reducing the computation time. Hence, it is concluded that the
proposed methods are able to enhance generalization performance for detecting new malware.
In present work, we have considered the balanced dataset for simplicity. In general, the malware detection and classifi-
cation problem is a class imbalance problem which includes far more instances of the benign class than malicious. In the
future, we tend to address this issue while classifying the malware samples. Further, we aim to measure the computational
complexity of the proposed ensemble method. The efficiency of the proposed method in terms of learning time can be
studied and evaluated based on cluster size, dataset size, etc. Apart from this, the proposed solution can also be offered as
a cloud service for malware detection. A hybrid approach may be considered to process malware data on local and remote
clusters.
CRediT authorship contribution statement
Deepak Gupta: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data cu-
ration, Visualization, Writing - original draft. Rinkle Rani: Supervision, Writing - review & editing.
References
[1] Quick heal annual threat report 2019, online available at: https://www.quickheal.co.in/documents/threat- report/QH- Annual- Threat- Report- 2019.pdf.
[2] CERT-EU, “WannaCry Ransomware Campaign Exploiting SMB Vulnerability” May 2017. Online available at: https://cert.europa.eu/static/
SecurityAdvisories/2017/CERT- EU- SA2017- 012.pdf.
[3] Gupta D, Rani R. Big Data Framework for Zero-Day Malware Detection. Cybern Syst 2018 Feb 17;49(2):103–21.
[4] Gupta D, Rani R. A study of big data evolution and research challenges. J Inf Sci 2019;45(3):322–40.
[5] Gandotra E, Bansal D, Sofat S. A framework for generating malware threat intelligence. Scalable Comput Pract Exp 2017 Sep 9;18(3):195–206.
[6] Microsoft Malware Classification Challenge (BIG 2015), online available at: https://www.kaggle.com/c/malware-classification.
[7] Kuncheva LI. Combining pattern classifiers: methods and algorithms. John Wiley & Sons; 2004 Aug 20.
[8] Zhang B, Yin J, Hao J, Zhang D, Wang S. Malicious codes detection based on ensemble learning. In: Proceedings of the international conference on
autonomic and trusted computing. Berlin, Heidelberg: Springer; 2007 Jul 11. p. 468–77.
[9] Mukkamalaa S, Sunga AH, Abrahamb A. Intrusion detection using an ensemble of intelligent paradigms. J Netw Comput Appl 2005;28:167–82.
[10] Menahem E, Shabtai A, Rokach L, Elovici Y. Improving malware detection by applying multi-inducer ensemble. Comput Stat Data Anal 2009 Feb
15;53(4):1483–94.
[11] Ye Y, Li T, Jiang Q, Han Z, Wan L. Intelligent file scoring system for malware detection from the gray list. In: Proceedings of the 15th ACM SIGKDD
international conference on knowledge discovery and data mining; 2009 Jun 28. p. 1385–94. ACM.
[12] Guo S, Yuan Q, Lin F, Wang F, Ban T. A malware detection algorithm based on multi-view fusion. In: Proceedings of the international conference on
neural information processing. Berlin, Heidelberg: Springer; 2010 Nov 22. p. 259–66.
[13] Jyoti Landage, MP Wankhade. Malware detection with different voting schemes. Compusoft 2014;3(1):450–6.
[14] Sheen S, Anitha R, Sirisha P. Malware detection by pruning of parallel ensembles using harmony search. Pattern Recognit Lett 2013 Oct
15;34(14):1679–86.
[15] Krawczyk B., Woźniak M. Evolutionary cost-sensitive ensemble for malware detection. In Proceedings of the international joint conference SOCO’14-
CISIS’14-ICEUTE’14 2014 (pp. 433–442). Springer, Cham.
[16] Ozdemir M, Sogukpinar I. An android malware detection architecture based on ensemble learning. Trans Mach Learn Artif Intell 2014;2(3):90–106.
[17] Sheen S, Anitha R, Natarajan V. Android based malware detection using a multifeature collaborative decision fusion approach. Neurocomputing 2015
Mar 5;151:905–12.
[18] Yerima SY, Sezer S. Droidfusion: a novel multilevel classifier fusion approach for android malware detection. IEEE Trans Cybern 2018 Jan 3;99:1–4.
[19] Bai J, Wang J. Improving malware detection using multi-view ensemble learning. Secur Commun Netw 2016;9(17) pp.4227-4241.
[20] Caruana R, Niculescu-Mizil A, Crew G, Ksikes A. Ensemble selection from libraries of models. In: Proceedings of the twenty-first international confer-
ence on Machine learning; 2004 Jul 4. p. 18. ACM.
[21] Kuncheva LI. Diversity in multiple classifier systems (Editorial). Inf Fusion 2005;6(1) 2005.
[22] Miller B.A. Scalable platform for malicious content detection integrating machine learning and manual review (Doctoral dissertation, UC Berkeley).
[23] Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: practical machine learning tools and techniques. Morgan Kaufmann; 2016 Oct 1.
[24] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction
for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation; 2012 Apr 25.
2-2USENIX Association.
Deepak Gupta Received advanced degrees in Computer Science & Engineering and Physics. Prior to his foray into academia, he worked in IT industry for a
decade performing different roles in software development and program management. His research interests include big data analytics, machine learning,
cybersecurity, and programming languages.
Rinkle Rani is working as Associate Professor in Computer Science & Engineering Department, Thapar Institute of Engineering & Technology, Patiala. She
has done her post-graduation from BITS, Pilani and PhD from Punjabi University, Patiala. She has more than 22 years of teaching experience. She has 120
research publications. Her areas of interest are data analytics and machine learning.

10 1016@j Compeleceng 2020 106729

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1016@j Compeleceng 2020 106729

Uploaded by

Copyright:

Available Formats

Computers and Electrical Engineering 86 (2020) 106729

Contents lists available at ScienceDirect

Computers and Electrical Engineering

Improving malware detection using big data and ensemble

2. Background and related work

2.1. Ensemble learning – An overview

2.2. Related work

3.1. Dataset preparation

3.2. Feature extraction

Fig. 1. Proposed methodology for malware detection.

Types Category Features Raw Type Derived Type Derived Value

Static File Metadata File metadata present String Boolean [0, 1]

3.3. Feature vectorization

3.4. Malware classiﬁcation using base classiﬁers

3.4.1. Naïve Bayes

3.4.2. K-Nearest neighbour

3.4.3. Decision table

3.4.4. Support vector machine

3.4.5. Random forest

3.5. Proposed schemes

3.5.1.1. Average accuracy (AA)

Rank(xi, Pavgxi ) = Rankdesc (A ) ∀ i ∈ {1, 2, . . . , c} (6)

3.5.1.2. Class accuracy differential (CAD)

3.5.1.3. Ranked aggregate per class accuracies (RACA)

Rank(xi , m ) = Rankdesc (M ) (12)

3.5.1.4. Ranked aggregate of average accuracy and class differential (RACD)

Hxi = Rank(xi, Pavgxi ) + Rank(xi , txi ) (18)

Rank(xi , Hxi ) = Rankdesc (H ) (19)

4.1. Experimental environment

Actual Class Malware TP FN

4.3. Performance parameters

4.4. Evaluation results

NB 0.844 0.159 0.156 0.844 0.844 84.2 0.685

Fig. 2. Comparison of base classiﬁers on the basis of FPR/FNR.

Fig. 3. Comparison of base classiﬁers on the basis of precision/F-Measure.

Fig. 4. Comparison of base classiﬁers on the basis of accuracy.

Fig. 5. Comparison of base classiﬁers on the basis of MCC.

Classiﬁer TPR TNR AA CAD RACA RACD

Rank Weight Rank Weight Rank Weight Rank Weight

NB 0.844 0.841 1 0.067 5 0.300 2 0.133 4 0.2

Majority Voting 0.988 0.014 0.012 0.986 0.987 98.7 0.974

easily mapped to distinct classes.

5. Conclusion and future scope

CRediT authorship contribution statement

You might also like