Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343746772

Machine Learning Classifiers for Android Malware Detection

Chapter · January 2021


DOI: 10.1007/978-981-15-5616-6_22

CITATIONS READS

22 3,034

2 authors:

Prerna Agrawal Bhushan H Trivedi


GLS University GLS Institute of Computer Technology
13 PUBLICATIONS   73 CITATIONS    64 PUBLICATIONS   342 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Digital Marketing View project

User Behavior based Intrusion Detection View project

All content following this page was uploaded by Prerna Agrawal on 06 September 2020.

The user has requested enhancement of the downloaded file.


Machine Learning Classifiers
for Android Malware Detection

Prerna Agrawal and Bhushan Trivedi

Abstract With the growing popularity of Android devices, it is also more prone
to malware attacks. There are many malware scanning tools available for scanning
the Android Malware but most of them perform static analysis and also require a
lot of resources and manual overhead. By using Machine Learning Classifiers, this
study aims to improve detecting Android Malware. In this paper, analysis is done on
different Android Malware Detection Techniques with different Machine Learning
Classifiers. This paper also discusses its strengths and weaknesses with their future
scope. The conclusion of the paper also states that one of the Machine Learning
Classifier known as Random Forest has the greatest accuracy compared to SVM and
Naive Bayes. Also, Random Forest, SVM, Naive Bayes classifiers are highly used
for performance evaluation.

Keywords Machine learning · Android malware · Static analysis · Malware


detection · Android mobile security · Dynamic analysis

1 Introduction

The usage of smartphones has become extensive now these days. With the ease of
new technologies, smartphones are becoming the basic need of the end-user [1]. In
2016, Google’s Android Market is leading by 82% [1] and selling of smartphones to
end-users is around 1.5 billion units. As Android system is much popular, it is more
vulnerable to malware attacks. Avast reported an increase in 40% of cyber-attacks
in Android since 2016 [1]. A total of 316 weaknesses were found in the Android OS
in 2017 which is more than compared to any operating system [2].

P. Agrawal (B) · B. Trivedi


Faculty of Computer Technology (MCA), GLS University, Ellisbridge, Ahmedabad, Gujarat, India
e-mail: prerna.agrawal@glsuniversity.ac.in
B. Trivedi
e-mail: bhushan.trivedi@glsuniversity.ac.in

© Springer Nature Singapore Pte Ltd. 2021 311


N. Sharma et al. (eds.), Data Management, Analytics and Innovation,
Advances in Intelligent Systems and Computing 1174,
https://doi.org/10.1007/978-981-15-5616-6_22
312 P. Agrawal and B. Trivedi

In paper [3], various Online Android Malware Scanning Tools are studied and
a brief comparison is also shown. Based on the comparison it can be concluded
that most of the existing Android Malware Scanning tools perform static analysis
and take a longer time to scan a single file [3]. Also, these tools require manual
overhead and heavy resources for performing the scanning [3]. So in this situation
Machine Learning is the proper solution for detecting the malware. With the usage
of different Machine Learning Classifiers automation in malware detection system is
possible which will improve the precision of the finding and also reduce time, usage
of heavy resources, and manual overhead [1]. So the study and detailed comparison of
detecting Android Malware using different Machine Learning Classifiers are needed.
The paper is distributed into the following segments: Sect. 2 defines the associ-
ated work done for detecting Android Malware using Machine Learning Classifiers.
Section 3 defines different Machine Learning Classifiers used. Section 4 provides
a comparative study for detecting Android Malware using Machine Learning
Classifiers. Section 5 delivers conclusion of the paper.

2 Related Work

There are many existing approaches which are proposed by researchers for detecting
Android Malware by using different Machine Learning Classifiers. Different Android
Malware detection techniques are Static analysis, Dynamic analysis, and Hybrid
analysis [2].
The static analysis focuses on the Android Manifest file to reverse engineer the
APK file to detect the malware [2]. Some approaches like Monica [4] uses static
analysis that applies different Machine Learning Classifiers on features and improves
static malware detection. Koli [1] uses static analysis that applies different Machine
Learning Classifiers on features and proposes a system named RanDroid. Mathew
[5] uses static analysis that applies different Machine Learning Classifiers on features
and proposes a system based on examining permission. Justin [6] uses static analysis
that applies different Machine Learning Classifiers and proposes an original machine
learning-based Malware detection system. Zarni [7] uses static analysis that applies
different Machine Learning Classifiers on the features and proposes a framework for
classifying Android Applications.
The dynamic analysis mainly focuses on the runtime behavioral analysis of an
application [2]. Some approaches like Ham [8] uses dynamic analysis that applies
different Machine Learning Classifiers on different runtime features and recommends
a method of selecting the feature and reducing Malware False Detection rate. Chang
[9] uses dynamic analysis that applies different Machine Learning Classifiers on
different runtime features and proposes a Robotium Program. Chieh [9] uses dynamic
analysis that applies different Machine Learning Classifiers on different runtime
features and proposes a framework named as DroidDolphin. Yu [10] uses dynamic
analysis that applies different Machine Learning Classifiers on different runtime
features and proposes a Malware detection system.
Machine Learning Classifiers for Android Malware Detection 313

3 Machine Learning Classifiers

Machine Learning Classifiers are mainly divided into two categories: supervised
learning and unsupervised learning [1, 4, 5, 7–12]. Supervised learning is also known
as predictive learning that predicts the class of unknown objects based on prior class-
related information of similar objects [6]. Unsupervised learning is also known as
descriptive learning and finds patterns in unknown objects by grouping other similar
objects together [6].
According to the study [1, 4, 5, 7–12], the Machine Learning Classifiers mainly
used are as follows.

3.1 Naive Bayesian

Naive Bayesian is used for a classification task that assigns class labels to problem
instances [12, 13]. It requires less amount of training information or data to classify
the parameters. Naive Bayesian classifiers are direct linear classifiers and are known
for their straight forward and accurate result [6]. The strengths of this classifier are that
it is simple and fast in calculation, in situations where it is noisy and missing data it
performs well, works well with small and large amount of data is present for training,
easy and straightforward for obtaining accurate results [6]. The weaknesses of this
classifier are that the assumption for equal importance and independence does not
hold true if the dataset contains large number of numeric features than the accuracy
and reliability of output becomes limited [6]. Text classification, Spam filtering,
Online Sentiment Analysis are certain applications of Naive Bayesian Classifier [6].

3.2 Support Vector Machine

Support Vector Machine (SVM) is a classification model recommended for linear


classification and regression that is grounded in the conception of surfaces called
hyperplane. It draws boundary between data instances plotted in multidimensional
feature space [6]. It is used to differentiate the data instances belonging to different
classes. The strengths of SVM are that it can be used in both regression and classifi-
cation, it is robust, and the prediction results are very accurate [6]. The weaknesses
of SVM are that is applicable only for binary classification, it is very complex, it is
very slow with large dataset, it is memory-intensive [6]. Cancer detection, detecting
the image of a face is certain applications of SVM classifier [6].
314 P. Agrawal and B. Trivedi

3.3 Random Forest

Random Forest is a collective classifier that syndicates and uses many decision tree
classifiers [6]. A set of decision trees are created from random selection of a subset
within a dataset [14]. When the random forest is generated with combination of
decision trees, majority vote is applied to combine the output of the different trees
[6, 14]. The strengths of Random Forest are that it works well on large and expansive
data sets, it has robust method for estimating missing data and maintains precision
in absence of large proportion of data, it has techniques for balancing errors in an
unbalanced dataset for class population, it provides estimation for which features
are most important ones in overall classification, generated forests can be saved for
future use on other data, it can be used for both classification and regression [6]. The
weaknesses of Random Forest are that it is very difficult to understand as it combines
multiple decision trees, it is much more expensive than a simple model like decision
tree [6].

3.4 Logistic Regression

Logistic Regression is used both in classification and regression [6]. It is also known
as a kind of regression study that is used to predict the result of categorized dependent
variable. It is used for binary classification [15]. The strengths of Logistic Regression
are that it is very effective, does not need high computational resources, no need to
scale the input features, gives accurate predictions of results, it is simple, and easy to
implement [15]. The weaknesses of Logistic Regression are that non-linear problems
are not solved, it does not work well if all the independent variables are not identified
clearly [15].

3.5 K-Means Clustering

It is a clustering technique which uses partitioning-based clustering in machine


learning [6]. It is known as a centroid-based technique. In K-means classifier n data
points are assigned to one of the K clusters. Here K will be a user-defined parameter
with a number of clusters desired [6]. The strength of K-means clustering classifier
is that it is very flexible and fits in most scenarios and complexities, the performance
and the efficiency are very high [6]. The weaknesses of K-means clustering are that
it involves a random chance and may not be an optimal set of a cluster in some cases,
some experience is required to the user for guessing the starting number of natural
clusters for efficient outcome [6].
Machine Learning Classifiers for Android Malware Detection 315

4 Comparative Study of Detecting Android Malware Using


Machine Learning Classifiers

In this section, a detailed comparison between detecting Android Malware using


Machine Learning techniques are shown [1, 4, 5, 7–12]. The following parameters
are Paper, Analysis Type, Input, Dataset Type, Final Dataset, Machine Learning Type,
Machine Learning Classifiers, Detection Rate, Performance Evaluation Criteria,
Comparison with other Machine Learning Classifiers, Proposed Approach. Table 1
shows details comparison for detecting Android Malware using Machine Learning
Classifiers.

4.1 Analysis Type

This parameter defines the type of analysis performed by the system. It can be static,
dynamic, or hybrid Analysis. Monica [4] performs static analysis. Ham [8] performs
a dynamic analysis. Chang [9] performs a dynamic analysis. Koli [1] performs static
analysis. Mathew [5] performs static analysis. Justin [12] performs static analysis.
Chieh [11] performs a dynamic analysis. Zarni [7] performs static analysis. Yu [10]
performs a dynamic analysis.

4.2 Input

This parameter defines the input type taken by every system. Monica [4] takes
Permissions, Intents as an input. Ham [8] takes Native Size, other_shared, VMPeak,
VMData, VMLib, Dalvik_Rss, cpu_usage, RxBytes, Send_sms as an input. Chang
[9] takes Permissions, Intent Receivers, Network Activities, and File read/write
permissions as an input. Koli [1] takes Requested Permissions, Vulnerable API Calls,
Dynamic Code, Reflection Code, Cryptographic Code, Database, and Native Code
as an input. Mathew [5] takes Permissions as an input. Justin [12] takes Permissions
as an input. Chieh [11] takes Run time logs of Applications as an input. Zarni [7]
takes Permissions as input. Yu [10] takes System calls as an input.

4.3 Dataset Type

This parameter defines whether the data taken for performing experiments in the
system is training or real dataset. Monica [4] uses training dataset for performing
experiments in the system. Koli [1] uses training dataset for performing experiments
in the system. Mathew [5] uses training dataset for performing experiments in the
Table 1. Comparison of Detecting Android Malware Using Machine Learning Classifiers
316

Paper Analysis Input Dataset Final dataset ML type ML classifiers Detection Performance Comparison Proposed
type type rate evaluation with other ML approach
criteria classifiers
Monica Static Permissions, Training 500 Benign Supervised Cubic SVM 91.7% Not Linear Improves static
[4] intents Applications learning mentioned discriminant malware
and 500 SVM, weighted detection
Malicious KMN, complex
Applications tree, linear
SVM, course
KNN
Ham Dynamic Native size, Not 11,268 Supervised Naïve 99% with FPR, TPR 10-fold Feature
[8] other_shared, specified benign learning Bayesian, random cross-validation selection
VMPeak, applications random forest, forest method and
VMLib, and 3526 Logistic reduction of
Dalvik_Rss, malicious Regression, false detection
RxBytes, applications SVM of malware
VMData,
send_sms,
cpu_usage
Ling [9] Dynamic Permissions, Not Not Supervised K-fold 97% FPR, TPR, Random forest, A robotium
intent specified specified learning cross-validation accuracy J48, LMT, program
receivers, logitboost,
network bagging, KNN,
activities, file Ksatr, PART,
read/write BayesNet
permissions
(continued)
P. Agrawal and B. Trivedi
Table 1. (continued)
Paper Analysis Input Dataset Final dataset ML type ML classifiers Detection Performance Comparison Proposed
type type rate evaluation with other ML approach
criteria classifiers
Koli [1] Static Requested Training 120 Benign Supervised SVM 97.7% FPR, Decision tree, A system
permissions, applications learning accuracy, Naïve Bayes, named
vulnerable and 175 Recall Rate, random forest randroid
API calls, malicious Precision,
dynamic applications F-measure
code,
reflection
code, native
code,
cryptographic
code,
database
Mathew Static Permissions Training 2444 benign Supervised SVM 80% Not Neural Detection of
[5] applications learning specified networks, android
and 870 classification malware
malicious trees, fuzzy technique built
Machine Learning Classifiers for Android Malware Detection

applications clustering, on examining


random forest permission
of decision
trees
Justin Static Permissions Training 2081 benign Supervised One-class SVM Not Not Not specified A malware
[12] applications learning specified specified detection
and 91 system based
malicious on machine
applications learning
(continued)
317
Table 1. (continued)
318

Paper Analysis Input Dataset Final dataset ML type ML classifiers Detection Performance Comparison Proposed
type type rate evaluation with other ML approach
criteria classifiers
Chieh Dynamic Run time logs Training 32000 Supervised SVM 86.1% Recall rate, BayesNet, A dynamic
[11] of benign learning FPR, Naïve Bayes, malware
applications applications precision J48, random analysis
and 32000 rate, forest, framework
malicious accuracy, multilayer named as
applications F-Score perception, droiddolphin
logistic
Zarni Static Permissions Not 700 Unsupervised K-Means 91.75% FPR, TPR, Random forest, A framework
[7] mentioned applications learning clustering with TP, FP, FN, J48, CART for classifying
random TN, overall android
forest accuracy applications
Wei Yu Dynamic System calls Training 96 benign Supervised SVM, Naïve 78% Detection Not specified A malware
[10] applications learning Bayes rate, error detection
and 92 rate, training system uses
malware time, behavior-based
applications detection detection
time
P. Agrawal and B. Trivedi
Machine Learning Classifiers for Android Malware Detection 319

system. Justin [12] uses training dataset for performing experiments in the system.
Chieh [11] uses training dataset for performing experiments in the system. Yu [10]
uses training dataset for performing experiments in the system.

4.4 Final Dataset

This parameter defines the criteria for the selection of the final dataset. Monica [4]
uses 500 Benign Applications and 500 Malicious Applications. Ham [8] uses 11,268
Benign Applications and 3526 Malicious Applications. Koli [1] uses 120 Benign
Applications and 175 Malicious Applications. Mathew [5] uses 2444 Benign Appli-
cations and 870 Malicious Applications. Justin [12] uses 2081 Benign Applications
and 91 Malicious Applications. Chieh [11] uses 32,000 Benign Applications and
32,000 Malicious Applications. Zarni [7] uses 700 Applications. Yu [10] uses 96
Benign Applications and 92 Malware Applications.

4.5 Machine Learning Type

This parameter defines the different types of machine learning. It can be super-
vised learning, unsupervised learning, or reinforcement learning [6]. Monica [4] uses
supervised learning. Ham [8] uses supervised learning. Chang [9] uses supervised
learning. Koli [1] uses supervised learning. Mathew [5] uses supervised learning.
Justin [12] uses supervised learning. Chieh [11] uses supervised learning. Zarni [7]
uses unsupervised learning. Wei Yu [10] uses supervised learning.

4.6 Machine Learning Classifiers

This parameter defines different Machine Learning Classifiers or algorithms used in


the system. Monica [4] uses Cubic Support Vector Machine (SVM). Ham [8] uses
Naive Bayes, Random Forest, Logistic Regression, and Support Vector Machine
(SVM). Chang [9] uses a K-fold Cross-Validation. Koli [1] usages a Support Vector
Machine (SVM). Mathew [5] usages a Support Vector Machine (SVM). Justin [12]
uses a one-class Support Vector Machine (SVM). Chieh [11] uses a Support Vector
Machine (SVM). Zarni [7] uses a K-Means Clustering. Yu [10] uses the Naïve
Bayesian and Support Vector Machine (SVM).
320 P. Agrawal and B. Trivedi

4.7 Detection Rate

This parameter shows the detection rate for detecting malware accurately. In Monica
[4], the detection rate is 91.7%. In Ham [8], the detection rate is 99% with Random
Forest classifier. In Chang [9], the detection rate is 97%. In Koli [1], the detection rate
is 97.7%. In Mathew [5], the detection rate is 80%. In Chieh [11], the detection rate
is 86.1%. In Zarni [7], the detection rate is 91.75% with Random Forest classifier.
In Yu [10], the detection rate is 78%.

4.8 Performance Evaluation Criteria

This parameter defines different values taken for the Performance Evaluation Criteria
using Machine Learning Classifiers. Ham [8] uses FPR and TPR. Chang [9] uses FPR,
TPR, and Accuracy. Koli [1] uses a False Positive Rate (FPR), Accuracy, Recall rate,
Precision, F-measure. Chieh [11] uses Recall rate, FPR, Precision rate, Accuracy,
F-Score. Zarni [7] uses TP, FP, TN, FN, TPR, FPR, and Overall Accuracy. Yu [10]
uses Detection Rate, Error Rate, Training Time, and Detection Time.

4.9 Comparison with Other Machine Learning Classifiers

This parameter defines other Machine Learning Classifiers compared with each
other using performance evaluation criteria. Monica [4] uses Course KNN, Weighted
KMN, Complex tree, Linear SVM, Linear Discriminant SVM. Ham [8] uses a 10-fold
Cross-Validation. Chang [9] uses Random Forest, J48, LMT, LogitBoost, Bagging,
KNN, Ksatr, PART, BayesNet. Koli [1] uses a Decision Tree, Naïve Bayes, and
Random Forest. Mathew [5] uses Neural Networks, Classification trees, Fuzzy Clus-
tering, Random Forest of decision trees. Chieh [11] uses BayesNet, Naïve Bayes,
J48, Random Forest, Multilayer Perception, and Logistic. Zarni [7] uses Random
Forest, J48, and CART.

4.10 Proposed Approach

This parameter defines the different approaches proposed by different researchers. In


Monica [4], the static malware detection is improved by comparing different Machine
Learning Classifiers on Manifest file dataset. In Ham [8], a feature selection method
is proposed and experimentation is done for reducing false detection rate of malware.
In Chang [9], a Robotium program in Android sandbox is proposed which triggers
the Android Application automatically and monitor behavior. Koli [1] proposed a
Machine Learning Classifiers for Android Malware Detection 321

system named RanDroid which detects malicious applications in the Android system
by using machine learning techniques. In Mathews [5] by examining permissions an
Android Malware detection technique is developed. Justin [12] proposed an original
machine learning-based malware detection system for the Android OS. Chieh [11]
proposed a dynamic malware analysis framework named DroidDolphin which uses
the technologies of Big Data Analysis, GUI-based testing, and machine learning
to detect malicious Android applications. Zarni [7] proposed a framework using
machine learning techniques for classifying Android applications for malware detec-
tion. In Yu [10], a malware detection system is proposed that uses behavior-based
detection approach for malware detection.
Based on the comparative study of Detecting Android Malware using Machine
Learning Classifiers, it can be concluded that every approach has some limitations.
In Monica [4], the dataset taken is very small. Also, the Detection rate is also not
high. The classifiers only depend on Manifest file, and it only uses static analysis
and lacks dynamic analysis. In Ham [8], there is a lot of variation in the accuracy
of Detection rate using different Machine Learning Classifiers. In Chang [9], there
are very fewer features selected for analysis. In Koli [1], the dataset taken is small
with fewer features. In the system, the Quality of detection model critically depends
on the accessibility of malicious and benign applications. It is good only for a small
and random set of application datasets. It only uses static analysis and lacks dynamic
analysis. In Mathew [5], the dataset taken is very small with fewer features. Detection
rate is also not high. It only uses static analysis and lacks dynamic analysis. In Justin
[12], dataset taken is very small, and it only uses static analysis and lacks dynamic
analysis. In Chieh [11], the Detection rate is not high. It takes up to 5 min to run
the apk files and do the analysis. So it is time-consuming and less efficient. Also, it
cannot detect malware with anti-emulation techniques. In Zarni [7], the Detection
rate is not high and the dataset taken is very small with fewer features. It only uses
static analysis and lacks dynamic analysis. In Yu [10], the Detection rate is not high
and the dataset taken is very small.

5 Conclusion

Based on the above study, it can be concluded that the accuracy rate of Malware Detec-
tion is higher using the Random Forest Classifier as compared to SVM and Naive
Bayesian Classifiers. The Random Forest, SVM, Naive Bayesian are highly used
Machine Learning Classifiers for Performance Evaluation. A Generalized Malware
Detection model using Machine Learning Classifiers is still lacking for proper
Malware Detection. So a Generalized Malware Detection model using a combina-
tion of supervised and unsupervised Machine Learning Classifiers must be proposed
to increase the efficiency and accuracy in detection rate with a large dataset and
more features. Also, Random Forest, SVM, Naive Bayes classifiers must be used for
performance evaluation of the model.
322 P. Agrawal and B. Trivedi

References

1. Koli, J. D. (2018). RanDroid: Android malware detection using random machine learning
classifiers. In: International Conference on Technologies for Smart City Energy Security and
Power (ICSESP) IEEE, Mar 2018.
2. Agrawal, P., & Trivedi, B. (2019). A survey on android malware and their detection techniques.
In: Third International Conference on Electrical, Computer and Communication Technologies
(ICECCT) IEEE, Feb 2019.
3. Agrawal, Prerna, & Trivedi, Bhushan. (2019). Analysis of android malware scanning tools.
International Journal of Computer Sciences and Engineering, 7(3), 807–810.
4. Kumaran, M., & Li, W. (2016). Lightweight malware detection based on machine learning
algorithms and the android manifest file. In: MIT Undergraduate Research Technology
Conference(URTC) IEEE, Nov 2016.
5. Leeds, M., & Atkison, T. (2016). Preliminary results of applying machine learning algorithms
to android malware detection. In: International Conference on Computational Intelligence
(ICCI) IEEE, Dec 2016.
6. Dutt, S., Chanframouli, S., & Das, A. K. (2019). Machine Learning 1st (Ed.), India: Pearson.
7. Aung, Z., & Zaw, W. (2013). Permission-based android malware detection. International
Journal of Scientific and Technology Research, 2(3).
8. Ham, H. S., & Choi, M. J. (2013). Analysis of android malware detection performance using
machine learning classifiers. In: International Conference on ICT Convergence (ICTC) IEEE,
Oct 2013.
9. Chang, W. L., & Wu, W. (2016). An android behaviour-based malware detection method using
machine learning. In: International Conference on Signal Processing, Communications, and
Computing (ICSPCC) IEEE, Aug 2016.
10. Yu, W., & Zhang, H. (2013). On behaviour-based detection of malware on android platform.
In: Communication and Information System Security Symposium (Globecom) IEEE, Dec 2013.
11. Wu, W. C., & Hung, S. H. (2014). DroidDolphin: A dynamic android malware detection using
big data and machine learning. In: Research in Adaptive and Convergent Systems (RACS).
ACM, Oct 2014.
12. Sahs, J., & Khan, L. (2012). A machine learning approach to android malware detection. In:
European Intelligence and Security Informatics Conference (EISIC) IEEE, Aug 2012.
13. Naïve Bayesian Classifier. https://towardsdatascience.com/naive-bayes-classifier-81d512
f50a7c.
14. Random Forest Classifier. https://medium.com/machine-learning-101/chapter-5-random-for
est-classifier-56dc7425c3e1.
15. Logistic Regression Classifier. https://machinelearning-blog.com/2018/04/23/logistic-regres
sion-101/.

View publication stats

You might also like