Professional Documents
Culture Documents
Donkal, Gita Verma, Gyanendra K. (2018)
Donkal, Gita Verma, Gyanendra K. (2018)
net/publication/328353465
Securing Big Data Ecosystem with NSGA-II and Gradient Boosted Trees based
NIDS using Spark-3
CITATIONS READS
0 292
1 author:
Gita Donkal
Chandigarh University
4 PUBLICATIONS 20 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Gita Donkal on 28 December 2019.
a r t i c l e i n f o a b s t r a c t
Article history: Securing Big Data has become one of the major issues of the exponentially pacing computing world,
where data analysis plays an integral role, as it helps data analysts to figure out the interests and detailed
Keywords: information of organizational and industrial assets. Acts like cyber espionage and data theft lead to the
Hadoop MapReduce inappropriate use of data. In order to detect the malicious content, we propose a model that ensures the
Spark security of heterogeneous data residing in commodity hardware. To verify the correctness of our model,
IDS (Knowledge Data Discovery) NSL KDD Cup 99 dataset is used that has been used by various researchers
Multimodal fusion for working on (Intrusion Detection System) IDS. We incorporate decision-based majority voting multi-
NSGA-II modal fusion that combines the results of different classifiers and facilitates better performance in terms
Multiple classifiers
of accuracy, detection rate and false alarm rate. Moreover, (Non-dominated Sorting Genetic Algorithm)
NSGA-II plays its integral role for the selection of most promising features. Additionally, to reduce the
computational complexity which is again a crucial aspect while processing Big Data, we incorporate the
concepts of Hadoop MapReduce and Spark to ensure the fast processing of Big Data in a parallel com-
putational environment. Our proposed model is able to achieve 92.03% accuracy, 99.38% detection rate
and a testing time of 0.32 seconds. Additionally, we have achieved advantages in terms of accuracy and
testing time of data over the existing techniques that use IDS as a security mechanism.
© 2018 Elsevier Ltd. All rights reserved.
∗
Corresponding author. The exponential growth in the size of Big Data with variations
E-mail address: gyanendra@nitkkr.ac.in (G.K. Verma). in data structure requires huge computational power and process-
https://doi.org/10.1016/j.jisa.2018.10.001
2214-2126/© 2018 Elsevier Ltd. All rights reserved.
2 G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11
Table 1
Comparison between Hadoop MapReduce and Apache Spark [2,8,11].
Data processing If dataset is larger than available RAM, then Up to 100 times fast processing of data in
Hadoop MapReduce may outperform Spark (random access memory) RAM up to 10 times
fast processing of data in storage
Near real-time processing Not good for obtaining immediate insights If a business needs immediate insights then
Spark is the best option
Graph processing Hadoop MapReduce supports HBase for graph Apache Spark has GraphX—an (application
computation programming interface) API for graph parallel
computation
Machine learning Hadoop MapReduce needs a third-party for the Spark has (machine learning library) MLlib- a
same built in machine learning library
Joining datasets Hadoop MapReduce is better for joining a large no. Spark can create all combinations faster
of datasets that require a lot of shuffling and
sorting
ing time that makes it quite difficult to gain the insights of user’s including firewall, IDS, antivirus, packet sniffers, encryption tech-
data in a few seconds. Moreover, storage for streaming data which niques and anti-rootkits [12]. In our model, we choose IDS over
is characterized by huge variation in data rates can go beyond its other cyber-security tools for many reasons among which the top-
capacity. But with the advent of technologies like Hadoop MapRe- notch reason is that it covers almost all the attributes of an (In-
duce [6,7], Hadoop Distributed File System (HDFS) and latest pro- ternet Protocol) IP packet and network-traffic that are necessary
cessors like i7, Macintosh, supercomputers and mainframes, it is to distinguish between malicious and genuine packets including
possible to process Big Data in seconds. As there are many data matching of pre-stored signatures of attacks that reduces the false
models that are not only complex in structure but also associated alarm rate to an acceptable rate [1,13].
with heterogeneous sources that makes the processing and analy- IDS has always been of major concern and a crucial topic of
sis of data even more difficult. Therefore, we need BDS like Spark research when it comes to cyber attacks on computers, worksta-
[8,9], Hadoop MapReduce that supports data-scalability, facilitate tions and networks. It is of two types -Signature-based IDS and
resilience to failure, fast processing of voluminous data and pro- Anomaly-based IDS. Researchers are working on each one of them
vides flexibility, while at the same time cost effective too [2]. A as well as Hybrid-IDS for the enhancement of their performance
typical structure of Apache Spark exhibiting its integral compo- with respect to accuracy, false alarm rate and detection rate that
nents is shown in Fig. 1 [10]. helps in constructing a more robust and reliable system. In our
Hadoop also includes (Yet Another Resource Negotiator) YARN proposed model, we train our classifiers using machine learning
in its system which is entirely responsible for the parallel process- for verifying the strings, regular expressions and digital signatures
ing of data stored in HDFS. Comparison between Hadoop MapRe- in case of Signature-based IDS and authenticating the traffic of in-
duce and Apache Spark is exhibited in Table 1. coming packets for Anomaly-based IDS [14].
As BDSs need more processing nodes to handle the load, so a To make reinforced IDS a success, we use in-built classifiers in
buffer or a queue is used to process data in chunks. With large size Apache Spark that are Support Vector Machine (SVM), Gradient
datasets, evolution of BDS becomes more complicated and less ro- Boosted Trees (GBT), Decision Tree (DT), Logistic Regression (LR)
bust that demands for a fault tolerance system with infrastructure and Random Forest (RF). We also use NSL KDD cup 99 dataset
management along with fast processing of data. To serve this pur- [15] which is derived from the KDD cup 99 dataset [16]. We also
pose, we use Spark in our model [8,11]. pre-process it for our convenience, for example in terms of chang-
ing string values to numerical values for conveniently performing
feature selection procedure. In feature selection phase use NSGA-II
1.2. Security of Big Data
[17,18] which is a Pareto optimal solution-based approach.
Multimodal fusion [9,20] is a process of integrating two or
Security attacks are launched almost every day and executed
more input modalities in order to combine them as a complete
in various forms like virus, worms, rootkits, spywares, ransom-
command. The integration mechanism can be based on statistical
ware, cross-site scripting, (Structured Query Language) SQL injec-
technique, artificial neural networks, Hidden Markov model, finite-
tion, buffer overflow, masquerading, Denial of Service (DoS) and
state transducers or time-stamped lattices, depending upon the
much more. The non-deterministic nature of attacks has made
approach. There are basically three approaches to execute multi-
them a lot complicated to detect and react. A number of systems
modal fusion including decision-based, recognition-based and hy-
and techniques are designed for resolving these security issues
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 3
brid multi-level fusion. We used Decision-based Multimodal fusion ter detection of novel attacks on computer systems as there are no
that categorizes the information gained on the basis of decision redundant records and no missing values in NSL KDD dataset.
from majority. From the experimental results conducted by Thomas et al. [24],
Following are the noteworthy contributions of our proposed it can be concluded that fusion of multiple sensors can produce
scheme: much more accuracy and performance rate than considering their
outputs individually. Kaliappan et al. [25] proposed a scheme that
• A multimodal fusion based framework for reinforcing IDS in or- performs fusion of multiple IDS including both signature-based
der to achieve high detection rate and false alarm rate. and anomaly-based. The experiments show that performance and
• Use of NSGA-II for the extraction of prominent features out of accuracy can be improved with reduction of false positive rate by
feature set of 41 attributes in order to improve feature selection utilizing the multimodal fusion. C. Ramachandran [20] elaborated
procedure. Feature selection is taken into consideration in order the taxonomies of IDS along with data mining, feature selection
to improve the overall efficiency of the proposed systems. and data preprocessing. He did experimentation using KDD cup 99
dataset, on which he performed pre-processing and feature selec-
Rest of the paper is organized as follows. Section 2 highlights
tion that included learning and testing phases. He proved in his ex-
some of the related work from the past. Section 3 discusses our
perimentation results that fusion IDS structure provides more ac-
proposed model in detail along with the system entities and work-
curacy and better false positive rate comparatively.
ing mechanism. Section 4 covers the experimental results of the
same. Finally, Section 5 concludes the paper with future work.
3. The proposed system
• Data Preprocessing [15,29]—We use NSL KDD cup 99 dataset techniques like classification and regression, and clustering are
which is preprocessed to some extent. We further preprocess it also provided by this library.
using Parse-labeled-point. In MLlib, labeled points are used in • NSGA-II [17,18]—We use Information Gain (IG) [19] and NSGA-
supervised learning algorithms. We use a double to store a la- II for feature selection phase. As in the proposed framework of
bel, so that labeled points can be used in both regression and Kaliappan et al. [19], IG has been derived using some formulas
classification. Spark.ml package provides machine learning API and Genetic Algorithm; we also apply the same methodology.
built on the data frames that are also becoming the core part of Feature selection is done to filter out less promising and irrele-
Spark SQL library. Spark.ml package can be used for developing vant features for IDS from our dataset.
and managing the machine learning pipelines, feature extrac-
tion, as transformers and selectors. Moreover, machine learning We incorporate the upgraded version of NSGA for feature se-
lection which is NSGA-II [18] that uses fast elitist NSGA that is
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 5
Fig. 5. Information gain (IG) values for (i) Train set, and (ii) Test set.
Table 5
Description of 41 features in NSL KDD dataset.
Table 6
Selected features based on NSGA-II providing best accuracy.
SVM 41 30 1, 2, 3, 4, 8, 9, 10, 11, 12, 14, 15, 16, 18, 20, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 35, 36, 37, 39, 40 90.58%
GBT 41 30 1, 3, 4, 6, 7, 8, 10, 12, 14, 16, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 39, 40,41 90.36%
DT 41 29 1, 4, 5, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 33, 34, 39, 40,41 91.04%
LR 41 32 1, 3, 4, 6, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 20, 21, 22, 24, 25, 26, 28, 29, 30, 31, 34, 35, 36, 37, 38, 39, 40,41 89.39%
RF 41 28 1, 2, 3, 5, 6, 7, 8, 10, 12, 13, 15, 16, 18, 19, 20, 21, 22, 24, 25, 26, 30, 34, 35, 36, 37, 38, 39 90.13%
Fusion 41 30 1, 2, 3, 4, 5, 6, 8, 9, 12, 13, 14, 15, 17, 18, 19, 20, 21, 24, 25, 27, 29, 30, 31, 32, 33, 34, 35, 37, 38, 40 92.03%
able in NSL KDD cup 99 dataset [39]. NSGA-II selects the most Table 7
Calculated accuracy, precision and F-measure.
prominent features from these forty one features. Some of the
parameters of NSGA-II are dynamic and some are set static, for Classifiers Feature selection Accuracy (%) Precision F-measure
instance the crossover and mutation rate in which the ratio of SVM With FS 90.58 0.916 0.903
crossover and mutation rate is kept 8:2, as we do not want es- Without FS 83.38 0.859 0.827
calation in complexity in diversity that will be affected with the GBT With FS 90.36 0.914 0.901
soaring scalability of Big Data. If scalability soars in Big Data envi- Without FS 88.79 0.902 0.885
DT With FS 91.04 0.919 0.908
ronment, it is very likely that complexity will increase too, due to
Without FS 88.68 0.900 0.884
which BDSs become less robust and more complicated to evolve. LR With FS 89.39 0.907 0.891
Table 6 shows the features selected by NSGA-II algorithm, along Without FS 90.69 0.909 0.906
with the accuracy obtained so far in percentage. This experiment is RF With FS 90.13 0.913 0.899
Without FS 90.37 0.915 0.901
carried out for all the attack patterns available in NSL KDD Cup
Fusion With FS 92.03 0.928 0.919
99 dataset that are mentioned in Table 4. We carried out this Without FS 89.95 0.912 0.897
experiment iteratively with various combinations of features, and
we were able to achieve maximum accuracy, detection rate, false
alarm rate and processing speed with some of the most promi- Table 8
nent features filtered out by NSGA-II. It can be articulated from Calculated detection rate (%) for different classifiers.
Table 6 that how different combinations of distinguishable features Classifiers With feature selection (%) Without feature selection (%)
of our considered dataset can generate different accurate values
SVM 99.26 98.02
and allow us to choose a limited number of features out of all 41
GBT 99.14 99.04
features selecting only the most promising features and thus re- DT 99.03 98.85
ducing the burden of all features. Apparently, the number of fea- RF 99.42 99.33
tures selected by multimodal-fusion is 30 out of 41 and these fea- LR 99.29 95.78
tures can be looked up in Table 5. Fusion 99.38 99.34
0.94
0.92
0.9
0.88
0.86
0.84 Precision
0.82
0.8
F-measure
0.78
0.76
With FS Without With FS Without With FS Without With FS Without With FS Without With FS Without
FS FS FS FS FS FS
SVM GBT DT LR RF Fusion
94%
92%
90%
88%
ACCURACY
86%
84%
82%
80%
78%
SVM GBT DT RF LR Fusion
With Feature Selecon Without Feature Selecon
Table 9
Comparison of results of different approaches used in building an Enhanced IDS.
Approaches Model description Big Data Tool/language used Classifiers used Detection FAR Testing
considered rate time(s)
Since data is the most important asset these days, therefore, [2] Dietrich D, Heller B, Yang B. Data science & big data analytics: discovering.
BDSs developed for securing the entire Big Data environment must Analyzing, visualizing and presenting data; 2015.
[3] Chen M, Mao S, Liu Y. Big data: A survey. Mobile Netw. Appl.
be designed in a manner where the confidentiality, integrity and 2014;19(2):171–209.
availability of data can be maintained. Large networks and data [4] Ordiano JÁG, Bartschat A, Ludwig N, Braun E, Waczowicz S, Renkamp N,
centers that supports the communication of titanic data needs et al. Concept and benchmark results for Big Data energy forecasting based
on Apache Spark. J. Big Data 2018;5(1):11.
to be facilitated with full-fledge security, so that user’s data do [5] Moreno J, Serrano MA, Fernández-Medina E. Main issues in big data security.
not get tampered by any anonymous user or impersonated device Future Internet 2016;8(3):44.
or person. In Table 9, our scheme is compared with the existing [6] Wang L, Jones R. Big data analytics for network intrusion detection: a survey.
Int J Netw Commun 2017;7(1):24–31.
schemes (Kaliappan et al. [19] and Gupta et al [11]) which exhibits
[7] Download link (Hadoop): https://archive.apache.org/dist/hadoop/core/. Ac-
that the proposed scheme is better in terms of detection rate and cessed on December 2017.
false alarm rate. In addition, a slight improvement in overall test- [8] Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. Apache spark:
a unified engine for big data processing. Commun ACM 2016;59(11):56–65.
ing time is also achieved in our proposed model.
[9] Download link (Spark): https://www.apache.org/dyn/closer.lua/spark/. Ac-
In [19], by adopting multimodal fusion and GA, the model was cessed on December 2017.
able to achieve a 99.0% of detection rate, on the other hand, our [10] https://spark.apache.org/. Accessed on December 2017.
proposed model achieved 99.38% detection rate. Additionally, in [11] Gupta GP, Kulariya M. A framework for fast and efficient cyber secu-
rity network intrusion detection using apache spark. Procedia Comput Sci
comparison to the model proposed in [18], detection rate achieved 2016;93:824–31.
by our proposed model is much higher since the authors did not [12] Suthaharan S. Big data classification: Problems and challenges in network in-
use fusion-based approach. Our proposed scheme clearly signifies trusion prediction with machine learning. ACM SIGMETRICS Perform Eval Rev
2014;41(4):70–3.
that using this methodology, the overall efficiency of the model is [13] Jonnalagadda SK, Reddy RP. A literature survey and comprehensive study of
approved significantly. In addition, a slight improvement in overall intrusion detection. Int J Comput Appl 2013;81(16):40–7.
testing time is also achieved in our proposed model. [14] Terzi DS, Terzi R, Sagiroglu S. Big data analytics for network anomaly detec-
tion from netflow data. In: Computer science and engineering (UBMK), 2017
international conference on. IEEE; 2017. p. 592–7. October.
5. Conclusion and future scope [15] Dhanabal L, Shantharajah SP. A study on NSL-KDD dataset for intrusion detec-
tion system based on classification algorithms. Int J Adv Res Comput Commun
Eng 2015;4(6):446–52.
The pace with which data is being generated and transferred [16] Aggarwal P, Sharma SK. Analysis of KDD dataset attributes-class wise for in-
across the globe has inclined the world towards a more cognitive trusion detection. Procedia Comput Sci 2015;57:842–51.
and knowledgeable era. But with bulk generation of data, hackers [17] Deb K, Pratap A, Agarwal S, Meyarivan TAMT. A fast and elitist multiobjective
genetic algorithm: NSGA-II. IEEE Trans Evol Comput 2002;6(2):182–97.
seek to work on new opportunities in order to breach security sys- [18] Tamimi A, Naidu DS, Kavianpour S. An Intrusion detection system based
tems built with cutting-edge techniques and algorithms. Therefore, on NSGA-II Algorithm. In: Cyber security, cyber warfare, and digital forensic
keeping in mind the value and importance of data in our day-to- (CyberSec), 2015 fourth international conference on. IEEE; 2015. p. 58–61.
October.
day lives, its security should be the primary concern of data scien- [19] Kaliappan J, Thiagarajan R, Sundararajan K. Fusion of heterogeneous intrusion
tists. detection systems for network attack detection. Sci World J 2015;2015:1–9.
Our proposed work was inspired by the disadvantages we ex- [20] Ramachandran C. An advanced data processing based fusion IDS structures. Int
J Appl Eng Res 2017;12(21):10929–37.
plored in some of the existing techniques that include the predic- [21] Giacinto G, Roli F, Didaci L. Fusion of multiple classifiers for intrusion detection
tion time and detection rate of Kaliappan et al. [19] approach and in computer networks. Pattern Recog Lett 2003;24(12):1795–803.
accuracy parameter of other few approaches. We have merged the [22] Sun K, Miao W, Zhang X, Rao R. An improvement to feature selection of ran-
dom forests on spark. In: Computational science and engineering (CSE), 2014
benefits of five classifiers available in Apache Spark BDS that led to
IEEE 17th international conference on. IEEE; 2014. p. 774–9.
the development of our proposed model. Consolidating the outputs [23] Sung AH, Mukkamala S. Identifying important features for intrusion detection
of five different classifiers using multimodal fusion assimilating on using support vector machines and neural networks. In: Applications and the
internet, 2003. Proceedings. 2003 symposium on. IEEE; 2003. p. 209–16.
BDS Spark helped us in achieving:
[24] Thomas C, Balakrishnan N. Improvement in intrusion detection with advances
in sensor fusion. IEEE Trans Inf Forensics Secur 2009;4(3):542–51.
• Better performance of the entire system in terms of false alarm [25] Mulay SA, Devale PR, Garje GV. Intrusion detection system using support vec-
rate, detection rate, accuracy and faster processing of data. tor machine and decision tree. Int J Comput Appl 2010;3(3):40–3.
• We were able to achieve 92.03% accuracy and the highest accu- [26] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R,
et al. Apache hadoop yarn: Yet another resource negotiator. In: Proceedings
racy achieved individually is by DT which is 91.04%.
of the 4th annual symposium on cloud computing. ACM; 2013. p. 5.
• Additionally, we were able to procure the high computational [27] Jeya PG, Ravichandran M, Ravichandran CS. Efficient classifier for R2L and U2R
speed by performing the entire execution in 0.32 s. attacks. Int J Comput Appl 2012;45(21):29.
[28] Ezin EC, Djihountry HA. Java-based intrusion detection system in a wired net-
To proceed with resolving bottlenecks of security and compu- work. Int J Comput Sci Inf Secur 2011;9(11):33.
[29] Hamed T, Ernst JB, Kremer SC. A survey and taxonomy on data and pre-pro-
tation issues in a Big Data environment, maneuvers that optimize cessing techniques of intrusion detection systems. In: Computer and Network
the entire voluminous ecosystem must be taken into consideration. Security Essentials. Springer; 2018. p. 113–34.
Since we used Apache Spark which is still under development and [30] Singh SP, Jaiswal UC. Machine learning for big data: a new perspective. Int J
Appl Eng Res 2018;13(5):2753–62.
thus do not support many classifiers. Therefore, more classifiers [31] Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D. Mllib:
can be introduced in Spark BDS to obtain more accurate results machine learning in apache spark. J Mach Learn Res 2016;17(1):1235–41.
and also to have diverse options for better performances of classi- [32] Kevric J, Jukic S, Subasi A. An effective combining classifier approach us-
ing tree algorithms for network intrusion detection. Neural Comput Appl
fiers. For future work, this entire implementation can be executed
2017;28(1):1051–8.
on Flink which is faster than Hadoop Map-Reduce and Spark. [33] Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detec-
tion. Int J Adv Netw Appl 2016;7(4):2828.
[34] Liu H, Gegov A, Cocea M. Rule based systems for big data: a machine learning
Supplementary materials approach, 13. Springer; 2015.
[35] Río Del, S López, V Benítez, M J, Herrera F. On the use of MapReduce for im-
Supplementary material associated with this article can be balanced big data using Random Forest. Inf Sci 2014;285:112–37.
[36] Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source
found, in the online version, at doi:10.1016/j.jisa.2018.10.001. tools for machine learning with big data in the Hadoop ecosystem. J Big Data
2015;2(1):24.
References [37] Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU. The rise of
“big data” on cloud computing: Review and open research issues. Inf Syst
[1] Zuech R, Khoshgoftaar TM, Wald R. Intrusion detection and big heterogeneous 2015;47:98–115.
data: a survey. J. Big Data 2015;2(1):3.
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 11
[38] Shanahan JG, Dai L. Large scale distributed data science using apache spark. In: [39] Dhanabal L, Shantharajah SP. A study on NSL-KDD dataset for intrusion detec-
Proceedings of the 21th ACM SIGKDD international conference on knowledge tion system based on classification algorithms. Int J Adv Res Comput Commun
discovery and data mining. ACM; 2015. p. 2323–4. Eng 2015;4(6):446–52.