Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328353465

Securing Big Data Ecosystem with NSGA-II and Gradient Boosted Trees based
NIDS using Spark-3

Conference Paper · October 2018

CITATIONS READS

0 292

1 author:

Gita Donkal
Chandigarh University
4 PUBLICATIONS 20 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Big Data Security View project

All content following this page was uploaded by Gita Donkal on 28 December 2019.

The user has requested enhancement of the downloaded file.


Journal of Information Security and Applications 43 (2018) 1–11

Contents lists available at ScienceDirect

Journal of Information Security and Applications


journal homepage: www.elsevier.com/locate/jisa

A multimodal fusion based framework to reinforce IDS for securing


Big Data environment using Spark
Gita Donkal, Gyanendra K. Verma∗
Computer Engineering Department, National Institute of Technology Kurukshetra, India

a r t i c l e i n f o a b s t r a c t

Article history: Securing Big Data has become one of the major issues of the exponentially pacing computing world,
where data analysis plays an integral role, as it helps data analysts to figure out the interests and detailed
Keywords: information of organizational and industrial assets. Acts like cyber espionage and data theft lead to the
Hadoop MapReduce inappropriate use of data. In order to detect the malicious content, we propose a model that ensures the
Spark security of heterogeneous data residing in commodity hardware. To verify the correctness of our model,
IDS (Knowledge Data Discovery) NSL KDD Cup 99 dataset is used that has been used by various researchers
Multimodal fusion for working on (Intrusion Detection System) IDS. We incorporate decision-based majority voting multi-
NSGA-II modal fusion that combines the results of different classifiers and facilitates better performance in terms
Multiple classifiers
of accuracy, detection rate and false alarm rate. Moreover, (Non-dominated Sorting Genetic Algorithm)
NSGA-II plays its integral role for the selection of most promising features. Additionally, to reduce the
computational complexity which is again a crucial aspect while processing Big Data, we incorporate the
concepts of Hadoop MapReduce and Spark to ensure the fast processing of Big Data in a parallel com-
putational environment. Our proposed model is able to achieve 92.03% accuracy, 99.38% detection rate
and a testing time of 0.32 seconds. Additionally, we have achieved advantages in terms of accuracy and
testing time of data over the existing techniques that use IDS as a security mechanism.
© 2018 Elsevier Ltd. All rights reserved.

1. Introduction with a lot of variations that confronts difficulties of storing, ana-


lyzing and visualizing the data for further results [4].
The term Big Data is used for the immense collection of Enforcing security in Big Data is indeed a challenging task, as
differently-structured data obtained from variety of heterogeneous over the past few years, size of data has shot up astronomically.
sources piled up on storage devices for which data is measured in Also, massive amount of data is generated from heterogeneous
peta-bytes and zeta-bytes. The five distinct dimensions associated sources and comes in different forms, such as structured, semi-
with Big Data, termed as 5 Vs are: Volume, Variety, Velocity, Value, structured and unstructured that results in crashing of commodity
and Veracity [1]. Each one of these dimensions has a key role to hardware. Cryptography, tokenization, de-duplication of data are
play in the Big Data Management Systems (BDMSs) which is a re- some of the ways that assist in securing the confidential data from
cent term for managing Big Data Systems (BDSs) [2]. Nowadays, getting hacked. However, the way in which new cutting edge tools
data can be analyzed for unraveling the covert patterns, unascer- and techniques are being developed to ensure the safety of our
tained correlations, market-trends, and a lot other beneficial infor- data, in the similar manner, adversaries are coming up with new
mation that can help industries and organizations in establishing techniques to exploit the susceptibilities present in BDS including
more-cognizant commercial decisions [3]. BDSs confront lots of dif- its infrastructure. Security in Big Data scenario includes the secu-
ficulties in processing and analyzing massive data to retrieve vital rity of the data generated as well as the infrastructure security,
information that could be useful in medical field, sports industry, data availability, user authentication and communication security
business and so forth. Big Data deals with voluminous data stored [5]. Most of these issues can be resolved using (Intrusion Detec-
on servers and in data centers having a highly complex structure tion System) IDS. In this paper, we consider two important aspects
of Big Data.

1.1. Computational challenges in Big Data environment


Corresponding author. The exponential growth in the size of Big Data with variations
E-mail address: gyanendra@nitkkr.ac.in (G.K. Verma). in data structure requires huge computational power and process-

https://doi.org/10.1016/j.jisa.2018.10.001
2214-2126/© 2018 Elsevier Ltd. All rights reserved.
2 G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11

Fig. 1. Apache spark components.

Table 1
Comparison between Hadoop MapReduce and Apache Spark [2,8,11].

Functionalities Hadoop MapReduce Apache spark

Data processing If dataset is larger than available RAM, then Up to 100 times fast processing of data in
Hadoop MapReduce may outperform Spark (random access memory) RAM up to 10 times
fast processing of data in storage
Near real-time processing Not good for obtaining immediate insights If a business needs immediate insights then
Spark is the best option
Graph processing Hadoop MapReduce supports HBase for graph Apache Spark has GraphX—an (application
computation programming interface) API for graph parallel
computation
Machine learning Hadoop MapReduce needs a third-party for the Spark has (machine learning library) MLlib- a
same built in machine learning library
Joining datasets Hadoop MapReduce is better for joining a large no. Spark can create all combinations faster
of datasets that require a lot of shuffling and
sorting

ing time that makes it quite difficult to gain the insights of user’s including firewall, IDS, antivirus, packet sniffers, encryption tech-
data in a few seconds. Moreover, storage for streaming data which niques and anti-rootkits [12]. In our model, we choose IDS over
is characterized by huge variation in data rates can go beyond its other cyber-security tools for many reasons among which the top-
capacity. But with the advent of technologies like Hadoop MapRe- notch reason is that it covers almost all the attributes of an (In-
duce [6,7], Hadoop Distributed File System (HDFS) and latest pro- ternet Protocol) IP packet and network-traffic that are necessary
cessors like i7, Macintosh, supercomputers and mainframes, it is to distinguish between malicious and genuine packets including
possible to process Big Data in seconds. As there are many data matching of pre-stored signatures of attacks that reduces the false
models that are not only complex in structure but also associated alarm rate to an acceptable rate [1,13].
with heterogeneous sources that makes the processing and analy- IDS has always been of major concern and a crucial topic of
sis of data even more difficult. Therefore, we need BDS like Spark research when it comes to cyber attacks on computers, worksta-
[8,9], Hadoop MapReduce that supports data-scalability, facilitate tions and networks. It is of two types -Signature-based IDS and
resilience to failure, fast processing of voluminous data and pro- Anomaly-based IDS. Researchers are working on each one of them
vides flexibility, while at the same time cost effective too [2]. A as well as Hybrid-IDS for the enhancement of their performance
typical structure of Apache Spark exhibiting its integral compo- with respect to accuracy, false alarm rate and detection rate that
nents is shown in Fig. 1 [10]. helps in constructing a more robust and reliable system. In our
Hadoop also includes (Yet Another Resource Negotiator) YARN proposed model, we train our classifiers using machine learning
in its system which is entirely responsible for the parallel process- for verifying the strings, regular expressions and digital signatures
ing of data stored in HDFS. Comparison between Hadoop MapRe- in case of Signature-based IDS and authenticating the traffic of in-
duce and Apache Spark is exhibited in Table 1. coming packets for Anomaly-based IDS [14].
As BDSs need more processing nodes to handle the load, so a To make reinforced IDS a success, we use in-built classifiers in
buffer or a queue is used to process data in chunks. With large size Apache Spark that are Support Vector Machine (SVM), Gradient
datasets, evolution of BDS becomes more complicated and less ro- Boosted Trees (GBT), Decision Tree (DT), Logistic Regression (LR)
bust that demands for a fault tolerance system with infrastructure and Random Forest (RF). We also use NSL KDD cup 99 dataset
management along with fast processing of data. To serve this pur- [15] which is derived from the KDD cup 99 dataset [16]. We also
pose, we use Spark in our model [8,11]. pre-process it for our convenience, for example in terms of chang-
ing string values to numerical values for conveniently performing
feature selection procedure. In feature selection phase use NSGA-II
1.2. Security of Big Data
[17,18] which is a Pareto optimal solution-based approach.
Multimodal fusion [9,20] is a process of integrating two or
Security attacks are launched almost every day and executed
more input modalities in order to combine them as a complete
in various forms like virus, worms, rootkits, spywares, ransom-
command. The integration mechanism can be based on statistical
ware, cross-site scripting, (Structured Query Language) SQL injec-
technique, artificial neural networks, Hidden Markov model, finite-
tion, buffer overflow, masquerading, Denial of Service (DoS) and
state transducers or time-stamped lattices, depending upon the
much more. The non-deterministic nature of attacks has made
approach. There are basically three approaches to execute multi-
them a lot complicated to detect and react. A number of systems
modal fusion including decision-based, recognition-based and hy-
and techniques are designed for resolving these security issues
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 3

brid multi-level fusion. We used Decision-based Multimodal fusion ter detection of novel attacks on computer systems as there are no
that categorizes the information gained on the basis of decision redundant records and no missing values in NSL KDD dataset.
from majority. From the experimental results conducted by Thomas et al. [24],
Following are the noteworthy contributions of our proposed it can be concluded that fusion of multiple sensors can produce
scheme: much more accuracy and performance rate than considering their
outputs individually. Kaliappan et al. [25] proposed a scheme that
• A multimodal fusion based framework for reinforcing IDS in or- performs fusion of multiple IDS including both signature-based
der to achieve high detection rate and false alarm rate. and anomaly-based. The experiments show that performance and
• Use of NSGA-II for the extraction of prominent features out of accuracy can be improved with reduction of false positive rate by
feature set of 41 attributes in order to improve feature selection utilizing the multimodal fusion. C. Ramachandran [20] elaborated
procedure. Feature selection is taken into consideration in order the taxonomies of IDS along with data mining, feature selection
to improve the overall efficiency of the proposed systems. and data preprocessing. He did experimentation using KDD cup 99
dataset, on which he performed pre-processing and feature selec-
Rest of the paper is organized as follows. Section 2 highlights
tion that included learning and testing phases. He proved in his ex-
some of the related work from the past. Section 3 discusses our
perimentation results that fusion IDS structure provides more ac-
proposed model in detail along with the system entities and work-
curacy and better false positive rate comparatively.
ing mechanism. Section 4 covers the experimental results of the
same. Finally, Section 5 concludes the paper with future work.
3. The proposed system

2. Related work Considering the bottleneck of handling computational chal-


lenges in BDSs, we propose a framework where tremendously huge
Dietrich et al. [2] mentioned the entire process of handling and data can be processed in a very satisfactory interval of time along
managing Big Data. They performed a detailed and useful analysis with the security of massive data using IDS. Fig. 2 represents our
over such a massive data, embedding Hadoop MapReduce archi- proposed model for securing a Big Data environment using a mul-
tecture to incorporate fast processing mechanism in Big Data, us- timodal fusion based framework on Apache Spark BDS. A brief de-
ing classifiers like Logistic Regression (LR), Random Forest (RF) and scription of entities of the proposed model is explained in forth-
Decision Tree (DT) for the classification of input dataset, and also coming sub-sections.
enforcing clustering on data chunks level to minimize the load. Ap-
parently, after gaining the insights of Big Data in depth from Diet- 3.1. Proposed system entities
rich et al. [2] work, led to the development of our proposed model.
Giacinto et al. [21] proposed a model which reflects the advantage • Hadoop MapReduce [2,26]—Hadoop is popularly known as “Big
of using fusion based pattern recognition approach to network de- Data Handler” as it is a system designed especially for han-
tection when implemented over the collaborative results of multi- dling the data generated in bulk. We utilize Hadoop 2.6.0 for
ple classifiers. The experimental results of this scheme motivated the implementation of our model. Collectively, Hadoop refers to
us to work in a similar manner However, we assimilated Giacinto several different techniques such as HDFS, MapReduce, HBase,
et al. [21] scheme in a Big Data environment, where we can secure Hive, Pig etc. In Hadoop, YARN is responsible for managing
Big Data environment using multiple IDS. all the associated resources and job scheduling tasks. In cer-
Zaharia et. al [8] emphasized the significant and efficient use tain scenarios, the data is needed immediately and this data is
of Apache Spark for the processing of massive data along with the stored in HDFS. Hadoop can store data in peta-bytes and zeta-
concept of parallel computation that assists in the faster process- bytes with no limitation on storage, and when used in collabo-
ing of Big Data, which is proved by the implementation results. ration with MapReduce then provides faster computation and
Apache Spark utilizes the concept of composability which is the processing of data. MapReduce is a batch-oriented program-
inter-relationships of components in programming libraries for Big ming model in Hadoop that performs management on data and
Data, and also encourages the development of conveniently inter- scheduling of jobs. MapReduce first splits the data into inde-
operable libraries. Sun et al. [22] incorporated their improved ap- pendent chunks that are processed entirely by map tasks in
proach on Spark system. Their approach is based on feature se- parallel and later sorted by the framework. Then the output is
lection for Random Forests classifier that utilizes UCI dataset for given as input to the reduced task. Hadoop and MapReduce col-
proving their propounded approach. They studied the two impor- lectively manage data.
tant issues of feature selection that includes—elimination of noisy • Apache Spark [8,7,27]—Spark is a BDS built on Hadoop and
features which have no relevance with classification, and elimina- runs on Hadoop Yarn. We use Apache Spark version 2.1.0. Spark
tion of redundant features. can be standalone or can work in collaboration with the frame-
Kalyanmoy et al. [17] proved the benefits of embodying NSGA-II work of MapReduce and HDFS, out of which we use the lat-
for a better feature selection approach which is an improved ver- ter one. Spark’s components include SQL, Streaming and ML-
sion of the existing NSGA that takes into account the three very lib (Machine Learning library). Classifiers in Apache Spark are
important issues in NSGA viz. high computational complexity of still evolving. Hadoop clusters are capable of running interac-
non-dominated sorting, need for specifying the sharing parame- tive query, streaming data, and much more simultaneously on
ter, and lack of elitism. Tamimi et al. [18] proposed a method to Apache Spark.
generate rules for IDS using NSGA-II that utilizes multi-objective • Spark RDD [8,11,28]—Spark supports a programming model
method. DARPA dataset is used to evaluate the experimental re- that is very similar to MapReduce however extends it with
sults that eventually lead to the acceleration of rule generation, “Resilient Distributed Datasets” (RDD) that is a data-sharing
and elevating the security factor in rule-based IDS. Sung et al. abstraction. Using this extension, Spark is able to capture a
[23] used KDD cup 99 dataset and considered the most promis- broader range of processing workloads that previously needed
ing attributes out of the 41 attributes for carrying out the exper- separate engines, including streaming, machine learning, SQL,
iments, which led to the degradation in performance of IDS with and graph processing. It also provides fault-tolerance without
respect to accuracy and detection rate. Thus, we used NSL KDD cup requiring replication. Spark exposes RDD through an API that is
99 dataset, an upgraded version of KDD cup 99 dataset, for a bet- a functional programming in Scala, Java, Python and R.
4 G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11

Fig. 2. The Proposed framework.

• Data Preprocessing [15,29]—We use NSL KDD cup 99 dataset techniques like classification and regression, and clustering are
which is preprocessed to some extent. We further preprocess it also provided by this library.
using Parse-labeled-point. In MLlib, labeled points are used in • NSGA-II [17,18]—We use Information Gain (IG) [19] and NSGA-
supervised learning algorithms. We use a double to store a la- II for feature selection phase. As in the proposed framework of
bel, so that labeled points can be used in both regression and Kaliappan et al. [19], IG has been derived using some formulas
classification. Spark.ml package provides machine learning API and Genetic Algorithm; we also apply the same methodology.
built on the data frames that are also becoming the core part of Feature selection is done to filter out less promising and irrele-
Spark SQL library. Spark.ml package can be used for developing vant features for IDS from our dataset.
and managing the machine learning pipelines, feature extrac-
tion, as transformers and selectors. Moreover, machine learning We incorporate the upgraded version of NSGA for feature se-
lection which is NSGA-II [18] that uses fast elitist NSGA that is
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 5

Fig. 3. Output showing best string selected.

Table 2 (iii) Decision tree [25,33,34]—A rule-based system classifier that


NSGA-II parameters.
works as follow:
Modeling description Setting • Best attribute of the dataset is placed at the root of the tree
Population size 40 and then the training set is split into subsets.
Crossover rate 0.8 • Subsets should contain data with the same value for an at-
Mutation rate 0.2 tribute.
Binary chromosome length 41 • Step 1 and step 2 are repeated on each subset until leaf nodes
Crowding distance sorting Dynamic
are found in all the branches of the tree.
Pareto front Dynamic
Non-domination rank Dynamic (iv) Random forest [12,27,35,34]—RF is a supervised classification al-
gorithm which is similar to bootstrapping algorithms that grow
bootstrap samples from the original data and grow an un-
best suited for our Big Data environment. Because of large size pruned classification tree and predict the new data by aggre-
of columnar data and with diversification in attributes of the NSL gating the predictions of the new tree. Using RF in collabora-
KDD cup 99 dataset, NSGA-II can perform feature selection with tion with MapReduce for a Big Data scenario produces better
a faster rate [17]. Table 2 represents all the parameters of Genetic results.
Algorithm (GA) and non-dominated algorithm combined. Since GA (v) Logistic regression [11]—LR algorithm is used for machine
provides solutions to all constrained and unconstrained optimiza- learning and predictive modeling. It works in scenarios, where
tion problems and it is entirely based on theory of Charles Darwin the dependent variable is Binary or Dichotomous. As lots of val-
that all organisms develop through natural selection of a small por- ues in our dataset are binary, so we use this algorithm to obtain
tion of inherited variations that boost up the individual’s ability to successful results. It doesn’t require any pick learning rate and
compete, reproduce and survive. We have considered population also runs faster.
size to be 40 and the length of binary chromosome is 41 that rep- ● Intrusion detection system [20,23,36]—We train our classifiers
resents the 41 features of our dataset. Crossover rate is kept 0.8 in a manner where they can differentiate between malicious
and mutation 0.2 because recombination is mandatory to generate and non-malicious packets. In Signature-based IDS, the patterns
new offspring, however we do not want the whole new popula- including hash values, signatures, string values and regular ex-
tion with large diversity, therefore crossover to mutation ratio is pressions are matched with the values already stored in the
kept 4:1. database. Anomaly-based IDS analyzes the traffic and informs
As we have 41 attributes in our dataset with 1, 25,973 instances, about the nature of traffic that is further distinguished as legit-
so it will affect the computational speed of system and complexity imate and illegitimate. Examining the incoming/outgoing traf-
will escalate too, that’s why NSGA-II is the best choice to perform fic, one can determine whether it is DoS attack in which the
feature selection. Fig. 3 shows one of the outputs of our program- traffic of Bots goes from Master to Slave (Zombie machines). In
ming model, where 17169 rows are tested and chromosomes con- addition, Probe, R2L (Remote to Local) and U2R (User to Root)
taining binary values are the resulting outputs, out of which the attacks can also be identified using this technique. Anomaly-
best chromosome is selected by NSGA-II. based IDS is used for the detection of DoS and R2L attacks,
whereas Signature- based IDS is used for U2R and Probe at-
● Classifiers: Below are the five classifiers used by our proposed tacks.
model. We train these classifiers using in-built machine learn- ● Multimodal fusion [19–21]—We use multimodal fusion in our
ing libraries of Apache Spark according to our needs for better scheme to enhance the performance of the entire IDS, making
results [30,31]. it ready for detecting novel attacks and to classify attacks in-
(i) Support vector machine [23]—SVM is a supervised learning cluding DoS, U2R, R2L, and Probe. As singular-modality won’t
model that examines data and acknowledges the existence of be sufficient to achieve a better accuracy and detection rate,
patterns in data which is used for linear and non-linear clas- therefore we instill the idea of merging all the three techniques
sification and regression analysis. Besides, it is commonly used using decision-based multimodal fusion which gets its founda-
for transduction, novelty detection and semi-supervised learn- tion from the approach used by Kaliappan et al. [19] proving
ing. that the false alarm rate and accuracy can be improved by us-
(ii) Gradient boosted trees [25,32]—There are three elements in ing this technique. Decision-based multimodal fusion incorpo-
GBT; a loss function to be optimized, a weak learner to make rates the majority voting rule on the outputs generated from
predictions and an additive model to add weak learners to min- individual classifiers and then decides the output. In our case,
imize the loss function. There are a few constraints in GBT the output is either “Attack” or “Not Attack”.
that include the number of trees, tree depth, number of nodes, ● NSL KDD 99 dataset [15,18]—This dataset is a newly proposed
number of observations per split, and minimum improvement dataset that overcomes the issues of existing KDD Cup 99
to loss.
6 G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11

Table 3 and without feature selection in order to distinguish the re-


Class-wise attack in NSL_KDD cup 99 dataset [15].
sults in terms of false alarm rate, detection rate and accu-
Class of attack Train file Test file racy.
Normal 9711 9711 Step IV: Classification phase—Our preprocessed and selected
DoS 45927 7456 dataset now goes through classification phase, where classi-
Probe 11656 2421 fiers like SVM, GBT, DT, RF, and LR perform the required clas-
R2L 995 2756 sification on the basis of training strategies applied on them.
U2R 52 200
Then their results are inputted to the next phase. Their out-
Total instances 125973 22544
puts basically include the classification of an input catego-
rizing it as attack and not-attack. For example, classifying
Table 4
rootkit under attack category.
Types of attacks in NSL_KDD cup 99 dataset [15].
Step V: Multimodal fusion phase—In this phase, decision-
Attack type Attack pattern based fusion is applied on the inputs received from the pre-
Probe ipsweep, nmap, portsweep, satan, mscan, saint vious phase. As we are considering Big Data scenario, we
DoS Back, land, neptune, pod, smurf, teardrop, apache2, need an efficient and highly accurate classification process
worm, udpstorm, mailbomb, processtable
which could perform faster computations and facilitates pre-
U2R buff_overflow, loadmodule, rootkit, perl, sqlattack,
xterms, ps cise results with low false alarm rate and higher values for
R2L guess_password imap, multihop, phf, spy, warezclient, detection rate. So instead of combining all the classifiers, we
warezmaster, ftp_write, xlock, xsnoop, snmpgue, combine SVM, GBT, DT, LR, and RF. Outputs from classifi-
snmpgetattack, httptunnel, sendmail, named cation phase are of two categories; attack and not-attack,
where majority wins; pseudo code is shown in Fig. 4.
Step VI: Output phase—The categorizer (Fig. 4) will decide the
dataset, such as, it does not include redundant records in the
final output on the basis of majority, that further decides
training set and there is no duplicity of records in the test sets.
whether it is an attack or not on the basis of number of fea-
Apparently, we can say that it is a reduced version of KDD cup
tures obtained in Attack (A) and Not-Attack (NA). In Fig. 2,
99 dataset. There are 1, 25,973 records in the training set and
outputs from different classifiers are denoted by X and Y,
22,544 records in the test set, with forty one attributes. It is the
where X represents “it’s an attack” and Y represent “it’s not
current upgraded version of KDD Cup 99 dataset which is used
an attack” for all the patterns and packets matched in the
for working on IDS. Table 3 shows the total number of attacks
NSL KDD dataset. Now, function of a categorizer comes into
and Normal packet instances that are present in NSL KDD 99
action, when it has to decide on the basis of majority of the
dataset. Table 4 represents the types of attacks available in NSL
result, whether it’s an attack or not. Let us assume if 3 out
KDD dataset along with their patterns.
of 5 classifiers say it is an Attack and 2 classifiers say it is
Not-Attack, then majority will win facilitating the final out-
3.2. Working mechanism
put of our model.
We will be working on NSL KDD cup 99 dataset that is ap-
parently suitable for implementation of the proposed model. Af- 4. Experimental results and discussions
ter setting the environment for Hadoop MapReduce and Spark we
execute the following steps: We carried out all the experiments on Windows10 platform
on a 64-bit OS and a system with configuration of i5 proces-
Step I: Initialization phase [37]—System’s configuration is sor/2.70 GHz, with 8 GB installed RAM for executing our experi-
checked initially, as only upgraded systems are able to sup- ment. NetBeans 8.2 is used as our main platform to run all our
port the parallel processing of massive dataset. First of all, applications including Hadoop MapReduce task (with RDD), and
NetBeans is initiated on JDK8.0 and all the required libraries Spark (RDD). NetBeans assimilated with Hadoop MapReduce and
are imported to NetBeans. Now, Hadoop [38] is initialized, Spark RDD is used for simulation and analysis of experimental re-
that further initiate Yarn, MapReduce and HDFS. Then we sults.
initialize Apache Spark [39] which exists in the same cluster Firstly, we perform data pre-processing on NSL KDD cup 99
with Hadoop MapReduce. Hadoop MapReduce and Spark are dataset using Java RDD and parse-labeled-point that are in-built
in running state now, which further initiates parallel com- functions provided by Java and Scala. This step lets us work with
putation. Finally, the initialization phase gets over with NSL our dataset with no setbacks of redundancy, string conversion is-
KDD Dataset importation to NetBeans. sues, and chaotic adjustments of instances. We utilize all the at-
Step II: Data pre-processing phase—NSL KDD cup dataset al- tacks which are categorized as malicious packets including DoS,
ready comes as a processed dataset as it is being used since Probe, R2L and U2R to train our classifiers to predict which packet
1999 for IDS and have been upgraded as well with no redun- is malicious and which one is legitimate. Secondly, IG values are
dant values in training set and no duplicate records in test obtained for different features available in the train set and test
set. However, we pre-process in order to achieve a dataset set which is apparently the basis for proceeding with feature se-
with fast processing, no redundant values, and no string val- lection step are represented in Fig. 5. Reckoning IG is an integral
ues in any column. We use JavaRDD and parseRDD that per- part of the implementation of this entire model as it helps in fea-
form index mapping, parsing and labeling, and thus yield ture selection procedure and then later classifying a given sample.
clean data. Declination in the IG values corresponding to the attributes rang-
Step III: Feature selection phase—In this phase, NSGA-II is ing from 1 to 41 for training set can be observed, whereas there
used to extract the best features out of all the 41 features are fluctuations in IG values of testing set.
with diversity in its selection. Using NSGA-II that uses Pareto We employ NSGA-II to perform feature selection on our dataset
optimality [17] and is faster with no over-fitting problem, utilizing the parameters as mentioned in Table 2 that selects the
select optimum features can be selected that that also pro- most promising features which assist us in procuring the enhanced
vides multi-objective functions in the execution of an algo- performance in respect with accuracy and detection rate. Table 5
rithm. We implement our model using both feature selection provides the description of all the forty one features that are avail-
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 7

Fig. 4. Function of a categorizer.

Fig. 5. Information gain (IG) values for (i) Train set, and (ii) Test set.

Table 5
Description of 41 features in NSL KDD dataset.

S.no Feature name Description Type

1. Duration Length (number of seconds) of the connection Continuous


2. Protocol_type Type of the protocol, e.g. tcp, udp, etc. Discrete
3. Service Network service on the destination e.g. http, telnet, etc. Discrete
4. Src_bytes Number of data bytes from source to destination Continuous
5. Dst_bytes Number of data bytes from destination to source Continuous
6. Flag Normal or error status of the connection Discrete
7. Land 1 if connection is from/to the same host/port; 0 otherwise Discrete
8. Wrong_fragment Number of wrong fragments Continuous
9. Urgent Number of urgent packets Continuous
10. Hot Number of hot indicators Continuous
11. Num_failed_logins Number of failed login attempts Continuous
12. Logged_in 1 if successfully logged in; 0 otherwise Discrete
13. Num_compromised Number of compromised conditions Continuous
14. Root_shell 1 if root shell is obtained; 0 otherwise Discrete
15. Su_attempted 1 if su root command attempted; 0 otherwise Discrete
16. Num_root Number of root accesses Continuous
17. Num_file_creat ions Number of file creation operations Continuous
18. Num_shells Number of shell prompts Continuous
19. Num_access_files Number of operations on access control files Continuous
20. Num_outbound_cmd s Number of outbound commands in an ftp session Continuous
21. Is_host_login 1 if the login belongs to the host list; 0 otherwise Discrete
22. Is_guest_login 1 if the login is a guest login; 0 otherwise Discrete
23. Count Number of connections to the same host as the current connection in the past two seconds Continuous
24. Serror_rate Number of connections that have SYN errors Continuous
25. Rerror_rate Number of connections that have REJ errors Continuous
26. Same_srv_rate Number of connections to the same service Continuous
27. Diff_srv_rate number of connections to different services Continuous
28. Srv_count number of connections to the same service as the current connection in the past two seconds Continuous
29. Srv_serror_rate Number of connections that have SYN errors Continuous
30. Srv_rerror_rate Number of connections that have REJ errors Continuous
31. Srv_diff_host_rate Number of connections to different hosts Continuous
32. Dst_host_count count of connections having the same destination host Continuous
33. Dst_host_srv_count count of connections having the same destination host and using the same service Continuous
34. Dst_host_same_srv_r ate Number of connections having the same destination host and using the same service Continuous
35. Dst_host_diff_srv_ra te Number of different services on the current host Continuous
36. Dst_host_same_src_ port_rate Number of connections to the current host having the same src port Continuous
37. Dst_host_srv_diff_ho st_rate Number of connections to the same service coming from different hosts Continuous
38. Dst_host_serror_rate Number of connections to the current host that have an S0 error Continuous
39. Dst_host_srv_serror_ rate Number of connections to the current host and specified service that have an S0error Continuous
40. Dst_host_rerror_rate Number of connections to the current host that have an RST error Continuous
41. Dst_host_srv_rerror_ rate Number of connections to the current host and specified service that have an RST error Continuous
8 G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11

Table 6
Selected features based on NSGA-II providing best accuracy.

Classifiers Total features Selected features Selected features Accuracy

SVM 41 30 1, 2, 3, 4, 8, 9, 10, 11, 12, 14, 15, 16, 18, 20, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 35, 36, 37, 39, 40 90.58%
GBT 41 30 1, 3, 4, 6, 7, 8, 10, 12, 14, 16, 18, 19, 20, 21, 22, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 39, 40,41 90.36%
DT 41 29 1, 4, 5, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 33, 34, 39, 40,41 91.04%
LR 41 32 1, 3, 4, 6, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 20, 21, 22, 24, 25, 26, 28, 29, 30, 31, 34, 35, 36, 37, 38, 39, 40,41 89.39%
RF 41 28 1, 2, 3, 5, 6, 7, 8, 10, 12, 13, 15, 16, 18, 19, 20, 21, 22, 24, 25, 26, 30, 34, 35, 36, 37, 38, 39 90.13%
Fusion 41 30 1, 2, 3, 4, 5, 6, 8, 9, 12, 13, 14, 15, 17, 18, 19, 20, 21, 24, 25, 27, 29, 30, 31, 32, 33, 34, 35, 37, 38, 40 92.03%

able in NSL KDD cup 99 dataset [39]. NSGA-II selects the most Table 7
Calculated accuracy, precision and F-measure.
prominent features from these forty one features. Some of the
parameters of NSGA-II are dynamic and some are set static, for Classifiers Feature selection Accuracy (%) Precision F-measure
instance the crossover and mutation rate in which the ratio of SVM With FS 90.58 0.916 0.903
crossover and mutation rate is kept 8:2, as we do not want es- Without FS 83.38 0.859 0.827
calation in complexity in diversity that will be affected with the GBT With FS 90.36 0.914 0.901
soaring scalability of Big Data. If scalability soars in Big Data envi- Without FS 88.79 0.902 0.885
DT With FS 91.04 0.919 0.908
ronment, it is very likely that complexity will increase too, due to
Without FS 88.68 0.900 0.884
which BDSs become less robust and more complicated to evolve. LR With FS 89.39 0.907 0.891
Table 6 shows the features selected by NSGA-II algorithm, along Without FS 90.69 0.909 0.906
with the accuracy obtained so far in percentage. This experiment is RF With FS 90.13 0.913 0.899
Without FS 90.37 0.915 0.901
carried out for all the attack patterns available in NSL KDD Cup
Fusion With FS 92.03 0.928 0.919
99 dataset that are mentioned in Table 4. We carried out this Without FS 89.95 0.912 0.897
experiment iteratively with various combinations of features, and
we were able to achieve maximum accuracy, detection rate, false
alarm rate and processing speed with some of the most promi- Table 8
nent features filtered out by NSGA-II. It can be articulated from Calculated detection rate (%) for different classifiers.
Table 6 that how different combinations of distinguishable features Classifiers With feature selection (%) Without feature selection (%)
of our considered dataset can generate different accurate values
SVM 99.26 98.02
and allow us to choose a limited number of features out of all 41
GBT 99.14 99.04
features selecting only the most promising features and thus re- DT 99.03 98.85
ducing the burden of all features. Apparently, the number of fea- RF 99.42 99.33
tures selected by multimodal-fusion is 30 out of 41 and these fea- LR 99.29 95.78
tures can be looked up in Table 5. Fusion 99.38 99.34

4.1. Performance evaluation


graph generated corresponding to the data given for precision and
We use confusion matrix to evaluate the performance of our
F-measure in Table 7. Similarly, Fig. 7 can be referred to skim data
implemented proposed scheme on Apache Spark. Confusion matrix
related to accuracy of all the five classifiers.
is created from the total number of True Positive (TP), False Posi-
Fig. 6 shows the experimental results of our proposed scheme,
tive (FP), True Negative (TN) and False Negative (FN) values. The
where the elevation in precision and F-measure can be observed
overall classification performance of IDS is measured by accuracy,
which is obtained after the successful incorporation of feature se-
false alarm rate, precision, F-measure, and detection rate. More-
lection in our scheme using NSGA-II. Eqs. (3)–(5) represents the
over, accuracy is defined by the ratio between numbers of correct
formula for calculating accuracy, precision and F-measure. Table 8
predictions to the total number of predictions made by classifica-
represents the detection rate with and without feature selection.
tion mode [19].
Detection rate can be computed using formula in Eq. (2). Detection
FP rate is computed using formula provided in Eq. (2). Moreover, it is
Falsealarmrate = (1)
TN + FP quite lucid that using feature selection the detection rate achieved
is not only high consolidating fusion but also the DR achieved in-
TP dividually by all the five classifiers is higher using FS.
Detectionrate = ∗ 100 (2)
FN + TP Fig. 7 depicts the increase in accuracy percentage when fusion
is performed utilizing NSGA-II for feature selection procedure. This
TP + TN elevation in accuracy and false alarm rate cannot be observed in
Accuracy = ∗ 100 (3)
TP + TN + FP + FN Kaliappan et al. [17] and Gupta et al. [11] proposed works. Al-
though Kaliappan et al. [17] work is not Big Data environment cen-
TP
Precision = ∗ 100 (4) tric, but the defense mechanism he introduced with the fusion of
TP + FP five IDS has proved that better efficiency and performance can be
achieved.
P recision ∗ Recall
F − measure = 2 ∗ (5) To recapitulate all our experimental results, including the classi-
P recision + Recall fiers used, tools/programming models used along with false-alarm
Table 7 shows the precision and F-measure/F1 score with and rate (formula in Eq. (1)), detection rate and testing time of the en-
without feature selection (FS) up to three decimal values for all the tire execution are depicted in Table 9. Also, after exploring our op-
five individual classifiers and for the fusion. Additionally, it shows tions to choose the best implementation environment, we found
the accuracy for all the five classifiers for both the scenarios “with that eclipse provides a smooth and effective execution of our pro-
FS” and “without FS” in percentage format. Fig. 6 represents the gramming model.
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 9

0.94
0.92
0.9
0.88
0.86
0.84 Precision
0.82
0.8
F-measure
0.78
0.76
With FS Without With FS Without With FS Without With FS Without With FS Without With FS Without
FS FS FS FS FS FS
SVM GBT DT LR RF Fusion

Fig. 6. Measurements of precision and F-measure.

94%
92%
90%
88%
ACCURACY

86%
84%
82%
80%
78%
SVM GBT DT RF LR Fusion
With Feature Selecon Without Feature Selecon

Fig. 7. Measurement of accuracy with and without feature selection.

Table 9
Comparison of results of different approaches used in building an Enhanced IDS.

Approaches Model description Big Data Tool/language used Classifiers used Detection FAR Testing
considered rate time(s)

Kaliappan et al. – Decision-based No – Weka tool SVM 95.4 0.7 126.0


[17] multi-modal fusion – Java IBK 99.5 0.3 0.25
– Feature selection- – NSL-KDD Dataset J48 99.6 0.1 1.69
GA RF 99.7 0.2 12.88
BayesNet 93.7 0.3 0.69
Fusion (All) 99.0 1.0 NA
Gupta et al. – No fusion Yes – Apache Spark and LR 60.01 6.6 1.336
[11] – Correlation-based it’s MLlib SVM 58.45 83.1 1.53
and Chi-squared – DARPA KDD’99 Naïve Bayes 0.65 18.6 0.345
feature selection Dataset RF 69.28 3.6 2.242
– NSL-KDD cup 99 GBT 60.85 2.67 1.674
Dataset
Proposed – Decision-based Yes – Apache Spark and SVM 99.26 0.35 0.19
model multi-modal fusion it’s MLlib GBT 99.14 0.21 0.14
– Feature selection- – Scala language DT 99.03 0.19 0.11
NSGA-II – NSL-KDD cup 99 LR 99.42 0.21 0.43
Dataset RF 99.29 0.23 0.22
Fusion 99.38 0.17 0.32
10 G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11

Since data is the most important asset these days, therefore, [2] Dietrich D, Heller B, Yang B. Data science & big data analytics: discovering.
BDSs developed for securing the entire Big Data environment must Analyzing, visualizing and presenting data; 2015.
[3] Chen M, Mao S, Liu Y. Big data: A survey. Mobile Netw. Appl.
be designed in a manner where the confidentiality, integrity and 2014;19(2):171–209.
availability of data can be maintained. Large networks and data [4] Ordiano JÁG, Bartschat A, Ludwig N, Braun E, Waczowicz S, Renkamp N,
centers that supports the communication of titanic data needs et al. Concept and benchmark results for Big Data energy forecasting based
on Apache Spark. J. Big Data 2018;5(1):11.
to be facilitated with full-fledge security, so that user’s data do [5] Moreno J, Serrano MA, Fernández-Medina E. Main issues in big data security.
not get tampered by any anonymous user or impersonated device Future Internet 2016;8(3):44.
or person. In Table 9, our scheme is compared with the existing [6] Wang L, Jones R. Big data analytics for network intrusion detection: a survey.
Int J Netw Commun 2017;7(1):24–31.
schemes (Kaliappan et al. [19] and Gupta et al [11]) which exhibits
[7] Download link (Hadoop): https://archive.apache.org/dist/hadoop/core/. Ac-
that the proposed scheme is better in terms of detection rate and cessed on December 2017.
false alarm rate. In addition, a slight improvement in overall test- [8] Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, et al. Apache spark:
a unified engine for big data processing. Commun ACM 2016;59(11):56–65.
ing time is also achieved in our proposed model.
[9] Download link (Spark): https://www.apache.org/dyn/closer.lua/spark/. Ac-
In [19], by adopting multimodal fusion and GA, the model was cessed on December 2017.
able to achieve a 99.0% of detection rate, on the other hand, our [10] https://spark.apache.org/. Accessed on December 2017.
proposed model achieved 99.38% detection rate. Additionally, in [11] Gupta GP, Kulariya M. A framework for fast and efficient cyber secu-
rity network intrusion detection using apache spark. Procedia Comput Sci
comparison to the model proposed in [18], detection rate achieved 2016;93:824–31.
by our proposed model is much higher since the authors did not [12] Suthaharan S. Big data classification: Problems and challenges in network in-
use fusion-based approach. Our proposed scheme clearly signifies trusion prediction with machine learning. ACM SIGMETRICS Perform Eval Rev
2014;41(4):70–3.
that using this methodology, the overall efficiency of the model is [13] Jonnalagadda SK, Reddy RP. A literature survey and comprehensive study of
approved significantly. In addition, a slight improvement in overall intrusion detection. Int J Comput Appl 2013;81(16):40–7.
testing time is also achieved in our proposed model. [14] Terzi DS, Terzi R, Sagiroglu S. Big data analytics for network anomaly detec-
tion from netflow data. In: Computer science and engineering (UBMK), 2017
international conference on. IEEE; 2017. p. 592–7. October.
5. Conclusion and future scope [15] Dhanabal L, Shantharajah SP. A study on NSL-KDD dataset for intrusion detec-
tion system based on classification algorithms. Int J Adv Res Comput Commun
Eng 2015;4(6):446–52.
The pace with which data is being generated and transferred [16] Aggarwal P, Sharma SK. Analysis of KDD dataset attributes-class wise for in-
across the globe has inclined the world towards a more cognitive trusion detection. Procedia Comput Sci 2015;57:842–51.
and knowledgeable era. But with bulk generation of data, hackers [17] Deb K, Pratap A, Agarwal S, Meyarivan TAMT. A fast and elitist multiobjective
genetic algorithm: NSGA-II. IEEE Trans Evol Comput 2002;6(2):182–97.
seek to work on new opportunities in order to breach security sys- [18] Tamimi A, Naidu DS, Kavianpour S. An Intrusion detection system based
tems built with cutting-edge techniques and algorithms. Therefore, on NSGA-II Algorithm. In: Cyber security, cyber warfare, and digital forensic
keeping in mind the value and importance of data in our day-to- (CyberSec), 2015 fourth international conference on. IEEE; 2015. p. 58–61.
October.
day lives, its security should be the primary concern of data scien- [19] Kaliappan J, Thiagarajan R, Sundararajan K. Fusion of heterogeneous intrusion
tists. detection systems for network attack detection. Sci World J 2015;2015:1–9.
Our proposed work was inspired by the disadvantages we ex- [20] Ramachandran C. An advanced data processing based fusion IDS structures. Int
J Appl Eng Res 2017;12(21):10929–37.
plored in some of the existing techniques that include the predic- [21] Giacinto G, Roli F, Didaci L. Fusion of multiple classifiers for intrusion detection
tion time and detection rate of Kaliappan et al. [19] approach and in computer networks. Pattern Recog Lett 2003;24(12):1795–803.
accuracy parameter of other few approaches. We have merged the [22] Sun K, Miao W, Zhang X, Rao R. An improvement to feature selection of ran-
dom forests on spark. In: Computational science and engineering (CSE), 2014
benefits of five classifiers available in Apache Spark BDS that led to
IEEE 17th international conference on. IEEE; 2014. p. 774–9.
the development of our proposed model. Consolidating the outputs [23] Sung AH, Mukkamala S. Identifying important features for intrusion detection
of five different classifiers using multimodal fusion assimilating on using support vector machines and neural networks. In: Applications and the
internet, 2003. Proceedings. 2003 symposium on. IEEE; 2003. p. 209–16.
BDS Spark helped us in achieving:
[24] Thomas C, Balakrishnan N. Improvement in intrusion detection with advances
in sensor fusion. IEEE Trans Inf Forensics Secur 2009;4(3):542–51.
• Better performance of the entire system in terms of false alarm [25] Mulay SA, Devale PR, Garje GV. Intrusion detection system using support vec-
rate, detection rate, accuracy and faster processing of data. tor machine and decision tree. Int J Comput Appl 2010;3(3):40–3.
• We were able to achieve 92.03% accuracy and the highest accu- [26] Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R,
et al. Apache hadoop yarn: Yet another resource negotiator. In: Proceedings
racy achieved individually is by DT which is 91.04%.
of the 4th annual symposium on cloud computing. ACM; 2013. p. 5.
• Additionally, we were able to procure the high computational [27] Jeya PG, Ravichandran M, Ravichandran CS. Efficient classifier for R2L and U2R
speed by performing the entire execution in 0.32 s. attacks. Int J Comput Appl 2012;45(21):29.
[28] Ezin EC, Djihountry HA. Java-based intrusion detection system in a wired net-
To proceed with resolving bottlenecks of security and compu- work. Int J Comput Sci Inf Secur 2011;9(11):33.
[29] Hamed T, Ernst JB, Kremer SC. A survey and taxonomy on data and pre-pro-
tation issues in a Big Data environment, maneuvers that optimize cessing techniques of intrusion detection systems. In: Computer and Network
the entire voluminous ecosystem must be taken into consideration. Security Essentials. Springer; 2018. p. 113–34.
Since we used Apache Spark which is still under development and [30] Singh SP, Jaiswal UC. Machine learning for big data: a new perspective. Int J
Appl Eng Res 2018;13(5):2753–62.
thus do not support many classifiers. Therefore, more classifiers [31] Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Xin D. Mllib:
can be introduced in Spark BDS to obtain more accurate results machine learning in apache spark. J Mach Learn Res 2016;17(1):1235–41.
and also to have diverse options for better performances of classi- [32] Kevric J, Jukic S, Subasi A. An effective combining classifier approach us-
ing tree algorithms for network intrusion detection. Neural Comput Appl
fiers. For future work, this entire implementation can be executed
2017;28(1):1051–8.
on Flink which is faster than Hadoop Map-Reduce and Spark. [33] Rai K, Devi MS, Guleria A. Decision tree based algorithm for intrusion detec-
tion. Int J Adv Netw Appl 2016;7(4):2828.
[34] Liu H, Gegov A, Cocea M. Rule based systems for big data: a machine learning
Supplementary materials approach, 13. Springer; 2015.
[35] Río Del, S López, V Benítez, M J, Herrera F. On the use of MapReduce for im-
Supplementary material associated with this article can be balanced big data using Random Forest. Inf Sci 2014;285:112–37.
[36] Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source
found, in the online version, at doi:10.1016/j.jisa.2018.10.001. tools for machine learning with big data in the Hadoop ecosystem. J Big Data
2015;2(1):24.
References [37] Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Khan SU. The rise of
“big data” on cloud computing: Review and open research issues. Inf Syst
[1] Zuech R, Khoshgoftaar TM, Wald R. Intrusion detection and big heterogeneous 2015;47:98–115.
data: a survey. J. Big Data 2015;2(1):3.
G. Donkal, G.K. Verma / Journal of Information Security and Applications 43 (2018) 1–11 11

[38] Shanahan JG, Dai L. Large scale distributed data science using apache spark. In: [39] Dhanabal L, Shantharajah SP. A study on NSL-KDD dataset for intrusion detec-
Proceedings of the 21th ACM SIGKDD international conference on knowledge tion system based on classification algorithms. Int J Adv Res Comput Commun
discovery and data mining. ACM; 2015. p. 2323–4. Eng 2015;4(6):446–52.

View publication stats

You might also like