Professional Documents
Culture Documents
A Detailed Analysis of New Intrusion Detection
A Detailed Analysis of New Intrusion Detection
ABSTRACT
The increasing use of Internet networks has led to increased threats and new attacks day after day. In order
to detect anomaly or misused detection, Intrusion Detection System (IDS) has been proposed as an
important component of secure network. Because of their model free properties that enable them to identify
the network pattern and discover whether they are normal or malicious, Machine-learning technique has
been useful in the area of intrusion detection. Different types of machine learning models were leveraged in
anomaly-based IDS. There is an increasing demand for reliable and real-world attacks dataset among the
research community. In this paper, a detailed analysis of most-recent dataset CICIDS2017 has been made.
During the analysis, many problems and shortcoming in a dataset were found. Some solutions are proposed
to fix these problems and produce optimized CICIDS2017 dataset. A 36-feature has been extracted during
the analysis, and compared to 23-featured extracted by the dataset from literatures. The 36-features gave the
best result of losses, accuracy and F1-score metrics for the testing model.
Keywords: CICIDS2017, Network Anomaly Detection, Imbalance Dataset, Quantile Transform, Machine
Learning.
4519
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
Researchers at Canadian Institute of Cyber security dataset makes the machine learning algorithm
provide this dataset. It contains up-to-date attacks bias to large instances for a given attack.
and fulfills all the criteria of IDS that mentioned What is the suitable normalization function
above. The dataset consists of CSV files for flow that results in the best detection rate? Not all
records generated with CICFlowMeter, and the normalization functions are suitable for all
sniffed network traffic PCAP files. Traffic Data data set.
captured for a total of five days. Attacks carried out
on working days (Tuesday-Friday) in both morning
Table 1: CICIDS2017 dataset [5]
and afternoon. The implemented attacks include
Flow pcap Duration CS Attack name Flow
DoS, DDoS, Heartbleed, Web Attacks, FTP Brute Record File V count
Force, SSH Brute Force, Infiltration, Botnet and ing size File
Port Scans. Monday is the normal day and includes Day Size
only legitimate traffic [3]. (Worki
ng
Hours)
10 All Day 257 No Attack 529918
The CICIDS2017 Dataset builds from an abstract GB MB
behavior of 25 internet users, which are based on
Monday
email protocols and the HTTP HTTPS FTP SSH. It
involves labeled network flows comprising full
packet payloads in PCAP format, the corresponding
profiles and the labeled flows 10 All Day 166 FTP-Patator, 445909
(GeneratedLabelledFlows.zip), and CSV files for GB MB SSH-Patator
deep learning purpose (MachineLearningCSV.zip)
Tuesday
a high detection rate of the intrusion detection Afternoo 103 Infiltration 288602
system for all attacks in the data set. The following n MB
questions will be the search path:
8.2 Morning 71.8 Bot 192033
GB MB
Is the dataset contains null values? Null values
not accepted by Machine Learning algorithms.
Afternoo 92.7 DDoS 225745
What type of values of the features in the data n MB
set? The suitability of those values for
Machine Learning algorithms. Afternoo 97.1 Port Scan 286467
Friday
4520
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
solutions to inherent shortcomings in this dataset [16] proposed a data dimensionality reduction
and relabeling it to reduce high-class imbalance. method for network intrusion detection. The model
Hossen et al. [4] explored the performance of a has been evaluated on both the datasets NSL-KDD
Network Intrusion Detection System (NIDS) that and CICIDS 2017. And the comparison between
can detect various types of attacks in the network both the proposed approaches has also been
using Deep Reinforcement Learning Algorithm, represented.
they worked on 85 attributes of CICIDS2017 which
aided as an effective means in the detection of
different types of attacks. Kostas [5] tried to detect The CICIDS2017 dataset consists of two zipped
network anomaly by using machine-learning files: GeneratedLabelledFlows and
algorithms, the CICIDS2017 is used due to its up-to MachineLearningCVE. the first file consists of 85
date, and has wide attack diversity. Feature features plus one for attacks type labels, while the
selection made by using the Random Forest second file consist of 78 features plus one for
Regressor algorithm. Seven Different machine- attacks type labels. The difference between the
learning methods are used in the implementation features in the files is 6 features these features in
step and achieved high performance. Boukhamla et the first file is for identification the flow as stated
al. [7] described optimized CICIDS2017 dataset, by the authors of the dataset. All previous works on
and then evaluated the performance by using CICIDS2017 dataset except [3] used all 85 features
machine-learning algorithms. Again, the study used of data. Some of these features such as “FlowID,
CSV file, which contains features obtained from SourceIP, SourcePort, DestinationIP” in the first
network flow. Mieden [8] presented many file wrongly considered as a features to train the
experiments on classifying unwanted behavior in model. For example, the IP addresses and the
CICIDS2017 dataset using deep learning with source ports are changing continuously throughout
tensor flow. They worked on CSV files which the network. Inaddition the FlowID consist of 4
contains 85 features. Pektas et al. [9] presented tuple they are “SourceIP, SourcePort,
deep learning architecture combining CNN and DestinationIP, DestinationPort” is considered a
LSTM to enhance the intrusion detection repeated feature. Therefore we used the second file
performance. The CICIDS2017 dataset is used for for analysis and training. For the imbalance
testing the model. Vijayanand et al. [10] proposed problem, the study [6] solve class imbalance
IDS that used the genetic algorithm as a feature problem by relabeling of classes. They merged a
selection method and multiple Support Vector few minority class to form new attack classes (DoS
Machines (SVM) for classification. A small portion and DDoS merged in one class, all Web attacks
of the CICIDS2017 dataset instances are used to merged in one class too), the merging process
evaluate the system. Radford et al. [11] compared incorrect. According to our result, DoS attack and
and contrasted a frequency-based model from five DDoS attacks can be discriminated clearly. For
sequence of aggregation rules with sequence-based dataset normalization, all of the previous literature
modeling of the Long Short-Term Memory (LSTM) used MiniMax or Standard scaling function. Our
recurrent neural network. Lavrova et al. [12] work use Quintile transform which give us better
analyzed the CICIDS2017 dataset using digital results.
wavelets. Watson [13] applied the Multi-Layer
Perceptron (MLP) classifier algorithm and a The aim of the research is to focus on analysis
Convolutional Neural Network (CNN) classifier and preprocess that we conduct on the CICIDS2017
that used the Packet Capture file of CICIDS2017. dataset. Our purpose is to prepare the dataset to
They selected specified network packet header train and test an Anomaly intrusion detection model
features for their study. Aksu et al. [14] proposed a based on machine learning. A preprocess on the
denial of service IDS that used the Fisher Score dataset has an important effect on the detection rate
algorithm for features selection and Support Vector of the model. This trained model considered the
Machine (SVM), K-Nearest Neighbor (KNN) and building block of a network intrusion detection
Decision Tree (DT) as the classification algorithm. system. Contributions of the research in this work
Marir et al. [15] use a distributed Deep Belief are: Detecting faults and problems never explored
Network (DBN) as the dimensionality reduction before in the CICIDS2017 and finding solutions for
approach. The obtained features are fed to a multi- them. The research includes features analysis, their
layer ensemble SVM. The researchers divided importance in detecting the attack, find the high
CICIDS2017 dataset into training and test datasets correlated between features and study the balancing
using a ratio of 60% to 40%, respectively. Bansal in the data set. In addition, the best normalization
4521
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4522
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
have null values with label of Bot. To delete any Heartbleed) are increased so that minimum number
record have a Null values we run the command: of records of any attack type is not less than 5000
records. This is done using SMOTE (Synthetic
Bot=Bot.dropna(axis=0, how='any') Minority Over-sampling Technique), a python
algorithm in Imblearn balancing library, which do
The same operations are applied to all upsample (generate new samples) from few records
files. Then we merge all files to form a dataset for based on the n-nearest neighbor. The semi-balanced
all attacks, by run the following commands: dataset becomes as shown in Table 5. After semi-
balancing, the data set shape becomes as follows:
AllAttacks=pd.concat([bot,ddos,dosH,infilt,ports 832,373 rows, 78 feature columns, 1 label Column.
,ssh,web,benign],ignore_index=True)
Table 5: Semi-balanced dataset
Then the dataset shape becomes 2827876 rows, Attack type Record
78 feature Columns, 1 label Column. The Count
distribution of records by traffic type in the data set BENIGN 250000
shown in Table 4. DoS Hulk 230124
PortScan 158804
Table 4: Traffic flow type Distribution DDoS 128025
Attack type Record DoS GoldenEye 10293
Count FTP-Patator 7935
BENIGN 2271320 SSH-Patator 5897
DoS Hulk 230124 DoS slowloris 5796
PortScan 158804 DoS Slowhttptest 5499
DdoS 128025 Bot 5000
DoS GoldenEye 10293 Web Attack-Brute Force 5000
FTP-Patator 7935 Web Attack- XSS 5000
SSH-Patator 5897 Infiltration 5000
DoS slowloris 5796 Web Attack-Sql Injection 5000
DoS Slowhttptest 5499 Heartbleed 5000
Bot 1956 Total 832,373
Web Attack-Brute Force 1507
Web Attack-XSS 652 The following commands illustrate the
Infiltration 36 process of downsampling. We import the balancing
Web Attack-Sql Injection 21 library then define a dictionary with the number of
Heartbleed 11 records for each class:
Total 2827876 from imblearn.under_sampling import
RandomUnderSampler
From Table 4, it is noted that the data set
is imbalanced, as the number of normal traffic Define a dictionary:
records is very large compared to other records of
attacks, and the existence of a few records of some Down-Dic = {
types of attacks. Many records for one class make 'BENIGN' : 250000,
the machine-learning model to bias to that class, 'DoS Hulk' : 230124,
while little records make the machine-learning 'PortScan' : 158804,
model learn nothing about that class. To solve the 'DDoS' : 128025,
problem of imbalance, the normal traffic records 'DoS GoldenEye' : 10293,
(BENIGN) is downsampled to 250000 records by 'FTP-Patator' : 7935,
using RandomUnderSampler a python algorithm in 'SSH-Patator' : 5897,
Imblearn balancing library. In addition, in order to 'DoS slowloris' : 5796,
increase the sensitivity of the classifier to detect 'DoS Slowhttptest' : 5499,
attacks with few records in multi-class 'Bot' : 1956,
classification problems, the number of records for 'Web Attack-Brute Force’:1507,
the few attacks (Bot Web Attack-Brute Force Web 'Web Attack-XSS' : 652,
Attack-XSS Infiltration Web Attack-Sql Injection 'Infiltration' : 36,
4523
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4524
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4525
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4526
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4527
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4528
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
current work by implementing various balancing intrusion detection” [bachelor thesis]. Munich:
algorithm and investigate its effects on model the Ludwig maximilians university, institute
learning. Study the trouble with the two attacks fur informatics; 2018.
(Web Attack-Brute Force Web Attack-XSS). In [9] Pektaş A., Acarman T., “A deep learning
addition, applying deep learning algorithms on this method to detect network intrusion through
data set such as convolution and recurrent neural flow‐based features”, International Journal of
networks. At another direction a new dataset CSE- Network Management 29(3), 2018, e2050.
CIC-IDS2018 [2] dataset need to explored and [10] Vijayanand R, Devaraj D, Kannapiran B.,
analyzed. This dataset is huge and needs more “Intrusion Detection System For Wireless
hardware and computing resources to be analyzed. Mesh Network Using Multiple Support Vector
Machine Classifiers With Genetic-Algorithm-
REFRENCES: Based Feature Selection”, Computers &
Security (77), 2018, 304–314.
[1] Gharib A, Sharafaldin I, Lashkari AH, [11] Radford BJ, Richardson BD, Davis SE.,
Ghorbani AA., “An Evaluation Framework “Sequence Aggregation Rules for Anomaly
for Intrusion Detection Dataset”, International Detection in Computer Network Traffic”,
Conference on Information Science and arXiv preprint arXiv: 1805.03735, 2018.
Security (ICISS), 2016, 1-6. [12] Lavrova D, Semyanov P, Shtyrkina A,
[2] Search UNB [Internet]. University of New Zegzhda P., “Wavelet-Analysis Of Network
Brunswick est.1785. [cited 2019May26]. Traffic Time-Series For Detection Of Attacks
Available from: On Digital Production Infrastructure”, SHS
https://www.unb.ca/cic/datasets/ids-2017.html Web of Conferences, EDP Sciences (44), 2018,
[3] Sharafaldin I, Lashkari AH, Ghorbani AA., 00052.
“Toward Generating a New Intrusion Detection [13] Watson, G., “A Comparison of Header and
Dataset and Intrusion Traffic Deep Packet Features When Detecting
Characterization”, Proceedings of the 4th Network Intrusions”, Technical Report,
International Conference on Information University of Maryland: College Park, MD,
Systems Security and Privacy, 2018, 108-116. USA, 2018.
[4] Hossen, S., Janagam A., “Analysis of Network [14] Aksu D, Üstebay S, Aydin MA, Atmaca
Intrusion Detection System with Machine T., “Intrusion Detection With Comparative
Learning Algorithms (Deep Reinforcement Analysis Of Supervised Learning Techniques
Learning Algorithm)” [master’s thesis]. And Fisher Score Feature Selection
Karlskrona, Sweden: Faculty of Computing at Algorithm”, International Symposium on
Blekinge Institute of Technology; 2018. Computer and Information Sciences , Springer,
[5] Kostas, K., “Anomaly Detection in Networks Cham , 2018, 141-149.
Using Machine Learning” [master’s thesis]. [15] Marir N, Wang H, Feng G, Li B, Jia M. ,
Colchester, UK: School of Computer Science “Distributed Abnormal Behavior Detection
and Electronic Engineering, University of Approach Based On Deep Belief Network And
Essex; 2018. Ensemble Svm Using Spark”, IEEE Access
[6] Panigrahi R, Borah S., “A detailed analysis of (6),2018,59657-71.
CICIDS2017 dataset for designing Intrusion
Detection Systems”, International Journal of [16] Bansal, A., “DDR Scheme and LSTM
Engineering & Technology 7 (3.24), 2018, RNN Algorithm for Building an Efficient
479-482. IDS”, [Master’s Thesis]. Punjab, India: Thapar
[7] Boukhamla A, Gaviro JC., “CICIDS2017 Institute of Engineering and Technology; 2018.
dataset: performance improvements and [17] NumPy [Internet]. NumPy. [cited
validation as a robust intrusion detection 2019May26]. Available from:
system testbed” [pdf]. 2018 [cited https://www.numpy.org/
2019May26]. Available from: [18] Learn [Internet]. scikit. [cited
https://www.researchgate.net/publication/3277 2019May26]. Available from: https://scikit-
98156_CICIDS2017_Dataset_Performance_Im learn.org/stable/
provements_and_Validation_as_a_Robust_Intr [19] Python Data Analysis Library [Internet].
usion_Detection_System_Testbed Pandas. [cited 2019May26]. Available from:
[8] Mieden, P., “Implementation and evaluation of https://pandas.pydata.org/index.html
secure and scalable anomaly-based network
4529
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4530
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
Concatenate all
dataset files 36 features
dataset
Remove Null
values records
Normalization
Semi-balancing
23 Quantile 36
features features
dataset (Best scaling) dataset
4531
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
MachineLearningCVE\Thursday- Infiltration 36 0 36
WorkingHours-Afternoon- BENIGN 288566 207 288359
Infilteration.pcap_ISCX.csv
4532
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4533
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4534
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4535
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4536
Journal of Theoretical and Applied Information Technology
15th September 2019. Vol.97. No 17
© 2005 – ongoing JATIT & LLS
4537