Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

School of Information Technology and Engineering

A PROJECT ON
INTRUSION DETECTION SYSTEM USING
UNSUPERVISED ML ALGORITHMS

Technical Answers for Real World Problems


(ITE3999)

Faculty: Prof. DAPHNE LOPEZ

Submitted by:
Aditya Kumar (18BIT0235)
Ritvik Gupta (18BIT0218)

Significance of the Study


An institution would always try to have a fast network, a conducive working
environment free from viruses and without disruptive messages this will help
reduce on bandwidth utilization. Well planned intrusion detection will also
simplify network management. As the network expands, combining different
techniques gives a better coverage and more effective intrusion detection and
hence prevention. In the long run, deploying IDSs greatly cuts down on costs as
a result of unauthorized access, allowing network managers or system
administrators engage on other productive endeavors.

Generally, these attacks can be classified into four types:

1. Interception: This means that some unauthorized party gains access to an


asset. This could be a program, person or a computing system. An example
of this could be wiretapping, or illicit copying of program or data files. The
damage could be worsened when the attacker leaves no traces on the
network.

2. Interruption: In this situation an asset of the system is made unavailable


or unusable. An example is a malicious destruction of a hardware device,
erasure of a program or data file or malfunctioning of an operating
system so that it does not get say a particular disk.

3. Modification: The attacker both accesses and tampers with the asset. This
may include altering the program so that it can perform an additional
computation or modify data being transmitted. They may range from simple
changes to more subtle changes that may not even be detected.

4. Fabrication: This involves creating counterfeit objects on the computing


system. When skill- fully done may also go undetected and thus a very
serious threat to the network security

Limitation of traditional IDS

1. Several real attacks are far less than the number of false alarms raised. This
causes real threats to go often unnoticed.

1. Noise can severely reduce the capabilities of the IDS by generating a high
false-alarm rate.

2. Constant software updates are required for signature-based IDS to keep up


with the new threats.

2. IDS monitor the whole network, so are vulnerable to the same attacks the
network’s hosts are. Protocol-based attacks can cause the IDS to fail.

3. Network IDS can only detect network anomalies which limit the
variety of attacks it can discover.

3. Network IDS can create a bottleneck as all the inbound and outbound traffic
passes through it.
4. Host IDS rely on audit logs, any attack modifying audit logs threaten the
integrity of HIDS

Machine Learning is the field of study that gives computers the capability to
learn and improve from experience without being programmed explicitly
automatically. Machine learning focuses on the development of programs
that can use data to discover themselves

ABSTRACT
With the advent vast amounts of information and technology, all forms of
businesses around the world are becoming increasingly data driven.
Companies collect and deal with high velocity, variety and volumes of
data. This also gives way to various loopholes in the systems developed
for working with such large amounts of data.

In this project, we attempt to tackle the problem of intrusions in digital


systems by creating an Intrusion Detection System using Unsupervised
Machine Learning Algorithms.

Traditional Intrusion Detection Systems have existed however, they


detect intrusions based only in network signatures and flags previously
classified. By using Machine Learning, we are able to use the data from
previous attacks and dynamically detect new intrusion patterns in real-
time by analyzing that data.

AIM

The aim of our project is to compare and analyses clustering models


like K- means and Gaussian Mixture clustering with the help of Big Data
techniques like Spark on an IDS dataset so as to provide optimal
network threat detection.
METHODOLOGY

We are using the K-Means and Gaussian Mixture Model for training
our unsupervised machine learning algorithms on local memory spark
clusters.

 Feature scaling will be done to adjust for widely varying data.

 Feature selection is also applied by using Attribute-Ratio.

 K-means and Gaussian Mixture are used for training.

 We will also PySpark to optimize our operations and divide


processing into different batches using pipelines .

DATASET DESCRIPTION

 These data sets contain the records of the internet traffic seen by a
simple intrusion detection network and are the ghosts of the traffic
encountered by a real IDS and just the traces of its existence
remain.
 The data set contains 42 features per record, with 41 of the
features referring to the traffic input itself and the last is a label
(whether it is a normal or attack).
 The test dataset contains 22,000 entries and the train
dataset contains 1.26lakh entries.
 The training dataset is made up of 22 different attacks out of the
37 presents in the test dataset.
 The known attack types are those present in the training dataset
while the novel attacks are the additional attacks in the test
dataset
i.e. not available in the training datasets.
 The attack types are grouped into four categories: DoS, Probe,
U2R and R2L.

The feature types in this data set can be broken down into 4 types:

 4 Categorical (Features: 2, 3, 4, 42)


 6 Binary (Features: 7, 12, 14, 20, 21, 22)
 22 Discrete (Features: 8, 9, 15, 23–41)
 10 Continuous (Features: 1, 5, 6, 10, 11, 13, 16, 17, 18, 19)

FEATURE ELIMINATION USING ATTRIBUTE

RATIO
It is also known as attribute selection or variable selection. It helps in
selecting the most appropriate features amongst the available. Feature
selection can be performed manually or automatically.
Importance:
 Features may be expensive to obtain, thus feature selection
is helpful.
 It helps in improving accuracy of the model.
 It also reduces the time required by the model to train itself.
 Discards the garbage data.

References: https://www.naun.org/main/UPress/cc/2014/a102019-106.pdf
http://www.wseas.us/e-library/conferences/2013/Nanjing/ACCIS/ACCIS-30.pdf

You might also like