Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Accelerat ing t he world's research.

Improving network security using


genetic algorithm approach
Zorana Bankovic

Computers & Electrical Engineering

Cite this paper Downloaded from Academia.edu 

Get the citation in MLA, APA, or Chicago styles

Related papers Download a PDF Pack of t he best relat ed papers 

Implement at ion of Int elligent Techniques for Int rusion Det ect ion Syst ems
Sahar Selim

Anomaly-based int rusion det ect ion syst em t hrough feat ure select ion analysis and building hybrid effi…
Mont her Aldwairi

FCAAIS: Anomaly based net work int rusion det ect ion t hrough feat ure correlat ion analysis and associa…
Jyot hsna Prasad
ARTICLE IN PRESS

Computers and Electrical Engineering xxx (2007) xxx–xxx


www.elsevier.com/locate/compeleceng

Improving network security using genetic algorithm approach


a,*
Zorana Banković , Dušan Stepanović b, Slobodan Bojanić a,
Octavio Nieto-Taladriz a
a
ETSI Telecomunicación, Technical University of Madrid, Ciudad Universitaria s/n, 28040 Madrid, Spain
b
Faculty of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 78, 11000 Beograd, Serbia

Abstract

With the expansion of Internet and its importance, the types and number of the attacks have also grown making intru-
sion detection an increasingly important technique. In this work we have realized a misuse detection system based on
genetic algorithm (GA) approach. For evolving and testing new rules for intrusion detection the KDD99Cup training
and testing dataset were used. To be able to process network data in real time, we have deployed principal component
analysis (PCA) to extract the most important features of the data. In that way we were able to keep the high level of detec-
tion rates of attacks while speeding up the processing of the data.
 2007 Published by Elsevier Ltd.

Keywords: Intrusion detection; Genetic algorithm; Principal component analysis

1. Introduction

Internet and local area networks are expanding at an amazing rate in recent years, not just in the terms of
size, but also in the terms of changing the services offered and the mobility of users that make them more vul-
nerable to various kinds of complex attacks. While we are benefiting from the convenience that new technol-
ogy has brought us, computer systems are exposed to increasing number and complexity of security threats. Of
particular importance, thus, is the ability of applying rapidly new network security policies in order to detect
and react as quickly as possible to the occurring attacks. Different techniques have been developed and
deployed to protect computer systems against network attacks (anti-virus software, firewall, message encryp-
tion, secured network protocols, password protection). Despite all the efforts, it is impossible to have a com-
pletely secured system. Therefore, intrusion detection is becoming an increasingly important technique that
monitors network traffic and identifies network intrusions such as anomalous network behaviors, unautho-
rized network access, or malicious attacks to computer systems. Most of the existing solutions are developed
for well-defined networks and systems [1–3]. Nevertheless, they are not adapted to dynamic environments, or
to the increasing complexity of user behaviors.

*
Corresponding author.
E-mail address: zorana@die.upm.es (Z. Banković).

0045-7906/$ - see front matter  2007 Published by Elsevier Ltd.


doi:10.1016/j.compeleceng.2007.05.010

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

2 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

There are two general categories of intrusion detection systems (IDSs): misuse detection and anomaly
based. Misuse detection systems are most widely used and they detect intruders with known patterns. The sig-
natures and patterns used to identify attacks consist of various fields of a network packet, like source address,
destination address, source and destination ports or even some key words of the payload of a packet. These
systems exhibit a drawback in the sense that only the attacks that already exist in the attack database can be
detected, so this model needs continuous updating, but they have a virtue of having very low false positive
rate. Anomaly detection systems identify deviations from normal behaviour and alert to potential unknown
or novel attacks without having any prior knowledge of them. They exhibit higher rate of false alarms, but
they have the ability of detecting unknown attacks and perform their task of looking for deviations much
faster.
Application and development of specialized machine learning techniques is gaining increasing attention in
the intrusion detection community [19]. Soft computing is a collection of methodologies, which aim to exploit
tolerance for imprecision, uncertainty and partial truth to achieve tractability, robustness and low solution
cost. As soft-computing techniques can also be used for machine learning, different soft-computing techniques
have been used for intrusion detection (Fuzzy Logic, Artificial Neural Networks, Genetic Algorithms)
[4,7,15,17], but their possibilities are still under-utilized. In this work we have realized a misuse detection sys-
tem that is based on Genetic Algorithm (GA). We have exploited both possibilities, either to classify network
traffic as normal or abnormal, or to further classify the attacks by their type. Many features of GA make it
very suitable for intrusion detection. Like robustness to noise, self-learning capabilities, and the fact that ini-
tial rules can be built randomly so there is no need of knowing the exact way of attack machinery at the begin-
ning. Further classification of the attacks is not very important for intrusion detection, but it is important for
network forensics because knowing the exact type of a threat and the way it performs its attack, the recovery
after an attack would be more successful.
Genetic algorithm (GA) field is one of the up-coming fields in computer security, especially in intrusion
detection systems (IDS) [7,15,18]. GA operates on a population of potential solutions applying the principle
of survival of the fittest to produce better and better approximations to the solution of the problem that GA is
trying to solve. At each generation, a new set of approximations is created by the process of selecting individ-
uals according to their level of fitness value in the problem domain and breeding them together using the oper-
ators borrowed from the genetic process performed in nature, i.e. crossover and mutation. This process leads
to the evolution of populations of individuals that are better adapted to their environment than the individuals
that they were created from, just as it happens in natural adaptation.
For evolving new rules the KDD99Cup training and testing dataset was used [5]. KDD99Cup dataset was
found to have quite drawbacks as containing missing and useless features and impossibility of detection of
some attacks [6]. Despite of the shortcomings, it is still prevailing dataset used for training and testing of IDSs
due to its good structure, i.e. every connection is described using 41 features and is labelled, thus providing the
information whether the connection is normal or it is a specific attack type, and the fact that it is the only data
set of that kind available [7,8].
It is extremely difficult to process in real time large amount of network traffic data in order to be able to
detect network attacks and take the appropriate actions. On the other hand, not all the features have the same
relevance for intrusion detection [9]. To process network data in real time and perform efficient intrusion
detection, we need to extract the most important piece of information that can be deployed for efficient detec-
tion of network attacks. We have deployed principal component analysis (PCA), known also as Karhunen–
Loève transform, in order to extract the most relevant features of the data. PCA seeks to reduce the dimension
of the data by finding a few orthogonal linear combinations of the original variables with the largest variance.
Our results confirm maintenance of high detection rate while using lower dimension of data. The other benefit
is that data processing and the decision making whether a connection is an attack are performed much faster.
Rest of the work is organized as follows. In Section 2 a survey on the machine learning techniques is given.
Section 3 gives the overview of the dimension reduction technique deployed. In Section 4 the overview of
genetic algorithm is shown. Genetic algorithm approach to intrusion detection as deployed is presented in Sec-
tion 5. In Section 6 software implementation of the proposed approach is shown. Obtained results are given in
Section 7. Possible application environments of the system are presented in Section 8. Conclusions are drawn
in Section 9.
Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx 3

2. Survey on the machine learning techniques used for intrusion detection

Large amount of network data and big number of network attacks have imposed the usage of intelligent
machine learning techniques in order to discover attacks and their way of functioning. Past few years have
witnessed a growing recognition of intelligent techniques for the construction of efficient and reliable intrusion
detection systems. Most of the well-known pattern recognition techniques, both supervised and unsupervised,
and their combinations resulting in meta-classifiers have been used for intrusion detection. Some of the tech-
niques used in the state-of-the-art [15,17–24] and their results performed over KDD99Cup dataset are pre-
sented in Table 1.
Genetic algorithm is one of the techniques that have recently been recognized as having potential in the
intrusion detection field. Some of its applications are presented in [7,8,15,18]. Novelty of our approach con-
sists in the fact that we have used only tree features out of 41 in order to describe network connection while
maintaining high detection rates, thus providing to the system the ability to perform intrusion detection pro-
cess rapidly, in the terms of both training and testing the rules for detection of intrusions, and the possibility of
application to the high speed networks. Our approach exhibits similar detection and false-positive rate as the
approaches presented in [7,8], but at the same time exhibits much shorter process of training and thus refresh-
ing the rule set. Frequent refreshing of the rule set is very important characteristic considering the rate of the
emerging of new attacks [25].

3. Dimension reduction technique

3.1. PCA – overview

Advances in data collection and storage capabilities during the past decades have led to an information
overload in most sciences. Researchers working in various domains face every day’s larger need for observa-
tions and simulations. Traditional statistical methods do not give satisfactory results mostly because of the
growing number of variables associated with each observation. Dimension of the data is the number of vari-
ables used to describe the data.
High dimensional datasets have given rise to the new theoretical developments, like dimension reduction
techniques, as one of the problems with high dimensional datasets is that, in many cases, not all measured

Table 1
Machine learning techniques deployed to intrusion detection and their performance on KDD99Set
Technique Detection rate (%) False positive rate (%)
C4.5 95 1
Support vector machine (SVM) 95.5 1
Multi layer perception (MLP) 94.5 1
k-nearest neighbor (k-NN) 92 1
Linear programming machine (LPM) 94 1
Regularized discriminant analysis (RDA) 94 1
Ficher linear discriminant (FD) 89 1
c-algorithm 80 1
k-means clustering 65 1
Single leakage clustering 69 1
Quarter-sphere SVM 65 1
Y-means clustering 89.89 1
Genetic programming ensemble for distributed intrusion detection (GEdIDS) 91 0.43
SVM + GA 99 –
SVM + Fuzzy Logic 99.56 0.44
Neural Networks + PCA 92.22 –
C4.5 + PCA 92.16 –
GA 97.47 0.69
C4.5+Hybrid neural networks 93.28 0.2
Hidden Markov model (HMM) 79 –

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

4 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

variables are important for the phenomena of interest. A large number of dimension reduction techniques
have been developed [10], like principal component analysis, factor analysis, projection pursuit, indepen-
dent component analysis etc. For our work, we have deployed principal component analysis (PCA) or
Karhunen–Loève transform because it is the best, in mean-square error sense, linear dimension reduction tech-
nique [11].
In our approach it was more convenient to select a subset of the original features that preserves most of the
relevant information according to some optimality criteria, i.e. in our case the features that are more likely to
participate in an attack, rather then finding a mapping that uses all of the original features. Our benefit of
finding the proper subset is in avoiding the cost of computations of unnecessary features thus leading to
the speed gain in both the process of detecting intrusions and training the rules.
Different techniques have been deployed for feature selection, like regression techniques, clustering or PCA-
based methods [31–33]. It has been demonstrated that PCA-based techniques exhibit certain advantages to
regression techniques in the terms of optimality property and speed [33].

3.2. Implementation of the PCA technique deployed for feature selection

In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations
(the PCs) of the original variables with the largest variance [10]. The first PC, s1, is the linear combination with
the largest variance. The second PC is the linear combination with the second largest variance and orthogonal
to the first PC, and so on. There are as many PCs as the number of original variables. For many datasets, the
first several PCs explain most of the variance, so that the rest can be discarded with minimal loss of
information.
We have deployed an alternative way to reduce the dimension of a dataset using PCA proposed in [12].
Instead of using the PCs as new variables, this method uses the information in the PCs to find important vari-
ables in the original dataset. As before, one first calculates the PCs, and then studies the scree plot, i.e. shows
the sorted eigenvalues, from large to small, as a function of the eigenvalue index, to determine the number of k
important variables to keep. Next, one considers the eigenvector corresponding to the smallest eigenvalue (the
least important PC), and discards the variable that has the largest (absolute value) coefficient in that vector.
Then, one considers the eigenvector corresponding to the second smallest eigenvalue, among the variables not
discarded earlier. The process is repeated until only k variables remain.

4. Genetic algorithm overview

Genetic algorithms (GA) are search algorithms based on the principles of natural selection and genetics.
The bases of genetic algorithm approach are given by Holland [13] and it has been deployed to solve wide
range of problems.
GA evolves a population of initial individuals to a population of high quality individuals, where each indi-
vidual represents a solution of the problem to be solved. Each individual is called chromosome, and is com-
posed of a predetermined number of genes [14]. The quality of each rule is measured by a fitness function as
the quantitative representation of each rule’s adaptation to a certain environment. The procedure starts from
an initial population of randomly generated individuals. Then the population is evolved for a number of gen-
erations while gradually improving the qualities of the individuals in the sense of increasing the fitness value as
the measure of quality. During each generation, three basic genetic operators are sequentially applied to each
individual with certain probabilities, i.e. selection, crossover and mutation. The algorithm flow is presented in
Fig. 1.
Determination of the following factors has the crucial impact on the efficiency of the algorithm: selection of
fitness function, representation of individuals and the values of GA parameters (crossover and mutation rate,
size of population, threshold of fitness value). Determination of these factors usually depends on the applica-
tion. In our work we have employed two simple fitness functions. The fist one calculates the detection rate of a
rule subduing the false detection rate and the second one is support-confidence framework. GA parameters
were chosen after a large number of experiments.
Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx 5

Is end of yes
Generat e initial Evaluate Best
evaluation
population population individuals
reached?

no

Generation of Selection result


start new populat ion

Crossover

Mutation

Fig. 1. Genetic algorithm flow.

Deployment of GA in the intrusion detection field offers number of advantages, namely:

• GAs are intrinsically parallel, since they have multiple offspring, they can explore the solution space in mul-
tiple directions at once. If one path turns out to be a dead end, they can easily eliminate it and continue
working on more promising avenues, giving them a greater chance by each run of finding the optimal
solution.
• Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly
well-suited to solving problems where the space of all potential solutions is truly huge – too vast to search
exhaustively in any reasonable amount of time, as network data is.
• System based on GA can easily be re-trained, thus providing the possibility of evolving new rules for intru-
sion detection. This property provides the adaptability of a GA-based system, which is an imperative qual-
ity of an intrusion detection system having in mind the high rate of emerging of new attacks.

5. Genetic algorithm approach to intrusion detection

The proposed approach contains two stages. In the first one, the training stage, a set of rules for detecting
intruders is generated using network audit data offline. In the second stage, the best rules, i.e. the rules with the
highest fitness values, are used for intrusion detection in the real-time environment.
As some of the network characteristics have higher possibilities to be involved in network intrusions, we
have deployed PCA approach to identify these characteristics. The PCA algorithm described above was
implemented in MATLAB and deployed over the training dataset in order to define the features that par-
ticipate most frequently in a machinery of an attack. According to the obtained results, we have selected
three features out of forty one used to describe each connection of KDD99Cup dataset. The objective
was to select the smallest possible number of the features while maintaining high detection rate of intrusions.
In such a way detection could be performed as a real-time one. Selected features and their explications are
presented in Table 2. In Appendix original 41 features from KDD99Cup dataset and their explication are
presented as well as selected features up to dimension of 16 that we have obtained deploying the same
PCA technique as described before. Every feature represents one gene of the chromosome. As one byte is
being used to represent every feature, i.e. every gene, a chromosome that represents each individual is com-
posed of three bytes.
Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

6 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

Table 2
Selected network features
Name of the feature Explication Number of genes
Duration Length (number of seconds) of the connection 1
src_bytes Number of data bytes from source to destination 1
dst_host_srv_serror_rate Percentage of connections that have ‘‘SYN’’ errors 1

Every rule for intrusion detection is simple if-then clause. Features from Table 2 are connected using an and
function thus forming the conditional part of a rule. The result of every rule is the confirmation of an intru-
sion. For example, one rule could be:
if ðduration ¼ \1" and src bytes ¼ \0" and dst host srv serror rate ¼ \50"Þ then intrusion;
To determine a fitness value of each rule, the following fitness function is deployed [15]:
a b
fitness ¼  ð1Þ
A B
where a is the number of correctly detected attacks, A is the total number of attacks in the training dataset, b is
the number of normal connections incorrectly characterized as attacks, i.e. false-positives, and B is the total
number of normal connections in the training dataset. Scale of fitness values is [1, 1], where -1 is the lowest
and 1 the highest value. High detection rate and low rate of false-positives result in a high fitness value. On the
other side, low detection rate and high rate of false-positives result in a low fitness value.
We have also exploited the possibility of GA to detect the exact type of an attack. Detecting the type of
each intrusion is not very important for intrusion detection, but it is important for forensics in order to recover
from an attack. In this case a rule can be presented as:
if ðduration ¼ \1" and src bytes ¼ \0" and dst host srv serror rate ¼ \50"Þ then ðattack name
¼ \portsweep"Þ;
As the previous fitness function being aware only of total number of intrusions and not of its exact type, we
have deployed support-confidence framework [8] to determine the fitness of each rule:
support ¼ jA and Bj=N
confidence ¼ jA and Bj=jAj ð2Þ
fitness ¼ w1  support þ w2  confidence
where N is the total number of network connections in the testing dataset, jAj stands for the number of con-
nections matching the condition A, and jA and Bj stands for the number of connections that matches the rule if
A then B. The weights w1and w2 are used to control the balance between the two terms.
The algorithm for generating new rules is performed as follows. The first step is initialization of an initial
population when each gene is given a random value. Then the parameters of genetic algorithm (crossover and
mutation rate, size of population, end of evolution of rules) are specified and the network audit data is being
loaded. After that the initial population is being evolved for a number of generations. In every generation the
quality of every rule, i.e. fitness value, is being calculated according to the fitness function, then a number of
rules with the highest fitness values are being selected and at the end the genetic operators (crossover and
mutation) are performed with a certain probability. The output of the algorithm are rules for intrusion
detection.

6. Implementation of the system

The system proposed here is implemented in software using C++ programming language. Implementation
contains two systems: an ‘‘offline’’ system used to derive the rules from network audit data, and an ‘‘online’’
system that uses the derived rules for intrusion detection and is supposed to perform its function in a real-time
environment.
Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx 7

Fig. 2. Class diagram of the realized system.

The program contains the following classes:

• Individual: represents an individual of the rule population; contains its chromosome representation and fit-
ness value; also contains the field called attack_name which is used in the second case of rule generation to
determine the exact type of an attack;
• Fitness: used for the calculation of a fitness value; contains the definitions of both fitness functions
described above and each of them is selected when necessary;
• Initializer: used for the initialization of a population;
• Evaluator: used for selecting the rules whose fitness values are higher than the determined threshold value
(‘‘best-fit’’ rules);
• Breeder: used for the breeding of each generation; at the beginning it selects two individuals randomly, per-
forms their one-point crossover with a certain probability thus generating two new individuals and then
performs the mutation of the new individuals with a small probability;
• PriorityQueue: class that contains a queue of elements organized by their priority, in this case the individ-
uals organized in descending order of their fitness values, facilitating on that way the work of Breeder and
Evaluator classes and election of the best-fit individuals at the end of the training process.

The classes (with their most important attributes and functions) and their interconnections and dependen-
cies are depicted in Fig. 2.

7. Training and testing the rules for intrusion detection

7.1. Training and testing data subsets

For the purpose of this work, two subsets of KDD99Cup datasets for training and testing are derived. Each
connection has the corresponding marking that states whether it is a normal connection or a certain type of an
Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

8 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

attack. The subset used for training contained 137 attacks and 839 normal connections. The testing subset
contained 234 attacks and 743 normal connections. The most of the connections selected are normal, which
is generally the case in real-world networks. The subsets contained three types of network attacks: portsweep,
smurf and neptune [16]. Portsweep is a kind of an attack that sweeps through many ports to determine which
services are supported on a single host. Smurf is a denial-of-service attack that sends a stream of ICMP
‘‘ECHO’’ to the broadcast address of many subnets, resulting in a large continuous stream of ‘‘ECHO’’ replies
that floods the victim. Neptune (or SYN-flood) is a denial-of-service attack where attacking system continues
sending IP-spoofed packets requesting new connections faster than the victim system can close pending con-
nections, i.e. they will expire. In some cases, the system may exhaust memory, crash or be rendered otherwise
inoperative.
In the terms of the features used to describe a particular type of an attack, portsweep attacks experiment
greater time range, i.e. duration feature usually has a value higher that ‘0’, considering that it takes some per-
iod of time to perform its attack, while the other two in most of the cases have ‘0’ value. They also exhibit
wider range of dst_host_srv_serror_rate than the other two attacks. Neptune attack exhibits great value of
the dst_host_srv_serror_rate, usually around 100%, considering that the attacker sends a stream of SYN pack-
ets to a port on the target machine and dst_host_srv_serror_rate provides the information of the connections
that have SYN errors, while duration and src_bytes features have ‘0’ value in most of the cases. Smurf attack,
on the other side, has a high value of the src_bytes feature, few times bigger than the number of src_bytes in
the normal connection, while the values of dst_host_srv_serror_rate and duration remain ‘0’.
The training dataset contained 74 neptune, 24 smurf and 39 portsweep attacks. The testing dataset contained
87 neptune, 107 smurf and 40 portsweep attacks.

7.2. GA parameters deployed for training the rules

The system was trained using the fitness function defined in formula (1) with the following parameters of
genetic algorithm: 1000 generations, 500 initial rules, ‘‘one-point’’ crossover with the probability of 0.6 in the
first and 0.7 in the second experiment, the mutation rate of 0.01 in the first and 0.05 in the second experiment.
When the process of training was finished, 10 ‘‘best-fit’’ rules were selected for the classification of the intru-
sions and the normal connections in the testing dataset.

7.3. Obtained results

Obtained results are presented in Table 3. It can be observed from the table that in the first experiment nor-
mal connections are classified 100% correctly, i.e. there are no false-positives, but the ‘‘trade-off’’ is lower
attack detection rate.
Obtained detection rates are similar to the detection rates obtained in [7,8], but the main advantage of our
approach is that we are using only 3 features of a network connection, while they are using 7 and 41 respec-
tively. Hence, our system can perform the training process and the process of detecting intrusions faster and
could be applied to high speed networks while maintaining high detection rates.
The system was also trained using support-confidence framework with the fitness function defined with the
formula (2) and the same parameters of the algorithm as defined before. In our experiments we have varied the
values of the weight coefficients w1 and w2. Obtained results are presented in Table 4.
From the Table 4, we can conclude that the maximum detection rate of the rules obtained by training the
rules with the fitness function (2) is 87.6% and that the set of the rules trained deplying the fitness function
according to (1) gives higher detection rate. The detection rates of neptune and smurf are very high, but the

Table 3
Detection rates (%) in experiments 1 and 2 of the system trained with fitness function (1)
Type of connection Experiment 1 Experiment 2
Normal 100 98.38
Attack 92.74 94.87

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx 9

Table 4
Detection rates (%) in experiments 1 and 2 of the system trained with fitness function (2)
w1 w2 Exp1 Exp2
Neptune Smurf Portsweep False-positive Neptune Smurf Portsweep False-positive
0 1 0 0 22.5 0 0 0 30 1.6
0.1 0.9 0 0 7.5 0 0 0 30 0
0.2 0.8 0 99 22.5 0 0 0 22.5 0
0.3 0.7 0 0 22.5 0 0 99 22.5 0
0.4 0.6 0 0 35 1.6 0 0 35 1.6
0.5 0.5 0 99 30 0 0 99 30 0
0.6 0.4 100 99 30 0 100 99 30 0
0.7 0.3 100 99 30 0 100 99 30 0
0.8 0.2 100 99 30 0 100 99 30 0
0.9 0.1 100 99 30 0 100 99 30 0
1 0 100 99 30 0 100 99 30 0

detection rate of portsweep remains low probably because of the smaller specimen used for training and higher
diversity of individual attacks in regard to the features deployed to describe the attacks. False positive rate in
this case is also 0 and confirms the previous state that GA approach provides low false-positive rate. Again, we
have maintained high detection rate while using only three features of a network connection.

8. Application environments

The implementation of the algorithm presented here is supposed to be deployed in a wired network envi-
ronment. Nevertheless the system could also be deployed in a wireless network environment as a supplemen-
tary item for reinforcing security considering the number of existing security flaws in wireless networks [26],
provided the proper attack taxonomy customized to wireless network attacks and the proper training dataset.
IEEE 802.11 protocol of Wireless Local Area networks (WLAN) has similar security issues as Ethernet pro-
tocol for LAN or WAN networks [27], adding up some new ones due to the security issues concerning Wired
Equivalent Protocol (WEP) [26,30]. This implies the necessity of deploying supplementary items for security
reinforcement.
For example, as RTS/CTS (RequestToSend/ClearToSend) combination is similar to TCP’s synchronize
(SYN) and Acknowledge (ACK) [28], the attack that has the machinery similar to the one of SYN-flood (Nep-
tune) attack, could be deployed in the WLAN environment. The RequestToSend (RTS) frame is followed by a
ClearToSend (CTS) frame to ensure that no hidden node can transmit while another node out of the sender’s
range is also transmitting [29]. This allows other nodes in the broadcast area to suspend transmission until the
current frame has been transmitted. Many RTS frames could be sent in a flood, thus tying up the medium and
causing a DoS. Because of a lack of (improper) validation of the senders of the packets, a RTS Flood could be
developed. This example, and the others presented in [27], implies the possibility of using some existing intru-
sion detection systems used for securing wired networks that are flexible enough to be deployed in wireless
environment, as the system presented here is, after having performed necessary changes.

9. Conclusions

In this work we have deployed genetic algorithm approach to intrusion detection. Software implementation
of the proposed approach is presented. Genetic algorithm was used to obtain classification rules for intrusion
detection while principal component analysis was used to identify the most important features of network
connections.
As in real word types of intrusions are changing rapidly and becoming increasingly complex, an intrusion
detection system should be adaptive in order to be able to cope with the evolution of the threat-space. As our
system can upload and update new data y evolve new rules for detecting new intrusions, it is adaptive and
also cost effective because it is easy to maintain. Therefore, GA-approach, with the appropriate and simple
Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

10 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

representation of the rules and effective fitness functions that can be applied, is easy to implement and main-
tain. Moreover, our system is flexible enough to be used in different application environments, provided the
proper attack taxonomy and the proper training dataset.
We have demonstrated that GA-approach can be used either to classify network connections as either nor-
mal or intrusive or further classify attacks by their type. Classification of attacks is not important in intrusion
detection, considering that the goal of intrusion detection is detecting attacks in real time so they could be
retained before creating any damage. On the contrary, classification of the attacks is very important in net-
work forensics because a good recovery after the damage made by an attack can be done by knowing the exact
type of an attack and its mechanism.
High attack detection rate and low false-positive rate demonstrate advantages of applying this technique to
intrusion detection without using any complementary technique typically used with other soft-computing
techniques. Our system is using only three features of the network connections maintaining high detection
rates, so it can perform intrusion detection process fast and could be applied to high speed networks.

Appendix

See Tables A1–A3.

Table A1
List of features with their description and data types of KDD99Cup data set [4]
Feature Description Type
1. Duration Duration of the connection Continual
2. Protocol type Connection protocol (udp, tcp, icmp) Discrete
3. Service Destination service (e.g. telnet, ftp) Discrete
4. Flag Status flag of the connection Discrete
5. Source bytes Bytes sent from source to destination Continual
6. Destination bytes Bytes sent from destination to source Continual
7. Land 1 if connection is from/to the same host/port; 0 otherwise Discrete
8. Wrong fragments Number of wrong fragments Continual
9. Urgent Number of urgent packets Continual
10. Hot Number of hot indicators Continual
11. Failed logins Number of failed logins Continual
12. Logged in 1 if successfully logged in; 0 otherwise Discrete
13. # Compromised Number of ‘‘compromised’’ conditions Continual
14. Root shell 1 if root shell is obtained; 0 otherwise Discrete
15. su attempted 1 if ‘‘su root’’ command attempted; 0 otherwise Discrete
16. # Root Number of ‘‘root’’ accesses Continual
17. # File creations Number of file creation operations Continual
18. # Shells Number of shell promts Continual
19. # Access files Number of operations on access control files Continual
20. # Outbound cmds Number of outbound commands in an ftp sessions Continual
21. Is hot login 1 if the login belongs to the ‘‘hot’’ list; 0 otherwise Discrete
22. Is guest login 1 if the login is a ‘‘guest’’ login; 0 otherwise Discrete
23. Count Number of the connections to the same host as the current connection in the past two Continual
seconds
24. srv count Number of connections to the same service as the current connection in the past two Continual
seconds
25. serror rate % of connections that have ‘‘SYN’’ errors Continual
26. srv serror rate % of connections that have ‘‘SYN’’ errors Continual
27. rerror rate % of connections that have ‘‘REJ’’ errors Continual
28. srv rerror rate % of connections that have ‘‘REJ’’ errors Continual
29. Same srv rate % of connections to the same service Continual
30. Diff srv rate % of connections to different services Continual
31. srv diff host rate % of connections to different hosts Continual
32. dst host count count of connections having the same destination host Continual

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx 11

Table A1 (continued)
Feature Description Type
33. dst host srv count count of connections having the same destination host and using the same service Continual
34. dst host same srv rate % of connections having the same destination host and using the same service Continual
35. dst host diff srv rate % of different services on the current host Continual
36. dst host same src port rate % of connections to the current host having the same src port Continual
37. dst host srv diff host rate % of connections to the same service coming from different host Continual
38. dst host srv serror rate % of connections to the current host that have S0 error Continual
39. dst host srv serror rate % of connections to the current host and specified service that have an S0 error Continual
40. dst host rerror rate % of connections to the current host that have RST errors Continual
41. dst host srv rerror rate % of connections to the current host and specified service that have an RST error Continual

Table A2
List of important features for each dimension, part I
Dimension 1 2 3 4 5 6 7 8
src_bytes duration duration duration duration duration duration duration
src_bytes src_bytes src_bytes src_bytes flag flag flag
dst_host serror_rate serror_rate src_bytes src_bytes src_bytes
_srv_
serror_rate
dst_host dst_bytes serror_rate hot srv_rerror_rate
_srv_
serror_rate
dst_host dst_bytes serror_rate hot
_srv_
serror_rate
dst_host dst_bytes serror_rate
_srv
_serror_rate
dst_host dst_bytes
_srv_
serror_rate
dst_host_srv_
serror_rate

Table A3
List of important features for each dimension, part II
Dimension 9 10 11 12 13 14 15 16
duration duration duration duration duration duration duration duration
Service Service protocol protocol_ protocol protocol_ protocol_ protocol_
_type type _type type type type
flag flag service service service service service service
src_bytes src_bytes flag flag flag flag flag flag
srv_rerror srv_rerror src_ bytes src_bytes src_bytes src_bytes src_bytes src_bytes
_rate _rate
hot hot srv_ rerror srv_rerror srv_rerror srv_rerror srv_rerror srv_rerror
_rate _rate _rate _rate _rate _rate
dst_host_srv_ count hot hot hot hot hot hot
serror_rate
serror_rate serror_rate count count count count count count
dst_bytes dst_bytes serror_ rate serror_rate serror_rate serror_rate serror_rate serror_ rate
dst_host_ dst_bytes dst_bytes dst_bytes dst_bytes dst_bytes dst_bytes
srv_
serror_rate
(continued on next page)

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

12 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

Table A3 (continued)
Dimension 9 10 11 12 13 14 15 16
dst_host_ dst_host dst_host dst_host_ dst_host_ dst_host_
srv_ _srv_ _srv_ srv_ srv_ srv_serror_
serror_rate serror_rate serror_rate serror_rate serror_rate rate
num_root num_root num_root num_root num_root
srv_count srv_count srv_count srv_count
num_ num_ num_
compromised compromised compromised
wrong wrong
_fragment _fragment
srv_diff_
host_rate

References

[1] Balasubramaniyan JS, Garcia-Fernandez JO, Isaco D, Spatford E, Zamboni D. An architecture for intrusion detection using
autonomous agents. In: Proceedings of 14th annual computer security applications conference, 1998.
[2] Heberlein LT, Mukherjee B, Levitt KN, Mansur DL. Towards Detecting Intrusions in a Networked Environment. In: Proceedings of
14th department of energy computer security group conference, 1991.
[3] White B, Fich EA, Pooch UW. Cooperating security managers: a peer-based intrusion detection system. IEEE Network J 1996(Jan/
Feb):20–3.
[4] Yao JT, Zhao SL, Saxton LV. A study on fuzzy intrusion detection. In: Belur V. Dasarathy, editor. In Proceedings of SPIE Vol. 5812,
Data Mining, Intrusion Detection, Information Assurance, And Data Networks Security, 28 March–1 April 2005, Orlando, Florida,
USA: SPIE, Bellingham, WA; 2005. p. 23–30.
[5] KDD Cup 1999 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, October 1999.
[6] McHugh J. Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA IDS evaluation as performed by Lincoln
laboratory. ACM Trans Inform Syst Security 2000;3(4):262–94, November.
[7] Gong RH, Zulkernine M, Abolmaesumi P. A software implementation of a genetic algorithm based approach to network intrusion
detection. In: Proceedings of the sixth international conference on software engineering, artificial intelligence, networking
and parallel/distributed computing and first ACIS international workshop on self-assembling wireless networks (SNPD/SAWN‘05),
2005.
[8] Lu W, Traore I. Detecting new forms of network intrusion using genetic programming. Comput Intell 2004;20(3):470–90.
[9] Kayacik HG, Zincir-Heywood AN, Heywood MI. Selecting Features for Intrusion Detection: A Feature Relevance Analysis on
KDD99 Intrusion Detection Datasets. <http://www.cs.dal.ca/projectx/>.
[10] Fodor IK. A Survey of Dimension Reduction Techniques. <http://llnl.gov/CASC/sapphire/pubs>.
[11] Jolliffe IT. Principal component analysis. Springer Verlag; 1986.
[12] Mardia KV, Kent JT, Bibby JM. Multivariate analysis. Probability and mathematical statistics. Academic Press; 1995.
[13] Holland J. Adaptation in natural and artificial system. Ann Arbor. The University of Michigan Press; 1975.
[14] Polhlheim H, Genetic and Evolutionary Algorithms: Principles, Methods and Algorithms. <http://www.geatbx.com/docu/
index.html>, accessed in 2006.
[15] Chittur A. Model Generation for an Intrusion Detection System Using Genetic Algorithms, http://www1.cs.columbia.edu/ids/
publications/gaids-thesis01.pdf, accessed in 2006.
[16] Kendall K. A database of computer attacks for the evaluation of intrusion detection systems. <http://www.kkendall.org/files/thesis/
krkthesis.pdf>, accessed in 2005.
[17] Pan Z, Chen S, Hu G, Zhang D. Hybrid Neural Network and C4.5 for Misuse Detection. In: Proceedings of the second international
conference on machine learning and cybernetics, vol. 4. 2003. pp. 2463–2467 [Nov.].
[18] Folino G, Pizzuti C, Spezzano G. GP ensemble for distributed intrusion detection systems. In ICAPR 2005, 3rd international
conference on advances in pattern recognition, LNCS, Springer Verlag, 3686/2005, Bath, UK, August 2005.
[19] Laskov P, Düssel P, Schäfer C, Rieck K. Learning intrusion detection: supervised or unsaupervised? CIAP: international conference
on image analysis and processing No. 13, vol. 3617. Italy: Cagliari; 2005, September.
[20] Guan Y, Ghorbani AA, Belacel N, Y-means. A Clustering method for Intrusion Detection. In Canadian conference on electrical and
computer engineering, IEEE CCECE, vol. 2. 2003. p. 1083–6.
[21] Kim DS, Ha-Nam Nguyen, Jong Sou Park. Genetic algorithm to improve SVM based network intrusion detection system. In:
Proceedings of the 19th international conference on advanced informational networking and applications, vol. 2. 2005. p. 155–8.
[22] Yao JT, Zhao SL, Saxton LV. A study on fuzzy intrusion detection. data mining, intrusion detection, information assurance, and
data networks security 2005. Orlando FL: 2005; [28–29 March].
[23] Bouzida Y, Gombault S. Eigenconnections to intrusion detection. In: Proceedings of the 19th IFIP international information security
conference. Kluwer Academic; 2004. August.

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx 13

[24] Joshi SS, Phoha VV. Investigating hidden Markov model capabilities in anomaly detection. ACM Regional Conference. In
Proceedings of the 43rd annual southeast regional conference, vol. 1, 2005; p. 98–103.
[25] <http://www.developer.net.au/A_brief_history_of_Malware.htm>.
[26] Sklavos N, Koufopavlou O. Mobile communications world: security implementations aspects – a state of the art. CSJM J, vol.
11. Institute of Mathematics and Computer Science; 2003, Number 2(32), p. 163–87.
[27] Lough DL. A Taxonomy of Computer Attacks with Applications to Wireless Networks. PhD Dissertation. <http://
scholar.lib.vt.edu/theses/available/etd-04252001-234145/unrestricted/lough.dissertation.pdf>.
[28] Jon Postel. Transmission control protocol: DARPA Internet Program protocol specification. Request for Comments (RFC) 793,
September 1981. InternetEngineering Task Force; <http;//www.ietf.org>.
[29] O’Hara Bob, Petrick Al. The IEEE 802.11 handbook: A designer’s companion.standards information network. 3 Park Avenue, New
York,New York, 10016-5997: IEEE Press; 1999.
[30] Borisov N, Ian Goldberg, Wagner D. Intercepting mobile communications: the insecurity of 802.11. In The seventh annual
international Conference on Mobile computing and networking. Rome: 2001. p. 180–9.
[31] Hocking RR. Development in linear regression methodology: 1959–1982. Technometrics 1983;25:219–49.
[32] Jolliffe IT. Discarding variables in principal component analysis. I: artificial data. Appl Statist 1972;21:160–72.
[33] Krzanowski WJ. Selection of variables to preserve multivariate data structure, using principal component analysis. Appl Stat – J Roy
Stat Soc Series C 1987;36:22–33.

Zorana Banković got the title of Electrical Engineer by the Faculty of Electrical Engineering, University of
Belgrade (Serbia) in 2005. Currently she is a PhD student at Technical University of Madrid. Her main research
interests are related to network security and FPGA implementations of common intrusion detection algorithms.

Dušan Stepanović received the title of Electrical Engineer from the University of Belgrade, Serbia, in 2004.
Currently he is a PhD student at the University of California, Berkeley. His research interests include design of
high-speed and low-power digital integrated circuits, general area of signal processing and its applications to
image and video compression.

Slobodan Bojanić received B.Sc, M.Sc and Ph.D from the University of Belgrade (Serbia) in 1986, 1991 and 1997
respectively. Currently he is a Research Scientist at Technical University of Madrid. His research interests are in
the areas of data security related to FPGA implementations for accelerating cryptographic algorithms, crypt-
analysis, network security and bioinformatics.

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010
ARTICLE IN PRESS

14 Z. Banković et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

Octavio Nieto-Taladriz received B.Sc and PhD by the Universidad Politécnica de Madrid in 1984 and 1989
respectively. Currently he is Full Professor and the Head of the Department of the Departamento de Ingenierı́a
Electrónica at the ETSI Telecomunicación of the Universidad Politécnica de Madrid. Main research and devel-
opment fields are the development of embedded systems, high performance digital architectures mainly focused on
broad band radio communications and the development and integration of services and applications for mobility
over heterogeneous communication platforms, security, ambient intelligence and domotics.

Please cite this article in press as: Banković Z et al., Improving network security using genetic algorithm approach,
Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

You might also like