Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Feature selection to detect botnets using machine

learning algorithms
Francisco Villegas Alejandre, Nareli Cruz Cortés, and Eleazar Aguirre Anaya

Instituto Politécnico Nacional,


Centro de Investigación en Computación,
Laboratory of Cybersecurity, México, D.F.
email: b140572@sagitario.cic.ipn.mx, {nareli, eaguirre}@cic.ipn.mx

Abstract—In this paper, a novel method to do feature selection the training phase, the botmaster infects other machines
to detect botnets at their phase of Command and Control (C&C) through the Internet, these infected machines now become
is presented. A major problem is that researchers have proposed bots controlled by the botmaster and receive instructions
features based on their expertise, but there is no a method to
evaluate these features since some of these features could get a from the botmaster during phase C&C. During the attack,
lower detection rate than other. To this aim, we find the feature bots perform malicious activities based on the instructions
set based on connections of botnets at their phase of C&C, received. During the phase post-attack, some bots could be
that maximizes the detection rate of these botnets. A Genetic detected and removed, for this reason, the botmaster analyzes
Algorithm (GA) was used to select the set of features that gives the the botnet (occasionally) to detect bots still active.
highest detection rate. We used the machine learning algorithm
C4.5, this algorithm did the classification between connections The detection of botnet during the C&C phase is very
belonging or not to a botnet. The datasets used in this paper important mainly because it allows the detection of bots
were extracted from the repositories ISOT and ISCX. Some tests before the attack phase. Furthermore, if all the C&C servers
were done to get the best parameters in a GA and the algorithm are identified the botnet would be disabled.
C4.5. We also performed experiments in order to obtain the best In this research, a proposal for obtaining the set of features
set of features for each botnet analyzed (specific), and for each
type of botnet (general) too. The results are shown at the end to detect botnets at their phase of C&C using connections
of the paper, in which a considerable reduction of features and between devices is presented, the features used were extracted
a higher detection rate than the related work presented were from the related work with algorithms to detect botnets
obtained. in the phase of C&C. Network connections to identify the
Keywords: Malware Detection, Botnet, Feature Selection, behavior of botnets were used. These connections were used
Machine Learning to organize packages in a 5-tuple in the following manner:
<source IP address, destination IP address, source port,
I. I NTRODUCTION destination port, protocol>. The Genetic Algorithm (GA) was
Due to the growth of the Internet, the number of people used to solve an optimization problem of a set of features
attempting cyber-crime has also increased. Botnets are one of with the best detection rate for botnets at the phase of C&C,
the tools used by criminals to attack the Internet. the algorithm that is responsible for evaluating each solution
Botnets are a collection of bots (infected computers) that that is generated by the GA and also the responsible for
are controlled remotely by a botmaster through a channel doing the classification is the machine learning algorithm C4.5.
Command and Control (C&C). Botnets are used to perform
different types of attacks such as Distributed Denial of This paper is organized as follows: in Section II some
Service (DDoS), credential theft, spam, phishing, etc. Each related algorithms to detect botnets are mentioned, in Section
member, that is bot and botmaster, communicates to each III the proposed solution to obtain the feature selection to
other. detect botnet is mentioned, in Section IV the different types
Botnets are classified as centralized and decentralized. In of experiments carried out are explained, in Section V the
centralized botnets, bots periodically contact the server C&C results of experimentation are mentioned, Section VI the
to receive instructions. Some of the communication protocols conclusions of the paper are mentioned.
used are HTTP, and IRC. In decentralized botnets, also
called ”peer to peer” (P2P), only one of the bots receives II. R ELATED WORK WITH ALGORITHMS TO DETECT
BOTNETS
the message directly from C&C server, then this bot is
responsible for transmitting the message to other bots, and Some related work with algorithms based on connections
those bots to more bots. Some used communication protocols with time intervals to detect botnets in the phase of C&C are
are Overnet, Kademlia, and HTTP2P. presented in this section. Where only the first work performed
Leonard et al. divided the life cycle of the botnet in four feature selection:
phases: training, C&C, attack, and post-attack [7]. During E. Beigi et. al [1] performed a selection of features to detect
botnets. The C4.5 machine learning algorithm was used. They the detection rate of the related work.
also used a greedy algorithm called stepwise for feature In our proposal, the initial feature vector set is borrowed
selection. The datasets used were obtained from three different from the features presented in various related works to detect
agencies: ISOT, ISCX, and Malware Capture Facility Project. botnets in the phase of C&C using connections within time
They performed different experiments in which they separated intervals. In general, our proposal is to use a GA to select
features in four groups based on their type: based on bytes, the set of features that best detects the botnets. The algorithm
packages, time, and behavior. The results of the final set of C4.5 is responsible for evaluating the potential set of features
features show a 99% detection rate in a dataset with a limited generated by the GA, delivering the detection rate obtained.
number of botnets. In another experiment which contained The initial vector is the input of our GA, which is responsible
some botnets for the training phase and a much more diverse for generating different solutions, these solutions are sets of
set of botnets for the test phase, the detection rate was 75%. features. The algorithm C4.5 is responsible for the evaluations
K. Huseynov et. al [2] performed a comparison between the of the set of features generated by the GA, it delivers the
algorithm K-means and the algorithm Ant Colony System detection rate. As a result of the interaction of the GA and
to detect decentralized botnets. They used features based on the algorithm C4.5, the best set of features to detect botnets
the host and proposed a method able to detect the botnets at their stage of C&C would be obtained.
quickly and accurately. Their results show that the algorithm Figure 1 shows the proposed solution to obtain the feature set.
K-means has a better performance that the algorithm Ant In the related work, only one work performed a feature
Colony System. The algorithm K-means obtained 82.1% of selection using the greedy algorithm stepwise. ”A greedy
detection rate with false positives of 2.4% and the algorithm algorithm is an algorithmic paradigm that follows the problem
Ant Colony System scored a very low detection rate with solving heuristic of making the locally optimal choice at
67.8% and a very high rate of false positives with 23.5%. each stage” [14] with the hope of finding a global optimum.
The dataset used was ISOT. A greedy strategy does not in general produce an optimal
Saad et. al [3] performed a comparison among five machine solution. Furthermore in that work they separate features in
learning techniques commonly used to detect decentralized groups and obtained the best features of each group, thus an
botnets. The results of the experiments based on the dataset evaluation of each feature were not made. In the rest of the
ISOT show that it is possible to detect botnets with preci- related works, they proposed their own features based on their
sion during the Command and Control phase (C&C). The expertise without using a method to evaluate these features.
classification techniques used to detect botnets were: Support
Vector Machine (SVM), Naive Bayes, Gaussian Classifier,
Artificial Neural Network and Nearest Neighbor, where the
SVM algorithm got the best detection rate with almost 98%.
D. Zhao et. al [4] used the Reptree algorithm to detect botnets
using the dataset ISOT. The results show a detection rate of
98.1% for a reduced dataset and detection rate of 98.3 %
for the full set with time windows of 300 seconds, with 8.58
seconds training and 29.4 seconds of training respectively, they
analyzed the detection rate and false positives of botnets with
various time windows, where the best time window was 300
seconds. Furthermore, they built a server to detect botnets in
Fig. 1. The general idea of the proposed solution.
real time and they did tests with 2 centralized botnets: black
energy and weasel, which gave 100% of detection rate.
P. Narang et. al [5] instead of the traditional 5-tuple flow-
based detection approach, they used a 2-tuple conversation-
based approach which is port-oblivious, protocol-oblivious and
it does not require Deep Packet Inspection. They named their
detector PeerShark, it also classified different P2P applications
like Emule and Utorrent, furthermore, they also detect Storm A. Initial feature vector
and Waledac p2p traffic with a detection rate of more than
95%. They executed tests with 3 different algorithms Bayesian
Network, C4.5, and Adaboost with Rep trees to detect the The initial set of features used in this work were borrowed
decentralized botnets, and the best algorithm was the C4.5. from the set of features in the related work to detect botnets in
The dataset used was ISOT. the stage of C&C using features without time intervals. This
feature vector is shown in Table I: the first column refers to the
III. P ROPOSED S OLUTION name given to the feature; the second column is the description
Our proposal is able to evaluate our features to obtain the set of the feature; the third column shows the references where the
of features with highest detection rate, this method improves feature was extracted. So this initial set has these 19 features.
TABLE I by the algorithm C4.5. The GA main loop.
I NITIAL FEATURE VECTOR . Figure 3 illustrates the process performed by the GA.
Name Description Reference
BytesAB Bytes from origin to destination [6]
BytesBA Bytes from destination to origin [6]
NPackets Number of transmitted packets [3], [4],
[5], [6]
NPacketsAB Number packages from origin to desti- [6]
nation
NPacketsBA Number packages from destination to [6]
origin
Duration Duration of connection [1], [5],
[6]
APL Average payload length [1], [2],
[3], [4]
DPS Number of different packet sizes for [1] Fig. 3. Overall Genetic Algorithm.
connection
Payload Total number of bytes per connection [3]
excluding the header C. Classifier C4.5
TBT Total bytes transmitted [1], [3], The classification process is performed by algorithm C4.5
[5], [6]
Flen Large of the first packet of the connec- [1], [2], and it evaluates individuals that are, possible solutions gener-
tion [3], [4] ated by the GA.
NNP Number of null packets exchanged [1] The classification process receives as input the individual or a
(payload size 0)
NSP Number of small packets exchanged [1]
set of features generated by the GA, which is received by
(length 63-400 bytes) the training set and the testing set, these datasets contains
PSP Percentage of small packets exchanged [1] connections belonging to a botnet and normal connections, the
IPP Inbound packets over packages [1], [3]
classification algorithm receives the training set, then the C4.5
OPP Outbound packets over packages [1], [3]
PV Standard deviation of payload [1], [2], classification algorithm is responsible for doing the learning
[4] stage to generate a model based on trees. This model separates
BS Average bits per second [1] features and gets the feature that has greater gain and represent
PPS Average packets per second [1]
the others features, the feature that has the higher gain (Fx1)
forms a node that divides other features (FX2, Fx3 ...., Fx19)
B. Genetic Algorithm (GA) generating as many branches as many different data exists
The GA searches for the set of features that best detect in the feature with higher gain. The model generated in the
botnets at the phase of C&C by trying with different sets. learning stage receives the testing set that contains the same
The representation of the individuals is using binary vectors. features of the training set in the stage of deduction, the stage
Each entry represents a feature, then a value of 1 means that of deduction in the testing set would generate as result the
the feature is included, and a value of 0 means that the feature detection rate which is the employee to measure the individual
is not included in the solution / feature set. The length of the evaluated.
vector is 19. Figure 4 illustrates the classification process performed to
Individuals are represented as in Figure 2. Where the repre- evaluate an individual.
sentation of the individual is in the top of the Figure, and
in the bottom the initial feature vector set that represent the
individual. In this example, the first feature BytesAB is not
included, but the second feature BytesBA, it is included and
so on.

Fig. 2. Individual representation in the GA.

The GA handles a set of size N of possible solutions, that is a Fig. 4. Classification to evaluate an individual.
set of N binary strings, where each string is length 19. The set
of N strings is called population, and each potential solution D. Best parameters in algorithms C4.5 and GA
(binary string) is called individual. The initial bit values are Tests were done to get the best parameters in a GA and the
chosen randomly. The evaluation of each individual is executed algorithm C4.5. These parameters were obtained by testing
by the classification algorithm C4.5. It obtains the detection all their combinations applying steps to test only with some
rate which is the evaluation of each individual. The fitness values, where the best parameters got the highest detection
value of the individuals is precisely the detection rate given rate.
In the classification algorithm C4.5 two types of parameters 1) ISOT: This dataset was created by Information Security
were used, the factor of confidence into the interval [0-1], and and Object Technology (ISOT) research laboratory at the
in the minimum number of objects in which their values start University of Victoria [10]. Basically, it is a mixture of many
at 1. In the confidence factor we did steps of 0.1 and in the existing datasets (malicious and non-malicious). Malicious
minimum number of objects steps of 1. The best parameters traffic in the dataset ISOT were extracted from the French
were obtained by testing each of the combinations, thus we chapter of the Honeynet Project [11] and it includes three
performed 10x10 = 100 experiments, using the training set different decentralized botnets: Waledac, Storm, and Zeus.
and test set ISOT maintaining the 19 features of the feature The non-malicious traffic was extracted from two agencies:
vector. One was extracted from Ericsson Research Laboratory in
The results of the tests to obtain the best parameters for the Hungary [12]. The second one was extracted from Lawrence
classifier C4.5 are a confidence factor = 2 and a minimum Berkeley National Lab (LBNL) [13].
number of objects = 1. This combination of normal traffic is important because the
In the GA two types of parameters were used, the crossover dataset from Ericsson Lab includes traffic from various appli-
rate (CR) into the interval [0-1], and the mutation rate (MR) cations such as HTTP traffic search engines, games like World
into the interval [0-1]. In the crossover rates we did steps of of Warcraft, and traffic from BitTorrent client Azureus. On
0.1. On the other hand, for mutation rate we remained at a the other hand, LNBL traffic comes from a midsize business
low level using only two values 1/L and 2/L, where L = 19 network consisting of 5 datasets.
and represents the number of features in the feature vector. In total, the dataset of ISOT contains 14.1 GB of data in format
The best parameters were obtained by testing each of the pcap. The description of this dataset is shown in Table II,
combinations, thus we performed 5x2 = 10 experiments, using in which the first column represents the botnet, the second
the training set and testing set ISOT maintaining classifier with the number of connections, and the third column the type of
best parameters obtained, which were a confidence factor = 0.2 botnet.
and a minimum number of objects = 1.
The results of the tests to obtain the best parameters for the TABLE II
D ESCRIPTION OF DATASET ISOT.
GA are crossover rate = 0.5 and mutation rate = 2/L.
Botnet Number of Type
IV. E XPERIMENTS connections
We performed specific experiments and general experiments ISOT Storm 22,888 Decentralized
using two datasets containing decentralized and centralized ISOT Waledac 34,442 Decentralized
ISOT NO Botnet 77,586 NO Botnet
botnets in order to obtain the best set of features with the
highest detection rate. The datasets used were:
· ISOT. Which contains decentralized botnets. 2) ISCX: This dataset was created by Information Secu-
· ISCX. Which contains centralized botnets. rity Centre of Excellence (ISCX) research laboratory at the
The experiments performed were: University of New Brunswick [8]. It has been generated in a
Specific. Which aims to find the feature set of a specific physical test environment using actual devices that generate
botnet, these experiments were categorized as: traffic (SSH, HTTP, and SMTP). It contains the centralized
· Experiments for decentralized botnets botnets Neris and RBot. In total, the dataset ISCX contains
· Experiments for centralized botnets 5.6GB of data in format pcap. The description of this dataset
General. Which aims to find the feature set of a type of is shown in Table III, in which the first column represents the
botnets and botnets in general, these experiments were botnet, the second the number of connections, and the third
categorized as: column the type of botnet.
· Experiment for decentralized botnets
· Experiment for centralized botnets TABLE III
· Experiment for decentralized and centralized botnets D ESCRIPTION OF DATASET ISCX.

Botnet Number of Type


In the next subsections, first a description about the connections
datasets used for the experiments are mentioned, then a ISCX Neris 33,084 Centralized
description of the specific and general experiments carried ISCX RBot 34,217 Centralized
ISCX NO Botnet 76,175 NO Botnet
out are mentioned.
A. Datasets
A description of the two datasets used for the experimenta- B. Parameters in experiments
tion are mentioned. These datasets contain representative data The GA was programmed in the language Java. The
for centralized and decentralized botnets as well as normal parameters applied to the GA are the next:
traffic, focusing only on the connections of this traffic. From · Generations: 50
this datasets are extracted the training set, as well as the testing · Individuals: 50
set. These two datasets are the following: · MR: 2/19
· CR: 0.15 centralized botnets Neris and RBot shown in Table VIIb.
· Selection: Tournament The third general experiment called ISOT+ISCX contains both
· Crossover: Uniform centralized and decentralized botnets Storm, Waledac, Neris,
The classification algorithm C4.5 is the one implemented in and RBot shown in Table VIIc.. These Tables show the data
weka [9]. used in our experiments. The first column corresponds to the
The parameters applied to the C4.5 algorithm are the next: botnet in the dataset, the second column to the class or label
· Confidence factor: 0.2 to identify the data and the third the number of connections
· Minimum number of objects: 1 used for that botnet.

To divide the data set used in each experiment into the TABLE VIIa
training set and test set we used the tool weka 3 folds [9]. G ENERAL EXPERIMENT 1 FOR ISOT.

Botnet Class Connections


C. Specific experiments
Storm Botnet 22,888
The main goal in this experiments is to obtain the set Waledac Botnet 34,442
of features that maximizes the detection rate for a specififc NO Botnet No Botnet 77,586
botnet. We conducted four different experiments.
TABLE VIIb
The botnets evaluated in every experiment were Storm, G ENERAL EXPERIMENT 2 FOR ISCX.
Waledac, Neris, and RBot shown in Tables VIa, VIb, VIc,
and VId respectively. These Tables show the data used in our Botnet Class Connections
experiments. The first column corresponds to the botnet in the Neris Botnet 33,084
RBot Botnet 34,217
dataset, the second column to the class or label to identify the NO Botnet No Botnet 76,175
data, and the third the number of connections used for that
botnet. TABLE VIIc
G ENERAL EXPERIMENT 3 FOR ISOT+ISCX.
TABLE VIa
Botnet Class Connections
S PECIFIC EXPERIMENT 1 FOR S TORM .
Storm Botnet 22,888
Botnet Class Connections Waledac Botnet 34,442
Storm Storm 22,888 NO Botnet No Botnet 77,586
Waledac No Storm 11,444 Neris Botnet 33,084
NO Botnet No Storm 11,444 RBot Botnet 34,217
TABLE VIb NO Botnet No Botnet 76,175
S PECIFIC EXPERIMENT 2 FOR WALEDAC .

Botnet Class Connections V. R ESULTS


Storm No Waledac 17,221
Waledac Waledac 34,442 In this section the results of the experimentation, the com-
NO Botnet No Waledac 17,221 parison, and the analysis of results are presented. We show the
TABLE VIc feature set and the detection rate obtained for each experiment,
S PECIFIC EXPERIMENT 3 FOR N ERIS .
also the comparison against the related work.
Botnet Class Connections
Neris Neris 33,084 A. Results of the specific features
RBot No Neris 16,542
NO Botnet No Neris 16,542 To verify the robustness of our proposal we repeated the
specific experiments 10 times shown in Table VI. This Table
TABLE VId shows in its first column the botnet; the second column the
S PECIFIC EXPERIMENT 4 FOR RB OT.
best detection rate obtained in the 10 runs; the third column
Botnet Class Connections the worst detection rate obtained in 10 runs; the fourth column
Neris No RBot 17,109 shows the average detection rates of 10 runs and the last
RBot RBot 34,217 column the standard deviation in detection rates of 10 runs.
NO Botnet No RBot 17,108

TABLE VI
D. General experiments S PECIFIC STATISTICS 10 RUNS .
The main goal in these experiments is to obtain the set Botnet Best Worst Average Standard
of features that maximizes the detection rate for a type of devia-
botnet and for botnets in general. We conducted three different tion
experiments. Storm 97.58% 96.17% 96.61% 0.607036
The first general experiment called ISOT contains the de- Waledac 97.12% 97.04% 97.09% 0.024701
Neris 84.92% 84.56% 84.74% 0.106575
centralized botnet Storm and Waledac shown in Table VIIa. RBot 99.58% 99.57% 99.58% 0.005782
The second general experiment called ISCX contains the
In Table VII the best set of features obtained in each specific dataset used; the fifth and sixth column the detection rate and
experiments is shown, the results are the best out of 10. The false positives.
Table shows in its first column the botnet; the second column
the best detection rate; the third column the false positives; TABLE X
and the fourth column the individual of the GA corresponding C OMPARISON OF RESULTS .
to the best feature set obtained.
Reference Algorithm Type of Dataset Detection False
TABLE VII connec- rate positives
R ESULTS OF THE SPECIFIC FEATURES . tion
K. K-Means With in- ISOT 82.1% 2.4%
Botnet Detection False Feature set Huseynov tervals
rate positives [2]
Storm 97.58% 2.08% [1010100100111100000] K. Ant With in- ISOT 67.8% 23.5%
Waledac 97.12% 0.36% [0011010110110000100] Huseynov Colony tervals
Neris 84.92% 5.11% [1101011110100110100] [2] System
RBot 99.58% 1.04% [1101101110101100000] S. Saad SVM With in- ISOT 97.8% 5.1%
[3] tervals
D. Zhao RepTree With in- ISOT 98.3% 0.01%
[4] tervals
B. Results of the general features P. Narang C4.5 With in- ISOT 98.7% 0.04%
To verify the robustness of our proposal we repeated the [5] tervals
This Best Without ISOT 99.46% 0.57%
general experiments 10 times shown in Table VIII. This Table proposal set of intervals
shows in its first column the botnet; the second column the features
best detection rate obtained in the 10 runs; the third column Using 19 Without ISOT 98.25% 1.9%
features intervals
the worst detection rate obtained in 10 runs; the fourth column
shows the average detection rates of 10 runs and the last
column the standard deviation in detection rates of 10 runs. The specific experiments obtained the feature set of a
decentralized botnet and a centralized botnet. The results of
TABLE VIII our decentralized botnets Storm and Waledac show 4 features
G ENERAL STATISTICS 10 RUNS .
are similar: Npackets, DPS, FLen, y NNP appears in both of
Botnet Best Worst Average Standard them. Storm and Waledac got a reduced feature set of 8, this
devia- means that Storm and Waledac got 50% of similar features.
tion
The detection rate of these experiments show more than 97%.
ISOT 99.46% 99.42% 99.44% 0.014946
ISCX 95.58% 95.51% 95.55% 0.023309 On the other hand the results of our centralized botnets
ISOT+ISCX 96.52% 96.51% 96.52% 0.003862 Neris and RBot show similarities too. 6 features are similar:
BytesBA, NPacketsBA, Duration, APL, Payload, and NNP.
Neris got a feature set of length 9 and RBot a feature set
In Table IX the best set of features obtained in each of of length 8, this means that Neris got 66% and RBot 75%
the general experiments are shown, the results are the best of similarity between them. The detection rate of these
out of 10. The Table shows in its first column the botnet; the experiments show 99.58% for RBot and 84.92% for Neris.
second column the best detection rate; the third column the Neris got a lower detection rate than the others, this means
false positives; and the fourth column the individual of the GA that the feature vector used are not as accurate than the others
corresponding to the best feature set obtained. to detect this botnet.
TABLE IX
R ESULTS OF THE GENERAL FEATURES .
In the general experiments to detect a type of botnet, we got a
reduced dataset of 10 and 11 for ISOT (decentralized botnets)
Botnet Detection False Feature set and ISCX (centralized botnets) respectively. In ISOT we
rate positives
maintain 9 of the features obtained in the specific experiments
ISOT 99.46% 0.57% [1010110011111100001]
ISCX 95.58% 2.24% [1100011110111100111]
and in ISCX 8. The similarity between ISOT and ISCX is 7
ISOT+ISCX 96.52% 1.23% [1101011111111001111] features: NPacketsAB, APL, Payload, TBT, Flen, NNP, PV,
this means that ISOT got 70% and ISCX 63% of similarity
between them. The detection rate of these experiments show
C. Comparison of results a 99.46% for ISOT and 95.58% for ISCX, and it could be
In Table X a comparison of results with the related work higher if Neris had obtained more detection rate.
is shown. The results were the one presented in the related In the last experiment that shows the features needed to
work. The Table shows in its first column the reference of detect botnets of any type, we combine the datasets ISOT
the related work; the second column the algorithms used; the and ISCX and we called this experiment ISOT+ISCX. In this
third column the type of connections with time intervals, and experiment we only got 2 features that were not included in
connections without time intervals; the fourth column is the each general experiment for a total of 13, this features added
were BytesBA and PSP. The detection rate of this experiment [5] P. NARANG , S. R AY, C. H OTA AND V. V ENKATAKRISHNAN, Peer-
shows a 96.52%, and it could be higher if Neris had obtained Shark: Detecting Peer-to-Peer Botnets by Tracking Conversations, in
IEEE Security and Privacy Workshops (SPW), May 17-18, 2014, San
more detection rate. Jose, CA.
[6] W IRESHARK, Conversations, obtained from:
As is shown in our comparison of results, we obtained https://www.wireshark.org/docs/wsug html chunked/
ChStatConversations.html
a significant improvement in the detection rate over other [7] L EONARD , S. X U , AND R. S ANDHU, A Framework for Understanding
representative works of the state of art, this is because they Botnets, in Proceedings of the International Workshop on Advances
did not use a feature selection (they proposed their features in Information Security (WAIS at ARES), Fukuoka, Japan, Fukuoka
Institute of Technology, March 16-19, 2009.
based on their expertise, and the features were not evaluated), [8] A. S HIRAVI , H. S HIRAVI , M. TAVALLAEE , AND A. A. G HORBANI,
and the related work which did a feature selection used Toward developing a systematic approach to generate benchmark
greedy algorithms, this algorithm only follow a route of datasets for intrusion detection, in Computers & Security, vol. 31, no.
3, pp. 357–374, May, 2012.
experiments without an evaluation of its route, our proposal [9] E. I AN , H. W ITTEN , L. T RIGG , M. H ALL , G. H OLMES , AND S. C UN -
solution uses a GA to guide the experiments creating and NINGHAM , Weka: Practical Machine Learning Tools and Techniques
evaluating feature sets, to obtain the best feature set that has with Java Implementations, 1999.
[10] I NFORMATION SECURITY AND OBJECT TECHNOLOGY (ISOT) RE -
the best detection rate. SEARCH LAB , ISOT Botnet dataset, University of Victoria, obtained
VI. C ONCLUSIONS from: http://www.uvic.ca/engineering/ece/isot/datasets/index.php
[11] T HE H ONEYNET P ROJECT, French Chapter of Honeynet, obtained
This paper presented a proposal to do a selection of a set from: http://www.honeynet.org/chapters/france.
[12] G. S ZAB ’ O , D. O RINCSAY, S. M ALOMSOKY, AND I. S ZAB ’ O, On the
of features to detect botnets in the phase of C&C using a validation of traffic classification algorithms, in Proceedings of the 9th
GA as optimizer algorithm and a classifier C4.5 to evaluate international conference on Passive and active network measurement
individuals in the GA, where we conclude the following: (PAM’08), Berlin, Heidelberg, pp. 72–81, Springer-Verlag, 2008.
[13] LBNL AND ICSI, LBNL Enterprise Trace Repository, obtained from:
· The results show a significant improvement over other http://www.icir.org/enterprise-tracing.
representative works of the state of art, where we got a highest [14] C ORMEN , L EISERSON , AND R IVEST, Introduction to Algorithms,
detection rate using the same dataset. Chapter 17 ”Greedy Algorithms”, p. 329, 1990.
· We found a reduction of features from the original feature
vector in our results, although the proposal did not consider
it. These features got higher detection rate than the original
vector.
Our main objective was accomplished because we obtained the
best feature set that has better detection rate than the related
work obtained with our proposal. We also solved the major
problem evaluating the feature vector with our proposal.
A possible future work would be testing with larger datasets
including more botnets to obtain the best feature set for these
datasets, and tests with a larger feature vector to evaluate more
features that obtain a better or worse detection rate.

ACKNOWLEDGMENT
The authors would like to thank to the Instituto Politécnico
Nacional (IPN), the Centro de Investigación en Computación
(CIC) and the Consejo Nacional de Ciencı́a y Tecnologı́a
(CONACYT) for the support in this research.

R EFERENCES
[1] E. B EIGI , H. JAZI , N. S TAKHANOVA AND A. G HORBANI, Towards Ef-
fective Feature Selection in Machine Learning-Based Botnet Detection
Approaches, in IEEE Conference on Communications and Network
Security (CNS), Oct 29-31, 2014, San Francisco, CA.
[2] K. H USEYNOV, K. K IM AND P. YOO, Semi-supervised Botnet Detec-
tion Using Ant Colony System, in 31th Symposium on Cryptography
and Information Security, Kagoshima, Japan, Jan. 21-24, 2014.
[3] S. S AAD , B. S AYED , J. F ELIX , I. T RAORE , D. Z HAO , A. G HORBANI ,
W. L U AND P. H AKIMIAN, Detecting P2P botnets through network
behavior analysis and machine learning, in Privacy, Security and Trust
(PST), on Ninth Annual International Conference, July 19-21, 2011.
[4] D. Z HAO , I. T RAORE , B. S AYED , W. L U , S. S AAD , A. G HORBANI ,
AND D. G ARAN , Botnet detection based on traffic behavior analysis
and flow intervals, in 27th IFIP International Information Security
Conference, November, 2013.

You might also like