Professional Documents
Culture Documents
A Fast Unsupervised Preprocessing Method For Network Monitoring
A Fast Unsupervised Preprocessing Method For Network Monitoring
A Fast Unsupervised Preprocessing Method For Network Monitoring
monitoring
Martin Andreoni Lopez, Diogo Mattos, Otto Carlos M. B. Duarte, Guy
Pujolle
Abstract—Identifying a network misuse takes days or even tuples [6]. This kind of applications generates a significant
weeks, and network administrators usually neglect zero-day volume of data. Even moderate speed networks generate huge
threats until a large number of malicious users exploit them. amounts of data, for instance, monitoring a single gigabit
Besides, security applications, such as anomaly detection and
attack mitigation systems, must apply real-time monitoring to Ethernet link, running at 50% of utilization, generates a
reduce the impacts of security incidents. Thus, information terabyte of data in a couple of hours.
processing time should be as small as possible to enable an One way to attain data processing optimization is to employ
effective defense against attacks. In this paper, we present Machine Learning (ML) methods. ML methods are well suited
a fast preprocessing method for network traffic classification for big data since with more samples to train methods tend to
based on feature correlation and feature normalization. Our
proposed method couples a normalization and a feature selection have higher effectiveness [7]. Nevertheless, with high dimen-
algorithms. We evaluate the proposed algorithms against three sional data machine learning methods tend to perform with
different datasets for eight different machine learning classifica- high latency, due to computational resources consumption. For
tion algorithms. Our proposed normalization algorithm reduces example, the K-Nearest Neighbours (K-NN) algorithm uses
the classification error rate when compared with traditional Euclidean distance to calculate the similarity between points.
methods. Our Feature Selection algorithm chooses an optimized
subset of features improving accuracy by more than 11% within Nonetheless, Euclidean distance is unrepresentative in high
a 100-fold reduction in processing time when compared to dimensions. Thus, it is necessary to apply other similarity
traditional feature selection and feature reduction algorithms. metrics, such as cosine distance, which have a higher com-
The preprocessing method is performed in batch and streaming putational cost, introducing more latency to the preprocessing.
data, being able to detect concept-drift. The high latency is a drawback for machine learning methods,
as they must analyze data as fast as possible to reach near real-
I. I NTRODUCTION
time responses. Feature Selection is a process to overcome the
Maintaining the stability, reliability, and security of com- drawback, reducing the number of dimensions or features to
puter networks implies understanding type, volume and intrin- a smaller subset than the set of original ones [8]. Whereas
sic features of flows comprising carried traffic. Efficient net- traditional Machine learning methods, also known as batch
work monitoring allows the administrator to achieve a better processing, deal with non-continuous data, in-stream process-
understanding of the network [1]. Nevertheless, the process of ing Machine Learning methods, data continuously arrive, one
network monitoring varies in complexity levels, from a simple sample at a time. Stream processing methods must process
collection of link utilization statistics, up to complex upper- data under strict constraints of time and resources [9]. When
layer protocol analysis to achieve network intrusion detection, data continuously arrive, changes in the distribution of the data
network performance tuning and protocol debugging. Current as well as the relationship between input and output variables
network analyzing tools, such as tcpdump, NetFlow, SNMP, are observed over the time. This behavior, known as concept-
Bro IDS [2], Snort [3], among others, are mainly executed drift [10], deteriorates the accuracy and the performance
in one server, and the systems are not able to cope with ever- of machine learning models. As a consequence, in case of
growing traffic throughput [4]. Bro IDS, for example, evolved occurrence of concept-drift, we must train a new model must.
to be used in a cluster to reach up to 10 Gb/s. Another example Streaming data arrives continuously form different hetero-
is the Suricata IDS, which was proposed as an improvement geneous source and frequently contain duplicated or missing
of the Snort IDS, using Graphical Processing Units (GPUs) to information producing incorrect or misleading statistics [11].
parallelize packet inspection. Data preprocessing deals with detecting and removing errors
In traffic analyzing and network monitoring applica- and inconsistencies from data to improve the quality of data.
tions, data arrives in real time from different heterogeneous The purpose of data preprocessing is to prepare the data set
sources [5], such as packets captured over the network, or that will be used to fit the model. In general, the preprocessing
multiple kinds of logging from systems and applications. step consumes 60% to 80% of the time of the entire machine
An infinite sequence of events characterizes real-time stream learning process and provides essential information that will
applications, representing a continuously arriving sequence of guide the analysis and adjustment of the models [12].
The number of input variables is often reduced before apply performance of the classification algorithm. Ensuring normal-
machine learning methods to minimize the use of compu- ized feature values, usually in [−1, 1]; implicitly weights all
tational resource and to improve Machine learning metrics. features equally in their representation. Classifier algorithms
The variable reduction can be made in two different ways, by that calculate the distance between two points, e.g., K-NN and
feature selection or by dimensionality reduction. On the one K-Means, suffer from the weighted feature effect [17]. If one
hand, feature selection only keeps the most relevant variables of the features has a broader range of values, this feature will
from the original dataset, creating a new subset of features profoundly influence the distance calculation. Therefore, the
from the originals. On the other hand, dimensionality reduction range of all features should be normalized, and each feature
exploits the redundancy of the input data by finding a smaller contributes approximately proportionately to the final distance.
set of new synthetic variables, each being a linear, or non- Besides, many preprocessing algorithms consider that data
linear combination of the input variables. is statically available before the beginning of the learning
In this paper, we present a preprocessing data method process [18].
for network traffic monitoring. First, we propose a fast and
efficient feature-selection algorithm. Our algorithm performs
feature selection in an unsupervised manner, i.e., with no a pri-
ori knowledge of the output classes of each flow. We compare
the presented algorithm with three traditional methods: Reli-
efF [13], Sequential Feature Selection (SFS), and the Principal
Component Analysis (PCA) [14]. Our algorithm shows better
accuracy fulfillment with the used Machine Learning methods,
as well as a reduction in the total processing time. Second, we
propose a normalization algorithm for streaming data, which Figure 1. Preprocessing steps composed of Data Consolidation, Data Clean-
ing, Data Transformation and Data Reduction. Data Transformation and Data
can detect and to adapt to concept-drift. Reduction are the most time-consuming steps.
The remainder of the paper is organized as follows. In
Section II, we introduce the concept of data preprocessing. We Although dealing with high dimensional data is compu-
present our preprocessing method in Section III. In Section IV, tationally possible, the more the number of dimensions in-
we validate our proposal and show the results. Section V creases, the more the computational costs are challenging, and
presents the related work. Finally, Section VI concludes the the more machine learning techniques are ineffective. Most
work. machine learning techniques are ineffective in handling high
dimensional data because they incur in overfitting while classi-
II. DATA P REPROCESSING fying data. As a consequence, if the number of input variables1
is reduced before running a machine learning algorithm, the
Data preprocessing is the most time-consuming task in algorithm can achieve better accuracy. We achieve the desired
machine learning [15]. As shown in Figure 1 Data prepro- variable number reduction in two different ways, dimensional-
cessing is composed of four main steps [16]. The first step, ity reduction, and feature selection. Dimensionality reduction
Data Consolidation, several sources generate data; the system exploits the redundancy of the input data by computing a
collects the data, and specialists interpret the data for better smaller set of new synthetic features, each being a linear or
understanding. The second step, Data Cleaning, the system non-linear combination of the data features. Hence, in this
analyzes all samples and verifies if there are values that are case, the set of input variables is a subset of synthetic features
empty or missing and is an anomaly in the dataset, also this that better describe the data than the original set of features. On
step check if there are some inconsistencies. In the third step, the other hand, feature selection only keeps the most relevant
Data Transformation, different functions are applied to data variables from the original dataset, creating a new subset of
to improve the machine learning process. Data Normalization, features from the original one. Thus, the set of variables is a
this step converts variables from categorical into numerical subset of the set of features.
values. In the last step, Data Reduction, techniques such as Dimensionality reduction is the process of deriving a
feature selection are applied to reduce data to improve and set of degrees of freedom to reproduce most of a data set
fast machine learning process. As the entire process finishes, variability [19]. Ideally, the reduced representation should have
data is ready for input in any Machine Learning algorithm. In a dimensionality that corresponds to the data intrinsic dimen-
this work, we focus on the last two steps Data Transformation sionality. The data intrinsic dimensionality is the minimum
and Data Reduction which are the most time-consuming steps. number of parameters to account for the observed proper-
Furthermore, all Feature Selection algorithms assume that ties of the data. Mathematically, in dimensionality reduction,
data arrive preprocessed. Normalization, also known as feature given the p-dimensional random variable x = (x1 , x2 , . . . , xp ),
scaling, is an essential method for proper use of classification
1 Features refer to the original set of attributes that describe the data.
algorithms because normalization bounds the domain of each
Variables refer to the input of the machine learning algorithms applied over
feature to a known interval. If the dataset features have the data. If no preprocessing method handles the original data, the set of
different scales, they may impact in different ways on the variables and the set of features are the same.
we calculate a lower dimensional representation of it, s = methods and avoid overfitting. Filtering methods also called
(s1 , s2 , . . . , sk ) with k ≤ p. open-loop methods, use heuristics to evaluate the relevance
Algorithms with different approaches have been developed of the feature in the dataset [21]. As its name implies, the
to reduce dimensionality, classified into two linear, or non- algorithm filters feature that does fill the heuristic criterion.
linear. The linear reduction of dimensionality is a linear One of the most popular filtering algorithms is the Relief.
projection, in which the p-dimensional data are reduced in The Relief algorithm associates each feature with a score,
k-dimensional data using k linear combinations of p original which is calculated as the difference between the distance
features. Two important examples of linear dimension reduc- from the nearest example of the same class and the nearest
tion algorithms are principal component analysis (PCA) and example of the other class. The main drawback of this method
independent component analysis (ICA). The goal of the PCA is the obligation of labeling data records in advance. Relief is
is to find an orthogonal linear transformation that maximizes limited to problems with just two classes, but ReliefF [13] is
the feature variance. The first PCA base vector, the main an improvement of the Relief method that deals with multiples
direction, best describes the variability of the data. The second classes using the technique of the k-nearest neighbors.
vector is the second-best description and must be orthogonal Embedded methods have a behavior similar to wrapper
to the first and so on in order of importance. On the other methods, using a classifier accuracy to evaluate the feature
hand, the goal of ICA is to find a linear transformation, in relevance. However, embedded methods make the feature
which the base vectors are statistically independent and not selection during the learning process and use its properties to
Gaussian, i.e., the mutual information between two features in guide feature evaluation. Therefore, this modification reduces
the new vector space is equal to zero. Unlike PCA, the base the computational time of wrapper methods. Support Vector
vectors in ICA are neither orthogonal nor ranked in order. Machine Recursive Feature Elimination (SVM-RFE) firstly
All vectors are equally important. PCA is usually applied to appears in gene selection for cancer classification [22]. The
reduce the representation of the data. On the other hand, the algorithm ranks features according to a classification problem
ICA is usually used to obtain feature extraction, identifying based on the training of a Support Vector Machine (SVM)
and selecting the features that best suit the application. with a linear kernel. The feature with the smallest ranking is
Dimensionality reduction techniques lack expressiveness removed, according to a criterion w, in sequential backward
since the generated features are combinations of other original elimination manner. The criterion w is the value of the decision
features. Hence, the meaning of the new synthetic feature is hyperplane on the SVM.
lost. When there is a need for an interpretation of the model, All the previous feature selection algorithms are supervised
for example, when creating rules in a firewall, it is necessary methods, in which the label of the class is presented before the
to use other methods. The feature selection produces a subset preprocessing step. In applications such as network monitoring
of the original features, which are the best representatives of and threat detection, network streams reach classifiers without
the data. Thus, there is no loss of meaning. There are three a known label. Therefore, unsupervised algorithms must be
types of feature selection techniques [8], wrapper, filtering and applied.
embedded.
Wrapper methods, also called closed loop, uses different III. T HE P ROPOSED P REPROCESSING M ETHOD
classifiers, such as Support Vector Machine (SVM), Decision Our preprocessing method comprises two algorithms. First,
tree, among others, to measure the quality of a feature subset a normalization algorithm enforces data to a normal distribu-
without incorporating knowledge about the specific structure tion re-scaling the values in a range between −1 and 1 interval.
of the classification function. Thus, the method will evaluate Indeed, the largest value for each attribute is 1 and the smallest
subsets based on the accuracy of the classifier. These methods value is -1. Normalization and standardization methods such as
consider the feature selection as a search problem, creating Max-Min or Z-score output values in a known range, usually
a NP-hard problem. An exhaustive search in the full dataset [−1, 1] or [0, 1]. Our normalization method is parametric-
must be done to evaluate the relevance of the feature. Wrapper less. Then, we propose a feature selection algorithm based
methods tend to be more accurate than the filtering methods, on the correlation between pairwise features. The Correlation
but they present a higher computational cost [20]. One popular Features Selection (CFS) [23] inspires the proposed algorithm.
Wrapper Method is the Sequential Forward Selection (SFS) CFS scores the features through the correlation between the
for its simplicity. The algorithm begins with an empty set S feature and the target class. The CFS algorithm calculates the
and the full set of all features X. The SFS algorithm does a correlation between pairwise features and the target class to
bottom-up search and gradually adds features, selected by an get the importance of each feature. Thus, the CFS depends
evaluation function, to S, minimizing the mean square error on target class information a priori, so it is a supervised
(MSE). Each iteration, the algorithm selects the feature to be algorithm. The proposed algorithm performs an unsupervised
included in S from the remaining available features of X. feature selection. The correlation and variance between the
The main disadvantage of SFS is adding a feature to the set S features measure the amount of information that each feature
prevents the method to remove the feature if it has the smallest represents compared to others. Thus, the presented algorithm
error after adding others. demands less computational time independently of class label-
Filtering methods are computationally lighter than wrapper ing a priori.
A. The proposed Normalization Algorithm Algorithm 2: CreateHistogram() Function
Normalization should be applied when the data distribution Input : X: Sliding window of Features, bn : number of
is unknown. Determine the distribution of streaming data is bins
computationally expensive [24]. In our normalization algo- Output: H: Histogram
rithm 1, a histogram of a feature fi is represented as a vector 1 [max, min]=CalculateMaxMin;
b1 , b2 , . . . , bn , such that bk represents the number of samples 2 k=(max − min)/(bn ); /* k: threshold */
that falls in the bin k. 3 for bin b in bn do
In practice, it is not possible to know in advance the min 4 b=[mini , k);
and max for any feature. As a consequence, we use a sliding 5 k+=min;
window approach, where the dataset X are the s last seen 6 end
samples. For every sliding window, we obtain the min and
max values for each feature. Then, data values join in a set
b of intervals called bins. The idea is to divide the feature fi Algorithm 3: UpdateHistogram() Function
in a histogram composed by bins b1 , b2 , . . . , bm , where m =
√ Input : X: Sliding window of Features, bn : number of
n + 1, being n the number of features, as shown in line 3 bins
in Algorithm 1. Output: H: Histogram, fr : relative frequency
Each bin consists of thresholds k, for example the feature 1 for sample s in X do
fi is grouped in b1 = [mini , k1 ), b2 = [k1 , k2 ), . . . , bm = 2 for b in bin do
[km − 1, maxi ]. The step between threshold k is called pivot 3 if s in b then
and it is determined as (maxi − mini )/m, as it is shown b+ = 1;/* getting frequency
4 */
in Algorithm 2. If the min or max values of the previous 5 else if then
sliding window are smaller or bigger than min or max of the 6 add bin to b until s in b
current window, that is, mini−1 < mini or maxi−1 > maxi , 7 end
new bins are created until the new values of min or max are 8 fr =Calculate using Equation 1;
reached. With the creation of new bins, the proposal is able 9 H=map s to N ormalDistribution;
to detect changes in the concept-drift. 10 end
Probability
0.75 0.75
the highest one. The activation function enables and disables
0.5 0.5 Max-Min
0.25 0.25
the contribution of a certain neuron to the final result.
0.1 0.1
The Support Vector Machine Algorithm: The Support Vec-
0.05 0.05 tor Machine (SVM) is a binary classifier, based on the concept
0.01 0.01
0.6 0.7 0.8 0.9 1 0 0.5 1
of a decision plane that defines the decision thresholds. SVM
Data Data algorithm classifies through the construction of a hyperplane
in a multidimensional space that split different classes. An
Figure 3. Representation of the feature divided in histogram. Each feature is iterative algorithm minimizes an error function, finding the
divided in bins that represent the relative frequency of the samples comprised
between the thresholds k. best hyperplane separation. A kernel function defines this
hyperplane. In this way, SVM finds the hyperplane with
In the following experiments, we verify our preprocessing a maximum margin, that is, the hyperplane with the most
method in a use case of traffic classification. Thus, we imple- significant distance possible between both classes.
ment the Decision Tree (DT), with the C4.5 algorithm, Arti- The real-time detection is performed by the classification to
ficial Neural Networks (ANN), and Support Vector Machine each class pairs: normal and non-normal; DoS and non-DoS;
and probe and non-probe. Once SVM calculates the output, the methods choose the input variables. In the first group, our
chosen class is the one with the highest score. The classifier proposal with six features reaches 97.4% accuracy, which is
score of a sample x is the distance from x to the decision the best results for the decision tree classifier. The following
boundaries, that goes from −∞ to +∞. The classifier score result is PCA with four and six features in 96% and 97.2%.
is given by: The Sequential Forward Selection (SFS) presents the same
n
result with four and six features with 95.5%. The ReliefF
f (x) =
X
αj yj G(xj , x) + b, (8) algorithm has the same results in both four and six features
j=1
is 91.2%. Finally, the lowest result is shown by the SVM-
RFE algorithm with four and six features. As the decision tree
where (α1 , . . . , αn .b) are the estimated parameters of SVM, algorithm creates the decision nodes based on the variables
and G(xj , x) is the used kernel. In this work, the kernel with higher entropy, the proposed feature selection algorithm
0
is linear, that is, G(xj , x) = xj x, which presents a good better performs because it keeps most of the variance of the
performance with the minimum quantity of input parameters. dataset.
The second classifier, the neural network, the best result is
A. Classification Results
shown by the PCA with six features in 97.6% of accuracy.
This experiment shows the efficiency of our feature selection However, the PCA with four features presents a lower perfor-
algorithm when compared to literature methods. We try a mance with 85.5%. ReliefF presents the same results for both
linear Principal Component Analysis (PCA), The ReliefF features in 90.2%. Our proposal shows a result with 83.9% and
algorithm, the Sequential Forward Selection (SFS), and the 85.0% for four and six features. On the other hand, the SFS
Support Vector Machine Recursive Feature Elimination (SVM- presents the worst results of all classifiers, 78.4% with four
RFE). For all methods, we analyze their version with four and features and 79.2% with six features. One impressive result is
six output features in the GTA/UFRJ dataset. For the sake the SVM-RFE, with four features presents a deficient result
of fairness, we tested all the algorithms with the classification of 73.6% that is one of the worst for all classifiers, however,
methods presented before. We use a decision tree with the with with six features present almost the same best second result
a minimum of 4096 leaves, a binary support vector machine with 90.1%.
(SVM) with linear kernel, and finally a neural network with In the Support Vector Machine (SVM) classifier, the PCA
one hidden layer with 10 neurons. We use ten-fold cross- presents a similar behavior compared with the neural networks.
validation for our experiments. For six features presents the highest accuracy of all classifiers
Figure 4 presents information gain (IG) sum of the se- with 98.3%, but just 87.8% for four features. ReliefF again
lected feature for each evaluated algorithm. Information gain presents the same result for both cases in 91.4%. Our proposal
measures the amount of information, in bits, that a feature has 84% for four features and 85% for six features. SFS
adds compared to the class prediction. Thus, information gain present the same result for both features in 79.5%. The lowest
calculation computes the difference of target class entropy and accuracy of this classifier is the SVM-RFE with 73.6% for
the conditional entropy of the target class given the feature both cases. As our proposal maximizes the variance on the
value as known. When employing six features, the results show resulting features, the resulting reduced dataset spreads into
our algorithm has information retention capability between the new space. For a linear classifier, such as SVM, it is hard
SFS and ReliefF, and higher than SVM-RFE. The information to define a classification surface for a spread data. Thus, the
retention capability of PCA, is higher than feature selection resulting accuracy is not among the highest. However, as the
methods, as each feature is a linear combination of the original selected set of features still being significant for defining the
features and is computed to retain most of the dataset variance. data, the resulting accuracy is not the worst one.
Figure 5 shows the accuracy of the three classification The Sensitivity metric shows the rate of correctly classified
methods, Decision Tree, Neural Network and Support Vec- samples. It is a good a metric to evaluate the success of a
tor Machine (SVM), when different dimensionality reduction classifier when using a dataset in which a class has much
7
more samples than others. In our problem, we use sensitivity
PCA 6
as a metric to evaluate our detection success. For this, we
6
consider the detection problem as a binary classification, i.e.,
Information Gain Sum
PCA 4 SVM-RFE 4
5
Proposal 6 SVM-RFE 6
we consider two classes: normal and abnormal traffic. In
ReliefF 6
4
Proposal 4
this way, the Denial of Service (DoS) and Port Scanning
ReliefF 4 SFS 4 SFS 6
3 threat classes compose a common attack class. Similar to
2 Accuracy representation in Figure 5, Figure 6 represents the
1
sensitivity of the classifiers applying the different methods
of feature selection. In the first group, the classification with
0
Feature Reduction Method Decision Three, PCA shows the best sensitivity with 99% of
correct classification, our algorithm achieves a performance
Figure 4. Information gain sum for feature selection algorithms. The selected
features by our algorithm keeps an information retention capability between of almost 95% of sensitivity, with four and six features.
SFS and ReliefF in the GTA/UFRJ dataset. Neural Networks, represented in the second group, has the best
SVM-RFE=4 SVM-RFE=6 PCA=6
Proposal=4 Proposal=6 ReliefF=4
100 99.1
97.6 SFS=6
96.0 96.2 95.5 95.5 PCA=4 ReliefF=6
94.3 93.4
91.2 91.2 91.4 91.4 91.4 91.4 SFS=4 91.4 91.4
90.2 90.2
87.8
Accuracy [%]
85.5
60
Desicion Tree Neural Network Support Vector Machine
Figure 5. Accuracy comparison of features selection methods. Our Proposal, Linear PCA, ReliefF, SFS and SVM-RFE compared in decision tree, SVM, and
neural network algorithms in the GTA/UFRJ dataset.
sensitivity with PCA using six features with 97.7%, then our matrix multiplication. Matrix multiplication is a simple com-
results show good performance with both four and six features putation function.
in 89%. In this group, the worst sensitivity of all classifiers is The next experiment is to evaluate our proposal in different
reached by the SVM-RFE with four and six features in 69.3%. datasets. We use the NSL-KDD dataset and the NetOp dataset.
Finally, the last group shows the Sensitivity for Support Vector Besides linear SVM, Neural Network and Decision Tree, we
Machine (SVM) classifier. Again, showing a similar behavior also evaluate K-Nearest Neighbors (K-NN), Random Forest,
to the previous group PCA with six features shows the best two kernels, linear and Radial Basis Function (RBF) kernel
sensitivity with 97.8%. Then, the second-best result is reached in Support Vector Machine (SVM), Gaussian Naive Bayes,
by our algorithm, as well as with ReliefF, with 89% sensitivity and Stochastic Gradient Descendant. Adding these algorithms,
with both features. It is worthy to note that our algorithm we cover the full range of the most important algorithms for
presents a stable behavior in Accuracy as well as in Sensitivity. supervised machine learning.
We highlight that our algorithm performs nearly equal to PCA. The Random Forests (RF) algorithm avoids overfitting
PCA creates artificial features that are a composition of all real when compared to the simple decision tree because it con-
features, while our algorithm selects some features from the structs several decision trees, trained in different parts of
complete set of features. In this way, our algorithm was the the same dataset. This procedure decreases the variance of
best feature-selection method among the evaluated ones, and it classification and improves the performance regarding the
also introduces less computing overload when compared with classification of a single tree. The prediction of the class in
PCA. the RF classifier consists of applying the sample as input to
When analyzed the features each method chooses, it is all the trees, obtaining the classification of each one of them
possible to see none of the methods selects the set of same and, then, a voting system decides the resulting class. The
features. Nevertheless, ReliefF and SFS select as the second- construction of each tree must follow the rules: (i) for each
best amount of IP packets. One surprising result from the node d, select k input variables of total m input variables,
SFS is the election of Amount of ECE Flags and Amount such that k m; to calculate the best binary division of the
of CWR Flags. In a correlation test, these two features show k input variables for the node d, using an objective function;
that there is no information inclusion because they are empty repeat the previous steps until each tree reaches l number of
variables. However, we realized that one of the main features nodes or until its maximum extension.
is Average Packet Size. In this dataset, the average packet size The simple Bayesian classifier (Naive Bayes - NB) takes the
is fundamental to classify attacks. One possible reason is that, strong premise of independence between the input variables to
during the creation of the dataset, an automated tool performed simplify the classification prediction, that is, given the value
the Denial of Services (DoS) and Probe attacks. Mainly this of each input variable, it does not influence the value of
automated tool produces attacks without altering the length of the others input variables. From this, the method calculates
the packet. the probabilities a priori of each input variable, or a set of
Figure 7 shows a comparison of the processing time of all them, to set up a given class. As a new sample arrives, the
implemented feature selection and dimensionality reduction algorithm calculates for each input variables the probability of
methods. All measures are in relative value. We can see being a sample of each class. The output of all probabilities
that SFS presents the worst performance. The SFS algorithm of each input variable results in a posterior probability of
performs multiple iterations to minimize the mean square this sample belonging to each class. The algorithm, then,
error (MSE). Consequently, all these iterations increase the returns the classification that contains the highest estimated
processing time. Our proposal shows the best processing time probability.
together with PCA because both implementations perform In the K-Nearest Neighbors (K-NN) the class definition
PCA=6 Proposal=6
99.3 99.9 SVM-RFE=4
100 97.9 Proposal=4 97.8
95.8 96.3 96.3 ReliefF=4
94.6 SVM-RFE=6
PCA=4 ReliefF=6
89.0 89.0
Sensitivity [%]
89.0 89.0 89.0 89.0 89.0 89.0 SFS=6
87.6 87.7 87.7
86.5 SFS=4
83.5
80 78.3
76.5 77.4 77.5 77.5 77.5 77.5
69.3 69.3
60
Decision Tree Neural Network Support Vector Machine
Figure 6. Sensitivity of detection in decision tree, SVM, and neural network algorithms for feature selection methods in the GTA/UFRJ dataset.
100
sample x(i) arrives, the SGD evaluates the Sigmoid function
and returns one for hθ (x(i) ) greater than 0.5 and zero other-
Relative Processing Time
FE
A
ie
SF
PC
os
el
-R
R
op
SV
0.6
scheme relies on the Stochastic Gradient Descent (SGD) [33]
algorithm, is a stochastic approximation of Gradient Descent, 0.4
in which a single sample approximates the gradient. In our
0.2
application, we consider two classes, normal and threat. There-
fore, we use the Sigmoid Function, expressed by 0
r
s
N
LP
D
st
BF
ee
a
ye
KN
G
re
ne
1
Tr
.M
-R
Ba
Fo
ic
i
-L
M
, (10) hθ (x) =
st
et
N
M
d.
SV
ha
N
1 + e−θ| x
G
SV
an
oc
al
R
r
St
eu
N
Time (s)
with no feature selection Figure 9 and with 10% of reduction 1.89 1.05
2.09
0.63 0.77
Figure 10. For K-NN, SVM with Radial Basis Function (RBF) 100 0.11
0.07 0.04
kernel, and Gaussian Naive Bayes metrics remain the same. 0.02
0.004 0.004
For the Neural Network, MLP and SVM with a linear kernel,
an improvement between 2-3% in all metrics is reached with
BF
LP
ee
st
N
r
ea
ye
G
KN
re
Tr
.M
-R
10% of features reduction. Random Forest present the worst
in
Ba
Fo
ic
-L
M
st
et
N
d.
M
SV
ha
lN
performance when features are reduced; all metrics worsen
G
an
SV
oc
ra
St
eu
their values between 8-9%. Stochastic Gradient Descendant
N
(SGD) also suffer a small reduction of 1% in their metrics.
Figure 11. Classification and training time in NSL-KDD Dataset with no
The decision tree is the most benefited improving between feature Selection
3-4% their metrics, which shows the capability of reducing
overfitting when applying feature selection.
105
1 785.08
Accuracy Training Time 349.05
Precision Sensitivity Classification Time
Time (s)
20.92 66.29 32.88
0.8 F1-score
5.03 2.88
0.68 1.50
0.6 0.79
100 0.08 0.10
0.4 0.02 0.03
0.003 0.003
0.2
BF
LP
ee
st
N
r
ea
ye
G
KN
re
Tr
.M
-R
in
Ba
Fo
ic
-L
M
0
st
et
N
d.
M
SV
ha
lN
LP
G
an
SV
BF
ee
t
N
oc
es
ra
ea
.M
ye
R
G
KN
St
Tr
eu
-R
r
in
Ba
Fo
et
ic
-L
N
M
st
N
N
d.
M
SV
ha
l
G
ra
an
SV
oc
eu
St
Figure 12. Classification and training time in NSL-KDD Dataset with only
N
s
N
D
st
BF
ee
ea
ye
KN
G
.M
re
Tr
-R
in
Ba
Fo
ic
et
-L
M
in three algorithms.
st
lN
N
M
d.
SV
ha
G
SV
an
ra
oc
R
eu
NO FS 30% 0.6
1 10% 90%
50% 70%
0.4
Accuracy
0.2
0.8
0
LP
BF
D
N
st
s
r
e
ea
.M
ye
e
G
KN
re
Tr
-R
in
Ba
Fo
ic
et
-L
M
st
lN
0.6
N
d.
M
SV
ha
G
ra
an
SV
oc
eu
St
N
r
s
N
LP
D
t
BF
e
s
ea
ye
e
KN
G
re
Figure 15. Metrics reducing only 90% of the initial features in the NetOp
Tr
M
-R
in
Ba
Fo
tic
-L
.
M
et
dataset.
as
N
M
d.
SV
lN
h
SV
an
oc
ra
St
eu
N
17.42 8.99
4.494.98
1.66
0 0.36
10 0.20
Sensitivity 0.05
1 Accuracy Precision 0.017 0.017 0.007
F1-Score 0.003
0.8
LP
BF
e
t
es
N
D
es
ea
e
KN
G
y
Tr
.M
-R
or
0.6
in
Ba
ic
F
-L
M
st
et
N
d.
M
SV
ha
lN
G
an
SV
oc
0.4
ra
St
eu
N
0.2
0 Figure 16. Classification and training time in NetOp Dataset with no feature
LP
Selection
BF
ee
t
s
N
D
es
ea
.M
ye
KN
Tr
-R
or
in
Ba
et
ic
F
-L
M
st
lN
N
d.
M
SV
ha
G
ra
an
SV
oc
eu
St
N
105
Figure 14. Accuracy, precision, sensitivity and F1-Score for NetOp dataset. Training Time 1365.61
296.88
Metrics with no future selection. 69.00
Time (s)
BF
ee
st
es
N
D
ea
KN
G
re
Tr
.M
-R
in
tic
-L
M
et
as
N
d.
M
SV
lN
ch
an
o
R
St
eu
0.8
G5 G7
G4 G6
G1
0.6
0.4
LP
ar
N
D
st
ee
ye
KN
G
re
ne
Tr
lM
Ba
Fo
ic
i
-L
st
ra
N
M
d.
ha
eu
G
SV
an
oc
.N
St
R
Figure 18. Evaluation of group features with different machine learning algorithms.
joins equal size chunk where all operations originate. The first
data chunk is used to take the references min-max values and
to send the normalized data for the training model. The metric
1 represents the amount of sample falling outside the min-
max reference values; the metric 2 is the relation between
new sample values in each dimension and the referenced min-
max value for that dimension. Similar to our proposal, the
algorithm works with numerical data.
Incremental Discretization Algorithm (IDA) uses a quan-
tile approach to discretize data stream [26]. The algorithm
discretizes data stream in m equal frequency bins. A sliding
window version of the algorithm is proposed to follow the
evolution of the data stream. The algorithm maintains the data
into bins with fixed quantiles of the distribution, rather than
Figure 19. Flow diagram used for proposal evaluation fixed absolute values, to follow the distribution drift.
In our proposal, we propose an unsupervised preprocessing
method. Our method includes normalization, and feature selec-
1
Gaussian Naive Bayes tion altogether. The proposal is parametric-less. Our algorithm
SVM (linear kernel) follows an active approach for concept-drift detection. The
0.8 active approach monitors the concept, the label, to determine
when drift occurs before taking any action. A passive ap-
Accuracy