A Fast Unsupervised Preprocessing Method For Network Monitoring

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

A fast unsupervised preprocessing method for network

monitoring
Martin Andreoni Lopez, Diogo Mattos, Otto Carlos M. B. Duarte, Guy
Pujolle

To cite this version:


Martin Andreoni Lopez, Diogo Mattos, Otto Carlos M. B. Duarte, Guy Pujolle. A fast unsuper-
vised preprocessing method for network monitoring. Annals of Telecommunications - annales des
télécommunications, 2019, 74 (3-4), pp.139-155. �10.1007/s12243-018-0663-2�. �hal-02098986�

HAL Id: hal-02098986


https://hal.sorbonne-universite.fr/hal-02098986
Submitted on 28 Jun 2019

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
A Fast Unsupervised Preprocessing Method for Network Monitoring
Martin Andreoni Lopez∗† , Diogo M. F. Mattos‡ , Otto Carlos M. B. Duarte∗ , and Guy Pujolle†
∗ UniversidadeFederal do Rio de Janeiro - GTA/COPPE/UFRJ - Rio de Janeiro, Brazil
† Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, F-75005 Paris, France
‡ Universidade Federal Fluminense - (UFF) - Niteroi, Brazil

Emails: {martin,otto}@gta.ufrj.br, menezes@midiacom.uff.br, guy.pujolle@lip6.fr

Abstract—Identifying a network misuse takes days or even tuples [6]. This kind of applications generates a significant
weeks, and network administrators usually neglect zero-day volume of data. Even moderate speed networks generate huge
threats until a large number of malicious users exploit them. amounts of data, for instance, monitoring a single gigabit
Besides, security applications, such as anomaly detection and
attack mitigation systems, must apply real-time monitoring to Ethernet link, running at 50% of utilization, generates a
reduce the impacts of security incidents. Thus, information terabyte of data in a couple of hours.
processing time should be as small as possible to enable an One way to attain data processing optimization is to employ
effective defense against attacks. In this paper, we present Machine Learning (ML) methods. ML methods are well suited
a fast preprocessing method for network traffic classification for big data since with more samples to train methods tend to
based on feature correlation and feature normalization. Our
proposed method couples a normalization and a feature selection have higher effectiveness [7]. Nevertheless, with high dimen-
algorithms. We evaluate the proposed algorithms against three sional data machine learning methods tend to perform with
different datasets for eight different machine learning classifica- high latency, due to computational resources consumption. For
tion algorithms. Our proposed normalization algorithm reduces example, the K-Nearest Neighbours (K-NN) algorithm uses
the classification error rate when compared with traditional Euclidean distance to calculate the similarity between points.
methods. Our Feature Selection algorithm chooses an optimized
subset of features improving accuracy by more than 11% within Nonetheless, Euclidean distance is unrepresentative in high
a 100-fold reduction in processing time when compared to dimensions. Thus, it is necessary to apply other similarity
traditional feature selection and feature reduction algorithms. metrics, such as cosine distance, which have a higher com-
The preprocessing method is performed in batch and streaming putational cost, introducing more latency to the preprocessing.
data, being able to detect concept-drift. The high latency is a drawback for machine learning methods,
as they must analyze data as fast as possible to reach near real-
I. I NTRODUCTION
time responses. Feature Selection is a process to overcome the
Maintaining the stability, reliability, and security of com- drawback, reducing the number of dimensions or features to
puter networks implies understanding type, volume and intrin- a smaller subset than the set of original ones [8]. Whereas
sic features of flows comprising carried traffic. Efficient net- traditional Machine learning methods, also known as batch
work monitoring allows the administrator to achieve a better processing, deal with non-continuous data, in-stream process-
understanding of the network [1]. Nevertheless, the process of ing Machine Learning methods, data continuously arrive, one
network monitoring varies in complexity levels, from a simple sample at a time. Stream processing methods must process
collection of link utilization statistics, up to complex upper- data under strict constraints of time and resources [9]. When
layer protocol analysis to achieve network intrusion detection, data continuously arrive, changes in the distribution of the data
network performance tuning and protocol debugging. Current as well as the relationship between input and output variables
network analyzing tools, such as tcpdump, NetFlow, SNMP, are observed over the time. This behavior, known as concept-
Bro IDS [2], Snort [3], among others, are mainly executed drift [10], deteriorates the accuracy and the performance
in one server, and the systems are not able to cope with ever- of machine learning models. As a consequence, in case of
growing traffic throughput [4]. Bro IDS, for example, evolved occurrence of concept-drift, we must train a new model must.
to be used in a cluster to reach up to 10 Gb/s. Another example Streaming data arrives continuously form different hetero-
is the Suricata IDS, which was proposed as an improvement geneous source and frequently contain duplicated or missing
of the Snort IDS, using Graphical Processing Units (GPUs) to information producing incorrect or misleading statistics [11].
parallelize packet inspection. Data preprocessing deals with detecting and removing errors
In traffic analyzing and network monitoring applica- and inconsistencies from data to improve the quality of data.
tions, data arrives in real time from different heterogeneous The purpose of data preprocessing is to prepare the data set
sources [5], such as packets captured over the network, or that will be used to fit the model. In general, the preprocessing
multiple kinds of logging from systems and applications. step consumes 60% to 80% of the time of the entire machine
An infinite sequence of events characterizes real-time stream learning process and provides essential information that will
applications, representing a continuously arriving sequence of guide the analysis and adjustment of the models [12].
The number of input variables is often reduced before apply performance of the classification algorithm. Ensuring normal-
machine learning methods to minimize the use of compu- ized feature values, usually in [−1, 1]; implicitly weights all
tational resource and to improve Machine learning metrics. features equally in their representation. Classifier algorithms
The variable reduction can be made in two different ways, by that calculate the distance between two points, e.g., K-NN and
feature selection or by dimensionality reduction. On the one K-Means, suffer from the weighted feature effect [17]. If one
hand, feature selection only keeps the most relevant variables of the features has a broader range of values, this feature will
from the original dataset, creating a new subset of features profoundly influence the distance calculation. Therefore, the
from the originals. On the other hand, dimensionality reduction range of all features should be normalized, and each feature
exploits the redundancy of the input data by finding a smaller contributes approximately proportionately to the final distance.
set of new synthetic variables, each being a linear, or non- Besides, many preprocessing algorithms consider that data
linear combination of the input variables. is statically available before the beginning of the learning
In this paper, we present a preprocessing data method process [18].
for network traffic monitoring. First, we propose a fast and
efficient feature-selection algorithm. Our algorithm performs
feature selection in an unsupervised manner, i.e., with no a pri-
ori knowledge of the output classes of each flow. We compare
the presented algorithm with three traditional methods: Reli-
efF [13], Sequential Feature Selection (SFS), and the Principal
Component Analysis (PCA) [14]. Our algorithm shows better
accuracy fulfillment with the used Machine Learning methods,
as well as a reduction in the total processing time. Second, we
propose a normalization algorithm for streaming data, which Figure 1. Preprocessing steps composed of Data Consolidation, Data Clean-
ing, Data Transformation and Data Reduction. Data Transformation and Data
can detect and to adapt to concept-drift. Reduction are the most time-consuming steps.
The remainder of the paper is organized as follows. In
Section II, we introduce the concept of data preprocessing. We Although dealing with high dimensional data is compu-
present our preprocessing method in Section III. In Section IV, tationally possible, the more the number of dimensions in-
we validate our proposal and show the results. Section V creases, the more the computational costs are challenging, and
presents the related work. Finally, Section VI concludes the the more machine learning techniques are ineffective. Most
work. machine learning techniques are ineffective in handling high
dimensional data because they incur in overfitting while classi-
II. DATA P REPROCESSING fying data. As a consequence, if the number of input variables1
is reduced before running a machine learning algorithm, the
Data preprocessing is the most time-consuming task in algorithm can achieve better accuracy. We achieve the desired
machine learning [15]. As shown in Figure 1 Data prepro- variable number reduction in two different ways, dimensional-
cessing is composed of four main steps [16]. The first step, ity reduction, and feature selection. Dimensionality reduction
Data Consolidation, several sources generate data; the system exploits the redundancy of the input data by computing a
collects the data, and specialists interpret the data for better smaller set of new synthetic features, each being a linear or
understanding. The second step, Data Cleaning, the system non-linear combination of the data features. Hence, in this
analyzes all samples and verifies if there are values that are case, the set of input variables is a subset of synthetic features
empty or missing and is an anomaly in the dataset, also this that better describe the data than the original set of features. On
step check if there are some inconsistencies. In the third step, the other hand, feature selection only keeps the most relevant
Data Transformation, different functions are applied to data variables from the original dataset, creating a new subset of
to improve the machine learning process. Data Normalization, features from the original one. Thus, the set of variables is a
this step converts variables from categorical into numerical subset of the set of features.
values. In the last step, Data Reduction, techniques such as Dimensionality reduction is the process of deriving a
feature selection are applied to reduce data to improve and set of degrees of freedom to reproduce most of a data set
fast machine learning process. As the entire process finishes, variability [19]. Ideally, the reduced representation should have
data is ready for input in any Machine Learning algorithm. In a dimensionality that corresponds to the data intrinsic dimen-
this work, we focus on the last two steps Data Transformation sionality. The data intrinsic dimensionality is the minimum
and Data Reduction which are the most time-consuming steps. number of parameters to account for the observed proper-
Furthermore, all Feature Selection algorithms assume that ties of the data. Mathematically, in dimensionality reduction,
data arrive preprocessed. Normalization, also known as feature given the p-dimensional random variable x = (x1 , x2 , . . . , xp ),
scaling, is an essential method for proper use of classification
1 Features refer to the original set of attributes that describe the data.
algorithms because normalization bounds the domain of each
Variables refer to the input of the machine learning algorithms applied over
feature to a known interval. If the dataset features have the data. If no preprocessing method handles the original data, the set of
different scales, they may impact in different ways on the variables and the set of features are the same.
we calculate a lower dimensional representation of it, s = methods and avoid overfitting. Filtering methods also called
(s1 , s2 , . . . , sk ) with k ≤ p. open-loop methods, use heuristics to evaluate the relevance
Algorithms with different approaches have been developed of the feature in the dataset [21]. As its name implies, the
to reduce dimensionality, classified into two linear, or non- algorithm filters feature that does fill the heuristic criterion.
linear. The linear reduction of dimensionality is a linear One of the most popular filtering algorithms is the Relief.
projection, in which the p-dimensional data are reduced in The Relief algorithm associates each feature with a score,
k-dimensional data using k linear combinations of p original which is calculated as the difference between the distance
features. Two important examples of linear dimension reduc- from the nearest example of the same class and the nearest
tion algorithms are principal component analysis (PCA) and example of the other class. The main drawback of this method
independent component analysis (ICA). The goal of the PCA is the obligation of labeling data records in advance. Relief is
is to find an orthogonal linear transformation that maximizes limited to problems with just two classes, but ReliefF [13] is
the feature variance. The first PCA base vector, the main an improvement of the Relief method that deals with multiples
direction, best describes the variability of the data. The second classes using the technique of the k-nearest neighbors.
vector is the second-best description and must be orthogonal Embedded methods have a behavior similar to wrapper
to the first and so on in order of importance. On the other methods, using a classifier accuracy to evaluate the feature
hand, the goal of ICA is to find a linear transformation, in relevance. However, embedded methods make the feature
which the base vectors are statistically independent and not selection during the learning process and use its properties to
Gaussian, i.e., the mutual information between two features in guide feature evaluation. Therefore, this modification reduces
the new vector space is equal to zero. Unlike PCA, the base the computational time of wrapper methods. Support Vector
vectors in ICA are neither orthogonal nor ranked in order. Machine Recursive Feature Elimination (SVM-RFE) firstly
All vectors are equally important. PCA is usually applied to appears in gene selection for cancer classification [22]. The
reduce the representation of the data. On the other hand, the algorithm ranks features according to a classification problem
ICA is usually used to obtain feature extraction, identifying based on the training of a Support Vector Machine (SVM)
and selecting the features that best suit the application. with a linear kernel. The feature with the smallest ranking is
Dimensionality reduction techniques lack expressiveness removed, according to a criterion w, in sequential backward
since the generated features are combinations of other original elimination manner. The criterion w is the value of the decision
features. Hence, the meaning of the new synthetic feature is hyperplane on the SVM.
lost. When there is a need for an interpretation of the model, All the previous feature selection algorithms are supervised
for example, when creating rules in a firewall, it is necessary methods, in which the label of the class is presented before the
to use other methods. The feature selection produces a subset preprocessing step. In applications such as network monitoring
of the original features, which are the best representatives of and threat detection, network streams reach classifiers without
the data. Thus, there is no loss of meaning. There are three a known label. Therefore, unsupervised algorithms must be
types of feature selection techniques [8], wrapper, filtering and applied.
embedded.
Wrapper methods, also called closed loop, uses different III. T HE P ROPOSED P REPROCESSING M ETHOD
classifiers, such as Support Vector Machine (SVM), Decision Our preprocessing method comprises two algorithms. First,
tree, among others, to measure the quality of a feature subset a normalization algorithm enforces data to a normal distribu-
without incorporating knowledge about the specific structure tion re-scaling the values in a range between −1 and 1 interval.
of the classification function. Thus, the method will evaluate Indeed, the largest value for each attribute is 1 and the smallest
subsets based on the accuracy of the classifier. These methods value is -1. Normalization and standardization methods such as
consider the feature selection as a search problem, creating Max-Min or Z-score output values in a known range, usually
a NP-hard problem. An exhaustive search in the full dataset [−1, 1] or [0, 1]. Our normalization method is parametric-
must be done to evaluate the relevance of the feature. Wrapper less. Then, we propose a feature selection algorithm based
methods tend to be more accurate than the filtering methods, on the correlation between pairwise features. The Correlation
but they present a higher computational cost [20]. One popular Features Selection (CFS) [23] inspires the proposed algorithm.
Wrapper Method is the Sequential Forward Selection (SFS) CFS scores the features through the correlation between the
for its simplicity. The algorithm begins with an empty set S feature and the target class. The CFS algorithm calculates the
and the full set of all features X. The SFS algorithm does a correlation between pairwise features and the target class to
bottom-up search and gradually adds features, selected by an get the importance of each feature. Thus, the CFS depends
evaluation function, to S, minimizing the mean square error on target class information a priori, so it is a supervised
(MSE). Each iteration, the algorithm selects the feature to be algorithm. The proposed algorithm performs an unsupervised
included in S from the remaining available features of X. feature selection. The correlation and variance between the
The main disadvantage of SFS is adding a feature to the set S features measure the amount of information that each feature
prevents the method to remove the feature if it has the smallest represents compared to others. Thus, the presented algorithm
error after adding others. demands less computational time independently of class label-
Filtering methods are computationally lighter than wrapper ing a priori.
A. The proposed Normalization Algorithm Algorithm 2: CreateHistogram() Function
Normalization should be applied when the data distribution Input : X: Sliding window of Features, bn : number of
is unknown. Determine the distribution of streaming data is bins
computationally expensive [24]. In our normalization algo- Output: H: Histogram
rithm 1, a histogram of a feature fi is represented as a vector 1 [max, min]=CalculateMaxMin;
b1 , b2 , . . . , bn , such that bk represents the number of samples 2 k=(max − min)/(bn ); /* k: threshold */
that falls in the bin k. 3 for bin b in bn do
In practice, it is not possible to know in advance the min 4 b=[mini , k);
and max for any feature. As a consequence, we use a sliding 5 k+=min;
window approach, where the dataset X are the s last seen 6 end
samples. For every sliding window, we obtain the min and
max values for each feature. Then, data values join in a set
b of intervals called bins. The idea is to divide the feature fi Algorithm 3: UpdateHistogram() Function
in a histogram composed by bins b1 , b2 , . . . , bm , where m =
√ Input : X: Sliding window of Features, bn : number of
n + 1, being n the number of features, as shown in line 3 bins
in Algorithm 1. Output: H: Histogram, fr : relative frequency
Each bin consists of thresholds k, for example the feature 1 for sample s in X do
fi is grouped in b1 = [mini , k1 ), b2 = [k1 , k2 ), . . . , bm = 2 for b in bin do
[km − 1, maxi ]. The step between threshold k is called pivot 3 if s in b then
and it is determined as (maxi − mini )/m, as it is shown b+ = 1;/* getting frequency
4 */
in Algorithm 2. If the min or max values of the previous 5 else if then
sliding window are smaller or bigger than min or max of the 6 add bin to b until s in b
current window, that is, mini−1 < mini or maxi−1 > maxi , 7 end
new bins are created until the new values of min or max are 8 fr =Calculate using Equation 1;
reached. With the creation of new bins, the proposal is able 9 H=map s to N ormalDistribution;
to detect changes in the concept-drift. 10 end

Algorithm 1: Stream Normalization Algorithm


Input : X: Sliding window of Features, w: Window !
Number m
X
Output: H: Normalized Features, fr : relative frequency Z>P x= f rj . (1)
0
1 if w == 1 then
with Equation 1 is it possible to see that all values are now
2 for f eature
√ f in X do mapped into a normal probability distribution with µ = 0 and
3 bn = n + 1;/* n: number of features σ = 1, line 8 in algorithm 3. As a consequence, all samples
*/ are normalized between −1 ≤ xi ≤ 1.
4 H=CreateHistogram(X,bn );
5 end
6 else if w > 1 then
7 for sample s in f do
8 [H, fr ]=UpdateHistogram(X,b);
9 end

The rate between the number of observed samples in a


bin and the total number of samples in the entire histogram
produces the frequency of each bin. Comparing the sample Figure 2. Representation of the feature divided in histogram. Each feature is
divided in bins that represent the relative frequency of the samples comprised
xi against the thresholds k of the bins, line 4 Algorithm 3, between the thresholds k. The second step of the algorithm approximate the
we define in which bin we have to increment the number of histogram to a normal distribution.
observed samples. If the value of the sample xi is in-between
the thresholds of the binj , then the hit number of observed If we consider the process that generates the stream is non-
samples f qj of the binj is increased by one. Moreover, we stationary, it implies a possible concept-drift. Haim and Tov
calculate the relative frequency of each bin as the relation affirm that histogram must be dynamic when dealing with
between the bin hit number and the total number of samples, streaming data [25]. As a consequence, intervals do not have a
f rj = f qj /N . Finally, the relative frequency values f r are fixed value, and the bins adapt to concept-drift. However, if the
mapped into a normal distribution by: bins remain static, it reflects the evolution of the change during
time [26]. In our application, feature normalization for network the amount of information the feature has independently from
monitoring, we follow the former approach. Maintaining fixed the others. The weight w has values between 0 ≤ N , where
intervals allow us to see how a feature evolves. Also, as our N is the number of features and 0 means that the features
histogram algorithm creates new bins when a value does not are independent of the other. The higher the w value is, the
enter in any of the current intervals, it enables to detect outliers higher is the variance of the feature and lesser correlation with
dynamically. In streaming data, it is not possible to maintain other features. Thus, more information is the aggregated by
all the samples xi , because it is computationally inefficient this feature.
and, in case of unlimited data, it does not fit in memory. Our
algorithm only efficiently keeps the frequency of each bin. Algorithm 4: Correlation Based Feature Selection
The most complicated function in the normalization process Input : X: Matrix of Features and Data
is to update the bins. If the max and min reference values Output: r: Vector of Ranked Features, w: Vector of
of the window change, the bins update functions takes the weights
complexity O(n) on time. The creation of the histogram is ρ = Corr(X) /* Correlation Matrix
1 */
only done in the first window and takes constant time. The 2 for 0 ≤ i < len (ρ) do
histogram update uses a binary search to fill the bin value in 3 Wi = 0
O(log n) time. 4 for 0 ≤ j < len (ρi ) do
B. The proposed Correlation-Based Feature Selection 5 ki = abs(ρij ) /* Absolute Values */
6 auxi + = ki /* Sample Addition */
We propose the Correlation-Based Feature Selection, a
7 end
simple unsupervised filter method for feature selection. Our
8 wi = V̂ (i)/auxi /* Calculate Weights */
method uses the correlation between features. The Pearson
9 end
correlation of two variables is a measure of their linear
10 r = sort(w, byhighervalues)
dependence. The key idea of the method is to weight each
feature based on the correlation of the feature against all
other features that describe the dataset. We adopt the Pearson’s
coefficient as the correlation metric. Pearson’s coefficient value IV. E VALUATION
is between −1 ≤ ρ ≤ 1, where 1 means that the two variables To evaluate the proposed algorithm, we perform traffic clas-
are directly correlated, linear relationship, and −1 in the case sification to detect threats. We chose the traffic classification
of an inverse linear relationship, also called anticorrelation. application because it is time sensitive and our algorithm can
The Pearson Coefficient ρ, can be calculated in terms of significantly reduce the processing time, enabling prompt de-
mean µ and standard deviation σ, fense mechanisms. We implemented traffic classification using
machine learning algorithms against three different datasets.
N    The NSL-KDD is a modification of the original KDD-99
1 X Ai − µA Bi − µB dataset and has the same 41 features and the same five classes,
ρ(A, B) = , (2)
N − 1 i=1 σA σB Denial of Service (DoS), Probe, Root2Local (R2L), User2Root
or in terms the of covariance (U2R) and normal, as the KDD 99 [27]. The improvements of
the NSL-KDD over KDD 99 are the elimination of redundant
cov(A, B) and duplicate samples, to avoid a biased classification and
ρ(A, B) = , (3)
σA σB overfitting, and a better cross-class balancing to avoid random
then we calculate the weight vector, selection. GTA/UFRJ dataset2 [28] combines real network traf-
σ2 fic captured from a laboratory and network threats produced
wi = Pj=Ni . (4) in a controlled environment. Network traffic is abstracted
j=0 |ρij | in 26 features and contains three classes, DoS, probe and
Firstly, we need to obtain the correlation matrix, calculated normal traffic. The NetOp2 is a real data set from a Brazilian
by Equation 3, line 1 algorithm 4. The correlation matrix is operator [29]. The dataset has anonymous access traffic of
the pairwise covariance calculations between features. Then, 373 broadband users of the South Zone of the city of Rio
applying the Equation 4, we establish a weight w that is de Janeiro. We collected data during 1 week of uninterrupted
a measure of the importance of the feature. To calculate data collection, from February 24 to March 4, 2017. We
w, we sum the absolute values of the correlation features, summarized packets in a dataset of 46 flow features, associated
lines 5-6 algorithm 4. This absolute value sum is due to with an IDS alarm class or the legitimate traffic class. All
Pearson’s coefficient, ρ may assume negative values. Then datasets are summarized in Table I.
we calculate the variance V̂ of each feature that privileges We count on Intel Xeon processors with a clock frequency
the feature that has greater variance and lower covariance, of 2.6 GHz and 256 GB of RAM for conducting the measure-
line 8 algorithm 4. The idea is to establish which feature ments.
represent the most information, giving the correlation between 2 Anonymized data can be asked by sending an email contact to the
two features. Furthermore, the weights give us an indication of authors
Table I (SVM) as classification algorithms to evaluate the proposed
S UMMARY OF DATASETS USED FOR EVALUATION (F E : F EATURES ; F L : feature selection algorithm. We selected these algorithms be-
FLOWS ).
cause they are the most used ones for network security [31]. In
Dataset Format Size Attacks Type all methods, we perform the training in a 70% partition of the
NSL-KDD 41 Fe 150k Fl 80.5 % Synthetic dataset and the test run over the remaining 30%. During the
GTA/UFRJ 26 Fe 95 GB 30 % Synthetic training phase, we perform tenfold cross-validation to avoid
NetOp 46 Fe 5M Fl - Real overfitting. In cross-validation, parts of the dataset are divided
and not used in model parameters estimation. They are further
used to check whether the model is general enough to adapt
In the first experiment, we use one day from NetOp dataset to new data, avoiding overfitting to training data.
to evaluate our normalization method. Shapiro–Wilk test was The Decision Tree Algorithm: In a decision tree, leaves
used to verify that our proposal enforces a normal distribution represent the final class and branches represent conditions
for the normalized features. Table II shows Shapiro-Wilk test, based on the value of one of the input variables. During the
we considered α = 0.05. We evaluate the hypothesis that our training part, the C4.5 algorithm determines a tree-like classifi-
proposal normalization method follows a normal distribution. cation structure. The real-time implementation of the decision
According to the results, the proposed method has a p-value tree consists of if-then-else rules that generate the tree-like
of 0.24 > 0.05, and W is closer to one, W = 0.93, then structure previously calculated. The results are presented in the
we assume that samples are not significantly different from a Section IV-A, along with the ones from the other algorithms.
normal population. In the case of Max-Min normalization [30], The Artificial Neural Network Algorithm: The artificial
p-value is very smaller than the α, and W indicates that it neural networks are inspired on the human brain, in which
is far from 1. As a consequence, we refuse the hypothesis each neuron performs a small part of the processing and
assuming than sampling data are significantly different from a transfers the result to the next neuron. In artificial neural
normal population. Figure 3 shows a graphical interpretation networks, the output represents a degree of membership for
of the Shapiro–Wilk test, it represents a sample after being each class, and the highest degree is selected. The weight
normalized. As our proposal follows the normal distribution, vectors Θ are calculated during the training. These vectors
the blue points follow the dashed line, while the max-min determine the weight of each neuron connection. In training,
approach follows a right-skewed distribution. there are input and output sample spaces and the errors, caused
by each parameter. The back-propagation algorithm minimizes
Table II the errors.
H YPOTHESIS COMPARISON FOR A NORMAL DISTRIBUTION APPROACH . I N In order to determine to which class a sample belongs each
S HAPIRO -W ILK TEST p- VALUE IS 0.24 > 0.05, AND W IS CLOSER TO neural network layer computes the following equations:
ONE , W =0.93, CONFIRMING THAT VALUES FOLLOW A NORMAL
DISTRIBUTION . 1
z(i+1) = Θ(i) a(i) a(i+1) = g(z(i+1) ) g(z) =
1 + e−z
Shapiro-Wilk (5) (6) (7)
Mean W Mean p where a is the vector that determines the output of layer i,
proposal 0.93 0.24 Θ(i) is the weight vector that leads layer i to layer i + 1, and
max-min 0.65 9.28e-07 a(i+1) is the output of layer i + 1. The function g(z) is the
activation function, represented by Sigmoid function, which
plays an important role in the classification. For high values
0.99 0.99
of z, g(z) returns one and for low values g(z) returns zero.
0.95 0.95 Therefore, the output layer gives the degree of membership in
0.9 0.9 each class, between zero and one, classifying the sample as
Proposal
Probability

Probability

0.75 0.75
the highest one. The activation function enables and disables
0.5 0.5 Max-Min
0.25 0.25
the contribution of a certain neuron to the final result.
0.1 0.1
The Support Vector Machine Algorithm: The Support Vec-
0.05 0.05 tor Machine (SVM) is a binary classifier, based on the concept
0.01 0.01
0.6 0.7 0.8 0.9 1 0 0.5 1
of a decision plane that defines the decision thresholds. SVM
Data Data algorithm classifies through the construction of a hyperplane
in a multidimensional space that split different classes. An
Figure 3. Representation of the feature divided in histogram. Each feature is iterative algorithm minimizes an error function, finding the
divided in bins that represent the relative frequency of the samples comprised
between the thresholds k. best hyperplane separation. A kernel function defines this
hyperplane. In this way, SVM finds the hyperplane with
In the following experiments, we verify our preprocessing a maximum margin, that is, the hyperplane with the most
method in a use case of traffic classification. Thus, we imple- significant distance possible between both classes.
ment the Decision Tree (DT), with the C4.5 algorithm, Arti- The real-time detection is performed by the classification to
ficial Neural Networks (ANN), and Support Vector Machine each class pairs: normal and non-normal; DoS and non-DoS;
and probe and non-probe. Once SVM calculates the output, the methods choose the input variables. In the first group, our
chosen class is the one with the highest score. The classifier proposal with six features reaches 97.4% accuracy, which is
score of a sample x is the distance from x to the decision the best results for the decision tree classifier. The following
boundaries, that goes from −∞ to +∞. The classifier score result is PCA with four and six features in 96% and 97.2%.
is given by: The Sequential Forward Selection (SFS) presents the same
n
result with four and six features with 95.5%. The ReliefF
f (x) =
X
αj yj G(xj , x) + b, (8) algorithm has the same results in both four and six features
j=1
is 91.2%. Finally, the lowest result is shown by the SVM-
RFE algorithm with four and six features. As the decision tree
where (α1 , . . . , αn .b) are the estimated parameters of SVM, algorithm creates the decision nodes based on the variables
and G(xj , x) is the used kernel. In this work, the kernel with higher entropy, the proposed feature selection algorithm
0
is linear, that is, G(xj , x) = xj x, which presents a good better performs because it keeps most of the variance of the
performance with the minimum quantity of input parameters. dataset.
The second classifier, the neural network, the best result is
A. Classification Results
shown by the PCA with six features in 97.6% of accuracy.
This experiment shows the efficiency of our feature selection However, the PCA with four features presents a lower perfor-
algorithm when compared to literature methods. We try a mance with 85.5%. ReliefF presents the same results for both
linear Principal Component Analysis (PCA), The ReliefF features in 90.2%. Our proposal shows a result with 83.9% and
algorithm, the Sequential Forward Selection (SFS), and the 85.0% for four and six features. On the other hand, the SFS
Support Vector Machine Recursive Feature Elimination (SVM- presents the worst results of all classifiers, 78.4% with four
RFE). For all methods, we analyze their version with four and features and 79.2% with six features. One impressive result is
six output features in the GTA/UFRJ dataset. For the sake the SVM-RFE, with four features presents a deficient result
of fairness, we tested all the algorithms with the classification of 73.6% that is one of the worst for all classifiers, however,
methods presented before. We use a decision tree with the with with six features present almost the same best second result
a minimum of 4096 leaves, a binary support vector machine with 90.1%.
(SVM) with linear kernel, and finally a neural network with In the Support Vector Machine (SVM) classifier, the PCA
one hidden layer with 10 neurons. We use ten-fold cross- presents a similar behavior compared with the neural networks.
validation for our experiments. For six features presents the highest accuracy of all classifiers
Figure 4 presents information gain (IG) sum of the se- with 98.3%, but just 87.8% for four features. ReliefF again
lected feature for each evaluated algorithm. Information gain presents the same result for both cases in 91.4%. Our proposal
measures the amount of information, in bits, that a feature has 84% for four features and 85% for six features. SFS
adds compared to the class prediction. Thus, information gain present the same result for both features in 79.5%. The lowest
calculation computes the difference of target class entropy and accuracy of this classifier is the SVM-RFE with 73.6% for
the conditional entropy of the target class given the feature both cases. As our proposal maximizes the variance on the
value as known. When employing six features, the results show resulting features, the resulting reduced dataset spreads into
our algorithm has information retention capability between the new space. For a linear classifier, such as SVM, it is hard
SFS and ReliefF, and higher than SVM-RFE. The information to define a classification surface for a spread data. Thus, the
retention capability of PCA, is higher than feature selection resulting accuracy is not among the highest. However, as the
methods, as each feature is a linear combination of the original selected set of features still being significant for defining the
features and is computed to retain most of the dataset variance. data, the resulting accuracy is not the worst one.
Figure 5 shows the accuracy of the three classification The Sensitivity metric shows the rate of correctly classified
methods, Decision Tree, Neural Network and Support Vec- samples. It is a good a metric to evaluate the success of a
tor Machine (SVM), when different dimensionality reduction classifier when using a dataset in which a class has much
7
more samples than others. In our problem, we use sensitivity
PCA 6
as a metric to evaluate our detection success. For this, we
6
consider the detection problem as a binary classification, i.e.,
Information Gain Sum

PCA 4 SVM-RFE 4
5
Proposal 6 SVM-RFE 6
we consider two classes: normal and abnormal traffic. In
ReliefF 6
4
Proposal 4
this way, the Denial of Service (DoS) and Port Scanning
ReliefF 4 SFS 4 SFS 6
3 threat classes compose a common attack class. Similar to
2 Accuracy representation in Figure 5, Figure 6 represents the
1
sensitivity of the classifiers applying the different methods
of feature selection. In the first group, the classification with
0
Feature Reduction Method Decision Three, PCA shows the best sensitivity with 99% of
correct classification, our algorithm achieves a performance
Figure 4. Information gain sum for feature selection algorithms. The selected
features by our algorithm keeps an information retention capability between of almost 95% of sensitivity, with four and six features.
SFS and ReliefF in the GTA/UFRJ dataset. Neural Networks, represented in the second group, has the best
SVM-RFE=4 SVM-RFE=6 PCA=6
Proposal=4 Proposal=6 ReliefF=4
100 99.1
97.6 SFS=6
96.0 96.2 95.5 95.5 PCA=4 ReliefF=6
94.3 93.4
91.2 91.2 91.4 91.4 91.4 91.4 SFS=4 91.4 91.4
90.2 90.2
87.8
Accuracy [%]
85.5

80 78.4 79.2 79.5 79.5

60
Desicion Tree Neural Network Support Vector Machine

Figure 5. Accuracy comparison of features selection methods. Our Proposal, Linear PCA, ReliefF, SFS and SVM-RFE compared in decision tree, SVM, and
neural network algorithms in the GTA/UFRJ dataset.

sensitivity with PCA using six features with 97.7%, then our matrix multiplication. Matrix multiplication is a simple com-
results show good performance with both four and six features putation function.
in 89%. In this group, the worst sensitivity of all classifiers is The next experiment is to evaluate our proposal in different
reached by the SVM-RFE with four and six features in 69.3%. datasets. We use the NSL-KDD dataset and the NetOp dataset.
Finally, the last group shows the Sensitivity for Support Vector Besides linear SVM, Neural Network and Decision Tree, we
Machine (SVM) classifier. Again, showing a similar behavior also evaluate K-Nearest Neighbors (K-NN), Random Forest,
to the previous group PCA with six features shows the best two kernels, linear and Radial Basis Function (RBF) kernel
sensitivity with 97.8%. Then, the second-best result is reached in Support Vector Machine (SVM), Gaussian Naive Bayes,
by our algorithm, as well as with ReliefF, with 89% sensitivity and Stochastic Gradient Descendant. Adding these algorithms,
with both features. It is worthy to note that our algorithm we cover the full range of the most important algorithms for
presents a stable behavior in Accuracy as well as in Sensitivity. supervised machine learning.
We highlight that our algorithm performs nearly equal to PCA. The Random Forests (RF) algorithm avoids overfitting
PCA creates artificial features that are a composition of all real when compared to the simple decision tree because it con-
features, while our algorithm selects some features from the structs several decision trees, trained in different parts of
complete set of features. In this way, our algorithm was the the same dataset. This procedure decreases the variance of
best feature-selection method among the evaluated ones, and it classification and improves the performance regarding the
also introduces less computing overload when compared with classification of a single tree. The prediction of the class in
PCA. the RF classifier consists of applying the sample as input to
When analyzed the features each method chooses, it is all the trees, obtaining the classification of each one of them
possible to see none of the methods selects the set of same and, then, a voting system decides the resulting class. The
features. Nevertheless, ReliefF and SFS select as the second- construction of each tree must follow the rules: (i) for each
best amount of IP packets. One surprising result from the node d, select k input variables of total m input variables,
SFS is the election of Amount of ECE Flags and Amount such that k  m; to calculate the best binary division of the
of CWR Flags. In a correlation test, these two features show k input variables for the node d, using an objective function;
that there is no information inclusion because they are empty repeat the previous steps until each tree reaches l number of
variables. However, we realized that one of the main features nodes or until its maximum extension.
is Average Packet Size. In this dataset, the average packet size The simple Bayesian classifier (Naive Bayes - NB) takes the
is fundamental to classify attacks. One possible reason is that, strong premise of independence between the input variables to
during the creation of the dataset, an automated tool performed simplify the classification prediction, that is, given the value
the Denial of Services (DoS) and Probe attacks. Mainly this of each input variable, it does not influence the value of
automated tool produces attacks without altering the length of the others input variables. From this, the method calculates
the packet. the probabilities a priori of each input variable, or a set of
Figure 7 shows a comparison of the processing time of all them, to set up a given class. As a new sample arrives, the
implemented feature selection and dimensionality reduction algorithm calculates for each input variables the probability of
methods. All measures are in relative value. We can see being a sample of each class. The output of all probabilities
that SFS presents the worst performance. The SFS algorithm of each input variable results in a posterior probability of
performs multiple iterations to minimize the mean square this sample belonging to each class. The algorithm, then,
error (MSE). Consequently, all these iterations increase the returns the classification that contains the highest estimated
processing time. Our proposal shows the best processing time probability.
together with PCA because both implementations perform In the K-Nearest Neighbors (K-NN) the class definition
PCA=6 Proposal=6
99.3 99.9 SVM-RFE=4
100 97.9 Proposal=4 97.8
95.8 96.3 96.3 ReliefF=4
94.6 SVM-RFE=6
PCA=4 ReliefF=6
89.0 89.0
Sensitivity [%]
89.0 89.0 89.0 89.0 89.0 89.0 SFS=6
87.6 87.7 87.7
86.5 SFS=4
83.5

80 78.3
76.5 77.4 77.5 77.5 77.5 77.5

69.3 69.3

60
Decision Tree Neural Network Support Vector Machine

Figure 6. Sensitivity of detection in decision tree, SVM, and neural network algorithms for feature selection methods in the GTA/UFRJ dataset.

100
sample x(i) arrives, the SGD evaluates the Sigmoid function
and returns one for hθ (x(i) ) greater than 0.5 and zero other-
Relative Processing Time

wise. This decision presents an associated cost, based on the


10-1
real class of the sample y(i) . The cost function is defined in
Equation 11. This function is convex, and the goal of SGD
algorithm is to find its minimum, expressed by
10-2
J(i) (θ) = y(i) log(hθ (x(i) ))+(1−y(i) )log(1−hθ (x(i) )). (11)
On each new sample, the algorithm takes a step toward the cost
f
al

FE
A

ie

SF
PC
os

el

-R
R
op

function minimum based on the gradient of the cost function.


Pr

SV

Validation in NSL-KDD and NetOp Datasets: The first


Figure 7. Performance of features selection algorithms according to process-
ing time. Our proposal and the PCA present the best processing time in the
experiment evaluates the performance of the feature selection
GTA/UFRJ dataset. in both datasets. In this experiment, we vary the number
of selected features to evaluate the impact on the accuracy.
We analyze the performance with no feature selection (No
of an unknown sample is based on the k-neighbors classes FS), and then we reduce features from 10% to 90% of the
closest to the sample. The value k is a positive integer and original set of features. All the experiments were performed
usually small. If k = 1, then the sample class is assigned to using a K-fold cross-validation. The K-fold cross-validation
the class of its nearest neighbor. If k > 1, the sample class performs K training iterations in the partitions of the data and,
is obtained by starting a resultant function, such as a simple at each iteration, in the remaining K − 1 partition, the K-
voting or weighted voting, of the k-neighbors’ classes. The fold cross-validation performs the test in a mutually exclusive
neighborhood definition uses a measure of similarity between manner. We use K = 10, which is the commonly used value.
samples in the feature space. Threat detection literature often Figure 8 shows the effect of feature selection. No Feature
use Euclidean distance, although other distances have good Selection performs well for almost all algorithms. Reducing
results and the best choice for a similarity measure will depend the number of features in 10%, however, improve accuracy in
on the type of the used dataset [32]. The Euclidean distance all algorithms, except for Random Forest. In contrast, a bigger
of two samples p and q in the space of n features is given by reduction of feature deteriorates the accuracy for all classifiers.
v
u n
uX
d(p, q) = t (pi − qi )2 . (9) 1
NO FS 10% 90%
i 30%
0.8
Stochastic Gradient Descent with Momentum: This 50%
70%
Accuracy

0.6
scheme relies on the Stochastic Gradient Descent (SGD) [33]
algorithm, is a stochastic approximation of Gradient Descent, 0.4
in which a single sample approximates the gradient. In our
0.2
application, we consider two classes, normal and threat. There-
fore, we use the Sigmoid Function, expressed by 0
r

s
N

LP

D
st

BF

ee
a

ye
KN

G
re

ne

1
Tr
.M

-R

Ba
Fo

ic
i
-L
M

, (10) hθ (x) =
st
et

N
M
d.

SV

ha
N

1 + e−θ| x
G
SV
an

oc
al

R
r

St
eu
N

to perform logistic regression. In the Sigmoid function, low


values of the parameters θ| times the sample feature vector Figure 8. Evaluation of Feature Selection varying the selected features in
x return zero, whereas high values return one. When a new NSL-KDD dataset.
We also measure others metrics, such as Precision, Sensi-
105
tivity, F1-Score, training and classification time. We compare
Training Time 890.36 654.03
the effect of 10% reduction in all these metrics. Figure 910
show accuracy, precision, sensitivity and F1 - score for dataset 22.99
79.75 54.86 Classification Time

Time (s)
with no feature selection Figure 9 and with 10% of reduction 1.89 1.05
2.09
0.63 0.77
Figure 10. For K-NN, SVM with Radial Basis Function (RBF) 100 0.11
0.07 0.04
kernel, and Gaussian Naive Bayes metrics remain the same. 0.02
0.004 0.004
For the Neural Network, MLP and SVM with a linear kernel,
an improvement between 2-3% in all metrics is reached with

BF
LP

ee
st
N

r
ea

ye

G
KN

re

Tr
.M

-R
10% of features reduction. Random Forest present the worst

in

Ba
Fo

ic
-L
M

st
et

N
d.

M
SV

ha
lN
performance when features are reduced; all metrics worsen

G
an

SV

oc
ra

St
eu
their values between 8-9%. Stochastic Gradient Descendant

N
(SGD) also suffer a small reduction of 1% in their metrics.
Figure 11. Classification and training time in NSL-KDD Dataset with no
The decision tree is the most benefited improving between feature Selection
3-4% their metrics, which shows the capability of reducing
overfitting when applying feature selection.
105

1 785.08
Accuracy Training Time 349.05
Precision Sensitivity Classification Time

Time (s)
20.92 66.29 32.88
0.8 F1-score
5.03 2.88
0.68 1.50
0.6 0.79
100 0.08 0.10
0.4 0.02 0.03
0.003 0.003
0.2

BF
LP

ee
st
N

r
ea

ye

G
KN

re

Tr
.M

-R

in

Ba
Fo

ic
-L
M
0

st
et

N
d.

M
SV

ha
lN
LP

G
an

SV
BF

ee
t
N

oc
es

ra
ea
.M

ye

R
G
KN

St
Tr

eu
-R
r

in

Ba
Fo
et

ic
-L

N
M

st
N

N
d.

M
SV

ha
l

G
ra

an

SV

oc
eu

St

Figure 12. Classification and training time in NSL-KDD Dataset with only
N

10% of the initial features


Figure 9. Accuracy, precision, sensitivity and F1-Score for NSL-KDD.
Metrics with no future selection.

second, and classification time remained the same, which is


negligible because of the intrinsic error of the cross-validation.
1
Accuracy SVM with Radial Basis Function (RBF) and SVM with linear
Precision Sensitivity
0.8 F1-score kernel are the most benefited from features selection. SVM-
RBF training time reduced 11% while the classification time,
0.6
16%. SVM-Linear classification time reduced 46%, from 654
0.4 seconds to 349 seconds, and training time, 40%, from 54.86
to 32.88 seconds. Feature selection in Gaussian Naive Bayes,
0.2
Stochastic Gradient Descendant, and Decision Tree strongly
0 impact in training time with an approximate reduction of 30%,
LP

while the classification time was reduced of a one-time unit


r

s
N

D
st

BF

ee
ea

ye
KN

G
.M

re

Tr
-R

in

Ba
Fo

ic
et

-L
M

in three algorithms.
st
lN

N
M
d.

SV

ha
G
SV
an
ra

oc
R
eu

We performed the same experiment in the NetOp Dataset.


St
N

Figure 13 shows the accuracy of different classifiers while


Figure 10. Metrics reducing only 10% of the initial features in the NSL-KDD
dataset. reducing from 10% to 90% of the features. Using the NetOp
dataset, applying feature selection keeps unaffected classifier
Figure 11 shows training and classification time with no accuracy unaffected. In the case of K-NN, the accuracy
features selection, while Figure 12 shows results with 10% variation is less than 0.02%. A similar case occurs with Neural
of reduced features. The K-NN algorithm training time aug- Networks, SVN with linear and with RBF kernels, Stochastic
mented considerably, passing from 0.63 seconds to 5.03, while Gradient Descendant, and Decision Tree. In Random Forest,
classification time also suffers an increase passing from 1.89 the best accuracy is found with a reduction of 30% of the
seconds to 2.88. Neural Network reduced 9% of the training original set of features of the dataset. The best result is reached
time, from 22.99 seconds to 20.92, classification time got 0.01 in Gaussian Naive Bayes, in which 90% of the reduction in
second increased. Random Forest training time increased 0.02 the selected features increases the accuracy from 57% to 78%,
using only five features.
1 Precision Sensitivity
Acuracy
F1-Score
0.8

NO FS 30% 0.6
1 10% 90%
50% 70%
0.4
Accuracy

0.2
0.8
0

LP

BF

D
N

st

s
r

e
ea
.M

ye

e
G
KN

re

Tr
-R

in

Ba
Fo

ic
et

-L
M

st
lN
0.6

N
d.

M
SV

ha
G
ra

an

SV

oc
eu

St
N
r

s
N

LP

D
t

BF

e
s

ea

ye

e
KN

G
re

Figure 15. Metrics reducing only 90% of the initial features in the NetOp

Tr
M

-R

in

Ba
Fo

tic
-L
.

M
et

dataset.

as
N
M
d.

SV
lN

h
SV
an

oc
ra

St
eu
N

training time in 80% while classification time was reduced in


Figure 13. Evaluation of Feature Selection varying the selected features in 76%. Stochastic Gradient Descendant also shows a reduction
NetOp dataset.
of 61% in training and 66% for classification time. Finally,
Reducing 90% of selected features, we analyze other met- Decision Tree reduced training time by 79% and classification
rics, such as Precision, Sensitivity, and F1-Score, for all time got faster, being reduced by 28%. As a consequence,
classifiers. We compare the results with no feature selection, a feature reduction of 90% impacts directly in the training
Figure 14, and with only five features, Figure 15. All metrics and classification time of the machine learning classifiers.
remains almost equal. We achieve a slight positive variation in Therefore, our Feature Selection method improves training and
Gaussian Naive Bayes and Random Forest. We conclude that, classification times in all classifiers.
for this dataset, our Feature Selection method maintains the
metrics unaltered or increase classifier performance, because 105
5795.88
2528.53
our proposal keeps the most of independent features in the Training Time
153.10 117.52 Classification Time
dataset.
Time (s)

17.42 8.99
4.494.98
1.66
0 0.36
10 0.20
Sensitivity 0.05
1 Accuracy Precision 0.017 0.017 0.007
F1-Score 0.003
0.8
LP

BF

e
t

es
N

D
es

ea

e
KN

G
y

Tr
.M

-R
or

0.6
in

Ba

ic
F

-L
M

st
et

N
d.

M
SV

ha
lN

G
an

SV

oc
0.4
ra

St
eu
N

0.2

0 Figure 16. Classification and training time in NetOp Dataset with no feature
LP

Selection
BF

ee
t

s
N

D
es

ea
.M

ye
KN

Tr
-R
or

in

Ba
et

ic
F

-L
M

st
lN

N
d.

M
SV

ha
G
ra

an

SV

oc
eu

St
N

105
Figure 14. Accuracy, precision, sensitivity and F1-Score for NetOp dataset. Training Time 1365.61
296.88
Metrics with no future selection. 69.00
Time (s)

13.02 21.80 Classification Time


1.280.77 1.02 1.88
Figure 16 shows the training and classification times with no 10 0
0.14622
0.04
feature selection, while Figure 16 shows the training and clas- 0.03
0.01
0.004 0.005
sification times for the dataset with 90% of feature reduction. 0.001

All the classifiers reduced their times. K-NN training time is


LP

BF

ee
st

es
N

D
ea
KN

G
re

Tr
.M

-R

in

reduced by 71%, while classification time is reduced by 84%.


Ba
Fo

tic
-L
M
et

as
N
d.

M
SV
lN

ch
an

For Neural Networks reduced the training time by 25% and


SV
ra

o
R

St
eu

classification time is reduced in 0.02 seconds. Random Forest


N

reduced their training time by 38% while their classification


Figure 17. Classification and training time in NetOp Dataset with only 90%
time remains the same. SVM with RBF kernel training time of the initial features
is reduced by 78% and training time is reduced by 54%.
SVM with linear kernel received the biggest improvement. In this experiment, we show the most important group of
Training time was reduced by 88% while classification time features. Thus, we group features into eight groups in the
was reduced by 81%. Gaussian Naive Bayes reduced their NetOp dataset. We remove the flow tuple information features
because our algorithm works on numerical features and tuple Figure 20 shows the accuracy when we analyze one day
information features are categorical. Table III describes the from NetOp dataset. In the experiment, we measure the impact
groups. We established the window size at 1000 samples. of the concept-drift on the final accuracy. Determining the
Figure 18 shows the accuracy for all seven algorithms for clas- concept-drift helps to improve the performance of the system,
sification. In Decision Tree, all groups show similar behavior since the model will not be recalculated. We train different
and present high accuracy. Gaussian Naive Bayes and SVM static algorithm with 30% of the dataset. We use 1000 samples
with the linear kernel for group 3, Time Between Packets, and as window size. The trained static algorithms are the Support
for group 5, subflow information, present the lowest accuracy. Vector Machine (SMV) with linear kernel, and with Radial
For the rest of the groups, these classifiers also reach high Basis Function (RBF) kernel, Gaussian Naive Bayes, Decision
accuracy. K-Nearest Neighbors (K-NN) shows a special case, Tree, and Stochastic Gradient Descent (SGD). The decision
besides group 2, which is the highest accuracy, all the other tree has the worst accuracy when compared with the other
groups show different behaviors. In Neural Networks, groups algorithms. Decision tree shows a low accuracy in the second
2 and 3, Packet Statistics and Time between Packets, show the window. This behavior means that the created model during
highest accuracy, while the reminding groups maintain in 50%. the training step does not adequately represent the model of the
Random Forest shows a similar behavior than Decision Tree, entire dataset. Stochastic Gradient Descendant shows a similar
with high accuracy in all their groups. Nevertheless, the group behavior of decision tree, having a concept-drift in the second
5, subflow information, present the lowest accuracy. Stochastic window. The SVM with linear kernel presents a concept-
Descendant Gradient shows the highest accuracy in group 2,6 drift in the seventh window. SVM with RBF shows a lower
and 7. We conclude that group 2, Packet Statistics, is the most accuracy during all experiment and a concept-drift at the last
important for the accuracy calculation for all the classifiers. window. Finally, due to the implementation of the Gaussian
Naive Bayes, it follows the same probability distribution as
Table III our normalization method, as a consequence does not present
F EATURES G ROUPS
any concept-drift.
Group Description Number of Features
V. R ELATED W ORK
G1 Packet Volume 4
G2 Packet Statistics 8 State-of-art proposals focus on algorithms for online feature
G3 Time Between Packets 8 selection. Perkins and Theiler Grafting algorithm based on a
G4 Flow Time Statistics 9 stage-wise gradient descent approach. Gradient Feature Test-
G5 SubFlow Information 4 ing (grafting) [35] treats feature selection in the core of the
G6 TCP Flags 4 learning process. The objective function is a binomial negative
G7 Bytes in headers + ToS 3 log-likelihood loss. The Grafting method uses an incremental
and iterative gradient descent. For each step, a heuristic is
Finally, this experiment shows how our preprocessing used to identify which feature improves the existent model.
method when executing with machine learning classifiers in If the feature optimizes the model, the selected feature and
stream data, can detect concept-drift. This experiment also the model are returned. The Grafting algorithm is able to
demonstrates that the proposed preprocess method can run operate with streaming features. A value of the regularization
under batch and stream data. We use the flow diagram of the parameter λ must be determined in advance to establish which
Figure 19. We force traditional learning methods to become feature is most likely to improve the model at each iteration.
adaptive learners to detect the concept-drift. Adaptive learners To determine the value of λ is required information about
dynamically adapt to new training data when the learned the global feature set. As a consequence, Grafting method is
concept is contradicted. Once a concept-drift is detected, a ineffective with streaming data with unknown feature size.
new model is created. The Alpha-investing method [36] considers that new fea-
We validate the proposal with the NetOp dataset. The tures arrive in a streaming manner generated sequentially for
dataset presents labeled samples in threats and normal traffic, a a predictive model. The main advantage of Alpha-investing
binary classification problem. We divide the dataset in training is the possibility to run in a feature set of unknown or even
set and test set, in a relation of 70% for training and 30% infinite sizes. Every time a feature arrives, alpha-investing uses
for the test. We consider the training set as static with T linear regression to reduce the threshold of error dynamically.
consecutive sample windows. We have used the Synthetic Mi- As a drawback, alpha-investing only considers the addition
nority class Oversampling TEchnique (SMOTE) [34] approach of new features without evaluating the redundancy after the
to oversampling the minority class, only in the training set, feature inclusion.
initial window. When the number of samples in a class is Wu et al. presented the OSFS (Online Streaming Feature
predominant in the dataset, it is called class imbalance. Class Selection) algorithm and its faster version, the Fast-OSFS
imbalance is typical in our kind of threat detection application algorithm, to avoid the redundancy of added features [37].
when attacks are rare events when compared to normal traffic. The OSFS algorithm uses a Markov blanket of a feature to
The test set is streaming data arriving at the same frequency. determine the relevance of the feature in relation with their
We group data in sliding windows of N samples. neighbors. The Markov blanket of a node A, M B(A), is its
Accuracy 1 G2 G3

0.8
G5 G7
G4 G6
G1
0.6

0.4

LP

ar
N

D
st
ee

ye

KN

G
re

ne
Tr

lM
Ba

Fo

ic

i
-L
st
ra
N

M
d.

ha
eu
G

SV
an

oc
.N

St
R
Figure 18. Evaluation of group features with different machine learning algorithms.

joins equal size chunk where all operations originate. The first
data chunk is used to take the references min-max values and
to send the normalized data for the training model. The metric
1 represents the amount of sample falling outside the min-
max reference values; the metric 2 is the relation between
new sample values in each dimension and the referenced min-
max value for that dimension. Similar to our proposal, the
algorithm works with numerical data.
Incremental Discretization Algorithm (IDA) uses a quan-
tile approach to discretize data stream [26]. The algorithm
discretizes data stream in m equal frequency bins. A sliding
window version of the algorithm is proposed to follow the
evolution of the data stream. The algorithm maintains the data
into bins with fixed quantiles of the distribution, rather than
Figure 19. Flow diagram used for proposal evaluation fixed absolute values, to follow the distribution drift.
In our proposal, we propose an unsupervised preprocessing
method. Our method includes normalization, and feature selec-
1
Gaussian Naive Bayes tion altogether. The proposal is parametric-less. Our algorithm
SVM (linear kernel) follows an active approach for concept-drift detection. The
0.8 active approach monitors the concept, the label, to determine
when drift occurs before taking any action. A passive ap-
Accuracy

0.6 SVM (RBF kernel)


proach, in contrast, updates the model every time new data
Decision
0.4 Tree arrives, wasting resources. We modified our proposed Feature
Concept
Drift Selection algorithm to calculate the correlation between fea-
Concept
0.2
Drift
tures in a sliding window. Also, a normalization algorithm is
proposed to handle data stream.
0
10 20 30 40 50 60 70 80 90 100
VI. C ONCLUSION
WindowNumber
Achieving good classification metrics for streaming data is
Figure 20. concept-drift detection in one day of the NetOp dataset. Our a challenge because neither the number of samples nor the
proposal was able to detect early concept-drift in SGD and in SVM with
linear kernel. Gaussian Naive Bayes shows a very high performance with no
domain of each sample feature is bounded. Therefore, it is
concept-drift. mandatory to apply preprocessing methods to streaming data
to bound the domain of each feature and to select only the
most representative features for the classification model. In
set of neighboring nodes. The computational cost to calculate this paper, we presented a method for data preprocessing for
the Markov blanket of a feature is prohibitive when dealing classification of network traffic. The method is composed of
with high dimensional data. two algorithms. First, we propose a normalization algorithm
Smart Preprocessing for Streaming Data (SPSD) is an that enforces data to a normal distribution within values be-
approach that uses min-max normalization of numerical fea- tween −1 and 1, which produces a more accurate classification
tures [30]. The authors use two metrics to avoid unnecessary model and reduces the classification error. Then, a feature
renormalization. SPSD only renormalizes when a threshold selection algorithm calculates the correlation of all pairs of
exceeds some threshold value of the metrics. Streaming data features and selects the best features in an unsupervised way.
We select the features with the highest absolute correlation. [16] Zhang, S., Zhang, C., and Yang, Q., “Data preparation for data mining,”
When compared with traditional feature selection algorithms, Applied artificial intelligence, vol. 17, no. 5-6, pp. 375–381, 2003.
[17] Tan, S., “Neighbor-weighted k-nearest neighbor for unbalanced text
our proposal selects an optimized subset of features improving corpus,” Expert Systems with Applications, vol. 28, no. 4, pp. 667–671,
the accuracy by more than 11% within a 100-fold reduction 2005.
in processing time. Moreover, we modified our preprocessing [18] Ramı́rez-Gallego, S., Krawczyk, B., Garcı́a, S., Woźniak, M., and
Herrera, F., “A survey on data preprocessing for data stream mining:
method to work both on batch and on streaming data. When Current status and future directions,” Neurocomputing, 2017.
applied over streaming data, our preprocessing method was [19] Van Der Maaten, L., Postma, E., and den Herik, J., “Dimensionality
able to detect concept-drift due to the sensibility for detecting reduction: a comparative,” Journal of Machine Learning Research,
vol. 10, pp. 66–71, 2009.
changes on the accuracy over a threshold. [20] Ang, J. C., Mirzal, A., Haron, H., and Hamed, H. N. A., “Supervised,
Unsupervised, and Semi-Supervised Feature Selection: A Review on
ACKNOWLEDGMENT Gene Selection,” IEEE/ACM Transactions on Computational Biology
The authors would like to thank Antonio Lobato, Igor and Bioinformatics, vol. 13, no. 5, pp. 971–989, 9 2016.
[21] Chandrashekar, G. and Sahin, F., “A survey on feature selection meth-
Alvarenga and Igor Sanz for their significant contributions ods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28,
to obtain the results. This research is supported by CNPq, 2014.
CAPES, FAPERJ, and FAPESP (2015/24514-9, 2015/24485- [22] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V., “Gene selection for
cancer classification using support vector machines,” Machine learning,
9, and 2014/50937-1). vol. 46, no. 1-3, pp. 389–422, 2002.
[23] Hall, M. A., “Correlation-based Feature Selection for Machine Learn-
R EFERENCES ing,” Ph.D. dissertation, The University of Waikato, 1999.
[1] Hu, P., Li, H., Fu, H., Cansever, D., and Mohapatra, P., “Dynamic [24] Kumar, A., Sung, M., Xu, J. J., and Wang, J., “Data streaming algorithms
defense strategy against advanced persistent threat with insiders,” in for efficient and accurate estimation of flow size distribution,” in ACM
IEEE Conference on Computer Communications (INFOCOM), 4 2015, SIGMETRICS Performance Evaluation Review, vol. 32, no. 1. ACM,
pp. 747–755. 2004, pp. 177–188.
[2] Paxson, V., “Bro: a System for Detecting Network Intruders in Real- [25] Ben-Haim, Y. and Tom-Tov, E., “A streaming parallel decision tree
Time,” Computer Networks, vol. 31, no. 23-24, pp. 2435–2463, 1999. algorithm,” Journal of Machine Learning Research, vol. 11, no. Feb,
[3] Roesch, M., “Snort-Lightweight Intrusion Detection for Networks,” in pp. 849–872, 2010.
Proceedings of the 13th USENIX conference on System administration. [26] Webb, G. I., “Contrary to popular belief incremental discretization can
USENIX Association, 1999, pp. 229–238. be sound, computationally efficient and extremely useful for streaming
[4] Vallentin, M., Sommer, R., Lee, J., Leres, C., Paxson, V., and Tierney, B., data,” in IEEE International Conference on Data Mining (ICDM).
“The NIDS Cluster: Scalable, Stateful Network Intrusion Detection on IEEE, 2014, pp. 1031–1036.
Commodity Hardware,” in Recent Advances in Intrusion Detection. [27] Tavallaee, M., Bagheri, E., Lu, W., and Ghorbani, A. A., “A detailed
Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 107–126. analysis of the kdd cup 99 data set,” in 2009 IEEE Symposium on
[5] Bar, A., Finamore, A., Casas, P., Golab, L., and Mellia, M., “Large- Computational Intelligence for Security and Defense Applications, July
scale network traffic monitoring with DBStream, a system for rolling 2009, pp. 1–6.
big data analysis,” in 2014 IEEE International Conference on Big Data [28] Lobato, A., Andreoni Lopez, M., Sanz, I. J., Cárdenas, A., Duarte, O. C.
(Big Data). IEEE, 10 2014, pp. 165–170. M. B., and Pujolle, G., “An adaptive Real-Time architecture for Zero-
[6] Stonebraker, M., Çetintemel, U., and Zdonik, S., “The 8 requirements Day threat detection,” in IEEE ICC 2018 Next Generation Networking
of real-time stream processing,” ACM SIGMOD Record, vol. 34, no. 4, and Internet Symposium (ICC’18 NGNI), Kansas City, USA, May 2018.
pp. 42–47, 2005. [29] Andreoni Lopez, M., Silva, R. S., Alvarenga, I. D., Rebello, G. A. F.,
[7] Mayhew, M., Atighetchi, M., Adler, A., and Greenstadt, R., “Use of Sanz, I. J., Lobato, A. G. P., Mattos, D. M. F., Duarte, O. C. M. B.,
machine learning in big data analytics for insider threat detection,” in and Pujolle, G., “Collecting and characterizing a real broadband access
IEEE Military Communications Conference, MILCOM, 10 2015, pp. network traffic dataset,” in IEEE/IFIP 1st Cyber Security in Networking
915–922. Conference (CSNet), Oct 2017, pp. 1–8.
[8] Mladenić, D., “Feature Selection for Dimensionality Reduction,” in [30] Hu, H. and Kantardzic, M., “Smart preprocessing improves data stream
Subspace, Latent Structure and Feature Selection (SLSFS): Statistical mining,” in 49th Hawaii International Conference on System Sciences
and Optimization Perspectives Workshop., Saunders, C., Grobelnik, M., (HICSS). IEEE, 2016, pp. 1749–1757.
Gunn, S., and Shawe-Taylor, J., Eds. Bohinj, Slovenia: Springer Berlin [31] Buczak, A. and Guven, E., “A Survey of Data Mining and Machine
Heidelberg, 2006, pp. 84–102. Learning Methods for Cyber Security Intrusion Detection,” IEEE Com-
[9] Bifet, A. and Morales, G. D. F., “Big data stream learning with samoa,” munications Surveys Tutorials, no. 99, pp. 1–26, 2015.
in 2014 IEEE International Conference on Data Mining Workshop, Dec [32] Prasath, V. B. S., Alfeilat, H. A. A., Lasassmeh, O., and Hassanat, A.
2014, pp. 1199–1202. B. A., “Distance and Similarity Measures Effect on the Performance of
[10] Khamassi, I., Sayed-Mouchaweh, M., Hammami, M., and Ghédira, K., K-Nearest Neighbor Classifier - {A} Review,” CoRR, vol. abs/1708.0,
“Discussion and review on evolving data streams and concept drift 2017. [Online]. Available: http://arxiv.org/abs/1708.04321
adapting,” Evolving systems, vol. 9, no. 1, pp. 1–23, 2018. [33] Zhang, T., “Solving large scale linear prediction problems using stochas-
[11] Rahm, E. and Do, H. H., “Data cleaning: Problems and current ap- tic gradient descent algorithms,” in Proceedings of the twenty-first
proaches,” IEEE Bulletin of the Technical Committee on Data Engi- international conference on Machine learning. ACM, 2004, p. 116.
neering, vol. 23, no. 4, pp. 3–13, 2000. [34] Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P.,
[12] Garcı́a, S., Luengo, J., and Herrera, F., Data preprocessing in data “SMOTE: synthetic minority over-sampling technique,” Journal of arti-
mining. Springer, 2016. ficial intelligence research, vol. 16, pp. 321–357, 2002.
[13] Robnik-Šikonja, M. and Kononenko, I., “Theoretical and Empirical [35] Perkins, S. and Theiler, J., “Online feature selection using grafting,” in
Analysis of ReliefF and RReliefF,” Machine Learning, vol. 53, no. 1/2, Proceedings of the 20th International Conference on Machine Learning
pp. 23–69, 2003. (ICML-03), 2003, pp. 592–599.
[14] Schölkopf, B., Smola, A. J., and Müller, K.-R., “Kernel principal [36] Zhou, J., Foster, D. P., Stine, R. A., and Ungar, L. H., “Streamwise
component analysis,” in Advances in kernel methods. MIT Press, 1999, feature selection,” Journal of Machine Learning Research, vol. 7, no.
pp. 327–352. Sep, pp. 1861–1885, 2006.
[15] Garcı́a, S., Luengo, J., and Herrera, F., “Tutorial on practical [37] Wu, X., Yu, K., Ding, W., Wang, H., and Zhu, X., “Online feature
tips of the most influential data preprocessing algorithms in selection with streaming features,” IEEE transactions on pattern analysis
data mining,” Knowledge-Based Systems, vol. 98, pp. 1–29, 4 and machine intelligence, vol. 35, no. 5, pp. 1178–1192, 2013.
2016. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/
S0950705115004785

You might also like