Professional Documents
Culture Documents
Chapter 4
Chapter 4
4.1 INTRODUCTION
Real time intrusion data of various attacks were collected in cloud environment using
the Honeynet designed as discussed in the previous chapter 3. these datasets were subjected to
further analysis for intrusion detection. The efficiency of an intrusion detection system is
dependent on the selection of relevant features, among other factors. In this research work, we
study feature selection in the detection of network attacks for two purposes. Firstly, we study
the main purpose of feature selection, which is reducing the number of features in building the
attack detection predictive models. Our analysis provides a guideline for selecting a proper
feature selection method for an intrusion detection task. Secondly, we investigate the
determination of feature selection to discover the important features in the detection of a
specific attack. Such features provide more insight about the attack behavior and how it can be
distinguished from normal traffic. This information can be used in the process of feature
engineering for similar attacks [32].
In this research work, we used an ensemble of different filter feature selection methods
for finding the important feature for the detection of attacks in intrusion datasets. We proposed
a univariate ensemble feature selection technique for intrusion detection, this approach used
for the selection of valuable reduced feature set from given intrusion datasets, in which
simplicity and speed of five univariate filter methods were used for contributing features
towards intrusion detection were selected.
Then, we applied four different classification algorithms along with the univariate filter
feature selection methods to provide a good classification performance which influenced for
guiding the selection of valuable reduced feature set from given intrusion datasets. These
selected features were would be subjected to the classification algorithms discussed in next
chapter. The Pair-wise T-test was performed and the results obtained showed this proposed
feature selection technique as statistically significantly different from the exiting approaches.
4.2 PROBLEM DESCRIPTION
Elimination of redundant features in high dimension intrusion datasets has become a
major challenge for network intrusion detection. The focus of a particular feature selection
method on one specific region of the feature space may not provide better performance. Using
efficient feature selection methods in intrusion detection lead to better solution for handling
the high dimension intrusion datasets. However, distinctive feature selection algorithms would
have been chosen for different feature subsets.
In order to solve this issue, ensemble learning becomes an important part of most IDS
fields, this technique was used for combining the independent feature subsets obtained by a
function in order to get a robust feature subset. This study uses univariate ensemble-based filter
feature selection technique for intrusion detection, this approach used for the selection of
valuable reduced feature set from given intrusion datasets. The output of univariate filter
feature selection techniques namely, Information-gain, Gain-ratio, Chi-squared, Symmetric
Uncertainty and Relief have been combined to produce the final outcome.
In this study, two novel algorithms, namely Combined feature scoring(CFS) and
Minimum threshold value selection (MTVS) have been proposed, these two algorithms have
come up with the solutions for the following issues, It reduces the problems of ranking without
adopting any learning algorithms, computational overheads, statistical biases of existing
methods and used the Minimum threshold value selection to identify useful features.In this
investigation, the effect of ensemble feature selection on the model accuracy by looking deeply
into the ensemble feature selection, and performing a comparative study between existing
approaches were performed.
Feature selection essentially is categorised into search and evaluation processes. The
search process is applied for locating the features which have the viability to the target concept.
The search method is further categorised as exhaustive/complete search, heuristic search and
random search. One of these search techniques is used for searching the features from the
feature space. Every search method has its own merits and demerits on the feature selection
process. The search process comprises of three phases namely, starting phase, generation of
subset and stopping criterion of search. The evaluation process has two phases such as
individual feature evaluation and feature subset evaluation. It is otherwise said to be a
deterministic and nondeterministic evaluation of features. Figure (4.1) shows the four steps of
the feature selection process.
Figure 4.1: Feature Selection Process.
A filter algorithm first ranks features based on some quality criteria. Features with the
highest weights or ranks are then selected to induce classification. Filter based feature ranking
methods are further split in to two sub categories such as univariate and multivariate.
Univariate filter methods evaluate relevance of each feature relevance independently of others
while multivariate methods take into account features dependencies while evaluating them.
Univariate method could be easily scaled to very high dimensional datasets. these methods are
known for speedy performance, computational simplicity and have high performance
characteristics as compared with the other approaches. while multivariate approaches are more
complex specially for data sets with large d as they introduce all feature dependencies. In that
sense, univariate features are advantageous. The most widespread univariate filter methods are
described as follows.
Sf j = v
IG(S, f j ) = H ( S ) å H(Sf j = v) (4.1)
v=values( f j ) S
Sf j = v
Where is the fraction of examples with fj having the value v, and H(S) is the entropy
S
given by:
K
(4.2)
Where p(ck) is the possibility of detecting class ck in the training set S and K is the number of
classes. H(Sfj=v) is calculated in the same way using only the subset of instances with fj having
the value v. A feature is relevant if it has a high IG. Univariate approach is used to select the
features. The Info gain is the most popular filter based feature selection technique in the filter
method, this is an information theoretic measure also called as symmetric measure for feature
selection. The Info Gain is defined as the following equation [26]:
where Info Gain (A) is the IG of a feature A, and the entropy of the absolute dataset is an Info
(D), the sample entropy of attribute A is Info A(D).
b) Gain-ratio
The Gain ratio filter feature selection method is measured to be one of the discrepancy
measures that provides regularised notch to improve the Info Gain method score. The split
information value can be applied to measure the ratio, the formula for split information as
follows [26]:
v Dj Dj
Split InfoA ( D ) = - å * log2 (4.4)
j=1 D D
where the structure of v partitions represents the Split Info. The gain ratio formula as follows
[26]:
GainRatio( A) = InfoGain( A) / SplitInfo( A) (4.5)
c) Chi-squared
The Chi squared FS is the most widespread statistic measure feature selection technique, which
processes the relationship between two variables. It may assist to assess the independence of a
feature from its class. The Formula for Chi squared FS defined as follows [26]:
Where i and j are two variables and O represents Observed value and E represents expected
value and represents value of Chi-squared.
N * (F1 F4 - F2 F3 )2
CHI( A,Ci ) = (4.7)
(F1 + F3 ) *(F2 + F4 ) *(F1 + F2 ) *(F3 + F4 )
The frequencies of incident of both feature A and class Ci represents F1, F2, F3, and F4, also
the total number of attributes indicates N.
d) Symmetric Uncertainty
This method is an information theoretic measure also called symmetric measure for feature
selection technique, it is used to evaluate the ranking of produced solutions.
This method can be defined by the following formula [27]:
2* IG( A / B)
SU ( A, B) = (4.9)
H( A) + H(B)
where IG calculated by an independent feature A and feature which is denoted by IG (A | B),
then, H (A) and H (B) indicates the entropy of the features A and B.
e) Relief
Relief is an efficient filter based feature selection method. Relief uses heuristic techniques to
generate candidate feature subset and distance to evaluate a candidate subset. This algorithm
is a feature weight based algorithm and uses statistics method to choose the relevant features.it
correlates the features and feature weight value, and can correctly estimate the quality of
features in problems with strong dependencies between features. The key idea of the original
RELIEF algorithm is to estimate the quality of features according to how well their values
distinguish between instances that are near to each other. It also uses the concept of NearHit
and NearMiss. For that purpose, given a randomly selected instance Ri, Relief searches for its
two nearest neighbors, one from the same class, called NearHit(H) and the other from the
different class, called NearMiss (M). The algorithm uses one function diff () to find difference
of same features in two different records.
Where W[A] updates the quality estimation for all attributes A depending on their values for
Ri, M and H.
4.5 UNIVARIATE ENSEMBLE BASED FILTER FEATURE SELECTION (UEFFS)
The output of univariate filter feature selection techniques such as Info gain, Gain-ratio,
Chi-squared, Symmetric uncertainty and Relief were combined to enable production of the
final outcome. The proposed Feature ranking methodology has come up with solutions for the
following issues:
Using an efficient feature ranking algorithm to reduce the problems of ranking without
overheads.
Using a least threshold value selection for holding important selective features to be
In the proposed CFS algorithm based on ensemble methods, instead of selecting one
specific feature selection method, and taking its outcome as the final subset, different models
could be combined using ensemble feature selection approaches. This method not only
improves the classification performance but also helps the classifiers to get precise results
during attack detection. As discussed above, there are various filter approaches proposed and
selected as the most significant features for improving the predictive performance. This
proposed method has used the ranking approach, considering the reasons for considered a
desirable approach due to very fast, simple, measuring the relevance of a feature subset not
time consuming. Hence, this method could be very appropriate for the choice of the significant
features in intrusion data sets.
The proposed approach, evaluates the significance of the features by their association
to the class and classifies independent features corresponding to their degrees of weights.
Features with the highest weights or ranks are then selected for inducing classification. In order
to equalize the impact of dissimilar scales, the proposed approach changes the values to the
identical scale (i.e., range from 0 and 1). Features with the uppermost weights or ranks are then
selected to rank 1, the existing approach has been selected rank 0 to a feature with the
uppermost weights or ranks. Following this, the scale ranks are ordered in the ascending order
and combines them. Finally, the proposed algorithm calculates a mean to find the ranks and
significances of each feature.
An algorithm 1, first takes an intrusion datasets as input for proposed CFS scheme, and
calculates the weights of the attributes and the following key steps would clearly describe the
process of the algorithm. This algorithm, uses a univariate filter-based measures. The
univariate scheme does evaluation of each feature independent of the others, while the
multivariate scheme evaluates features in batches. The proposed CFS algorithm is presented as
follows:
Algorithm 1: CFS-Combined feature scoring
Input: Input Datasets (Honeypot, NSL-KDD, Kyoto)
Output: FR –Feature Rank
Step 1: Compute the number of features
totFeatures totFeatures(data)
Step 2: Let n be the number of filter features measures.
(FtrEv1, FtrEv2, FtrEv3……and FtrEvn)
Step 3: Perform the features ranks using filter features measures
FRC1[] featureRanksCompute (data, FtrEv1)// Where FRC represents computed ranks;
FRC2[] featureRanksCompute (data, FtrEv2);
FRC3[] featureRanksCompute (data, FtrEv3);
FRCn [] featureRanksCompute (data, FtrEvn);
Step 4: Feature Scaling Ranks(FSR) - Invoke the Algorithm 2 for –FSR feature scaling ranks
scaledRanks1[]scaleRanks(CR1)
scaledRanks2[]scaleRanks(CR2)
scaledRanks3[]scaleRanks(CR3)
scaledRanksn []scaleRanks(CRn)
Step 5: Then combined the sum of all computed ranks
Step 6: SumofcombinedRanks 0;
Step 7: combinedRanks [];
Step 9: then adding all computed scaled ranks to find the ranks for each feature.
Step 10: combinedRanks1 ∑𝑛𝑗=1 𝑠𝑐𝑎𝑙𝑒𝑑𝑅𝑎𝑛𝑘𝑠ji
Step 11: SumofcombinedRanks= SumofcombinedRanks+ combinedRanksi;
Step 12: end
Step 13: Make the rank list in ascending order
Step 14: sortedRanks[]sort(combinedRanks);
Step 15: Find each feature of Score, weight, and priority
The development of the minimum threshold value selection was described in algorithm 3. The
first step of this algorithm was consideration of the three intrusion datasets (Honeypot, NSL-
KDD, Kyoto) and four classifiers (naïve bayes, logistic regression, decision tree, SVM) as
input for evaluations. then, all of those datasets were fed to the info-gain filter measure to
compute the attributes ranks and then the attributes were arranged in the ascending order based
on their ranks, this is shown in step 3 and step 4 of algorithm 3. After that, all the datasets were
segregated into distinct chunks. The top ranked features of 80% datasets were retained and
lower ranks features of 20% were discarded. Once filtered datasets were generated, the next
step was to feed the filtered dataset to four classifiers from various types and different
characteristics as shown in step 6 to step 11 of the algorithm 3. In step 12, the predictive
accuracies of those classifiers were computed using 10-fold cross validation approach. Finally,
the minimum cut-off value was identified by the average predictive accuracies against each
chunk of dataset. This was shown in step 16 and step 20. Figure 4.5 illustrates the process of
the minimum threshold value selection.
Feature selection results are interpretable in many applications and selection methods leading
to meaningful results should be preferred to those that do not. However, in some cases the
interpretability of the selected features requires a deep knowledge of the application field that
a computer scientist may not have, making this evaluation not obvious. Due to this reason, the
following four performance metrics namely accuracy, precision, recall, and f-measure. In this
research work the focus is on the evaluation of the methods become applicable more than
interpretability of results. However, the classification accuracy evaluation relies on a predictive
algorithm. Hence, the predictive models were used and then several metrics for assessing
classification performances have been described.
Data classification is categorised under supervised learning where the objective is the
prediction of a class membership value, also called class label of unknown observations or
samples using a training data for which the class labels are known. Each observation in the
training or test data is represented by a feature vector associated with it. The process of data
classification involves generally two steps: training of a classifier and testing of the trained
classifier. There are many different classifiers that can be applied in different applications. The
details of those classification algorithms were discussed in section 1.6.2.
The results of any classification can be summarized with the help of a confusion matrix. A
confusion matrix shows the details of the actual and expected results given by the classifier.
This representation is made by comparing true and predicted labels. Comparison is done using
a confusion matrix where lines and columns are respectively true and predicted classes. TP and
TN represent correct decisions made by the classifier, while FN and FP are classification errors.
Given a set of known labels for positive and negative classes, any input data may be classified
into positive or negative class. According to the results the classification may be grouped as
True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).
Accuracy: The most used performance criterion is the correct classification rate, known as
accuracy. This measure ranges from 0, with perfect misclassification, to 1 when the classifier
perfectly classifies the testing data. It is the most common and simplest measure to evaluate a
classifier. It specifies the overall effectiveness of the classifier. It is calculated as follows:
TP + TN
Accuracy = (4.16)
P+ N
Recall: Recall is also called true positive rate, it is the proportion of positive cases that were
correctly identified, it is calculated as follows:
TP
Re call = (4.17)
P
Precision: Precision is the ratio of the predicted positives to total number of samples.
it is calculated as follows:
TP
Pr ecision = (4.18)
TP + FP
F-measure: The F-measure is used to check test accuracy using precision p and recall r of the
test. It is defined as the weighted average of the precision and recall. The maximum value of F
score is 1 and the minimum value is 0. It is calculated:
2
Fmeasure = (4.19)
1/ precision +1/ recall
4.7 EVALUATION OF THE UEFFS METHOD
The proposed UEFFS method was evaluated on three intrusion datasets namely, NSL-
KDD, Kyoto and Honeypot and also compared with existing methods for the purpose of
proving the effectiveness of predicted informative feature subsets. In this study, used the
following four statistical measures such as predictive accuracy, precision, recall, and f-
measure. In order to get the better understanding, the investigation was conducted with two
studies upon text and non-text based datasets to check the effectiveness of the proposed
methodology.
The proposed method was implemented on system configuration with macOs High
Sierra version 10.13.6 Mac Book Pro with Intel Corei5 Processor (3.10GHz) and 8GB RAM.In
this work, three datasets of varying complexity were chosen, namely, NSL-KDD, Kyoto and
Honeypot. The proposed method performed a pair-wise T-test with a 5% significance level to
investigate the Statistical significance of the classification accuracy result. Then p-values were
calculated (i.e. p < 0.05) for every three intrusion datasets and compared with five univariate
FS techniques. the results obtained showed this proposed feature selection technique as
statistically and significantly different from the exiting approaches.
The descriptions of the all final labelled features datasets are identified at both packet and flow
levels are obtained in Table 4.1 and Table 4.2.
Table 4.1: Features identified for packet level data
Table 4.2: Features identified for flow level data
The experiment was empirically evaluated for the effectiveness of proposed univariate-
ensemble based feature ranking methodology for selecting valuable reduced feature set upon
three intrusion datasets (Honeypot, NSL-KDD, Kyoto) and four classifiers (naïve bayes,
logistic regression, decision tree, SVM). This proposed methodology was compared with five
univariate filter measures in relations of four performance metrics. The proposed approach was
seen offering viable results as competed to the earlier FS measures. As discussed above, this
method was involved with two studies to ensure the effectiveness of the proposed
methodology.
The result achieved after applying three intrusion datasets (Honeypot, NSL-KDD, Kyoto) and
four classifiers such as naïve bayes, logistic regression, decision tree, SVM have been
summarized in Tables 4.3, Table 4.4 and Table 4.5. the proposed methodology shows
significantly better results compared to existing FS measures. In the first study, the comparison
of precision with existing FS measures were recorded in Tables 4.3.
Table 4.3 Comparison of precision with FS measures.
IG: Information Gain, GR: Gain ratio, CS: Chi-squared, SU: Symmetrical Uncertainty, R:
Relief
Similarly, Figure 4.4 depicts the comparison of the average classifier precision with FS
measures, the results obtained may better visualised and compared with existing approaches.
IG GR CS SU R UEFFS
0.96
0.955
0.95
0.945
0.94
0.935
Precision
0.93
0.925
0.92
0.915
0.91
0.905
Honeypot NSL-KDD Kyoto
Datasets
Similarly, Figure 4.5 depicted the comparison of recall with FS measures respectively.
IG GR CS SU R UEFFS
0.95
0.94
0.93
0.92
0.91
Recall
0.9
0.89
0.88
0.87
0.86
0.85
Honeypot NSL-KDD Kyoto
Datasets
96 IG GR CS SU R UEFFS
95
Predictive accuracy(%)
94
93
92
91
90
89
88
87
Honeypot NSL-KDD Kyoto
Datasets
Figure 4.6 Comparison of predictive accuracy
The results were obtained after applying the proposed method on three intrusion datasets
(Honeypot, NSL-KDD, Kyoto), four classifiers and five feature selection measures in terms of
accuracy, TP and FP which have been recorded in the Table 4.6, which shows the proposed
method evaluated on Honeypot dataset indicating improved outcome compared to the other
two datasets. (NSL-KDD and Kyoto).
UEFFS 0.953 0.12 95.39 0.948 0.12 94.12 0.942 0.13 92.12
Figure 4.7 depicts the performance of the classification based on the three intrusion datasets.
IG GR CS SU R UEFFS
600
500
400
300
200
100
0
Accuracy
Accuracy
Accuracy
TP
TP
TP
FP
FP
FP
Table 4.7: Predictive accuracy comparison of the proposed method with existing FS
measures.
In this second study, the result was achieved following evaluation on text datasets
namely, Spam assassin, MininewsGroups, and Course-cotrain. The proposed method was
compared with the existing feature measures in terms of classification predictive accuracy as
well as F-measure, as depicted in the Figure 4.8 shows the achievement of the better
performance of the proposed method than that of the existing methods.
1
0.9
0.8
0.7
F-Measure
0.6
0.5
0.4
0.3
0.2
0.1
0
Spam Assassin MiniNewsGroups Course-Cotrain
4.8 Conclusion
This research work proposed an ensemble-based univariate filter feature methodology for the
choice of the informative features from a given intrusion datasets. Two innovative algorithms
namely, the combined feature scoring algorithm and the minimum threshold value selection
have been proposed. The results obtained from these algorithms show the UEFFS technique
achieving better classification accuracy, f-measure and effectiveness than the single feature
selection method. This method was evaluated on three intrusion datasets, four classifiers and
five univariate feature selection measures, and also compared with the existing methods. The
Pair-wise T-test was performed and the results obtained showed this proposed feature selection
technique as statistically significantly different from the exiting approaches. This experiment
and the results provided the informative features with higher accuracy by removing the
insignificant and redundant features. Hence, the conclusion is that this proposed method has
the potential to bring about improvement in the accuracy and robustness of various
classification tasks, contributing to FS and not only key step in intrusion detection systems but
also in many different applications.