Professional Documents
Culture Documents
Article Vivien Armel Eyangolo
Article Vivien Armel Eyangolo
Article Vivien Armel Eyangolo
models from a training sample. In this article we always verified. And in this case, the classifier
propose a new approach, which consists of controlling cannot accurately detect the false positive and false
the sensitivity of support vector machine using the negative.
perturbation method.
In a two-class classification
Keywords—component; Artificial learning, Predictive problem, when the training data of the majority class
model, Support vector machine, Imbalanced classes,
are much greater in number than those of the
Sensitivity and Specificity.
minority class, the algorithms allowing to obtain a
minimum error rate will always tend to neglect the
class minority, because of this disproportion. Which
2
justifies the great connection between the two forms - Section 3 is the
of asymmetry in supervised learning. Indeed, evaluation of our model.
asymmetry comes in two main forms: class
imbalance and cost asymmetry. Class imbalance
concerns problems where one of the modalities of
the target variable is much less represented than the
others, which disrupts the learning algorithms. Cost
asymmetry concerns cases where the costs of errors
are not symmetrical. In this article, it is the
asymmetry of classes that interests us. To deal with
this problem, several algorithms have been
developed, for example sampling, under sampling.
In this article, we propose a new approach, which is
an algorithmic approach which consists of
perturbing the machine, so as to take into account the
minority class.
Several definitions of the term “cybersecurity” have For its part, the risk analysis method makes it
been established at the national and international possible to resolve the problem from above by
levels. For the purposes of this document, studying the threats and terrible events which
“cybersecurity” means all tools, policies, guidelines, particularly affect the IS studied, in order to better
risk management methods, actions, training, best adapt to the security measures in place. Therefore,
practices, safeguards and technologies that can be risks can have various origins or properties, but it is
used to protect the availability, integration and obvious that what people can now call cyber risks is
confidentiality of assets in connected infrastructures a risk that puts pressure on countries, organizations
of government, private organizations and users. and individuals. Risk analysis methods are
These assets include connected computing devices, particularly suited to complex systems subject to
personnel, infrastructure, applications, services, major threats. However, it must rely on a relatively
telecommunications systems and data in the cyber diversified and specific skills base, which will allow
environment. the implementation of appropriate methodological
tools. Depending on the objectives set for the
Before going any further, this is probably where we exercise, it may require a substantial investment of
need to try to clarify the use of several expressions resources.
and terms commonly used almost interchangeably,
such as: These two methods complement each other:
depending on the issues of the IS studied, the
information system security (ISS) which compliance method will constitute a basic
can be understood as a set of measures foundation which can effectively complement the
implemented to achieve and maintain the
risk analysis work, and adapt more precisely and
cybersecurity state of an information
specifically to the IS concerned and its context. The
system (IS);
cybersecurity which is therefore the business processes in which he participates and the
desirable state to achieve; risk scenarios that put him under pressure.
cyber defense which also includes Cyber risk, the threat of exploiting one or more
measures implemented to maintain a state
vulnerabilities in a digital system, is itself a risk. It
of cybersecurity, but in the face of
has strategic importance, and its achievements can
particularly marked adversity and within a
very specific time frame. be fatal and dizzying for an organization. For all
Therefore, the protection of the Information System these reasons, if the notion of cyber risk historically
(IS) aims to identify, analyze and evaluate the risks refers to a very technical area, it is now obvious that
that affect these assets (which can be hardware, it will inevitably extend to areas that matter to
software, business processes), and to take the managers and decision-makers, and will come more
necessary measures to control these risks. In order to from all areas. At the functional level, any profession
properly protect itself from threats from online that uses digital technology to support its value chain
sources, a country or organization can take certain or produces digital equipment or services will be
security measures. In this area, you have the choice affected by cyber risks. Digital security is, as we
between several levers. However, the often say, everyone’s business. This awareness and
implementation of such measures must be done in an this collective and holistic involvement in the search
informed and reasonable manner, compatible with for the state of cybersecurity is not trivial.
the threat of pressure on the elements intended to be Taking preventive measures is obviously an
protected and the value attributed to them by their important aspect of ensuring the security of the
owners. system network. However, this is not unique. Once
The first basic approach consists of setting up a the protection elements are deployed, it is also useful
measurement base based on a priori benchmarks to dedicate resources to the monitoring system in
applicable to the organization's environment. These order to detect possible security incidents that may
standards can be associated with professional or arise. For this reason, security monitoring
activity-specific regulations. They can also be very capabilities rely at least on software or hardware
basic lists of indicators. This approach focuses tools and data to guide them properly. In order to
resources on implementing security measures, detect non-trivial attacks, manual analysis is almost
instead of defining a list of measures to be essential. When capabilities are implemented as
implemented. It is not very personal, but is designed services for the entire organization or a group of
4
𝜕𝐿
Assume that the data is nonlinearly =0
{𝜕𝜔
𝜕𝐿
separable. The determination of the decision =0
𝜕𝑏
function first involves a transformation of the data
space into another characteristic () or representation We obtain the result:
space, possibly of high dimension, where the data
w = ∑𝑀 𝑖=1 αi 𝑦𝑖 ф (𝑥𝑖 )
becomes linearly separable.ℱ { 𝑀 (3)
∑𝑖=1 αi 𝑦𝑖 = 0
For certain characteristic spaces and 𝜉𝑖 Indicates to what extent the example is on
associated applications, the scalar products are easily the wrong side or not.𝑥𝑖
calculated using specific functions, called kernel
If =0 then is well classified𝜉𝑖 𝑥𝑖
functions such as:ф
If ≥1 then is misclassified𝜉𝑖 𝑥𝑖
𝐾(𝑥𝑖 , 𝑥𝑗 ) = ⟨ф (𝑥𝑖 ), ф (𝑥𝑗 )⟩
(5) Thus, in general, the optimization problem of
Support Vector Machines can be written as follows
The interest of these kernel functions is to
[4,6,23]:
make it possible to calculate scalar products without
1
2
having to explicitly transform the data by the min ‖w‖ + 𝐶 ∑𝑀
𝑖=1 𝜉𝑖
2
function, therefore, without necessarily knowing this 𝑆/𝐶 (7)
∀i, yi (w. ф (𝑥𝑖 ) + b) ≥ 1 − 𝜉𝑖
function.ℱ ф ф
{ 𝜉𝑖 ≥ 0
By integrating equation (1) into (2), we obtain the
The dual of the general optimization problem to be
following dual problem:
solved will therefore be:
1
max ∑𝑀
𝑖=1 𝛼𝑖 − ∑𝑀 𝑀
𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 ) 1
2 max ∑𝑀
𝑖=1 𝛼𝑖 − ∑𝑀 𝑀
𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑥𝑖 , 𝑥𝑗 )
2
𝑆/𝐶
𝑆/𝐶
∀i, ∑𝑀𝑖=1 𝛼𝑖 𝑦𝑖 = 0 𝑀
∑
∀i, 𝑖=1 𝛼𝑖 𝑦𝑖 = 0
{ 𝛼𝑖 ≥ 0
{ 0 ≤ 𝛼𝑖 ≤ 𝐶
(6)
(8)
Theorem 1[8]:
= 𝑠𝑔𝑛 (∑𝑀 ∗
𝑖=1 𝛼 i 𝑦𝑖 ⟨ф (𝑥𝑖 ), ф (𝑥 )⟩ + 𝑏)
A function is a valid kernel if it is symmetric and
= 𝑠𝑔𝑛 (∑𝑀 ∗
𝑖=1 𝛼 i 𝑦𝑖 𝐾(𝑥𝑖 , 𝑥) + 𝑏) positive definite.𝒌𝑿 × 𝑿 → ℝ
In other words, a function is kernel if and only if:𝑘
- 𝐾(𝑖, 𝑗) = 𝐾(𝑗, 𝑖) ;
A- Soft margin [4,8,23,25]
- ∑𝑀 𝑀
𝑖=1 ∑𝑗=1 𝑐𝑖 𝑐𝑗 𝑘(𝑥𝑖 , 𝑥𝑗 ) ≥ 0,
Polynomial: or𝐾(𝑥𝑖 , 𝑥𝑗 ) =
The probability that a positive example is classified
(𝑥𝑖 . 𝑥𝑗 )𝑑 (𝑥𝑖 . 𝑥𝑗 + 𝑐)𝑑 as positive by the model is what we call the
sensitivity rate:
RBF (Radial Basic Function): 𝐾(𝑥𝑖 , 𝑥𝑗 ) =
2
𝑒 −𝛾‖𝑥𝑖 −𝑥𝑗 ‖ , 𝛾 > 0 𝑉𝑃
𝑠𝑒𝑛 =
𝑉𝑃 + 𝐹𝑁
III. UNBALANCED CLASSES AND SVM
[20,24] 2- Specificity
𝑁 𝑁
‖𝜔‖2 – Calculation𝐿𝑝
𝐿𝑝 = + 𝐶 + ∑ 𝜉𝑖 + 𝐶 − ∑ 𝜉𝑖
2
𝑖|𝑦𝑖 =1 𝑖|𝑦𝑖 =−1 - Determine𝛼𝑖
𝑁
Either𝜖 + =
1
, 𝜖+ =
1 on imbalanced data.
4𝑐 + 4𝑐 −
The components of the diagonal of the kernel Agency) and AFRL (Air Force Research
matrices are fixed as follows: Laboratory) for the evaluation of research on
intrusion detection. The KDD data is obtained from
raw data from tcpdump, a command-line packet
𝑘(𝑥𝑖 , 𝑥𝑖 ) + 𝜖 + 𝑝𝑜𝑢𝑟 𝑦𝑖 = +1
𝑘(𝑥𝑖 , 𝑥𝑖 ) = {
𝑘(𝑥𝑖 , 𝑥𝑖 ) + 𝜖 − 𝑝𝑜𝑢𝑟 𝑦𝑖 = −1 analyzer, based on simulations on the United States
R2L, U2R, Probing examples belonging to the same 17 num_fil_creation Number of file
creation
class (attack). Which means that we only have two
operations
classes: normal and intrusion. 18 num_shells Number of
command
prompts or
shells requested
TABLE III. Attributes characterizing records 19 num_access_files Number of
[kdd.ics.uci.edu/databases/kddcup99/kddcup99.html] configuration
file access
No. Attribute Description operations
1 Duration Connection 20 num_out_bound_cmds Number of
duration in commands
seconds outside the ftp
2 Protocol Protocol type session
3 Service Network service 21 is_host_login 1 If access
for the belongs to the
destination hot list and 0
4 Flag Connection otherwise
status (normal, 22 is_guess_login 1 if invited 0
error otherwise
5 src_bytes Data size in 23 Count Number of
bytes from connections to
source to the same host as
destination the current
6 dst_bytes Data size in connection two
bytes from seconds ago
destination to 24 srv_count Number of
source connections to
7 land 1 if the the same
connection uses service as the
the same port at current service
the source as at two seconds
the destination ago
8 wrong_fragment Wrong number 25 serror_rate % of
of fragments connections
9 Urgent Number of with SYN error
urgent packets 26 srv_serror_rate % of
10 Hot Number of hot connections
indicators with SYN error
11 num_failed_logins Number of 27 rerror_rate % of
incorrect access connections
attempts with REJ error
12 logged in 1 if accessed 28 srv_rerror_rate % of
successfully and connections
0 otherwise with REJ error
13 num_compromised Number of 29 same_srv_rate % of
compromised connections to
conditions the same
14 root_shell 1 if root shell service
obtained and 0 30 diff_srv_rate % of
otherwise connections to
15 su_attempted 1 if attempting a different
"su root" services
command and 0
otherwise
16 num_root Number of root
accesses
10
We implemented our algorithm ((9), (10)) in python We note that, faced with unbalanced data, classical
and found the following results by considering algorithms provide very poor results. We must
1830 example datasets.
therefore take this imbalance into account as we did
in section 2, to improve the performance of the
TABLE IV. Implementation on the algorithm
model. Our proposed approach gives better results.
Attack Normal Total
Attack 57 243 300
Normal 3 1527 1530
Total 60 1770 1830
CONCLUSION
These are the confusion matrices for the classic case Most real data is often unbalanced, and this poses a
and the unbalanced case
performance problem for the classifier which is more
Classic case: likely in this case to classify individuals from the
Out of 300 attacks the machine predicted 57 as attack positive (minority) class as negative, which
and 243 as normal
constitutes a great danger. In this article, we solved
Out of 1530 normal, the machine predicted 3 as this problem of imbalance of training games. We
attack and 1527 as normal
took the case of intrusion detection using the
Machine Learning approach, we used the SVM with
TABLE V. Implementation on the algorithm an algorithmic modification, and the KDD99 cup
Attack Normal Total data. We found very satisfactory results. As future
Attack 218 82 300
prospects we propose the use of ensemble methods
Normal 1 1529 1530
Total 219 1611 1830 which involve several classifiers and then combine
them by majority vote. This way of doing things can
On the other hand, in the table just above, this is the also lead to good results since we use several experts.
case for unbalanced classes and the results are
improved.
23. VAPNIK, VN: The Nature of Statistical Engineering Mathematics, Bristol Universit
Learning Theory, Springer, New York, y, Bristol BS8 1TR, United King, 1999
1995, 188p. 25. WANG, L.: Support Vector Machines:
24. VEROPOULOS, C. CAMPBELL, N. Theory and Applications, Springer, New
CRISTIANINI, Controlling the Sensitivit y York, 2005, 431 p
of Supp ort Vector Machines Department of