Operational Constraints (Op-Ct) Compatibility With The Machine Learning For An APT Detection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/356378698

Operational constraints (Op-Ct) compatibility with the Machine Learning for


an APT detection

Article · November 2021

CITATIONS READS

0 14

3 authors, including:

Said Mohammed Alrashdi


MOE and University of Nizwa
9 PUBLICATIONS   3 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Adopting Quadrilateral Arabic Roots in Search Engine of E-library System View project

IOT Internet Of Things in Education View project

All content following this page was uploaded by Said Mohammed Alrashdi on 19 November 2021.

The user has requested enhancement of the downloaded file.


Operational constraints (Op-Ct) compatibility with the Machine
Learning for an APT detection

Mourad .H Henchiri Abdullah AL Aamri Said Al Rashdi


KICT, IIUM, Malaysia KICT, IIUM, Malaysia KICT, IIUM, Malaysia
mourad@unizwa.edu.om bustan2005@yahoo.com saidrashdi@unizwa.edu.om

Abstract- Whether network based or


host based, Intrusion detection I. INTRODUCTION
systems, traditionally based on Operational systems working upon
signatures, have not escaped the recent intrusion detection have widely and
appeal of machine learning techniques. traditionally relied on signatures
While the results presented in generated manually by security
academic research articles are often experts. However, Big Data, artificial
excellent, security experts still have intelligence, Machine Learning or
many reservations about the use of Deep Learning are often presented as
machine learning in intrusion detection technologies that can revolutionize
systems. They generally fear that these intrusion detection systems [18, 23].
techniques are inadequate to Op-Ct And thus, these methods induce
(operational constraints), in particular detection rules automatically from
due to an elevated expertise level, or a data, and their generalization
large number of false positives capacities allow them to detect
appraisal. In this research, we show malicious events as yet unknown.
that refined Machine Learning Numerous research papers on the
recursive based generated rules can be application of learning automatic
compatible with a total harmony with intrusion detection have been
the operational constraints of detection published and often show exceptional
systems. We tackle the case of how to results (detection of malicious PDFs
build a detection model and present [6, 12, 20, 21], malicious executable
good practices to validate it before it files [11], or botnets [2, 3]). However,
goes into production. The from an operational point of view,
methodology is illustrated by a case there are still a lot of reservations about
study on detecting malicious PDF using machine learning in production:
files. - In real time; can a flow filter based on
machine learning methods perform its
Key words: APT, Op-Ct, ML, ML processing in real time?
rules, detection systems. - Is the false positive rate of machine
learning methods, often presumed to security incidents. This work can be
be too high, acceptable for them to be costly.
put into production? Intrusion detection systems are
- Producton; machine learning is not traditionally based on
the core business of an expert safe, yet signatures: detection rules built by an
it is on him to establish a detection expert following an in-depth analysis
system. How can he trust these of malicious events. This approach is
methods to put them into production? effective against threats that have
- Are the alerts generated by such a already been observed and for which a
detection system sufficiently signature has been generated, but is
interpretable to allow their use in often ineffective at detecting new
production? threats. In addition, simple variations
in the threat, such as polymorphism
In this research, the answers are [5], may be enough to render the
provided with solutions so that signature ineffective. Signatures are
machine learning methods are not ubiquitous in current detection
incompatible with operational systems, but detection methods with
constraints(Op-Ct), and that they can machine learning are being considered
thus be integrated into detection in addition to better detect new threats.
systems in addition to other methods, In this section, we present the two main
like signatures. categories of machine learning
methods that can complement the
II. SUPERVISED ANOMALY signature approach: anomaly detection
DETECTION and supervised learning.
Anomaly detection is the first method
The role of an intrusion detection of machine learning applied to
system is to detect malicious events intrusion detection [7]. This approach
through the network and system requires only non-malicious data to
activity it analyzes. A suspicious event build the detection model. The model
could be the recovery of a malicious will then generate an alert as soon as an
file attached to an email or a visit to a event differs too much from the normal
corrupted website, for example. The behavior induced by the benign data
administrator of the detection system is provided initially.
responsible, in particular, for setting up Anomaly detection methods are very
detection methods, and for making attractive because they can detect
them evolve over time. The security unknown threats. They have no
operator analyzes and qualifies the prejudices about what a malicious
alerts in order to allow the necessary event is and are therefore prone to
measures to be taken to deal with any detect new threats. In addition, putting
them into production is often presented
as very simple: all you need is a benign automatically look for the points
dataset devoid of malicious activity. making it possible to characterize each
Obtaining such a dataset is not easy in of the classes or to discriminate them
practice, however, as there is no easy in order to build the detection model.
way to ensure that there is no malicious Once the detection model is learned on
activity. If the supposedly healthy a training dataset, it can be applied
dataset contains malicious activity, it automatically to detect malicious
can distort the learning of the model events.
and prevent detection of certain Through supervised learning, the
threats. These detection systems are supervising security operator
simple in principle, but rarely in the detection system can easily
practice, and putting them into participate in improving the
production can be very complex. detection model from the alerts it
Additionally, anomaly detection analyzes. Indeed, false alerts can be
methods raise alerts for abnormal reinjected to correct the detection
events that are not necessarily model and thus avoid generating the
malicious. For example, an abnormal same false alerts in the future. The real
transmit / receive ratio over HTTPS alerts can also be fed back into the
may be a sign of data exfiltration, but model to let it follow the evolution of
may also be caused by the use of the threat. Thus, security experts do
certain social networks; popular not give control of the detection
websites can be the source of system to an automatic model, but they
seemingly unusually large data actively supervise it to improve its
exchanges, but not necessarily performance over time [16].
malicious; and simple configuration
errors can also lead to behaviors thatIV. SOCIAL ENGINEERING FOR AN
trigger false alerts. Thus, these APT INJECTION
detection methods often suffer from a
high false positive rate. Through the study case here, we
present the problem of detecting
III. SUPERVISED LEARNING malicious PDF files based on machine
learning. The rest of the article will use
The supervised learning addresses this case study to illustrate best
with high proficiency this need for the practices and highlight pitfalls to avoid
integration of expert knowledge. when using machine learning to build
Indeed, a supervised detection model a detection model.
is built from labeled data provided by The default status of a PDF file is that
the expert: benign events, but also the PDF format is an open document
malicious events to guide the detection description format, created by Adobe
model. The learning algorithm will
Company in 1993, aimed at preserving
formatting regardless of playback V. BENIGN PREDICTION AND
software or operating system used. It PERFORMANCE
consists, among other things, of a ESTIMATION
number of metadata such as the author,
the date of production, as well as Two phases are suggested: learning
objects of different types referenced in and prediction
a table called Xref. These objects can Supervised learning can be used for
be in particular text, images, video or intrusion detection via a binary
even JavaScript code. Its richness and classifier. The classifier takes as input
the availability of readers on different an instance, a PDF file for example,
platforms make it a widely used format and returns as output the predicted
in most organizations for creating and label, benign or malicious. This
exchanging electronic documents. On classifier can also be called a detection
the other hand, the volume of the model in the context of intrusion
associated specifications (more than detection.
1,300 pages available publicly) implies Supervised learning has two main
a significant software complexity, stages:
amplified by dependencies with many 1. learning the classifier from
third-party libraries. Also, this labeled data;
software is often prone to 2. use of the classifier to detect
vulnerabilities, which make the PDF malicious instances.
format all the more attractive to In the first step, the classifier is built
attackers. from labeled data, that is, a set of
Among the elements that we can look malicious and benign PDF files with
for to detect malicious PDF files, we known labels. This labeled data used to
can note: train the model is called training data.
- typical characteristics linked to the The classifier is created by a machine
triggering of the vulnerability (use of learning algorithm looking for the
the OpenAction function, JavaScript commonalities of instances sharing the
code, etc.); same label, and the discriminating
- the presence of a malicious payload points of instances of different labels.
(a shellcode, etc.); Once the classifier is trained from the
- functions to deceive detection training data, it can be used to predict
(obfuscation by encryption, multiple the label of a PDF file. In practice,
encodings, concealment of objects, most classifiers do not just predict a
etc.); binary value (benign vs malicious), but
- the more or less realistic nature of the rather a probability of maliciousness.
files (malformations, low number of An alert is then generated only if the
pages and / or objects, etc.).
probability of malicious activity is the case of intrusion detection the data
greater than the detection threshold set is generally very asymmetric (with a
by the administrator of the detection low proportion of malicious instances),
system. In several cases, an alert will and the error rate is not able to
be raised for the considered PDF file correctly estimate the performance of a
only if the detection threshold is less classifier. in this situation. Here is an
than 75%. The probability of malicious example showing the limits of the
activity predicted by the detection classification error rate. We consider
model makes it possible to classify the 100 instances: 2 malicious and 98
alerts according to the confidence of benign. In this situation, a stupid
the model, and therefore to define the detection model predicting always
priority of the alerts that must be benign will have a classification error
processed by the operator supervising rate of only 2% while it is not able to
the detection system. detect any malicious instance.
In order to correctly analyze the
Performance estimators performance of a detection model, the
A detection model is not perfect, it can first step consists in writing the
make prediction errors. It is essential to confusion matrix which takes into
validate it, that is to say to measure the account the two types of possible
relevance of the alerts generated, errors: false positives, that is to say
before putting it into production. false alerts raised for benign instances,
The best known performance estimator and false negatives, i.e. undetected
is the error rate of classification which malicious instances. The following
is equal to the percentage of figure explains the contents of a
misclassified instances. However, in confusion matrix.

Real Label Predicted label

malicious benign
malicious TP FN
benign FP TN
True the prediction is true (label predicted = true label)
False the prediction is false (label predicted 6 = true label)
Positive the prediction is Malicious
Negative the prediction is Benin

Figure: Confusion Matrix


from the PDF file case study.
Generic method Here, in the case of detecting malicious
Our supervised learning presentation PDF files, it is easy to get a labeled
was based on the example of detecting dataset, as this type of file is popular
and frequently used to spread
malicious PDF files, but the instance
malicious code. The experiments
can also represent a DOC file, such as presented in this paper are based on
traffic associated with an IP address or two datasets: Contagio (9,000 benign
a web page. Machine learning files and 11,101 malicious) and
algorithms do not take raw instances as webPdf (2,078 benign files and 767
input, but a representation as vectors of malicious). Contagio is a public
fixed-size digital attributes. dataset used in a lot of academic work,
With this representation of instances, and we built webPdf from benign files
machine learning algorithms are from Google's search engine and
generic and can be easily applied to malicious files obtained from the
various intrusion detection problems. VirusTotal platform.
PDF files contain two types of
The attribute extraction step is specific
information that can be used to
to each detection problem. generate attributes(Op-Ct): metadata
(the author or creation date for
example), and the list of its objects.
Some information is already digital, or
VI. METHODOLOGY AND a simple transformation can make it
DETECTION MODEL digital. For example, the file size is
CONSTRUCTION numeric, and creation and
modification dates can be turned into
The first step before building a
timestamps. However, other
detection model with machine learning
information, such as the author or the
is to define the target, that is, what you
objects, is not digital, and therefore
want to detect. This preliminary step
cannot be exploited directly by
has been taken for the malicious PDF
machine learning algorithms.
file detection issue in the previous
Moreover, each PDF file has a variable
section.
number of objects: how to represent
Then, to create a detection model you
this information as a vector of fixed
must:
size?
- collect learning data containing
benign and malicious instances
PDF files contain two types of
corresponding to the target;
information that can be used
- define the attributes to be extracted to
generate attributes: metadata (the
represent the instances in the form of
author or creation date for example),
digital vectors;
and the list of its objects. Some
- choose a type of classification model
information is already digital, or a
adapted to the constraints operational.
simple transformation can make it
We now describe these three steps
digital. For example, the file size is
giving generic advice and examples
numeric, and creation and
modification dates can be turned into pages 129–138, 2012.
timestamps. However, other [4] Dong Chen, Rachel KE Bellamy, Peter K
Malkin, and Thomas Erickson.
information, such as the author or the Diagnostic visualization for non-expert
objects, is not digital, and therefore machine learning practitioners : A design
cannot be exploited directly by study. In Visual Languages and Human-
machine learning algorithms. Centric Computing (VL/HCC), 2016
IEEE Symposium on, pages 87–95.
Moreover, each PDF file has a variable IEEE, 2016.
number of objects: how to represent [5] Mihai Christodorescu and Somesh Jha.
this information as a vector of fixed Testing malware detectors. ACM
size? SIGSOFT Software Engineering Notes,
29(4) :34–44, 2004.
According to one of the formats: [6] Igino Corona, Davide Maiorca, Davide
- Character strings Ariu, and Giorgio Giacinto. Lux0r :
- Numerical lists of variable size Detection of malicious pdf-embedded
- Categorical characteristics javascript code through discriminant
analysis of api references. In AISEC,
pages 47–57, 2014.
[7] Dorothy E Denning. An intrusion-
VII. CONCLUSION detection model. IEEE Transactions on
software engineering, (2) :222–232,
1987.
In conclusion, we would like to point
out that applying once the stages of
construction and validation of the
detection model is generally not
sufficient. In practice, the
implementation of a model is an
iterative process where the validation
phase improves learning of the model.

REFRENCES
[1] Saleema Amershi, Max Chickering,
Steven M Drucker, Bongshin Lee,
Patrice Simard, and Jina Suh.
Modeltracker : Redesigning performance
analysis tools for machine learning. In
Proceedings of the 33rd Annual ACM
Conference on Human Factors in
Computing Systems, pages 337–346.
ACM, 2015.
[2] Manos Antonakakis, Roberto Perdisci,
Yacin Nadji, Nikolaos Vasiloglou, Saeed
Abu-Nimeh, Wenke Lee, and David
Dagon. From throw-away traffic to bots
: detecting the rise of DGA-based
malware. In USENIX Security, pages
491–506, 2012.
[3] Leyla Bilge, Davide Balzarotti, William
Robertson, Engin Kirda, and Christopher
Kruegel. Disclosure : detecting botnet
command and control servers through
large-scale netflow analysis. In ACSAC,
View publication stats

You might also like