Professional Documents
Culture Documents
InSIS Project Description PartA
InSIS Project Description PartA
InSIS Project Description PartA
Participating SROs:
• Faculty of Technical Sciences, University of Priština temporarily settled in Kosovska Mitrovica (FTN-KM).
Abstract:
Background: The use of machine learning methods for solving various cyber-security problems has recently
become popular in the research community. Still, due to the many facets of cyber-threats, it remains to be a
challenging problem. The overall objective of this project is to introduce a novel algorithmic framework for
designing, analyzing and evaluating intelligent computational systems intended for automatic real-time detection
of two separate but interrelated cyber-threats at an early stage: (i) email-based social engineering attacks, and (ii)
botnet attacks.
Methods: At the methodological level, the proposed approach combines (iii) statistical and machine learning
methods, with (iv) computational linguistic methods. The research is essentially supported by two authentic datasets
provided by the Ministry of Interior of the Republic of Serbia: (v) email dataset, and (vi) dataset describing network
traffic behavior.
Expected results: The project will introduce a set of algorithms and models aimed at automatic real-
time detection of the considered cyber-attacks. The introduced approaches will be practically evaluated both
in laboratory and real-life settings. To measure the achievement of the project’s objectives, the performance of the
proposed algorithms and models will be compared to the performance of existing software systems for protection
against the considered cyber-attacks used by the Ministry of Interior of the Republic of Serbia.
Impact: The primary beneficiary impacted by the project is the Ministry of Interior of the Republic of Serbia.
In addition, the project strengthens the domain expertise of the participating researchers, making thus a direct
long-term impact on the education of the new generations of ICT experts. Finally, the project has an indirect
long-term impact on the society in the Republic of Serbia, through the dissemination activities, by raising the
overall knowledge in and awareness on cyber-security.
1
1 Excellence
1.1 Objectives
Information security in the cyber-space is a burning issue for some time now, treated as an issue of national and
international importance [16]. Although significant research effort has been devoted to this question, due to the
many facets of cyber-threats, it remains to be a challenging problem.
The overall objective of this project is to introduce a novel algorithmic framework for designing, analyzing
and evaluating intelligent computational systems intended for automatic real-time detection of two separate
but interrelated cyber-threats at an early stage:
(i) email-based social engineering attacks (in further text: phishing emails),
(ii) botnet attacks.
Both these two kinds of attack represent a serious security threat, and it is expected that they will continue to
take a significant share in the future.
(i) Social engineering attack is the act of manipulating a person to take an action that may allow the attacker
to obtain information, gain access to resources, etc. [10]. This kind of attack fundamentally relies on interaction
between the attacker and the individual user targeted by an attack. In order to conduct an attack targeting a
significant number of users in an automated manner, the interaction usually evolves over digital channels, and is
text-based, e.g., emails, web, social media, short text messages. Therefore, automatic text classification has an
important role in the detection of social engineering attacks.
Since the researh question of atomatic text classification has already been elaborated to a great extent[15], it is
important to note that the approach presented in this proposal is novel with respect to two aspects:
• At the specification level, the approach introduced in this project considers both analytical and generative
aspects of phishing emails as a type of social engineering attack:
– The analytical aspect is related to automatic real-time detection of security-critical email contents.
– The generative capability is related to automatic generation of phishing emails aimed at exposing the
end-user to a simulated attack, for the purpose of end-user training and increasing awareness on this
type of social engineering attacks.
• At the methodological level (as discussed in Section 1.2 in more detail), the proposed approach is hybrid to
the extent that it combines:
– widely acknowledged language-independent statistical and machine learning methods,
– with underutilized, if not completely neglected, language-dependent computational linguistic methods.
Statistical and machine learning methods are important because we also aim at enabling systems to dynamically
adapt their underlying models according to constantly evolving attack tactics. The computational linguistic methods
are important because we devote special attention natural languages that are of interest in dealing with email-based
social engineering attacks in the Republic of Serbia, i.e.:
• the English language, as the globally dominant language in unsolicited emails,
• the Serbian and other South Slavic languages, which are dominantly represented in the Republic of Serbia
and some of the neighboring countries.
(ii) A botnet is a collection of computers (i.e. bots) acting in a coordinated fashion to accomplish a common
goal with little or no intervention from the hacker, and without the hacker having to log into the client’s operating
system. In a typical botnet architecture, bot-clients join a predesignated Internet Relay Chat channel on a bot-server.
The attacker sends a command to a bot-server, which forwards it to bot-clients for execution (e.g., conducting a
distributed denial of service attack) [24].
Despite the presence of various intrusion detection and prevention mechanisms and devices, attacks that come
through the network are still happening on daily basis in large quantities. One of the underlying reasons is that the
majority of the legacy systems still use signature-based approach, which is inherently incapable to detect zero-day
attacks. It has been only recently that companies (e.g. the Darktrace company [26]) and products appeared that
apply artificial intelligence techniques of unsupervised machine learning to autonomously detect and take action
against cyber-threats. The primary objective in the context of botnet attacks is to introduce an approach to detection
of threats to the network and computer systems through the analysis of threat specific anomalies in network traffic
2
behavior and network element logs in real time. The aim is to create a system that will use detailed real-time
in-network traffic and log inspection for the detection of specific threat related to traffic patterns that indicate the
presence of malware or other types of network based attacks. Unlike the existing approaches based on detection
of departures from the “normal” behavior of network or computer systems as indications of the attack [26], we
focus on the analysis and detection of the specific classes of botnet attacks. Our preliminary research reveals that
although there are numerous threats that appear every day, with the ever changing signatures, some fundamental
mechanisms (e.g. malware communication patterns and time series patterns) are often reused by many different
attacks/tools and can be exploited for more efficient threat detection and suppression.
To measure the achievement of the project’s objectives, the performance of the proposed algorithms will be
compared to the performance of existing software systems for protection against cyber-attacks considered in this
project, used by the Ministry of Interior of the Republic of Serbia. The introduced approaches will be practically
evaluated both in laboratory and real-life settings:
• Evaluation in laboratory settings is conducted over authentic history data on actual cyber-attacks of interest,
provided by the Ministry of Interior of the Republic of Serbia (datasets are described in Section 1.2.1),
• Evaluation in real-life settings is conducted independently by the Ministry of Interior of the Republic of
Serbia, using fresh previously unseen data.
3
At the start of the spectrum, there are naive Bayes classifiers which are based on two fundamental assumptions
[15]. The first assumption is that a text can be conceptualized as a bag-of-words, i.e., position of a word in a text
is not considered important. Thus, for a given set of classes C, text t that contains a set of features f1, f2, . . . , fn is
categorized as belonging to class ĉ ∈ C, where:
The second assumption is that the probabilities P( fi |c) are mutually independent, i.e.:
Ön
ĉ( f1, f2, . . . , fn ) = argmax P(c) P( fi |c) , (2)
c∈C |{z} i=1
prior | {z }
likelihood
where probabilities P(c) and P( fi |c) are estimated by using the frequencies in an underlying textual corpus,
combined with a smoothing algorithm.
In the middle of the spectrum, there is statistical language models (i.e., n-grams) [15]. These models are based
on the Markov-like assumption that the probability of a feature fi (e.g., a word or a letter) depends only on the
(k − 1) immediately preceding features, i.e.:
In other words, n-grams estimate features from a fixed-size window of previous features. For an order-k statistical
language model, the probability of a feature sequence f1, f2, . . . , fn is estimated as:
n
Ö
P( f1, f2, . . . , fn ) = P( fi | fi−k+1, . . . , fi−1 ) , (4)
i=1
where probabilities P( fi | fi−k+1, . . . , fi−1 ) are estimated by using the frequencies in an underlying textual corpus,
combined with a smoothing algorithm. Ngram models can be applied in different manners for the purpose of text
categorization. For example, text t represented by a feature sequence f1, f2, . . . , fn is categorized as belonging to
class ĉ ∈ C, where:
ĉ( f1, f2, . . . , fn ) = argmax P( f1, f2, . . . , fn |c) . (5)
c∈C
Alternatively, a separate n-gram-based profile can be calculated for text t and for each class in C. Text t is then
categorized as belonging to the class with minimum distance to the profile of t, i.e:
At the end of the spectrum, there are recurrent neural networks. The core idea can be briefly presented as
follows. Each word w from vocabulary V is assigned a learned distributed word feature vector in <k , i.e., a
real-valued vector of dimensionality k, where k |V |. This vector is called embedding (cf. [2, 19, 20]). The
joint probability function of a word sequence is then expressed in terms of embeddings of words contained in the
given sequence. However, word embeddings and the probability function are learned simultaneously, by applying a
recurrent neural network language model. This model contains three layers: an input layer xt ∈ < |V | representing
one-hot representation of input word at time t, a hidden layer st ∈ <k representing sentence history at time t, and
an output layer yt ∈ < |V | representing probability distribution over words at time t. At more practical level, this
recurrent neural network is described by three matrices (cf. [25]):
• Mx — input matrix of dimension |V | × k. Each word from vocabulary V is assigned a row in Mx containing
its k-dimensional embedding, i.e., the embedding of word xt is equal to xtT × Mx , where xtT is the transposed
one-hot representation of xt .
• Ms — recurrent matrix of dimension k × k representing sentence history. Recurrent connections allow for
cycling of information inside the network for arbitrary long time (as illustrated in Fig. 1(a)), which is used
to overcome the problem of fixed length dialogue context mentioned in the previous section.
• My — output matrix of dimension k × |V | that maps the hidden state on a probability distribution over words.
4
Figure 1: (a) Cycling of sentence history information inside the recurrent neural network model (cf. [19]), (b)
illustration of the recurrent neural network model for text processing.
The probability of a word sequence w1, w2, ..., wn is then estimated as:
n
Ö
P(w1, w2, ..., wn ) = P(wi |w1, ..., wi−1 ) . (11)
i=1
This estimation of the probability of a word sequence allows for accounting of much broader span of text than was
practically achievable by n-grams. Let A = a1, a2, ..., a p be a sequence of immediatelly preceding words. The
probability of an ensuing word sequence B = b1, b2, ..., bq is estimated as:
q
Ö
P(B| A) = P(bi | a1, a2, ..., a p, b1, b2, ..., bi−1 ) , (12)
i=1 | {z } | {z }
history ensuing
sequence sequence
5
The point of departure in our approach is that the temporal aspect of text essentially depends on the
underlying syntactic structure of a given text — which remains underutilized in the state-of-the-art machine
learning approaches. Thus, in languages with fixed word order (e.g. English, German, etc.), taking the longer
textual spans to model context may be an appropriate decision. In languages with flexible word order (e.g.,
Serbian and other South Slavic languages), feature sequence contained in shorter phrase-level textual span can be
considered as more relevant for context modeling.
To illustrate this, we consider the Serbian language. The word order in a sentence is rather flexible. For
example, the order or subject, verb and object can be arbitrary selected, i.e., all six permutations are considered
correct and carry the same propositional content. In contrast to this rather flexible word order, enclitics occur in a
rather fixed order. They cannot stand alone, but are dependent on the word preceding them, and thus they appear
either immediately after the first antecedent word in the clause, or immediately after an accented verb form. When
they occurs together, the enclitic forms are organized in the following order: (1) the question particle li, (2) an
auxiliary verb, (3) a dative form of pronoun, (4) an accusative forms of pronoun, (5) the reflexive enclitic se, and
(6) the verb form je. However, the speaker/writer typically does not fill all six places, but just some of them, as
illustrated in the following examples [12, 18]:
• Lexical features such as word and subword (i.e., symbol) units, and their sequential ordering. While
the word features are typically used for the purpose of text classification, they do not suffice, due to the
presence of out-of-vocabulary words, and different levels of morphological information encoding in different
languages:
– Out-of-vocabulary words: a text is expected to contain out-of-vocabulary words, i.e., words that have
not been seen during the training phase or possibly maliciously formulated, e.g. by inserting html
comments (e.g., awa<!–asd –>rd), diacritic marks on letters (e.g., hänäx), etc.
– Highly inflectional languages (such as the Serbian language) have significant information (e.g., case,
gender, semantic role) coded in word morphology.
6
Therefore, it is necessary to consider also subword units (e.g., letters, bytes), which further implies some
sort of morphological text analysis.
• Lexicogrammatical features such as lexical cohesive agencies that reflect suprasegmental relations between
words. This group include two complementary classes of features:
– Anaphoric referring expressions (e.g., pronouns) establish relations with elements that were explicitly
mentioned in the preceding text [11]. For example, in the following sentence:
(17) My friend received her email but he accidentally deleted it.
pronoun he refers to my friend, it refers to an email, and her cannot be resolved in the given context.
However, it should be noted that we do not deal with the research problem of coreference resolution,
but rather consider anaphoric referring expressions as features indicating the degree of text cohesion.
– Ellipsis-substitutions, i.e., forms of anaphoric cohesion in a discourse, where we presuppose something
by means of what is left out [11]. In contrast to anaphoric referring expressions, the typical meaning of
ellipsis-substitutions is not one of co-reference — there is always some significant difference between
the second instance and the first. For example, clause Do it! does not explicitly carry propositional
information, but contains an elliptical-substitution (do), a reference (it). However, it may be considered
as a potential signal of a critical point in text. Again, we do not deal with the research question of
ellipsis-substitution resolution, but rather consider them as features indicating possibly critical points
in a text.
• Structural features, e.g., paragraph structure, the presence of images, the ratio of text to image areas,
7
– data on email contents (and metadata) containing instances of both legitimate and social engineering
attack contents,
– data describing network traffic behavior representing both legitimate and security-critical activities.
The data will be both obtained from external resources (i.e., The Ministry of Interior of the Republic of
Serbia), and generated by the participating researchers during the project implementation.
• What significant datasets are needed for the Project implementation? Specify data types and data
size. Specify primary or secondary use of data.
The Ministry of Interior of the Republic of Serbia will provide two datasets to support the research activities
in this project:
– email dataset (including metadata) containing instances of both legitimate and social engineering attack
contents,
– dataset describing network traffic behavior and/or containing network element logs, representing both
legitimate and security-critical activities.
These datasets are authentic and machine-readable. They are representative to the extent that they propor-
tionally include a relevant range of cyber-threat manifestations that are of interest, and will be solely used
for research purposes.
Treba proceniti veličinu korpusa.
• Do you already have access to this data, or will the data be obtained during Project implementation?
If the data is to be obtained during Project implementation, specify so.
The data provided by the Ministry of Interior of the Republic of Serbia will be obtained at the start of
the project. The production of additional data will be conducted during the project implementation by the
participating researchers.
• How will the data be stored and accessed? What measures will be taken to ensure secure data storage
and use, including data security?
The data provided by the Ministry of Interior of the Republic of Serbia will be stored at a dedicated physical
server at the University of Criminal Investigation and Police Studies in Belgrade. The storage and access to
the data will be conducted under the regulations of the Ministry of Interior and the University of Criminal
Investigation and Police Studies.
• Who will have access to the data during Project implementation?
The participating researchers will have access to the data during the project implementation, under the
regulations of the Ministry of Interior and the University of Criminal Investigation and Police Studies.
• How will the data be used with reference to AI?
With reference to AI, the datasets will be used for the design, training and evaluation of statistical, machine
learning and computational linguistic models aimed at automatic detection of cyber-attacks of interest, as
described in Sections1.1 and 1.2.
• How will the costs for data curation and preservation be covered?
The costs for data curation are included in the researchers’ salaries. The cost for data preservation are
included in the equipment-related costs.
• How will these data be exploited and/or shared/made accessible for verification and re-use during and
after Project implementation? If data cannot be made available, explain why. Who will have access
to the data after Project implementation?
The datasets provided by the Ministry of Interior of the Republic of Serbia will not be publicly available.
During the project implementation, the data will be shared among the participating researchers under the
regulations of the Ministry of Interior and the University of Criminal Investigation and Police Studies.
In contrast to this, the datasets generated by the participating researchers during the project implementation
will be, after appropriate anonymizatio.n publicly available for the research purposes.
• Who will have access to the code, or software after Project implementation ends?
The Ministry of Interior of the Republic of Serbia will have access to the code and software after the project
implementation ends. Selected parts of the code and functionality of the systems may be presented in
scientific publication, on scientific meetings and for educational purposes, with the approval of the Ministry.
8
1.3 Ambition
Criminal activity on the Internet naturally grows with the growth of the business activity in it. Various types of
crimes, such as disrupting the regular services, thefts or scams, can be easily developed in the Internet due to the
possibility to hide the real identity of the attacker, the use of infected machines in the remote networks as platforms
for the attack, and the possibility to purchase low cost tools for conducting such attacks or find and modify existing
ones.
Current research in the domain of network based cyber attacks focuses a lot on attack detection and mitigation
once the attack has already begun (e.g., DDoS attack detection [17, 23]). However, the InSIS project team believes
that such an approach is inherently suboptimal as it often detects the attack once the damage to the victim and/or
neighboring networks is already done and furthermore the devices which caused the attack remain undetected.
By exploring various network and social engineering threat behaviour patterns and creating a system that can
efficiently discover threats in the early stage of the attack lifecycle, it would be possible to mitigate a wider set of
attacks and locate the attacker.
References
[1] I. Baptista, S. Shiaeles, N. Kolokotronis (2019) A Novel Malware Detection System Based On Machine
Learning and Binary Visualization, in Proc. of the 1st International Workshop on Data Driven Intelligence
for Networks and Systems (DDINS), Shanghai,China, May 20-24,2019, IEEE.
[2] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin (2003) A Neural Probabilistic Language Model. Journal of
Machine Learning Research 3, pp. 1137–1155.
[3] P. Čisar, D. Joksimović (2019) Heuristic scanning and sandbox approach in malware detection, in Proc. of
the The IX International Scientific Conference “Archibald Reiss Day”, Belgrade.
[4] P. Čisar, S. Maravić Čisar (2019) EWMA Statistics and Fuzzy Logic in Function of Network Anomaly
Detection, Facta Universitatis, Series: Electronics and Energetics, University of Niš, Vol. 32, No. 2, pp.
249–265.
[5] P. Čisar, S. Maravić Čisar, B. Markoski B (2014) Implementation of Immunological Algorithms in Solving
Optimization Problems, Acta Polytechnica Hungarica, Vol. 11, No. 4, pp. 225–240.
[6] CyberTrust (accessed on January 24, 2020) CyberTrust — Advanced Cyber-Threat Intelligence, Detection,
and Mitigation Platform for a Trusted Internet of Things, https://cyber-trust.eu/.
[7] M. Gnjatović (2019) Conversational Agents and Negative Lessons from Behaviourism, in Innovations in Big
Data Mining and Embedded Knowledge, Springer series in Intelligent System Reference Library, Springer,
Invited chapter, pp. 259–274.
[8] M. Gnjatović, V. Delić (2014) Cognitively-inspired representational approach to meaning in machine dialogue,
Knowledge-Based Systems 71, pp. 25–33, category: M21a.
[9] M. Gnjatović, M. Janev, V. Delić (2012) Focus Tree: Modeling Attentional Information in Task-Oriented
Human-Machine Interaction, Applied Intelligence 37(3), pp. 305–320, category: M21.
[10] C. Hadnagy, P. Wilson (2010) Social Engineering: The Art of Human Hacking, Wiley.
[11] M.A.K. Halliday, C.M.I.M. Matthiessen (2004) An introduction to functional grammar, third edition, Hodder
Arnold.
[12] L. Hammond (2005) Serbian: An Essential Grammar, CRC Press, Taylor & Francis Group.
[13] D. Joksimović, G. Kežević, V. Pavlović, M. Ljubić, V.Surovy (2017) Some aspects of the application of
benford’s law in the analysis of the data set anomalies, Chapter 4 in edition Knowledge Discovery in
Cyberspace: Statistical Analysis and Predictive Modeling, editors: K. Kuk and D. RanđeloviÄĞ, NOVA
Publishers New York, pp 85–120.
[14] O. Joldžić, Z. Ðurić, P. Vuletić (2016), A Transparent and Scalable Anomaly-Based DoS Detection Method,
Computer Networks, Vol 104, pages 27-42, category: M21.
[15] D. Jurafsky, J.H. Martin (2009) Speech and Language Processing: An Introduction to Natural Language
Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall.
9
[16] A. Klimburg, ed. (2012) National Cyber Security Framework Manual, NATO CCD COE Publications.
[17] A.C. Lapolli, J. Adilson Marques, L.P. Gaspar (2019) Offloading Real-time DDoS Attack Detection to
Programmable Data Planes, 2019 IFIP/IEEE Symposium on Integrated Network and Service Management
(IM), Arlington, VA, USA, 2019, pp. 19–27.
[18] T.F. Magner (1995) Introduction to the Croatian and Serbian Language, The Pennsylvania State University
Press.
[19] T. Mikolov, M. Karafiát, L. Burget, J.H. Černocký, S. Sanjeev Khudanpur (2010) Recurrent neural network
based language model. In Proc. of INTERSPEECH 2010, pp 1045–1048.
[20] T. Mikolov, W.-t. Yih, G. Zweig (2013) Linguistic Regularities in Continuous Space Word Representations.
In Proceedings of NAACL-HLT 2013, Association for Computational Linguistics, pp. 746–751.
[21] V. Pavlović, G. Kežević, M. Joksimović, D. Joksimović (2019) Fraud Detection in Financial Statements
Applying Benford’s Law with Monte Carlo Simulation, Acta Oeconomica, vol. 69, no. 2, pp. 217–239.
[22] Ramses 2020 (accessed on January 24, 2020) RAMSES 2020 — Internet Forensic platform for tracking the
money flow of financially-motivated malware, https://ramses2020.eu/.
[23] F. Rebecchi, J. Boite, P.A. Nardin, M. Bouet, V. Conan (2019) DDoS protection with stateful softwareâĂŘde-
fined networking. Int J Network Mgmt. 2019; 29(1).
[24] C.A. Schiller, J. Binkley, D. Harley, G. Evron, T. Bradley, C. Willems, M. Cross (2007) Botnets Overview.
Botnets, Burlington, Syngress, pp. 29–75.
[25] A. Sordoni, M. Galley, M., Auli, C. Brockett, Y. Ji, M. Mitchell, J-Y., Nie, J. Gao, B. Dolan, B. (2015) A Neural
Network Approach to Context-Sensitive Generation of Conversational Responses. In Proc. of HLT-NAACL
2015, pp. 196–205.
[26] The Enterprise Immune System (accessed on January 24, 2020) Darktrace, https://www.darktrace.
com/en/.
[27] S. Wang, C.D. Manning (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In
ACL 2012, pp. 90–94.
2 Impact
2.1 Expected impact
The direct and primary beneficiary impacted by the project is the Ministry of Interior of the Republic of Serbia.
It is expected that the project results will create — both in mid-term and long-term — the possibility for the
Ministry to:
At the education and technological levels, funding of the proposed project will present a great support for
further development of:
• the Department of Information Technology at the University of Criminal Investigation and Police Studies in
Belgrade,
10
• the recently established Laboratory for Information Security at the School of Electrical Engineering in
Belgrade and the establishment of the new research group around it.
The project aims to involve researchers at the beginning or middle of their careers. They also serve as lecturers
at their scientific organizations, and are crucial in the development and education of the new generations of ICT
experts. The experience gained in this project would improve in long-term the quality of the courses, lecturing
and the amount of knowledge transferred to the students.
At the societal level, the project has an indirect long-term impact on the society in the Republic of Serbia.
Raising the overall knowledge in the domain of cyber-security in Serbia and the awareness on the modern attack
vectors and mechanisms how to mitigate them and the increase of the cyber-security awareness in the whole society
through the dissemination of the project’s findings and lectures done by the project partners.
• publication of papers in prominent international journals with impact factor, including two publications in
open-access journals with impact factor,
• presenting research results at international conferences, workshops and scientific meetings,
• maintenance of a website featuring information about the project,
3 Implementation Plan
...
11