InSIS Project Description PartA

Project Description - Part A
Project proposal title (and acronym):

Intelligent Systems and Information Security (InSIS)
Participating SROs acronyms: KPU, ETF, ELFAK, FTN-NS, FTN-KM

Subprogram: Applied Research
Research field of the Project: Intelligent systems
Participating SROs:
• University of Criminal Investigation and Police Studies, Belgrade, Serbia (KPU),

• School of Electrical Engineering, University of Belgrade, Serbia (ETF),
• Faculty of Electronic Engineering, University of Niš (ELFAK),
• Faculty of Technical Sciences, University of Novi Sad, Serbia (FTN-NS),
• Faculty of Technical Sciences, University of Priština temporarily settled in Kosovska Mitrovica (FTN-KM).
Principal Investigator (PI): prof. dr Kristijan Kuk (KPU)
Abstract:
Background: The use of machine learning methods for solving various cyber-security problems has recently
become popular in the research community. Still, due to the many facets of cyber-threats, it remains to be a
challenging problem. The overall objective of this project is to introduce a novel algorithmic framework for
designing, analyzing and evaluating intelligent computational systems intended for automatic real-time detection
of two separate but interrelated cyber-threats at an early stage: (i) email-based social engineering attacks, and (ii)
botnet attacks.
Methods: At the methodological level, the proposed approach combines (iii) statistical and machine learning
methods, with (iv) computational linguistic methods. The research is essentially supported by two authentic datasets
provided by the Ministry of Interior of the Republic of Serbia: (v) email dataset, and (vi) dataset describing network
traffic behavior.
Expected results: The project will introduce a set of algorithms and models aimed at automatic real-
time detection of the considered cyber-attacks. The introduced approaches will be practically evaluated both
in laboratory and real-life settings. To measure the achievement of the project’s objectives, the performance of the
proposed algorithms and models will be compared to the performance of existing software systems for protection
against the considered cyber-attacks used by the Ministry of Interior of the Republic of Serbia.
Impact: The primary beneficiary impacted by the project is the Ministry of Interior of the Republic of Serbia.
In addition, the project strengthens the domain expertise of the participating researchers, making thus a direct
long-term impact on the education of the new generations of ICT experts. Finally, the project has an indirect
long-term impact on the society in the Republic of Serbia, through the dissemination activities, by raising the
overall knowledge in and awareness on cyber-security.
Total requested budget in EUR: ← dopuniti
1
1 Excellence
1.1 Objectives
Information security in the cyber-space is a burning issue for some time now, treated as an issue of national and
international importance [16]. Although significant research effort has been devoted to this question, due to the
many facets of cyber-threats, it remains to be a challenging problem.
The overall objective of this project is to introduce a novel algorithmic framework for designing, analyzing
and evaluating intelligent computational systems intended for automatic real-time detection of two separate
but interrelated cyber-threats at an early stage:
(i) email-based social engineering attacks (in further text: phishing emails),
(ii) botnet attacks.
Both these two kinds of attack represent a serious security threat, and it is expected that they will continue to
take a significant share in the future.
(i) Social engineering attack is the act of manipulating a person to take an action that may allow the attacker
to obtain information, gain access to resources, etc. [10]. This kind of attack fundamentally relies on interaction
between the attacker and the individual user targeted by an attack. In order to conduct an attack targeting a
significant number of users in an automated manner, the interaction usually evolves over digital channels, and is
text-based, e.g., emails, web, social media, short text messages. Therefore, automatic text classification has an
important role in the detection of social engineering attacks.
Since the researh question of atomatic text classification has already been elaborated to a great extent[15], it is
important to note that the approach presented in this proposal is novel with respect to two aspects:
• At the specification level, the approach introduced in this project considers both analytical and generative
aspects of phishing emails as a type of social engineering attack:
– The analytical aspect is related to automatic real-time detection of security-critical email contents.
– The generative capability is related to automatic generation of phishing emails aimed at exposing the
end-user to a simulated attack, for the purpose of end-user training and increasing awareness on this
type of social engineering attacks.
• At the methodological level (as discussed in Section 1.2 in more detail), the proposed approach is hybrid to
the extent that it combines:
– widely acknowledged language-independent statistical and machine learning methods,
– with underutilized, if not completely neglected, language-dependent computational linguistic methods.
Statistical and machine learning methods are important because we also aim at enabling systems to dynamically
adapt their underlying models according to constantly evolving attack tactics. The computational linguistic methods
are important because we devote special attention natural languages that are of interest in dealing with email-based
social engineering attacks in the Republic of Serbia, i.e.:
• the English language, as the globally dominant language in unsolicited emails,
• the Serbian and other South Slavic languages, which are dominantly represented in the Republic of Serbia
and some of the neighboring countries.
(ii) A botnet is a collection of computers (i.e. bots) acting in a coordinated fashion to accomplish a common
goal with little or no intervention from the hacker, and without the hacker having to log into the client’s operating
system. In a typical botnet architecture, bot-clients join a predesignated Internet Relay Chat channel on a bot-server.
The attacker sends a command to a bot-server, which forwards it to bot-clients for execution (e.g., conducting a
distributed denial of service attack) [24].
Despite the presence of various intrusion detection and prevention mechanisms and devices, attacks that come
through the network are still happening on daily basis in large quantities. One of the underlying reasons is that the
majority of the legacy systems still use signature-based approach, which is inherently incapable to detect zero-day
attacks. It has been only recently that companies (e.g. the Darktrace company [26]) and products appeared that
apply artificial intelligence techniques of unsupervised machine learning to autonomously detect and take action
against cyber-threats. The primary objective in the context of botnet attacks is to introduce an approach to detection
of threats to the network and computer systems through the analysis of threat specific anomalies in network traffic
2
behavior and network element logs in real time. The aim is to create a system that will use detailed real-time
in-network traffic and log inspection for the detection of specific threat related to traffic patterns that indicate the
presence of malware or other types of network based attacks. Unlike the existing approaches based on detection
of departures from the “normal” behavior of network or computer systems as indications of the attack [26], we
focus on the analysis and detection of the specific classes of botnet attacks. Our preliminary research reveals that
although there are numerous threats that appear every day, with the ever changing signatures, some fundamental
mechanisms (e.g. malware communication patterns and time series patterns) are often reused by many different
attacks/tools and can be exploited for more efficient threat detection and suppression.
To measure the achievement of the project’s objectives, the performance of the proposed algorithms will be
compared to the performance of existing software systems for protection against cyber-attacks considered in this
project, used by the Ministry of Interior of the Republic of Serbia. The introduced approaches will be practically
evaluated both in laboratory and real-life settings:
• Evaluation in laboratory settings is conducted over authentic history data on actual cyber-attacks of interest,
provided by the Ministry of Interior of the Republic of Serbia (datasets are described in Section 1.2.1),
• Evaluation in real-life settings is conducted independently by the Ministry of Interior of the Republic of
Serbia, using fresh previously unseen data.
1.2 Concept and methodology

The use of machine learning methods for solving various cyber-security problems (mainly detecting threats and
attacks) has recently become very popular in the research community. As an example, a method introduced in
the RAMSES project [22] is aimed at detection of fraudulent financial transactions using machine learning based
methods to detect anomalies in the behavior patterns of the users of the ATM machines and other electronic
payment methods.
At the conceptual level, the approach proposed in this project is fundamentally organized around the
notion of time series patterns. Whether we consider the inherently present text sequentiality of phishing email
content or malware communication patterns in botnet attacks, a given input can be considered in a general sense as
a time ordered sequence of symbols over a given alphabet. This conceptualization allows for the employment of
statistical and machine learning techniques, which is also in line with the ongoing H2020 Cyber Trust project [6],
whose one of the technological objectives is to detect already formed botnets using network analytics and process
the gathered data with deep learning and other cutting-edge methods and tools [1].
At the methodological level, the proposed approach integrates three diverse methodological lines:
• feature-based probabilistic generative machine learning models (e.g., naive Bayes, n-grams, and recurrent
neural networks),
• statistical methods from the information theory and data mining (e.g., methods based on entropy, the notion
of information gain, and the Benford’s law),
• nonstatistical computational linguistic methods (e.g., adaptations of the edit distance algorithm for the
purpose of weighted multiple sequence alignment).
To adapt these rather general statistical and machine learning methods to the particular tasks stated in
Section 1.1, they are extended by linguistic concepts, in particular with respect to the research question of
context modeling.
The task of automatic text classification can be briefly described as assigning a category from a predefined set
C = {c1, c2, . . . , cm } to given text which is typically represented by a feature set or a feature vector: f1, f2, . . . , fn .
The field of state-of-the-art probabilistic text classifiers is characterized by a methodological dichotomy between
generative and discriminative models. In our approach, we particularly focus on generative probabilistic models
for text classification, since they allow for generating the features of a text. This is particularly important keeping in
mind that we aim not only at detecting phishing email attacks (and botnet attacks), but also at automatic generation
of phishing emails aimed at exposing the end-user to a simulated attack.
However, the probabilistic generative models differs with respect to how they deal with the very funda-
mental question of context modeling with respect to the sequentiality of text occurring at the level of words
and symbols. Thus, they range from models that do not account for the text sequentiality at all (e.g., naive Bayes),
over models that can capture limited spans of text (e.g., statistical language models — n-grams), to models that
can capture an arbitrary long span of text (e.g., recurrent neural networks, and, more specifically, long short-term
memory networks). The models on this spectrum have a specif trade-off that is discussed in more detail below (a
more detailed discussion is provided in [7]).
3
At the start of the spectrum, there are naive Bayes classifiers which are based on two fundamental assumptions
[15]. The first assumption is that a text can be conceptualized as a bag-of-words, i.e., position of a word in a text
is not considered important. Thus, for a given set of classes C, text t that contains a set of features f1, f2, . . . , fn is
categorized as belonging to class ĉ ∈ C, where:
ĉ( f1, f2, . . . , fn ) = argmax P(c) P( f1, f2, . . . , fn |c) . (1)

c∈C |{z} | {z }
prior likelihood
The second assumption is that the probabilities P( fi |c) are mutually independent, i.e.:
Ön
ĉ( f1, f2, . . . , fn ) = argmax P(c) P( fi |c) , (2)
c∈C |{z} i=1
prior | {z }
likelihood
where probabilities P(c) and P( fi |c) are estimated by using the frequencies in an underlying textual corpus,
combined with a smoothing algorithm.
In the middle of the spectrum, there is statistical language models (i.e., n-grams) [15]. These models are based
on the Markov-like assumption that the probability of a feature fi (e.g., a word or a letter) depends only on the
(k − 1) immediately preceding features, i.e.:
P( fi | f1, f2, ..., fi−1 ) ≈ P( fi | fi−k+1, . . . , di−1 ) , (3)
In other words, n-grams estimate features from a fixed-size window of previous features. For an order-k statistical
language model, the probability of a feature sequence f1, f2, . . . , fn is estimated as:
n
Ö
P( f1, f2, . . . , fn ) = P( fi | fi−k+1, . . . , fi−1 ) , (4)
i=1
where probabilities P( fi | fi−k+1, . . . , fi−1 ) are estimated by using the frequencies in an underlying textual corpus,
combined with a smoothing algorithm. Ngram models can be applied in different manners for the purpose of text
categorization. For example, text t represented by a feature sequence f1, f2, . . . , fn is categorized as belonging to
class ĉ ∈ C, where:
ĉ( f1, f2, . . . , fn ) = argmax P( f1, f2, . . . , fn |c) . (5)
c∈C
Alternatively, a separate n-gram-based profile can be calculated for text t and for each class in C. Text t is then
categorized as belonging to the class with minimum distance to the profile of t, i.e:
ĉ(t) = argmin distance(c, t) . (6)

c∈C
At the end of the spectrum, there are recurrent neural networks. The core idea can be briefly presented as
follows. Each word w from vocabulary V is assigned a learned distributed word feature vector in <k , i.e., a
real-valued vector of dimensionality k, where k |V |. This vector is called embedding (cf. [2, 19, 20]). The
joint probability function of a word sequence is then expressed in terms of embeddings of words contained in the
given sequence. However, word embeddings and the probability function are learned simultaneously, by applying a
recurrent neural network language model. This model contains three layers: an input layer xt ∈ < |V | representing
one-hot representation of input word at time t, a hidden layer st ∈ <k representing sentence history at time t, and
an output layer yt ∈ < |V | representing probability distribution over words at time t. At more practical level, this
recurrent neural network is described by three matrices (cf. [25]):
• Mx — input matrix of dimension |V | × k. Each word from vocabulary V is assigned a row in Mx containing
its k-dimensional embedding, i.e., the embedding of word xt is equal to xtT × Mx , where xtT is the transposed
one-hot representation of xt .
• Ms — recurrent matrix of dimension k × k representing sentence history. Recurrent connections allow for
cycling of information inside the network for arbitrary long time (as illustrated in Fig. 1(a)), which is used
to overcome the problem of fixed length dialogue context mentioned in the previous section.
• My — output matrix of dimension k × |V | that maps the hidden state on a probability distribution over words.
4
Figure 1: (a) Cycling of sentence history information inside the recurrent neural network model (cf. [19]), (b)
illustration of the recurrent neural network model for text processing.
These matrices are initialized randomly, and then calculated as follows:

(
f (xtT Mx + sTt−1 Ms ), if 1 ≤ t ≤ T ,
st = (7)
0, if t = 0 ,
yt = sTt My , (8)
where f is the sigmoid activation function (i.e., logistic activation function):

1
f (z) = . (9)
1 + e−z
The probability distribution of next word w, given the previous words is estimated by the softmax activation
function (i.e., multiple logistic function):
e yt w
P(wt = w|w1, w2, ..., wt−1 ) = Í y . (10)
e tv
v ∈V
The probability of a word sequence w1, w2, ..., wn is then estimated as:
n
Ö
P(w1, w2, ..., wn ) = P(wi |w1, ..., wi−1 ) . (11)
i=1
This estimation of the probability of a word sequence allows for accounting of much broader span of text than was
practically achievable by n-grams. Let A = a1, a2, ..., a p be a sequence of immediatelly preceding words. The
probability of an ensuing word sequence B = b1, b2, ..., bq is estimated as:
q
Ö
P(B| A) = P(bi | a1, a2, ..., a p, b1, b2, ..., bi−1 ) , (12)
i=1 | {z } | {z }
history ensuing
sequence sequence
which is illustrated in Fig. 1(b).

The trade-off between the described approaches can be described as follows. The advantages of naive
Bayes are that it is easy to implement and fast to train, and that it performs very well on short texts [27] — which
is particularly important keeping in mind that we aim at dealing with email contents which is of limited size. Still,
the temporal aspect is missing. The naive Bayes assumption of the conditional independence is too restrictive,
because the word order may carry information that is relevant for the predictive processing, and some kind of
textual sequence should be taken into account. On the other hand, recurrent neural networks are designed to allow
for variable length inputs, and can practically capture an arbitrary long span of text. Still, the notions of textual
history and textual context should not be confused, since not all information in textual history is equally important
for decision making. This context management problem is at least partially addressed by long short-term memory
networks, by extending the architecture or recurrent neural networks by an explicit context layer and neural units
(i.e. gates) intended to control the flow of information (e.g. adding and removing information from the context)
[15].
5
The point of departure in our approach is that the temporal aspect of text essentially depends on the
underlying syntactic structure of a given text — which remains underutilized in the state-of-the-art machine
learning approaches. Thus, in languages with fixed word order (e.g. English, German, etc.), taking the longer
textual spans to model context may be an appropriate decision. In languages with flexible word order (e.g.,
Serbian and other South Slavic languages), feature sequence contained in shorter phrase-level textual span can be
considered as more relevant for context modeling.
To illustrate this, we consider the Serbian language. The word order in a sentence is rather flexible. For
example, the order or subject, verb and object can be arbitrary selected, i.e., all six permutations are considered
correct and carry the same propositional content. In contrast to this rather flexible word order, enclitics occur in a
rather fixed order. They cannot stand alone, but are dependent on the word preceding them, and thus they appear
either immediately after the first antecedent word in the clause, or immediately after an accented verb form. When
they occurs together, the enclitic forms are organized in the following order: (1) the question particle li, (2) an
auxiliary verb, (3) a dative form of pronoun, (4) an accusative forms of pronoun, (5) the reflexive enclitic se, and
(6) the verb form je. However, the speaker/writer typically does not fill all six places, but just some of them, as
illustrated in the following examples [12, 18]:
(13) Da li ćes mu ga dati?

1 2 3 4 5 6
‘Will you give it to him?’
(14) Da li vam ga je dala?
1 2 3 4 5 6
‘Did she give it to you?’
(15) Da li joj se obećao?
1 2 3 4 5 6
‘Did he pledge himself to her?’
(16) Rekao mi je da će zakasniti.
1 2 3 4 5 6
‘He told me he would be late.’
Although these examples are by no means exhaustive, they clearly indicate that the notion of context cannot be
reduced to the notion of text sequentiallity, which is one of misconceptions present in machine learning approaches
(a more detailed discussion is provided in [7, 8, 9]). Thus, the aim of this research line is to address the
question of appropriate context modeling in machine learning which is often missing or neglected in modern
approaches. We aim at:
• addressing the context management problem for the purpose of predictive processing by extend the data-
driven machine learning methods by language-based computational linguistic methods,
• enabling a text classification system to dynamically adjust the span of immediately preceeding feature
sequence when estimating/generating the ensuing feature,
• integrating the advantages of the considered generative models.

Furthermore, our approach to text classification is feature-based, and it is important to note that we
extend the feature set typically used for the considered purposes (e.g., detection of critical email contents).
We consider features of four different categories:
• Lexical features such as word and subword (i.e., symbol) units, and their sequential ordering. While
the word features are typically used for the purpose of text classification, they do not suffice, due to the
presence of out-of-vocabulary words, and different levels of morphological information encoding in different
languages:
– Out-of-vocabulary words: a text is expected to contain out-of-vocabulary words, i.e., words that have
not been seen during the training phase or possibly maliciously formulated, e.g. by inserting html
comments (e.g., awa<!–asd –>rd), diacritic marks on letters (e.g., hänäx), etc.
– Highly inflectional languages (such as the Serbian language) have significant information (e.g., case,
gender, semantic role) coded in word morphology.
6
Therefore, it is necessary to consider also subword units (e.g., letters, bytes), which further implies some
sort of morphological text analysis.
• Lexicogrammatical features such as lexical cohesive agencies that reflect suprasegmental relations between
words. This group include two complementary classes of features:
– Anaphoric referring expressions (e.g., pronouns) establish relations with elements that were explicitly
mentioned in the preceding text [11]. For example, in the following sentence:
(17) My friend received her email but he accidentally deleted it.
pronoun he refers to my friend, it refers to an email, and her cannot be resolved in the given context.
However, it should be noted that we do not deal with the research problem of coreference resolution,
but rather consider anaphoric referring expressions as features indicating the degree of text cohesion.
– Ellipsis-substitutions, i.e., forms of anaphoric cohesion in a discourse, where we presuppose something
by means of what is left out [11]. In contrast to anaphoric referring expressions, the typical meaning of
ellipsis-substitutions is not one of co-reference — there is always some significant difference between
the second instance and the first. For example, clause Do it! does not explicitly carry propositional
information, but contains an elliptical-substitution (do), a reference (it). However, it may be considered
as a potential signal of a critical point in text. Again, we do not deal with the research question of
ellipsis-substitution resolution, but rather consider them as features indicating possibly critical points
in a text.
• Structural features, e.g., paragraph structure, the presence of images, the ratio of text to image areas,
• Metadata features, e.g., the sender information , link addresses, etc.

Outside the textual domain, the mechanism to analyze bulk data from full packet traces and network element
logs, and to extract the outliers and anomalies which indicate attempts to compromise information security of a
protected system will be also based on machine learning algorithms, enabling, inter alia, the detection of zero-day
attacks. The project will explore the use of modern programmable network elements (with e.g. P4 programming
language, OpenFlow and eBPF features) which can enable stateful and adaptable traffic filtering and offloading a
part of the traffic analysis from the main CPU and allow line rate throughput of the system on high speed links
while doing the threat and attack detection.
Relation to other projects

The proposed project is related to other projects in which the member of the project team have been involved, as
summarized below (for more details of the projects, please cf. the Project Description Part B).
Prof. Dušan Joksimović participated in a number of national and international research projects. His
contributions relevant for this project proposal include but are not limited to introduciton of statistical and heuristic
methods for the analysis of dataset anomalies, malware detection, and detection of frauds in financial statements
[3, 13, 21].
Prof. Petar Čisar participated in an ERASMUS+ project devoted to the improvement of academic and
professional education capacity in serbia in the field of safety and security. His contributions primarily relate to
intrusion and network traffic anomaly detection [3, 4, 5].
Prof. Pavle Vuletić has led the research tasks in several GEANT H2020 projects related to network monitoring
and traffic analysis using dedicated programmable hardware over high speed links and also has completed a research
in the domain of DDoS attack detection using programmable network elements and anomaly detection [14].
Prof. Milan Gnjatović was involved in a number of national and international research projects dealing
with human-machine interaction, natural language processing and artificial intelligence. His contributions are
primarily related but not limited to the introduction of representational and statistical approaches to context
modeling and meaning in human-machine interaction. The introduced approaches were validated though functional
conversational software agents in the English, German, and Serbian languages [7, 8, 9].
1.2.1 Data usage

• What types of data will the Project generate/collect?
The research activities in this project are essentially supported by two type of data that describe cyber-threat
manifestations that are of interest for this project:
7
– data on email contents (and metadata) containing instances of both legitimate and social engineering
attack contents,
– data describing network traffic behavior representing both legitimate and security-critical activities.
The data will be both obtained from external resources (i.e., The Ministry of Interior of the Republic of
Serbia), and generated by the participating researchers during the project implementation.
• What significant datasets are needed for the Project implementation? Specify data types and data
size. Specify primary or secondary use of data.
The Ministry of Interior of the Republic of Serbia will provide two datasets to support the research activities
in this project:
– email dataset (including metadata) containing instances of both legitimate and social engineering attack
contents,
– dataset describing network traffic behavior and/or containing network element logs, representing both
legitimate and security-critical activities.
These datasets are authentic and machine-readable. They are representative to the extent that they propor-
tionally include a relevant range of cyber-threat manifestations that are of interest, and will be solely used
for research purposes.
Treba proceniti veličinu korpusa.
• Do you already have access to this data, or will the data be obtained during Project implementation?
If the data is to be obtained during Project implementation, specify so.
The data provided by the Ministry of Interior of the Republic of Serbia will be obtained at the start of
the project. The production of additional data will be conducted during the project implementation by the
participating researchers.
• How will the data be stored and accessed? What measures will be taken to ensure secure data storage
and use, including data security?
The data provided by the Ministry of Interior of the Republic of Serbia will be stored at a dedicated physical
server at the University of Criminal Investigation and Police Studies in Belgrade. The storage and access to
the data will be conducted under the regulations of the Ministry of Interior and the University of Criminal
Investigation and Police Studies.
• Who will have access to the data during Project implementation?
The participating researchers will have access to the data during the project implementation, under the
regulations of the Ministry of Interior and the University of Criminal Investigation and Police Studies.
• How will the data be used with reference to AI?
With reference to AI, the datasets will be used for the design, training and evaluation of statistical, machine
learning and computational linguistic models aimed at automatic detection of cyber-attacks of interest, as
described in Sections1.1 and 1.2.
• How will the costs for data curation and preservation be covered?
The costs for data curation are included in the researchers’ salaries. The cost for data preservation are
included in the equipment-related costs.
• How will these data be exploited and/or shared/made accessible for verification and re-use during and
after Project implementation? If data cannot be made available, explain why. Who will have access
to the data after Project implementation?
The datasets provided by the Ministry of Interior of the Republic of Serbia will not be publicly available.
During the project implementation, the data will be shared among the participating researchers under the
regulations of the Ministry of Interior and the University of Criminal Investigation and Police Studies.
In contrast to this, the datasets generated by the participating researchers during the project implementation
will be, after appropriate anonymizatio.n publicly available for the research purposes.
• Who will have access to the code, or software after Project implementation ends?
The Ministry of Interior of the Republic of Serbia will have access to the code and software after the project
implementation ends. Selected parts of the code and functionality of the systems may be presented in
scientific publication, on scientific meetings and for educational purposes, with the approval of the Ministry.
8
1.3 Ambition
Criminal activity on the Internet naturally grows with the growth of the business activity in it. Various types of
crimes, such as disrupting the regular services, thefts or scams, can be easily developed in the Internet due to the
possibility to hide the real identity of the attacker, the use of infected machines in the remote networks as platforms
for the attack, and the possibility to purchase low cost tools for conducting such attacks or find and modify existing
ones.
Current research in the domain of network based cyber attacks focuses a lot on attack detection and mitigation
once the attack has already begun (e.g., DDoS attack detection [17, 23]). However, the InSIS project team believes
that such an approach is inherently suboptimal as it often detects the attack once the damage to the victim and/or
neighboring networks is already done and furthermore the devices which caused the attack remain undetected.
By exploring various network and social engineering threat behaviour patterns and creating a system that can
efficiently discover threats in the early stage of the attack lifecycle, it would be possible to mitigate a wider set of
attacks and locate the attacker.
References
[1] I. Baptista, S. Shiaeles, N. Kolokotronis (2019) A Novel Malware Detection System Based On Machine
Learning and Binary Visualization, in Proc. of the 1st International Workshop on Data Driven Intelligence
for Networks and Systems (DDINS), Shanghai,China, May 20-24,2019, IEEE.
[2] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin (2003) A Neural Probabilistic Language Model. Journal of
Machine Learning Research 3, pp. 1137–1155.
[3] P. Čisar, D. Joksimović (2019) Heuristic scanning and sandbox approach in malware detection, in Proc. of
the The IX International Scientific Conference “Archibald Reiss Day”, Belgrade.
[4] P. Čisar, S. Maravić Čisar (2019) EWMA Statistics and Fuzzy Logic in Function of Network Anomaly
Detection, Facta Universitatis, Series: Electronics and Energetics, University of Niš, Vol. 32, No. 2, pp.
249–265.
[5] P. Čisar, S. Maravić Čisar, B. Markoski B (2014) Implementation of Immunological Algorithms in Solving
Optimization Problems, Acta Polytechnica Hungarica, Vol. 11, No. 4, pp. 225–240.
[6] CyberTrust (accessed on January 24, 2020) CyberTrust — Advanced Cyber-Threat Intelligence, Detection,
and Mitigation Platform for a Trusted Internet of Things, https://cyber-trust.eu/.
[7] M. Gnjatović (2019) Conversational Agents and Negative Lessons from Behaviourism, in Innovations in Big
Data Mining and Embedded Knowledge, Springer series in Intelligent System Reference Library, Springer,
Invited chapter, pp. 259–274.
[8] M. Gnjatović, V. Delić (2014) Cognitively-inspired representational approach to meaning in machine dialogue,
Knowledge-Based Systems 71, pp. 25–33, category: M21a.
[9] M. Gnjatović, M. Janev, V. Delić (2012) Focus Tree: Modeling Attentional Information in Task-Oriented
Human-Machine Interaction, Applied Intelligence 37(3), pp. 305–320, category: M21.
[10] C. Hadnagy, P. Wilson (2010) Social Engineering: The Art of Human Hacking, Wiley.
[11] M.A.K. Halliday, C.M.I.M. Matthiessen (2004) An introduction to functional grammar, third edition, Hodder
Arnold.
[12] L. Hammond (2005) Serbian: An Essential Grammar, CRC Press, Taylor & Francis Group.
[13] D. Joksimović, G. Kežević, V. Pavlović, M. Ljubić, V.Surovy (2017) Some aspects of the application of
benford’s law in the analysis of the data set anomalies, Chapter 4 in edition Knowledge Discovery in
Cyberspace: Statistical Analysis and Predictive Modeling, editors: K. Kuk and D. RanđeloviÄĞ, NOVA
Publishers New York, pp 85–120.
[14] O. Joldžić, Z. Ðurić, P. Vuletić (2016), A Transparent and Scalable Anomaly-Based DoS Detection Method,
Computer Networks, Vol 104, pages 27-42, category: M21.
[15] D. Jurafsky, J.H. Martin (2009) Speech and Language Processing: An Introduction to Natural Language
Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall.
9
[16] A. Klimburg, ed. (2012) National Cyber Security Framework Manual, NATO CCD COE Publications.
[17] A.C. Lapolli, J. Adilson Marques, L.P. Gaspar (2019) Offloading Real-time DDoS Attack Detection to
Programmable Data Planes, 2019 IFIP/IEEE Symposium on Integrated Network and Service Management
(IM), Arlington, VA, USA, 2019, pp. 19–27.
[18] T.F. Magner (1995) Introduction to the Croatian and Serbian Language, The Pennsylvania State University
Press.
[19] T. Mikolov, M. Karafiát, L. Burget, J.H. Černocký, S. Sanjeev Khudanpur (2010) Recurrent neural network
based language model. In Proc. of INTERSPEECH 2010, pp 1045–1048.
[20] T. Mikolov, W.-t. Yih, G. Zweig (2013) Linguistic Regularities in Continuous Space Word Representations.
In Proceedings of NAACL-HLT 2013, Association for Computational Linguistics, pp. 746–751.
[21] V. Pavlović, G. Kežević, M. Joksimović, D. Joksimović (2019) Fraud Detection in Financial Statements
Applying Benford’s Law with Monte Carlo Simulation, Acta Oeconomica, vol. 69, no. 2, pp. 217–239.
[22] Ramses 2020 (accessed on January 24, 2020) RAMSES 2020 — Internet Forensic platform for tracking the
money flow of financially-motivated malware, https://ramses2020.eu/.
[23] F. Rebecchi, J. Boite, P.A. Nardin, M. Bouet, V. Conan (2019) DDoS protection with stateful softwareâĂŘde-
fined networking. Int J Network Mgmt. 2019; 29(1).
[24] C.A. Schiller, J. Binkley, D. Harley, G. Evron, T. Bradley, C. Willems, M. Cross (2007) Botnets Overview.
Botnets, Burlington, Syngress, pp. 29–75.
[25] A. Sordoni, M. Galley, M., Auli, C. Brockett, Y. Ji, M. Mitchell, J-Y., Nie, J. Gao, B. Dolan, B. (2015) A Neural
Network Approach to Context-Sensitive Generation of Conversational Responses. In Proc. of HLT-NAACL
2015, pp. 196–205.
[26] The Enterprise Immune System (accessed on January 24, 2020) Darktrace, https://www.darktrace.
com/en/.
[27] S. Wang, C.D. Manning (2012) Baselines and bigrams: Simple, good sentiment and topic classification. In
ACL 2012, pp. 90–94.
2 Impact
2.1 Expected impact
The direct and primary beneficiary impacted by the project is the Ministry of Interior of the Republic of Serbia.
It is expected that the project results will create — both in mid-term and long-term — the possibility for the
Ministry to:
• additionally strengthen the capability to automatically and in real-time detect cyber-threats,

• additionally increasing awareness in the employees from the targeted group on social engineering attacks.
It is important to note that the approaches introduced in this project will be particularly tailored:
• the Serbian and other South Slavic languages, which are dominantly represented in the Republic of Serbia
and some of the neighboring countries,
• to the specific cyber-threats that are of interest to the Ministry (we recall that the Ministry will provide
datasets underlying the research in this project, and will evaluate the introduced approaches in real-life
settings).
At the education and technological levels, funding of the proposed project will present a great support for
further development of:
• the Department of Information Technology at the University of Criminal Investigation and Police Studies in
Belgrade,
10
• the recently established Laboratory for Information Security at the School of Electrical Engineering in
Belgrade and the establishment of the new research group around it.
The project aims to involve researchers at the beginning or middle of their careers. They also serve as lecturers
at their scientific organizations, and are crucial in the development and education of the new generations of ICT
experts. The experience gained in this project would improve in long-term the quality of the courses, lecturing
and the amount of knowledge transferred to the students.
At the societal level, the project has an indirect long-term impact on the society in the Republic of Serbia.
Raising the overall knowledge in the domain of cyber-security in Serbia and the awareness on the modern attack
vectors and mechanisms how to mitigate them and the increase of the cyber-security awareness in the whole society
through the dissemination of the project’s findings and lectures done by the project partners.
2.2 Dissemination of results

Appropriate attention will be devoted to the dissemination activities, including the following:
• publication of papers in prominent international journals with impact factor, including two publications in
open-access journals with impact factor,
• presenting research results at international conferences, workshops and scientific meetings,
• maintenance of a website featuring information about the project,
• generation of publicly available datasets, intended for scientific purposes,

• presenting the project’s results to the Ministry of Interior of the Republic of Serbia, and other relevant
governmental stakeholders.
• ← dopuniti
3 Implementation Plan
...
11

InSIS Project Description PartA

Uploaded by

Copyright:

Available Formats

You might also like

InSIS Project Description PartA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

InSIS Project Description PartA

Uploaded by

Copyright:

Available Formats

Project Description - Part A

Project proposal title (and acronym):

Participating SROs acronyms: KPU, ETF, ELFAK, FTN-NS, FTN-KM

• University of Criminal Investigation and Police Studies, Belgrade, Serbia (KPU),

Principal Investigator (PI): prof. dr Kristijan Kuk (KPU)

Total requested budget in EUR: ← dopuniti

1.2 Concept and methodology

ĉ( f1, f2, . . . , fn ) = argmax P(c) P( f1, f2, . . . , fn |c) . (1)

P( fi | f1, f2, ..., fi−1 ) ≈ P( fi | fi−k+1, . . . , di−1 ) , (3)

ĉ(t) = argmin distance(c, t) . (6)

These matrices are initialized randomly, and then calculated as follows:

where f is the sigmoid activation function (i.e., logistic activation function):

which is illustrated in Fig. 1(b).

(13) Da li ćes mu ga dati?

• integrating the advantages of the considered generative models.

• Metadata features, e.g., the sender information , link addresses, etc.

Relation to other projects

1.2.1 Data usage

• additionally strengthen the capability to automatically and in real-time detect cyber-threats,

2.2 Dissemination of results

• generation of publicly available datasets, intended for scientific purposes,

You might also like