Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Received December 30, 2018, Revised October 8, 2019, Revised July 3, 2020, Accepted March 8, 2021, date of current

version December 16, 2021

Approach to Automatic Identification of Terrorist and


Radical Content in Social Networks Messages

Andrey I. Kapitanov, Ilona I. Kapitanova,Vladimir M. Troyanovskiy, Vladimir F. Shangin, Nikolay O. Krylikov


National Research University of Computer and Electronic Technology
Zelenograd, Moscow, Russia
andrey@kapdx.ru, kapitanovai@ya.ru, troy40@mail.ru, incos@miee.ru, krylikov-no@rambler.ru

Abstract— Terrorist and radical groups of people use instant network, which ultimately leads to the influx of new fighters,
messengers and accounts on social networks to publish and also help to consolidate the power over its territory.
propaganda messages. Blocking such accounts is one of the most
effective methods of countering them. To do this, analysts need to Terrorist and radical groups of people use instant
read and process a huge amount of information. In this paper we messengers and accounts on social networks to publish
propose an approach based on machine learning that will propaganda messages. Blocking such accounts is one of the
automate the process of processing and classifying messages into most effective methods of countering them. To do this, analysts
radical and non-radical messages. need to read and process a huge amount of information. In the
Table I you can see the statistic of the messages frequency in
Keywords— security; counter-terrorism; classifier; machine different social networks and messengers.
learning; text analysis; social networks
TABLE I. NETWORK MESSAGES FREQUENCY
I. INTRODUCTION
Social Average messages frequency
It is hardly in today's world to find someone who has not Networks
heard about such prohibited in the Russian Federation and Per sec Per day Per month
organization as “the Islamic state of Iraq and the Levant” (ISIL Messengers
or ISIS), also known as “Islamic state” (IS). The rapidity of the WhatsApp [2] 636 thousand 55 billion 1.6 trillion
organization’s development and its actions magnitude, unlike Telegram [3] 175 thousand 15 billion 450 billion
Facebook [4] 2.5 thousand 216 million 6.5 billion
any other existing terrorist organization, using new devices and Twitter [5] 5.8 thousand 500 million 15 billion
techniques for ideological and informational activity allow us Instagram [6] 1 thousand 95 million 2.8 billion
to say that we deal with a new phenomenon in terrorism, which
has managed to take on a qualitatively different level of
development [1].
II. RELATED WORK
Neither the international anti-terrorist coalition led by the
Various machine learning methods are most often used for
US nor the defense Ministry of Russia officially provides the
the analysis of texts with radical content. Such a method as the
timing of combating the threat of prohibited in the Russian
Named Entity Recognition (NER) allows to extract structured
Federation ISIS. Such fact that this young organization is
information from unstructured or semi-structured documents
highly aggressive and behave provocatively in relation to other
and is successfully applied to short text messages. Clustering,
Islamist terrorist organizations, adds the significance of
logistic regression and Dynamic Query Expansion (DQE) are
prohibited in the Russian Federation ISIS in the middle East.
more suitable to predict terrorist acts, riots or protests. The
In order to provide maximum impact on the minds of the most often methods used to identify radicalism and extremism
Islamic world, including tech-savvy westernized Muslim in real-time mode are K-Nearest Neighbor, Naive Bayes
youth, prohibited in the Russian Federation ISIS has actively classifier, Support Vector Machine (SVM) with different
used all the possibilities of media technologies and PR. kernel functions (linear, AdaBoost, etc.), decision trees and
others.
Modern terrorist organizations weekly, and even daily put
in the Internet space professionally shot and orchestrated The task of determination messages on the radical and non-
videos of public executions, cultural monuments destruction, radical can be represented as the binary classification problem,
fighting, interviews with their commanders, the radical and solved using the methods of K-Nearest Neighbor or
preachers and activists, photos of the trophies and killed LIBSVM [7].
enemies. “Media Jihad” has firmly won its place in the
Another direction of investigation of extremist texts on the
international information space and became one of the most
Internet is to determine the type of activity Internet users.
important activities of terrorists. Financial capability of
Analyzing the message in real time or on the basis of the
prohibited in the Russian Federation ISIS allows to actively
aggregated data after the fact, it is possible to estimate the
finance various media projects and to promote them in the

1517

978-1-5386-4340-2/18/$31.00©2021 IEEE
extent of potential threats from users. Using the clustering, it is IV. DATASET
possible to analyze arbitrary text data, identifying typical To test algorithms for classification of text documents on
interests of the users (user groups). the subject of radical content there was used data collected by
All of the investigations are based on solving problems of scientists from the Artificial intelligence lab of the University
classification and categorization in the case where, generally, of Arizona. These data represent information collected from
there are assumptions about the topics of analyzed text various websites, forums, chats, blogs, social networking, etc.
documents. However, for a more thematic analysis of texts, for for designated terrorist organizations.
example, to organize information about terrorist and extremist To confirm the efficiency of the proposed method the set of
activity, identify the activity type, evolution of terrorist groups reference data prepared in the framework of the project Dark
investigation there are required another approaches. web was taken [8]. This set of annotated data contains
information from Russian sites and forums. This set of text
III. ALTERNATIVE APPROACH messages may contain typographical errors, messages in
The solution of the thematic analysis problem is Russian in transliteration, and messages in other languages
complicated by several factors. Disseminated by terrorist (national languages of the North Caucasus, Arabic language,
groups information is heterogeneous, messages in social etc.).
networks are quite short, contain slang and coded words, The dataset contains a lot of branches of discussions on
making semantic analysis useless. different thematic focus. Not all branches contain information
The difficulty is that the communication on the forums with potentially extremist content. Many messages dedicated to
proceeds in different languages, and also, perhaps, in their the discussion of religious topics such as the rules of conduct in
combinations (the same goes for Internet documents). Also, a an Islamic society, relations between men and women there
simple search based on keywords or specific phrases would not etc. Also there are everyday topics such as cooking, discussing
help to distinguish terrorist sites from such sites as news cars and sports. A lot of the message is devoted to the
agencies. In addition, terrorist sites are often disguised as news discussion of political events in the world, one way or another
sites and religious forums. The number of sites is huge, which connected with Russia, Caucasus and the Middle East, for
makes their manual analysis inefficient, therefore, for correct example, the war in Afghanistan and Libya, the events in
identification of these sites and forums associated with terrorist Georgia, Poland, the accident at the nuclear power plant in
groups there are required the automatic means for effective Japan.
selection and filtering. More difficult problem is to determine Therefore, before using this dataset, it is necessary to lead it
the identity of disseminated information to one of the terrorist to uniform format, convenient for further processing.
groups, because terrorist groups may be ideologically close,
and use similar vocabulary.
V. TEXT NORMALIZATION
Another method of operating prohibited in the Russian Canonization (normalization) of the text is the process of
Federation ISIS in social networks is the promotion of specific bringing to a single format, convenient for further processing.
“hashtags” (special tags using the # sign in order to organize When working with large amount of information, it is
posts by subject or group). Hundreds and thousands of activists necessary to exclude from the document all non-informative
repeatedly place the necessary messages with required parts of speech (prepositions, particles, conjunctions, etc.).
“hashtags” at a specific time of day. Mass mailing Islamists of
messages in social networks leads to dramatic results. So, In the first stage of canonization it is need to delete
experts give the following example: during the assault by hyperlinks, html tags, punctuation. Then all words lead to
militants of Mosul was published about 40 thousand tweets in lower case.
support of prohibited in the Russian Federation ISIS. It was
In the pre-testing uninformative parts is removing from
enough to display the top need hashtags (e.g., #ISIS,
text. To do this, we use a list of stop-words without any
#AllEyesOnISIS, #Iraqwar) and pictures by manipulating the
meaning in this type of treatment. In addition, it is assumed that
news agenda.
the word length is less than three letters do not carry meaning,
The approach proposed in this paper consists of several and we can delete them.
consecutive stages. First, one must clear the text from
The result of canonization stage is the set of documents
“information noise”, such as links, emoticons, words without
reduced to a single sample: each document contains a set of
any meaning (articles, pronouns, conjunctions, etc.). The next
words in the initial form, separated by spaces.
step is fixing typos in the keywords using Levenshtein distance
metrics. Then the naive Bayes classifier is trained on the test
data. The classifier will allocate two groups of messages: VI. NAÏVE BAYES CLASSIFIER
containing the radical content and not containing such content. Naive Bayes classifier is a simple probabilistic classifier
However, the first group may include a message dedicated to based on applying of Bayes theorem with strong (naive)
the fight against terrorism and strongly condemns radical assumptions about independence. This classifier was first
views. In order to separate such messages from advocating introduced in the early 1960s [9] and since then widely used in
extremism, it is necessary to perform texts tone analysis. problems of classification of text documents [10, 11]. Despite
the “naïve” view, and, of course, very simplified conditions,

1518
naive Bayes classifiers often show good results in many n
difficult situations. The advantage of naive Bayes classifier is a p ( c k , f1 , ! , f n ) = p ( c k ) ⋅ ¦ log( p ( f i c k ) + ε ),
small amount of training data needed to estimate the i =1
(3)
parameters required for classification. n
c* = arg max p (c k ) ⋅
k∈{1, !, K }
¦ log( p ( f i c k ) + ε ).
A. Mathematical Description of Classifiers i =1

Formally, the classical classification problem can be


described as follows. Let X be a set of objects features, Y – a While ε = 0 it is assumed that p ( fi ck ) > 0 .
set of classes numbers. There is an unknown target
dependence y* : X → Y which values are known only on the VII. TONE ANALYSIS
m
training set X = {(x1, y1),!, ( xm , ym )}. It is required to build Tone analysis is usually defined as one of tasks of computer
the algorithm a : X → Y which is able to classify any linguistics, i.e. assumes that we can find and classifier the tone
using the tools of natural language processing (such as taggers,
object x ∈ X .
parsers, etc.). Making a great generalization, we can divide the
The formal statement of text documents classification existing approaches into the following categories:
problem is following. Let D = {d1,!, d k } be a set of text • approaches based on rules;
documents. Each document d ∈ D is a sequence of
• approaches based on dictionaries;
words Wd = ( w1, ! , wnd ) where nd is a length of the
• machine learning with the teacher;
document d. C = {c1,!, cn} is a set of classes numbers. There is
an unknown target dependence c* : D → C which values are • machine learning without a teacher.
known only on the objects of the finite training The first type of system consists of a set of rules by which
set Dm = {(d1, c1),!, (dm , cm )}. It is required to build the the system makes a conclusion about the tone of the text. Many
commercial systems use this approach, despite the fact that it is
algorithm a : D → C which is able to classify any
costly, because for good performance of the system must be a
object d ∈ D . large number of rules. Rules are often tied to a specific domain
(e.g., “tourism”) or change the domain (“cars”) required to re-
B. Probabalistic Model make the rules. However, this approach is most accurate when
Using Bayes' theorem, the conditional probability can be there is a good rule base, but completely uninteresting for the
written as study.
Approaches based on dictionaries use so-called tonal
vocabularies (affective lexicons) for analysis of the text. In a
P (c k ) ⋅ p ( f 1 , ! , f n c k )
p (c k f 1 , ! , f n ) = (1) simple form of tone dictionary is a word list with a value of
p ( f1 ,! , f n ) tone for each word.
Machine learning with a teacher is the most common
where f1,!, f n are the signs, which will train a classifier. method used in the research. Its essence is to train a machine
Since p (di ) does not depend on ck and fi values are given, we classifier on a datasets of pre-labeled texts, and then use the
can assume the denominator is a constant. The assumption of resulting model for the analysis of new documents.
"naivety" suggests that each property fi does not depend on any The process of creating a system of tone analysis is very
other properties [12]. This means that similar to the process of creation of other systems using
machine learning:

n 1) Need to assemble a collection of documents to train the


p (c k , f 1 , ! , f n ) = p (c k ) ⋅ ∏ p ( f i c k ), classifier.
2) Each document from the training collection should be
i =1
(2)
n presented in the form of a feature vector.
c* = arg max p (c k ) ⋅
k∈{1, !, K }
∏ p ( fi ck ) 3) For each document you need to specify the “right”
i =1 answer, i.e. the type of sentiment (e.g. positive or negative) on
In practice, to avoid values which are very close to zero, the these answers and will train the classifier.
(3) is used [13]. 4) The choice of classification algorithm and the training
of the classifier.
5) Use of the resulting model.
Machine learning without a teacher is, probably, the most
interesting and at the same time the least accurate method of
tone analysis. One example of this method can be automated
clustering of documents.

1519
The quality of the results directly depends on how we will As we can see, the determination of messages on the radical
present the document to the classifier, namely, which set of and non-radical achieves the best quality (F1 score metric) with
characteristics we will use to compose a feature vector. The size of N-gram is equal to 4.
most common way to represent document in the problems of
computational linguistics and search is either in the form of a IX. CONCLUSION
bag-of-words or as a set of N-grams. For example, the sentence
"I hate Monday morning" can be represented in the form of a In this paper we consider the important applied problem of
set unigram (I, hate, morning, Monday) or bigrams (I hate, hate using machine learning methods to identify potential extremist
morning, Monday morning). and terrorist information on the Internet. Provides an overview
of existing solutions and approaches and proposes a new
The accuracy and quality of tone analysis of text is assessed original approach to the identification of terrorist and radical
by how well it agrees with the opinion of the person content.
concerning the emotional evaluation of the studied text. This
can be used by such metrics as precision and recall [14]. The We plan to continue research in this direction and solved
formula for finding recall: the task of implementing systems for continuous monitoring of
CEO flow of text messages, news feed, entries in forums and social
R= (4) networks, Internet communities with a view to continuously
TN search and identify potentially extremist information.
where CEO (correctly extracted opinions) is true of certain
views, TN (total number of opinions) is the total number of
opinions (both found and not found) [14]. The precision is REFERENCES
calculated by the (5): [1] Vasiliev M., Modern media technologies in the service of international
terrorism, http://katehon.com/ru/article/sovremennye-media-tehnologii-
CEO
P= (5) na-sluzhbe-mezhdunarodnogo-terrorizma.
TNF [2] Connecting One Billion Users Every Day
where CEO (correctly extracted opinions) is true of certain https://blog.whatsapp.com/10000631/Connecting-One-Billion-Users-
views, TNF (total number of opinions found by the system) Every-Day?l=en.
the total number of opinions found by the system [14]. Thus, [3] 10 Amazing Telegram Stats (October 2017) | By the Numbers
the precision represents the number of investigated texts, https://expandedramblings.com/index.php/telegram-stats/.
sentences, or documents, in which the opinion analysis system [4] Facebook by the Numbers: Stats, Demographics & Fun Facts
https://www.omnicoreagency.com/facebook-statistics/.
of tonality coincided with the opinion of the expert. Thus,
[5] Twitter by the Numbers: Stats, Demographics & Fun Facts
according to the study, the experts usually agree with the https://www.omnicoreagency.com/twitter-statistics/.
assessments of the tone of a particular text in 79% of cases
[6] Instagram by the Numbers: Stats, Demographics & Fun Facts
[15]. Consequently, the program that determines the tone of the https://www.omnicoreagency.com/instagram-statistics/.
text with an accuracy of 70%, does it almost as well as people. [7] I. Mashechkin, M. Petrovskiy, I. Pospelova, D. Tsarev, Automatic
summarization and keywords extraction methods for discovering
VIII. EXPERIMENTS extremist information on the internet, 2016.
[8] Yulei Zhang, Shuo Zeng, Li Fan, Yan Dang, Catherine A. Larson, and
To estimate the approach quality, we chose the following Hsinchun Chen. 2009. Dark web forums portal: searching and analyzing
metrics: Jihadist forums. In Proceedings of the 2009 IEEE international
• precision; conference on Intelligence and security informatics (ISI'09). IEEE Press,
Piscataway, NJ, USA, 71-76.
• recall; [9] Russell, Stuart; Norvig, Peter (2003) [1995]. Artificial Intelligence: A
Modern Approach(2nd ed.). Prentice Hall. ISBN 978-0137903955.
• F1 score. [10] J. Liu, Z. Tian, P. Liu, J. Jiang, Z. Li. An Approach of Semantic Web
Service Classification Based on Naive Bayes. IEEE, 2016.
The average values of quality measures depending on the
[11] F.Gumus, C. Okan Sakar, Z.Erdem, O.Kursun. Online Naive Bayes
size of N-gram for text representation are shown in Table II. classification for network intrusion detection. IEEE/ACM 2014.
[12] Zhang, Harry. The Optimality of Naive Bayes. FLAIRS2004
TABLE II. APPROACH QUALITY conference.
Size of Quality metrics [13] Eibe Frank and Remco R. Bouckaert, Naive Bayes for Text
N-gram Precision, % Recall, % F1 score, % Classification with Unbalanced Classes. ICML
1 85.23 85.1 85.16 [14] Nozomi Kobayashi, Ryu Iida, Kentaro Inui, Yuji Matsumoto Opinion
Mining on the Web by Extracting Subject-Aspect-Evaluation Relations
2 88.37 87.98 88.17 // Nara Institute of Science and Technology, Takayama, Ikoma, Nara
3 88.74 88.12 88.43 630-0192, Japan: conference. — 2006. — P. 1-6.
[15] Ogneva, M. How Companies Can Use Sentiment Analysis to Improve
4 89.05 88.83 88.94
Their Business. Mashable (2012).
5 89.11 88.71 88.91

1520

You might also like