Professional Documents
Culture Documents
Cyberspace Monitoring Using AI and Graph Theoretic Tools
Cyberspace Monitoring Using AI and Graph Theoretic Tools
Cyberspace Monitoring Using AI and Graph Theoretic Tools
Bachelor of Technology
by
Arinjaya Khare
(1501CS10)
under the guidance of
Dr Joydeep Chandra
to the
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY PATNA
PATNA - 800013, BIHAR
2
CERTIFICATE
This is to certify that the work contained in this thesis entitled “Cyberspace monitor-
ing using AI and Graph Theoretic Tools” is a bonafide work of Arinjaya Khare
(Roll No. 1501CS10), carried out in the Department of Computer Science and Engi-
neering, Indian Institute of Technology Patna under my supervision and that it has not
been submitted elsewhere for a degree.
Assistant/Associate Professor,
May, 2019 Department of Computer Science & Engineering,
Patna. Indian Institute of Technology Patna, Bihar.
i
ii
Acknowledgements
This project was a huge learning opportunity as I got to work with various Deep Learning
frameworks. I also gained a lot of insight into writing good quality scalable code and be
able to adapt quickly to change in requirements.
I would like to thank Dr Joydeep Chandra for being my supervisor and providing me his
valuable guidance during the development phase of this project. I would also like to thank
IIT Patna for providing me the necessary resources to work on this project.
iii
iv
Contents
List of Tables ix
1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization of The Report . . . . . . . . . . . . . . . . . . . . . . . . . . 3
v
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
References 25
vi
List of Figures
vii
viii
List of Tables
5.1 Results of 20 Fold Cross Validation for selection of Predictor Model P(θ) . 21
5.2 Performance of our RLAN Model on CDAC annotated Dataset . . . . . . . 22
5.3 Performance of classifier on CDAC annotated Dataset by itself . . . . . . . 22
ix
x
Chapter 1
Introduction
The 21st century is the era of Social Media. Social Networking Sites like Facebook and
twitter boast of more than 2.27 billion users worldwide and this number is set to grow even
further due to factors like Digital India and cheap internet rates.
An effect of this phenomenon is that users can now share information more quickly with
other people. This has resulted in various radical and violent propaganda messages being
spread very quickly.
While Technology has grown at an exponential rate, our legal system hasn’t been able
to keep up and so many of such ”Cybercrimes” have gone through the hands of the au-
thorities and many perpetrators are roaming freely and creating more havoc and chaos.
Our projects tries to create a model to identify such posts, estimate locations and iden-
tify menacing individuals and report them to relevant authorities to curb such behaviour.
1
1.1 Problem Description
Our goal is to identify tweets belonging 5 categories - violent extremism, nonviolent ex-
tremism, radical violence, non-radical violence and not relevant. This problem is treated
as a classification problem with each category having 3 scores to symbolize severity- 0(not
relevant for category), 1(moderate), 2(high).
More formally, given a tweet Ti = [w1 , w2 , ..., wm ], we must return a label li = [l0 , l1 , l2 , l3 , l4 ]
such that li = scoreseverity where scoreseverity = 0 if it doesn’t belong to category i or 1 or
2 depending on the severity of membership to that category.
The model should also be able to retrain itself with the newly classified data so as to stay
up to date and relevant with evolving vocabulary of the cyberspace over time.
The data should also be stored in a presentable manner so it is easy to gain insight on the
menacing tweets.
2. Self Annotated Dataset consisting of 1,00,000 tweets of which roughly 50% are men-
acing tweets and rest are normal tweets. This is fed as Unlabelled Data to our model.
2
1.3 Organization of The Report
This chapter talks about the challenges faced by authorities in Digital Age regarding Cyber
Crimes. We listed out the various requirements and expectations from our project. In
Chapter 2, we will discuss how our database of Tweets was created. In Chapter 3 we
will propose a novel Deep Learning Model inspired from Generative Adversarial Networks
and Reinforcement learning called as RLANs(Reinforcement Learning Based Adversarial
Network). In Chapter 4, we will discuss the architecture and the workings of our project.
In chapter 5, we will discuss the results we achieved from our model.
3
4
Chapter 2
In this chapter, we will discuss how our Tweet Datasets were created which will be used to
train our model.
5
2.1.1 Dataset 1
As can be seen , the data is biased and also quite less and will result in a poor classifier
if it is trained using this data. We will have to develop a model which can overcome this
bias in the data and give us a better classifier.
2.1.2 Dataset 2
The unlabelled Dataset consists of 1,00,000 tweets. To create this dataset, we first created
a set of keywords associated with cybercrime and created a set of 50,000 tweets associ-
ated with cybercrime. We then retrieved 50,000 normal tweets which don’t contain these
keywords. in the end these tweets were merged to create the final dataset.
2.2 Conclusion
In this chapter, we discussed how the datasets which we would be using in our project were
created and their properties.
6
Chapter 3
Various Deep Learning Models exist for classifying text. These classifiers have been proven
to give quite a good performance, but they tend to fail when the annotated training data
is quite less or is biased, thus resulting in poor performance.
We try to propose a novel model which not only trains using Labelled Data but also
using Unlabelled Data. We take inspiration from Generative Adversarial networks(GANs)
[GPAM+ 14] invented by Ian Goodfellow which is generally used to train Image classifiers
and try to extend its methodology to train a text classifier.
Our model is an evolution of GANs which we will be modifying for Text Classification and
combining it with Reinforcement Learning to give us a well-trained classifer. This model is
called as Reinforcement Learning Based Adversarial Networks(RLANs) [LY18]. Just like
GANs, our model consists of 2 networks-
1. Predictor Network P(θ) - This will be our main Text classifier which will be trained
using Reinforcement learning. The tweet will be our input and label vector of
7
scoreseverity will be our output vector. The primary goal of the learning will be
to learn a good policy for P such that the generated predicted label can maximise
the expected total reward of reinforcement learning.
2. Judge Network J(φ) - The Judge Network evaluates the predicted label from P(θ) and
provides a feedback reward which will drive the relearning of the Predictor network.
As the reward is dynamically generated, it can iteratively improve the Predictor
Network
We are using a hybrid CNN-LSTM based model as our classifier which learns from the
continuos stream of tweets and picks up salient features from them to predict the label
vector of scoreseverity .
The model consits of 3 parts-
1. An Embedding Layer using pretrained Glove [glo] embeddings that converts tokenised
tweets into vectors.
2. A Convolutional Layer
3. 3 layers of LSTM that extracts the salient features to predict the output
Even though CNNs are not very useful for Text classification and stacked LSTMs pro-
vide slightly better performance than our model, adding it to our model provides huge
speed improvements(close to 3 times faster). Even though they don’t have sequence pro-
cessing abilities of LSTM, they work well as they add a filter over the data and create a
higher level representation of data.
We had to make a trade-off between huge speed advantages and slightly better scores and
so we chose to work with faster training times.
8
Fig. 3.1 Architecture of Predictor Network P(θ)
9
3.3 Judge Network J(φ)
The goal of the judge model is to predict that given a Tweet-label pair Ti , li , how likely is
it to belong to the set of Labelled tweets.
We will be using an LSTM based model as our judge Network.At the tth step of our
algorithm, an estimated label vector, Ot is generated by the predictor network.
Ot is then concatenated with one hot embedded vector, y, from the LSTM layer. A weighted
combination of all these concatenations is used as input for the output layer of the judge
Network.
The judge model will output 1 if it thinks the Tweet-label pair belongs to the labelled
dataset and 0 if it doesn’t belong to the labelled tweet dataset.
10
3.4 How the Predictor is trained
For any particular tweet, we are using tweet text only, discarding other parameters. NLTK
and Keras was used to tokenise the text and remove stopwords and words not found in the
English Language. We also discarded words which do not contain any alphabets, mentions
and tokens of punctuation as they wont be containing any significant information required
for classification.
Currently we are using only labels of 0 and 1 due to lack of adequate data.
We are considering Predictor Pθ (y|x) as policy model which provides probability that ac-
tion y is chosen given state x. We will work on generating the best possible action so as to
maximise the reward:
Jφ (x, y) is the probability that the tweet label pair x, y originated from the set of labelled
Tweets(DL ) instead of the set of Unlabelled Tweets(DU ).
11
Initially we will be training our predictor model using the set of Labelled Tweets(DL ) and
its correspond labels. We will then generate the label vectors of Unlabelled tweets(DU ).
Both these sets will be used to train our judge model.
The algorithm will then run for a set number of iterations. In each iteration we will sam-
ple m tweets from the set of labelled tweets(DL ). We will also sample m tweets from the
Unlabelled set(DU ) and generate their predicted output labels.
The algorithm will then run for a set number of steps. In each step, we will first update
the judge model using the formula :
After Updating the judge model we will be updating the Action value function. To up-
date the predictor function, we will be maximising the reward function defined in eq(1).
To maximize the reward function we will be calculating its gradient using the formula :
Due to mini-batch training using m labelled and m unlabelled tweets, we can approximate
At the end of the step, we can update the predictor Network using :
12
Fig. 3.3 RLAN Training Algorithm
13
3.5 Conclusion
In this chapter we presented a new Deep Learning model which trains not only using
Labelled Data but also using Unlabelled Data. This model provides a better performance
than traditional models in cases when the data is biased or quite less. In the next chapter,
we will discuss the architecture of our entire project.
14
Chapter 4
The requirements of the project dictate that the model should be able to classify tweets
in real time. The model should also retrain itself to keep up to date with the evolving
vocabulary of the cyberspace. The tweets should also be stored in a presentable manner so
that it is easier to gain insights from the result.
We tried to create the Best design which would be able to satisfy all the requirements in
the best way.
4.1.1 Database
We will be using Mongo DB as our Database where new and classified tweets will be stored.
We decided to use Mongo DB as it is easy to use with python and highly scalable. Also it
is very easy to set an expiration policy to automaically delete old tweets after a set amount
of time.
New tweets will be generated from Twitter Stream API and will be stored in MongoDB
The classifier will retrieve these new tweets and classify them. these newly classified tweets
will then be stored in a seperate collection with their scores.
15
To fulfill these requirements, we created 3 collections -
1. tweet collection - A collection to store tweetsId, tweetText and other relevant fields
in a tweet
2. tweetScores collection - A collection to store tweetId and their relevant score label
3. tweetStatus collection - A collection to store tweetId and their status ie- if this tweet
has been classified or used for training or used by other collaboraors of the project.
1. TweetStream thread - This thread retrieves new tweets from Twitter API and stores
it into out MongoDB database
2. Classifier Thread- The thread will load the saved model, get tweets through twitter
api and classify them. These classified tweets will be stored in mongoDB and will
be used for retraining the model and other tasks like location estimation, virality
detection etc.
3. Trainer thread- This thread will load in the old trained model and take in the newly
classified tweets and retrain the model with them.
16
Fig. 4.1 Design of Project
The first Requirement of our project was to classify tweets in real time. We achieved this
using TweetStream thread and Classifier thread. The tweetStream will gather new tweets
from Twitter API in real time. Classifier Thread will run at regular intervals(Currently set
at 30 mins) classifying tweets in psuedo-real time fulfilling our first requirement.
Our second requirement was for the classifier to retrain itself with the newly classified
tweets. We satisified this requirement by creating a trainer to run at regular intervals(currently
set at 6 hours).
Finally we used MongoDB to store our results so it was easy for other collaborators of the
project to use the results and gather relevant insights from it.
4.3 Conclusion
In this chapter, we discussed about the the design we used to create our project and how
it fulfilled all our requirements.
17
18
Chapter 5
5.1 Results
To evaluate the model, we used Precision, Recall and F1-score. These metrics are widely
used in the case of a multi-labeled dataset. Say a multi-labeled dataset contains a total of
N instances; each instance Ni can be represented as (xi ; yi ), where xi is the set of attributes.
yi ⊆ L is the set of labels, where L represents the total number of labels used in the dataset.
Suppose yi and y1i represents the subset of true and predicted labels respectively for the
ith instance, then the metrics can be described for the ith instance by the given formulae.
Precision: This is the number of accurately predicted location words to the total number
of predicted location words.
|yi ∩ y1i |
P recision =
|y1i |
Recall: This is the number of accurately predicted location words to the total number of
actual location words in the tweet.
|yi ∩ y1i |
Recall =
|yi |
19
F1-score: This is the harmonic mean between Precision and Recall, which gives the bal-
anced evaluation between them.
2 ∗ P recision ∗ Recall
F 1 − score =
P recision + Recall
First we ran tests for selection of the best model for our Predictor Network P(θ). We ran
20 fold Cross Validation for various models using Dataset 2. We considered just a single
label for this case. We considered the 50,000 tweets associated with cybercrime to have the
label 1 and the rest to have the label 0. This gave us an insight on which model would be
the best candidate for use as Predictor Network (θ).
Once the Predictor Network P(θ) was selected, we ran tests to check its viabilty. We set
number of iterations of LAN Training to 100 and number of Steps to 3. As it took more
than 3 days to train for just 1 Iteration of 10-Fold Cross Validation, we are considering
only 1 iteration of 10 Fold Cross Validation.
The Batch Size for training for both the tests was set to 64.
20
5.1.3 Results of selection of Predictor Network P(θ)
As can be seen, our Ensemble CNN-LSTM model is far superior compared to the other
models and so is the besty candidate for use as Predictor Model P(θ).
21
5.1.5 Performance or our RLAN model
In contrast, if we had used just the Ensemble CNN-LSTM based model on its own, we
would have got the following result
As our CDAC dataset was biased in favour of Non-Violent Extremism and due to it not
being adequate to train a neural network, if we would have just used the classifier on its
own, it would have resulted in the result in Table 5.3 i.e. all the tweets would be classified
as belonging to Non-Violent Extremism.
By using our RLAN Model to train our classifier, we were able to get better performance
as it was able to predict tweets for all categories. With just 100 iterations, we were able
to get respectable results in 3 of the 5 categories. With increased Computing Power, we
would be able to run the training of RLAN for more iterations and be able to get even
better results.
22
5.2 Future Work
23
24
References
[GPAM+ 14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad-
versarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
[LY18] Yan Li and Jieping Ye. Learning adversarial networks for semi-supervised text
classification via policy gradient. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, KDD
’18, pages 1715–1723, New York, NY, USA, 2018. ACM.
25
26