Cyberspace Monitoring Using AI and Graph Theoretic Tools

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Cyberspace monitoring using AI and Graph Theoretic Tools

A B. Tech Project Report Submitted


in Partial Fulfillment of the Requirements
for the Degree of

Bachelor of Technology

by

Arinjaya Khare
(1501CS10)
under the guidance of

Dr Joydeep Chandra

to the
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
INDIAN INSTITUTE OF TECHNOLOGY PATNA
PATNA - 800013, BIHAR
2
CERTIFICATE

This is to certify that the work contained in this thesis entitled “Cyberspace monitor-
ing using AI and Graph Theoretic Tools” is a bonafide work of Arinjaya Khare
(Roll No. 1501CS10), carried out in the Department of Computer Science and Engi-
neering, Indian Institute of Technology Patna under my supervision and that it has not
been submitted elsewhere for a degree.

Supervisor: Dr Joydeep Chandra

Assistant/Associate Professor,
May, 2019 Department of Computer Science & Engineering,
Patna. Indian Institute of Technology Patna, Bihar.

i
ii
Acknowledgements

This project was a huge learning opportunity as I got to work with various Deep Learning
frameworks. I also gained a lot of insight into writing good quality scalable code and be
able to adapt quickly to change in requirements.
I would like to thank Dr Joydeep Chandra for being my supervisor and providing me his
valuable guidance during the development phase of this project. I would also like to thank
IIT Patna for providing me the necessary resources to work on this project.

iii
iv
Contents

List of Figures vii

List of Tables ix

1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Dataset Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization of The Report . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Details of our Datasets 5


2.1 Details of our datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Dataset 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Dataset 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Reinforcement Learning Based Adversarial Networks(RLANs) 7


3.1 Architecture of our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Predictor Network P(θ)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Judge Network J(φ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 How the Predictor is trained . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.1 Input to the model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4.2 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Architecture of our project 15


4.1 Detailed Design of our project . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Detailed Architecture of our project . . . . . . . . . . . . . . . . . . 16
4.2 How this design fulfilled our requirements . . . . . . . . . . . . . . . . . . . 17
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Conclusion and Future Work 19


5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1.3 Results of selection of Predictor Network P(θ) . . . . . . . . . . . . 21
5.1.4 Analysis of Result in Table 1 . . . . . . . . . . . . . . . . . . . . . . 21
5.1.5 Performance or our RLAN model . . . . . . . . . . . . . . . . . . . 22
5.1.6 Discussion Of Result of RLAN model . . . . . . . . . . . . . . . . . 22
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

References 25

vi
List of Figures

3.1 Architecture of Predictor Network P(θ) . . . . . . . . . . . . . . . . . . . . 9


3.2 Architecture of Judge Network J(φ) . . . . . . . . . . . . . . . . . . . . . . 10
3.3 RLAN Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1 Design of Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

vii
viii
List of Tables

5.1 Results of 20 Fold Cross Validation for selection of Predictor Model P(θ) . 21
5.2 Performance of our RLAN Model on CDAC annotated Dataset . . . . . . . 22
5.3 Performance of classifier on CDAC annotated Dataset by itself . . . . . . . 22

ix
x
Chapter 1

Introduction

The 21st century is the era of Social Media. Social Networking Sites like Facebook and
twitter boast of more than 2.27 billion users worldwide and this number is set to grow even
further due to factors like Digital India and cheap internet rates.

An effect of this phenomenon is that users can now share information more quickly with
other people. This has resulted in various radical and violent propaganda messages being
spread very quickly.

While Technology has grown at an exponential rate, our legal system hasn’t been able
to keep up and so many of such ”Cybercrimes” have gone through the hands of the au-
thorities and many perpetrators are roaming freely and creating more havoc and chaos.

Our projects tries to create a model to identify such posts, estimate locations and iden-
tify menacing individuals and report them to relevant authorities to curb such behaviour.

1
1.1 Problem Description

Our goal is to identify tweets belonging 5 categories - violent extremism, nonviolent ex-
tremism, radical violence, non-radical violence and not relevant. This problem is treated
as a classification problem with each category having 3 scores to symbolize severity- 0(not
relevant for category), 1(moderate), 2(high).
More formally, given a tweet Ti = [w1 , w2 , ..., wm ], we must return a label li = [l0 , l1 , l2 , l3 , l4 ]
such that li = scoreseverity where scoreseverity = 0 if it doesn’t belong to category i or 1 or
2 depending on the severity of membership to that category.
The model should also be able to retrain itself with the newly classified data so as to stay
up to date and relevant with evolving vocabulary of the cyberspace over time.
The data should also be stored in a presentable manner so it is easy to gain insight on the
menacing tweets.

1.2 Dataset Used

For training our model, we will be making use of 2 Datasets-

1. Annotated Dataset provided by CDAC(Centre for Development of Advanced Com-


puting) consistng of around 5000 tweets. This will be fed as Labelled Data to our
model.

2. Self Annotated Dataset consisting of 1,00,000 tweets of which roughly 50% are men-
acing tweets and rest are normal tweets. This is fed as Unlabelled Data to our model.

2
1.3 Organization of The Report

This chapter talks about the challenges faced by authorities in Digital Age regarding Cyber
Crimes. We listed out the various requirements and expectations from our project. In
Chapter 2, we will discuss how our database of Tweets was created. In Chapter 3 we
will propose a novel Deep Learning Model inspired from Generative Adversarial Networks
and Reinforcement learning called as RLANs(Reinforcement Learning Based Adversarial
Network). In Chapter 4, we will discuss the architecture and the workings of our project.
In chapter 5, we will discuss the results we achieved from our model.

3
4
Chapter 2

Details of our Datasets

In this chapter, we will discuss how our Tweet Datasets were created which will be used to
train our model.

2.1 Details of our datasets

We are currently working with 2 Datasets:

1. Labelled Tweet Dataset provided by CDAC used as Labelled Dataset.

2. Self Generated Dataset used as Unlabelled Dataset.

5
2.1.1 Dataset 1

CDAC(Centre for Development of Advanced Computing) provided us with a Dataset con-


sisting of 5033 tweets with the following breakdown :

• Violent Extremism : 210

• Non-Violent Extremism : 3492

• Radical Violence : 220

• Non-Radical Violence : 181

• Not Relevant : 930

As can be seen , the data is biased and also quite less and will result in a poor classifier
if it is trained using this data. We will have to develop a model which can overcome this
bias in the data and give us a better classifier.

2.1.2 Dataset 2

The unlabelled Dataset consists of 1,00,000 tweets. To create this dataset, we first created
a set of keywords associated with cybercrime and created a set of 50,000 tweets associ-
ated with cybercrime. We then retrieved 50,000 normal tweets which don’t contain these
keywords. in the end these tweets were merged to create the final dataset.

2.2 Conclusion

In this chapter, we discussed how the datasets which we would be using in our project were
created and their properties.

6
Chapter 3

Reinforcement Learning Based


Adversarial Networks(RLANs)

Various Deep Learning Models exist for classifying text. These classifiers have been proven
to give quite a good performance, but they tend to fail when the annotated training data
is quite less or is biased, thus resulting in poor performance.
We try to propose a novel model which not only trains using Labelled Data but also
using Unlabelled Data. We take inspiration from Generative Adversarial networks(GANs)
[GPAM+ 14] invented by Ian Goodfellow which is generally used to train Image classifiers
and try to extend its methodology to train a text classifier.

3.1 Architecture of our Model

Our model is an evolution of GANs which we will be modifying for Text Classification and
combining it with Reinforcement Learning to give us a well-trained classifer. This model is
called as Reinforcement Learning Based Adversarial Networks(RLANs) [LY18]. Just like
GANs, our model consists of 2 networks-

1. Predictor Network P(θ) - This will be our main Text classifier which will be trained
using Reinforcement learning. The tweet will be our input and label vector of

7
scoreseverity will be our output vector. The primary goal of the learning will be
to learn a good policy for P such that the generated predicted label can maximise
the expected total reward of reinforcement learning.

2. Judge Network J(φ) - The Judge Network evaluates the predicted label from P(θ) and
provides a feedback reward which will drive the relearning of the Predictor network.
As the reward is dynamically generated, it can iteratively improve the Predictor
Network

3.2 Predictor Network P(θ))

We are using a hybrid CNN-LSTM based model as our classifier which learns from the
continuos stream of tweets and picks up salient features from them to predict the label
vector of scoreseverity .
The model consits of 3 parts-

1. An Embedding Layer using pretrained Glove [glo] embeddings that converts tokenised
tweets into vectors.

2. A Convolutional Layer

3. 3 layers of LSTM that extracts the salient features to predict the output

Even though CNNs are not very useful for Text classification and stacked LSTMs pro-
vide slightly better performance than our model, adding it to our model provides huge
speed improvements(close to 3 times faster). Even though they don’t have sequence pro-
cessing abilities of LSTM, they work well as they add a filter over the data and create a
higher level representation of data.
We had to make a trade-off between huge speed advantages and slightly better scores and
so we chose to work with faster training times.

8
Fig. 3.1 Architecture of Predictor Network P(θ)

9
3.3 Judge Network J(φ)

The goal of the judge model is to predict that given a Tweet-label pair Ti , li , how likely is
it to belong to the set of Labelled tweets.
We will be using an LSTM based model as our judge Network.At the tth step of our
algorithm, an estimated label vector, Ot is generated by the predictor network.
Ot is then concatenated with one hot embedded vector, y, from the LSTM layer. A weighted
combination of all these concatenations is used as input for the output layer of the judge
Network.

The judge model will output 1 if it thinks the Tweet-label pair belongs to the labelled
dataset and 0 if it doesn’t belong to the labelled tweet dataset.

Fig. 3.2 Architecture of Judge Network J(φ)

10
3.4 How the Predictor is trained

3.4.1 Input to the model

There are 2 sets of inputs-

• a set of labelled or annotated Tweets denoted by DL

• a set of unlabelled tweets denoted by DU

For any particular tweet, we are using tweet text only, discarding other parameters. NLTK
and Keras was used to tokenise the text and remove stopwords and words not found in the
English Language. We also discarded words which do not contain any alphabets, mentions
and tokens of punctuation as they wont be containing any significant information required
for classification.
Currently we are using only labels of 0 and 1 due to lack of adequate data.

3.4.2 Training Algorithm

We are considering Predictor Pθ (y|x) as policy model which provides probability that ac-
tion y is chosen given state x. We will work on generating the best possible action so as to
maximise the reward:

V() is the action-value function which is defined as :

Jφ (x, y) is the probability that the tweet label pair x, y originated from the set of labelled
Tweets(DL ) instead of the set of Unlabelled Tweets(DU ).

11
Initially we will be training our predictor model using the set of Labelled Tweets(DL ) and
its correspond labels. We will then generate the label vectors of Unlabelled tweets(DU ).
Both these sets will be used to train our judge model.
The algorithm will then run for a set number of iterations. In each iteration we will sam-
ple m tweets from the set of labelled tweets(DL ). We will also sample m tweets from the
Unlabelled set(DU ) and generate their predicted output labels.
The algorithm will then run for a set number of steps. In each step, we will first update
the judge model using the formula :

After Updating the judge model we will be updating the Action value function. To up-
date the predictor function, we will be maximising the reward function defined in eq(1).
To maximize the reward function we will be calculating its gradient using the formula :
Due to mini-batch training using m labelled and m unlabelled tweets, we can approximate

gradient of reward function as :

At the end of the step, we can update the predictor Network using :

where α is the learning rate defined by the user.

12
Fig. 3.3 RLAN Training Algorithm

13
3.5 Conclusion

In this chapter we presented a new Deep Learning model which trains not only using
Labelled Data but also using Unlabelled Data. This model provides a better performance
than traditional models in cases when the data is biased or quite less. In the next chapter,
we will discuss the architecture of our entire project.

14
Chapter 4

Architecture of our project

The requirements of the project dictate that the model should be able to classify tweets
in real time. The model should also retrain itself to keep up to date with the evolving
vocabulary of the cyberspace. The tweets should also be stored in a presentable manner so
that it is easier to gain insights from the result.
We tried to create the Best design which would be able to satisfy all the requirements in
the best way.

4.1 Detailed Design of our project

4.1.1 Database

We will be using Mongo DB as our Database where new and classified tweets will be stored.
We decided to use Mongo DB as it is easy to use with python and highly scalable. Also it
is very easy to set an expiration policy to automaically delete old tweets after a set amount
of time.
New tweets will be generated from Twitter Stream API and will be stored in MongoDB
The classifier will retrieve these new tweets and classify them. these newly classified tweets
will then be stored in a seperate collection with their scores.

15
To fulfill these requirements, we created 3 collections -

1. tweet collection - A collection to store tweetsId, tweetText and other relevant fields
in a tweet

2. tweetScores collection - A collection to store tweetId and their relevant score label

3. tweetStatus collection - A collection to store tweetId and their status ie- if this tweet
has been classified or used for training or used by other collaboraors of the project.

4.1.2 Detailed Architecture of our project

Our Project consists mainly of 3 threads -

1. TweetStream thread - This thread retrieves new tweets from Twitter API and stores
it into out MongoDB database

2. Classifier Thread- The thread will load the saved model, get tweets through twitter
api and classify them. These classified tweets will be stored in mongoDB and will
be used for retraining the model and other tasks like location estimation, virality
detection etc.

3. Trainer thread- This thread will load in the old trained model and take in the newly
classified tweets and retrain the model with them.

16
Fig. 4.1 Design of Project

4.2 How this design fulfilled our requirements

The first Requirement of our project was to classify tweets in real time. We achieved this
using TweetStream thread and Classifier thread. The tweetStream will gather new tweets
from Twitter API in real time. Classifier Thread will run at regular intervals(Currently set
at 30 mins) classifying tweets in psuedo-real time fulfilling our first requirement.
Our second requirement was for the classifier to retrain itself with the newly classified
tweets. We satisified this requirement by creating a trainer to run at regular intervals(currently
set at 6 hours).
Finally we used MongoDB to store our results so it was easy for other collaborators of the
project to use the results and gather relevant insights from it.

4.3 Conclusion

In this chapter, we discussed about the the design we used to create our project and how
it fulfilled all our requirements.

17
18
Chapter 5

Conclusion and Future Work

5.1 Results

5.1.1 Evaluation Metrics

To evaluate the model, we used Precision, Recall and F1-score. These metrics are widely
used in the case of a multi-labeled dataset. Say a multi-labeled dataset contains a total of
N instances; each instance Ni can be represented as (xi ; yi ), where xi is the set of attributes.
yi ⊆ L is the set of labels, where L represents the total number of labels used in the dataset.
Suppose yi and y1i represents the subset of true and predicted labels respectively for the
ith instance, then the metrics can be described for the ith instance by the given formulae.
Precision: This is the number of accurately predicted location words to the total number
of predicted location words.
|yi ∩ y1i |
P recision =
|y1i |

Recall: This is the number of accurately predicted location words to the total number of
actual location words in the tweet.

|yi ∩ y1i |
Recall =
|yi |

19
F1-score: This is the harmonic mean between Precision and Recall, which gives the bal-
anced evaluation between them.

2 ∗ P recision ∗ Recall
F 1 − score =
P recision + Recall

5.1.2 Experimental Setup

First we ran tests for selection of the best model for our Predictor Network P(θ). We ran
20 fold Cross Validation for various models using Dataset 2. We considered just a single
label for this case. We considered the 50,000 tweets associated with cybercrime to have the
label 1 and the rest to have the label 0. This gave us an insight on which model would be
the best candidate for use as Predictor Network (θ).
Once the Predictor Network P(θ) was selected, we ran tests to check its viabilty. We set
number of iterations of LAN Training to 100 and number of Steps to 3. As it took more
than 3 days to train for just 1 Iteration of 10-Fold Cross Validation, we are considering
only 1 iteration of 10 Fold Cross Validation.
The Batch Size for training for both the tests was set to 64.

20
5.1.3 Results of selection of Predictor Network P(θ)

Approach Accuracy Precision Recall F1 Score


Our Model 0.912 [0.912, 0.862] [0.961, 0.861] [0.928, 0.862]
LSTM 0.7698 [0.771, 0.769] [0.769,0.771] [0.770, 0.770]
GRU 0.498 [0.399, 0.099] [0.800, 0.200] [0.532, 0.133]
CNN 0.49704 [0.249, 0.249] [0.500, 0.500] [0.332, 0.332]
Stacked LSTM 0.8048 [0.830, 0.796] [0.829, 0.786] [0.830, 0.791]
Table 5.1 Results of 20 Fold Cross Validation for selection of Predictor
Model P(θ)

5.1.4 Analysis of Result in Table 1

As can be seen, our Ensemble CNN-LSTM model is far superior compared to the other
models and so is the besty candidate for use as Predictor Model P(θ).

21
5.1.5 Performance or our RLAN model

We present to you the results of our RLAN model in table 5.2

Category Precision Recall F1 Score


Violent Extremism 0.3200 0.2459 0.2781
Non-violent Extremism 0.8516 0.7431 0.7937
Radical Violence 0.6843 0.5567 0.61934
Non-radical Violence 0.3984 0.2897 0.33065
Not Relevant 0.7958 0.7759 0.7857
Table 5.2 Performance of our RLAN Model on CDAC annotated Dataset

In contrast, if we had used just the Ensemble CNN-LSTM based model on its own, we
would have got the following result

Category Precision Recall F1 Score


Violent Extremism 0 0 0
Non-violent Extremism 0.6938 1 0.8193
Radical Violence 0 0 0
Non-radical Violence 0 0 0
Not Relevant 0 0 0
Table 5.3 Performance of classifier on CDAC annotated Dataset by itself

5.1.6 Discussion Of Result of RLAN model

As our CDAC dataset was biased in favour of Non-Violent Extremism and due to it not
being adequate to train a neural network, if we would have just used the classifier on its
own, it would have resulted in the result in Table 5.3 i.e. all the tweets would be classified
as belonging to Non-Violent Extremism.
By using our RLAN Model to train our classifier, we were able to get better performance
as it was able to predict tweets for all categories. With just 100 iterations, we were able
to get respectable results in 3 of the 5 categories. With increased Computing Power, we
would be able to run the training of RLAN for more iterations and be able to get even
better results.

22
5.2 Future Work

Currently in the label vector, we are considering only scoreseverity to be 0 or 1. Our


requirement stated that we had to consider scoreseverity to be 0(not Relevant fotr that
category), 1(Moderate) and 2(Highly relevant for this topic). However as data having
scoreseverity = 2 was very low (around 15 for each category), we decided not to consider
this. However in case, we are able to get more annotated data from CDAC, we can also
add this score. In that case our label vector will have 11 classes instead of 5.

23
24
References

[glo] Glove Word Embeddings.

[GPAM+ 14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative ad-
versarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
and K. Q. Weinberger, editors, Advances in Neural Information Processing
Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.

[LY18] Yan Li and Jieping Ye. Learning adversarial networks for semi-supervised text
classification via policy gradient. In Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, KDD
’18, pages 1715–1723, New York, NY, USA, 2018. ACM.

25
26

You might also like