Overall System Architecture Crawler & Visual Cluster Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

TRUMOR

GOAL

Overall System
Architecture
Given the wide range of tools available for classification and data
processing, the system is created mostly as a set of Python modules.
Python modules were created for each of the tasks with the exception of
the crawler, for which NodeXL Pro was used over a Python library, mainly
because of its powerful network discovery and graph visualization
functionalities.

Crawler &
Visual Cluster Analysis
Using Twitter API we managed to crawl over 15,000
tweets related to Health Rumors, also for each tweet
the relationship in the social network was
considered. The results are focused on relevance of
the search queries. Additionally, to the crawled
dataset it was applied eigenvector centrality
clustering to get the following force-directed graph.

The focus is on Twitter due to the


availability of an open application
programming interface (API) and the
amount of data that can be retrieved
publicly.

The project consist of several modules,


which can be run separately depending
on a specic task.
Crawler queries Twitter for rumors or
facts based on given search queries,
analyze social relationship, visual graph
clustering and overall dataset analytics.
Preprocessor renes texts
Indexer creates an inverted index
given a set of texts and enables search
Topic classier classies text input
as health or non-health related
Rumors classier calculates the
trustworthiness of a tweet input
User ranker calculates user ranks for
users in the user graph of crawled
rumors

GROUP 4
Xuqiang Fang
Meri-Kris Jaama
Matteo Presutto
Santiago Salgado
2IMW15 Web Information
Retrieval and Data Mining
Prof: Mykola Pechenizkiy
TU/e Eindhoven 2016

IMPROVEMENTS
With a small to moderate budget it is
possible to build a wider dataset using
crowd sourcing platforms like click
workers to improve the rumor classier.
Having an actual GPU server would allow
us to use more complex models
introducing convolutional layers and
other recursive layers.

The goal of the project is to develop a


system that helps to determine whether
widespread tweets are rumors or not.
The end-user would be able to search
through some term-related tweets and
see how they are classied according to
the multiple steps model developed. This
project is focused on data only from the
social network Twitter.

FUNCTIONALITY

Identifying
health related
rumors on Twitter

Rumor Classifier
Pre Processor
We use NLTK to remove stop words
Determine hashtags (#) and remove them
Determine mentions (@) and remove them
Determine urls (http://) and remove them
Determine emojis and remove them
Determine retweet (RT) signs and remove them
Other basic functions to tokenize the text, removing repeated letters in a
word (like happpppyyyyy), remove digits in a word and remove hyphens
in a word.

Indexer

Indexer relies heavily on the implementation of preprocessor module.


Before indexing given tweets, every tweet goes through the preprocessor
to get a normalized form based on which the inverted index is built.
Inverted index is essentially a dictionary created by iterating through all
words in tweets and, having a word as dictionary key, saving the indexes
of tweets where the word occurs as an array.
In order to filter for tweets containing a defined
feature word the word is used as a key to find
the array of indexes for wanted tweets and
based on the indexes tweets are filtered out.

Lasagne and Theano have been used to develop the model. The
neural network has 32 lstm neurons and 2 softmax neurons in the
head, the whole architecture is trained using the cross entropy error
function. Problems encountered include the dying ReLU, exploding
gradient, severe overfitting due to too many neurons (64-128 neurons
variants), batch size, unbalancement and preprocessing (too many
letters implies too many features). Empirical experiments show
misspelling strings make it more probable it is a rumor or citing
sources make it more probable it is not a rumor.

User and Network Ranker


The user ranker is a module able to estimate the average probability
that a given user tweets a rumor. It is calculated retrieving as many
tweets as possible from the given user, estimating the rumor
probability of those tweets and averaging through them: the final
user rank is a float between 0 and 1.
The network ranker has been used to mediate the noise injected by
the performance of the classifier itself and the number of retrieved
tweets, it consists of a linear model trained on a simple graph of
users: given a user <U> I built a linear regressor which takes as input
4 features extracted from the list of user ranks of the set of users
interacting with <U>. Those four features are the minimum,
maximum, mean and standard deviation of the ranks of the neighbor
users. The user rank is then updated using this recurrent formula: UR
= 0.995*UR+0.005*NUR (UR=user rank, NUR=new user rank).
force-directed graph

Topic Classifier
Libraries used: nltk, sklearn, pickle

The topic classifier is based on multinomial Nave Bayes classification.


The feature independence assumption makes the calculation quite
trivial and results in a great performance gain for example compared to
approaches based on deep learning. Nave Bayes classification
calculates the probability of the document (tweet in our case) belonging
to each of the 7 predefined classes based on its features and then
classifies the document as the class with the highest probability.

The purpose of topic classifier is to correctly identify health


related documents. In order to evaluate that classifier, was used to
classify 2000 non-health related and 2201 health related
documents. The results are presented in the following confusion
matrix.

It would make sense to introduce batch


normalization and dropout on a dataset
of moderate size to further improve
performances. Since the GPU would
speed up the training phase, it could be
possible to pre train the model on other
datasets like Wikipedia.
The network ranker and user ranker will
benet from the computational speed up
induced by the GPU and by the higher
accuracy of the new classier.

PERFORMANCE
The performance of modules have been
analyzed one by one independently, the
classier has an estimated accuracy of
62% while the topic classier one is
85.5%.
An overall performance analysis of our
model is unfeasible due to the diculties
and ambiguities that would arise in
building a labelled dataset of queries and
relevant tweets which should even take
into account another set of rumor labels.

RUMORS

You might also like