Professional Documents
Culture Documents
Overall System Architecture Crawler & Visual Cluster Analysis
Overall System Architecture Crawler & Visual Cluster Analysis
Overall System Architecture Crawler & Visual Cluster Analysis
GOAL
Overall System
Architecture
Given the wide range of tools available for classification and data
processing, the system is created mostly as a set of Python modules.
Python modules were created for each of the tasks with the exception of
the crawler, for which NodeXL Pro was used over a Python library, mainly
because of its powerful network discovery and graph visualization
functionalities.
Crawler &
Visual Cluster Analysis
Using Twitter API we managed to crawl over 15,000
tweets related to Health Rumors, also for each tweet
the relationship in the social network was
considered. The results are focused on relevance of
the search queries. Additionally, to the crawled
dataset it was applied eigenvector centrality
clustering to get the following force-directed graph.
GROUP 4
Xuqiang Fang
Meri-Kris Jaama
Matteo Presutto
Santiago Salgado
2IMW15 Web Information
Retrieval and Data Mining
Prof: Mykola Pechenizkiy
TU/e Eindhoven 2016
IMPROVEMENTS
With a small to moderate budget it is
possible to build a wider dataset using
crowd sourcing platforms like click
workers to improve the rumor classier.
Having an actual GPU server would allow
us to use more complex models
introducing convolutional layers and
other recursive layers.
FUNCTIONALITY
Identifying
health related
rumors on Twitter
Rumor Classifier
Pre Processor
We use NLTK to remove stop words
Determine hashtags (#) and remove them
Determine mentions (@) and remove them
Determine urls (http://) and remove them
Determine emojis and remove them
Determine retweet (RT) signs and remove them
Other basic functions to tokenize the text, removing repeated letters in a
word (like happpppyyyyy), remove digits in a word and remove hyphens
in a word.
Indexer
Lasagne and Theano have been used to develop the model. The
neural network has 32 lstm neurons and 2 softmax neurons in the
head, the whole architecture is trained using the cross entropy error
function. Problems encountered include the dying ReLU, exploding
gradient, severe overfitting due to too many neurons (64-128 neurons
variants), batch size, unbalancement and preprocessing (too many
letters implies too many features). Empirical experiments show
misspelling strings make it more probable it is a rumor or citing
sources make it more probable it is not a rumor.
Topic Classifier
Libraries used: nltk, sklearn, pickle
PERFORMANCE
The performance of modules have been
analyzed one by one independently, the
classier has an estimated accuracy of
62% while the topic classier one is
85.5%.
An overall performance analysis of our
model is unfeasible due to the diculties
and ambiguities that would arise in
building a labelled dataset of queries and
relevant tweets which should even take
into account another set of rumor labels.
RUMORS