Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

NileTMRG at SemEval-2017 Task 8:

Determining Rumour and Veracity Support for Rumours on Twitter

Omar Enayet and Samhaa R. El-Beltagy


Center for informatics sciences
Nile University
Giza, Egypt
omar.enayet@gmail.com, samhaa@computer.org

Abstract Twitter is a famous social media platform ca-


pable of spreading breaking news, thus most of
This paper presents the results and conclu- rumour related research uses Twitter feed as a ba-
sions of our participation in SemEval-2017 sis for research.
task 8: Determining rumour veracity and
SemEval (Semantic Evaluation) is an ongoing
support for rumours. We have participated
in 2 subtasks: SDQC (Subtask A) which
series of evaluations of computational semantic
deals with tracking how tweets orient to analysis systems. Task 8 (RumourEval) (Derczyn-
the accuracy of a rumourous story, and Ve- ski, et al. (2017)) is one of 12 tasks presented in
racity Prediction (Subtask B) which deals SemEval 2017. This paper describes the system
with the goal of predicting the veracity of that we have used to participate in this task. The
a given rumour. Our participation was in task consists of 2 subtasks: SDQC (Subtask A)
the closed task variant, in which the pre- which has the objective of tracking how other
diction is made solely from the tweet it- tweets orient to the accuracy of a rumourous story,
self. For subtask A, linear support vector and Veracity Prediction (Subtask B) for which has
classification was applied to a model of
the goal to predict the veracity of a given rumour.
bag of words, and the help of a naïve
Bayes classifier was used for semantic fea-
Task B has two variants: an open variant and a
ture extraction. For subtask B, a similar closed one. We have only participated in the
approach was used. Many features were closed variant, in which the prediction should be
used during the experimentation process made solely from the tweet itself.
but only a few proved to be useful with the Scientific literature related to rumours on social
data set provided. Our system achieved media has started to emerge over the past 7 years.
71% accuracy and ranked 5th among 8 It can be categorized into 4 main categories: 1) the
systems for subtask A and achieved 53% detection of the spreading of a rumour, 2) the de-
accuracy with the lowest RMSE value of termination of the veracity of a rumour, 3) the
0.672 ranking at the first place among 5
analysis of the rumour propagation through a so-
systems for subtask B.
cial network and 4) speech act analysis of differ-
ent online replies to the rumour. Subtask A be-
1 Introduction
longs to the 4th category, while subtask B belongs
Over the past 15 years, and in a gradual man- to the 2nd category.
ner, social media has started to become a main The rest of the paper is organized as follows:
source of news. However, social media has also section 2 briefly overviews related work, section 3
become a ripe ground for rumours, spreading provides task description details, section 4 pro-
them in a matter of a few minutes. A rumour is de- vides a detailed system description covering pre-
fined as a claim that could be true or false. False processing, feature extraction and selection, learn-
rumours may greatly affect the social, economic ing model and evaluation done for both subtasks A
and political stability of any society around the and B. In the end a conclusion is given with the
world, hence the need for tools to help people, es- future work needed.
pecially journalists, analyze the spread of rumours
and their effect on the society as well as determine
their veracity.
speech act class. A tweet can be classified to be a
2 Related Work support, deny, query or a comment.
Subtask B: The input to this task is a set of
Zubiaga et al. (2016), presented a methodology tweets each representing a source of a rumour.
that enabled them to collect, identify and annotate The training data is composed of the tweet content
a big data set of rumours associated with multiple and its veracity. A tweet’s veracity can be either
newsworthy events, and analyzed how people ori- true, false or unknown. Also, a confidence value
ent to and spread rumours in social media. This which is a float from 0 to 1 is required for each
data set was used for task 8 of SemEval 2017: tweet.
RumourEval. Qazvinian et al. (2011), addressed
the problem of automatic rumour detection in mi- 4 System Overview
croblogs as well as identifying users that support
or deny or question the rumour. They achieved The systems used for both subtasks A and B
this by exploring the effectiveness of 3 categories were very similar, except that each focused on a
of features: content-based, network-based and mi- different set of features. Python libraries scikit-
cro-blog specific memes. Vosoughi et al. (2015), learn (Buitinck et al. (2013)) and NLTK (Bird et
addressed the problem of rumour detection via a al., 2009) were mainly used to implement this
speech act classifier that detects assertions using work. Below are the general system specifica-
different semantic and syntactic features in addi- tions. All classifiers were adjusted to use their de-
tion to a clustering algorithm to cluster groups of fault parameters.
rumourous tweets talking about the same topic to-
gether. Hamidiain et al. (2015), Castillo et al. 4.1 Preprocessing and feature extraction
(2011), Vosoughi et al. (2015) and Giasemidis et The system depends on performing some pre-
al. (2016) addressed the issue of detecting the ve- processing on the tweets’ texts, extracting simple
racity of rumours using manually selected and an- bag of words features from them, and then ex-
notated rumours on Twitter using linguistic, user, tracting additional higher level features from them
rumour, pragmatic, content, twitter-specific and as well as from the entire twitter feed provided.
propagation features and the latter developed a These steps were carried out with the aid of the
software demonstration that provides a visual user NLTK Tweet Tokenizer (2015).
interface to allow the user to examine the analysis. Preprocessing also included the removal of stop
Chua et al. (2016), concentrated on linguistic fea- words, punctuation characters and twitter specific
tures such as comprehensibility, sentiment and words such as ‘rt’ and ‘via’.
writing style to predict rumour veracity, ignoring No further pre-processing was performed. Be-
all non-linguistic features. Galitsky et al. (2015), low are some notes in this context:
also concentrated on linguistic features to detect • The case of the words could be useful in
disinformation by comparing the text to search re- showing the sentiment and the context of the
sults using the significant sentences in that text. word, thus all words were kept in their origi-
Liu et al. (2015), proposed the first real time ru- nal case.
mour debunking algorithm for Twitter while Zhao • Performing stemming or lemmatization
et al. (2015), concentrated on identifying a trend- caused worse performance as keeping the
ing rumour as early as possible without trying to word in its original form proved to be useful.
assess its veracity. • Removing URLs from text yielded worse
performance, as tweets using the same URL
3 Task Description usually shared the same speech act, so the
Below is a brief overview of each subtask. For URL word token acted as an important fea-
more details, please refer to RumourEval (Der- ture.
czynski, et al. (2017)) • Using bi-grams resulted in noise being added
Subtask A: The input to this task is a set of to the training data, causing the classifier’s
tweets each replying to a rumourous tweet, which performance to degrade.
we name the rumour source tweet. The training
data is composed of the tweet content and its
4.2 Features Selection NLTK corpus ‘nps_chat’ (2015). On cross
The following feature selection and dimension- validating that classifier, we got an accuracy
ality reduction methods were used on the basic of 67%. If this utility classifier marks the
bag of words features, before adding higher level tweet as a Yes-No question or a wh-question,
features: the tweet is considered to be a question.
• Chi-Squared Feature selection. (2010) • Denial term detection: We found that ex-
plicitly specifying the existence of a denial
• Variance Threshold Feature selection.
word within a tweet, to be a useful feature
(2010)
for generalizing over the data. The list of
• Truncated SVD dimensionality reduc-
words used are: ‘not true’, ‘don’t agree’,
tion. (2010)
‘impossible’, ‘false’, ‘shut’.
None of the above algorithms were used, as
they all yielded worse results when the model was • Support words detection: Like denial
cross-validated. We attribute this to the fact that words, we included another feature for sig-
the number of features were not big enough. naling the existence of a support term. These
Additional features were manually selected by were detected based on the following list of
measuring the performance of the classifier on common support words: ‘true’, ‘exactly’,
training data when adding/removing each addi- ‘yes’, ‘indeed’, ‘omg’, ‘know’
tional feature. Features with big numerical values • Hashtag Existence
were scaled down to the range between 0 and 1. • URL Existence
• Tweet is reply: This feature specifies
4.3 Additional features extraction whether the tweet was a reply to another
Several features were extracted though not all tweet or whether it is a source tweet. Source
of them proved useful in the classification pro- tweets are rarely queries, and not often a de-
cess. nial or support. Most of them are normal
Below is the complete list of features which comments.
apply for both subtask A and B. • Tweet’s words’ sentiments: Simple senti-
• Question Existence: The relationship be- ment prediction was performed on each
tween a question and a query is tight. A tweet’s text though counting the number of
question is often a query and a query is often positive and negative sentiment in the tweet
a question. Also, if a tweet is a question, then using the NLTK opinion lexicon (2015). If
it is highly unlikely that it is a normal com- the positive words exceeded negative words,
ment; it is more likely it is a support or a de- the feature got a value of 1, otherwise, it got
nial, if not a query. Thus, being a question is a value of 0. If there were no sentiment
an important feature to consider. A question words o or if the positive and negative words
detection module was built for this purpose. were equal, this feature value was set to a
Below are details for this module: 0.5.
• Tweet sentiment: A utility classifier was
• An assumption was made that any sentence used for further detection of sentiment. For
containing a question mark is considered a setting this feature, a naïve Bayes classifier
question. was trained using the NLTK movie reviews
• In case the question mark was absent, any corpus (2015) for sentiment analysis. It
sentence classified as a question should con- would be better of course to train this classi-
tain at least one of the following keywords fier using tagged tweets, which is what we
used in WH-questions: “what, why, how,
intend to do in future work.
when, where” or in Yes/No questions: “did,
• Is User verified
do, does, have, has, am, is, are, can, could,
may, would, will” as well as their negations. • Number of followers
It is highly unlikely that a question does not • Number of user’s past tweets
have one of these words. • Number of user’s friends
• A utility classifier was used for further detec- • Retweet Ratio: This feature represents the
tion of questions; we performed speech act ratio between the numbers of retweets of the
recognition using a Naive Bayes classifier on
Feature Type A B
target tweet over the number of retweets of Question Existence content Y N
the rumour source tweet. Denial term detection content Y N
Support words detection content Y N
• Photo Existence Hashtag existence twitter N Y
• Days since user creation: This feature rep- URL existence content Y Y
Tweet is reply twitter Y -
resents the number of days since user ac- Tweets’ words’ sentiments content Y N
count was created on Twitter. Older accounts Tweet sentiment content N N
may have more credibility than new ones. Is User Verified user N N
Number of followers user Y N
• Source tweet user is verified: This feature Number of user’s past tweets user N N
represents whether the tweeter of the rumour Number of user’s friends user N N
Retweet Ratio twitter N N
source has a verified account or not. Photo Existence content N N
Days since user creation user N N
The following list of features applies to sub- Source tweet user is verified user Y N
task A only: User ‘replied to’ is verified user Y -
• User ‘replied to’ is verified Cosine similarity with root rumourous tweet
Cosine similarity with the ‘replied to’ tweet
content Y
content N
-
-
• Cosine similarity with root rumourous Percentage of replying tweets classified as content - Y
queries, denies or support
tweet and the ‘replied to’ tweet: Using the Table 1 – Features found useful for each subtask and
same words may imply more that the tweet is used in final evaluation.
a support. Classifier A B B(RMSE)
Linear SVC 0.71 0.53 0.67
Finally, the following features applies to Random Forest 0.75 0.39 0.77
Linear SVM with SGD learning 0.72 0.5 0.73
subtask B only: Logistic Regression 0.76 0.53 0.71
• Percentage of replying tweets classified as Decision Tree 0.71 0.46 0.73
queries, denies or support: These 3 features Table 2 – The resultant accuracy and confidence RMSE for
subtasks A and B
represent the percentage of tweets classified
as different classes via the system imple- 6 Conclusion
mented for Task A, for this rumour’s source
tweet. In this paper, we have performed a quick analy-
sis of using different pre-processing, features ex-
5 Evaluation traction and selection, learning classifiers which
achieved good results in the RumourEval task. For
Several scikit-learn classifiers were used during subtask A, a combination of different types of
experimentation before deciding on the final mod- content, twitter and user specific features were
el. used. For subtask B, it was clear that only content
For subtask A, the linear support vector ma- and twitter features were useful. User based fea-
chine classifier (Linear SVC) proved to be the tures didn’t enhance the performance for the latter
most accurate during cross validation, however, subtask, thus we conclude that the identity and
logistic regression generalized the best on test da- behavior of the user didn’t affect much the credi-
ta. During cross-validation the macro-averaged F1 bility of the rumour he/she is spreading, at least
measure was used to evaluate the classifiers and for the data set provided.
choose the best amongst them, as the distribution
of categories was clearly skewed towards com- 7 Future Work
ments.
For subtask B, Linear SVC proved to be the Additional features could be extracted that can
best in terms of accuracy and the confidence root play a better role in classifying each tweet or ru-
mean square error (RMSE). mour. On the tweet text level, better linguistic fea-
Table 1 shows the features used for each sub- tures could be extracted. A better sentiment analy-
task along with its type. Type ‘Content’ refers to sis model could be employed. On the rumour lev-
the features determined from the tweet’s text, ‘us- el, network-based features maybe extracted such
er’ refers to the features determined from the user as the work done by Vosoughi, et. al. (2015).
who tweeted and his behavior, ‘twitter’ refers to Time-based analysis could be performed to detect
twitter specific features used. Table 2 compares certain patterns in the change of reactions to the
the accuracy of different classifiers for each sub- rumour.
task.
8 References learn.org/stable/modules/generated/sklearn.feature_sel
ection.chi2.html
Derczynski, Leon and Bontcheva, Kalina and Liakata, Scikit-learn Variance Threshold Feature Selection,
Maria and Procter, Rob and Wong Sak Hoi, Ger- 2010. Retrieved April 1, 2017, from http://scikit-
aldine and Zubiaga, Arkaitz, 2017. SemEval-2017 learn.org/stable/modules/generated/sklearn.feature_sel
Task 8: RumourEval: Determining rumour veracity ection.VarianceThreshold.html
and support for rumours. Proceedings of SemEval Scikit-learn Truncated SVD, 2010. Retrieved April
2017. 1, 2017, from http://scikit-
learn.org/stable/modules/generated/sklearn.decomposi
Zubiaga, A., Liakata, M., Procter, R., Hoi, G.W.S. and tion.TruncatedSVD.html
Tolmie, P., 2016. Analysing how people orient to NLTK NPS Chat Corpus Reader, 2015. Retrieved
and spread rumours in social media by looking at April 1, 2017, from
conversational threads. PloS one, 11(3), http://www.nltk.org/_modules/nltk/corpus/reader/nps_
p.e0150989. chat.html
Vosoughi, S., 2015. Automatic detection and verifica- NLTK Opinion Lexicon Corpus Reader, 2015. Re-
tion of rumors on Twitter (Doctoral dissertation, trieved April 1, 2017, from
Massachusetts Institute of Technology). http://www.nltk.org/_modules/nltk/corpus/reader/opin
ion_lexicon.html
Qazvinian, V., Rosengren, E., Radev, D.R. and Mei, NLTK Categorized Sentences Corpus Reader. Re-
Q., 2011, July. Rumor has it: Identifying misinfor- trieved April 1, 2017, from
mation in microblogs. In Proceedings of the Con- http://www.nltk.org/_modules/nltk/corpus/reader/cate
ference on Empirical Methods in Natural Lan- gorized_sents.html
guage Processing (pp. 1589-1599). Association for Buitinck, L., Louppe, G., Blondel, M., Pedregosa,
Computational Linguistics. F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer,
P., Gramfort, A., Grobler, J. and Layton, R., 2013. API
Giasemidis, G., Singleton, C., Agrafiotis, I., Nurse,
design for machine learning software: experiences
J.R., Pilgrim, A., Willis, C. and Greetham, D.V., 2016,
November. Determining the veracity of rumours on from the scikit-learn project. arXiv preprint
Twitter. In International Conference on Social Infor- arXiv:1309.0238.
matics (pp. 185-205). Springer International Publish- Chua, A.Y. and Banerjee, S., 2016. Linguistic Pre-
ing. dictors of Rumor Veracity on the Internet.
Castillo, C., Mendoza, M. and Poblete, B., 2011, In Proceedings of the International MultiConference
March. Information credibility on twitter. of Engineers and Computer Scientists (Vol. 1).
In Proceedings of the 20th international conference
on World wide web (pp. 675-684). ACM.
Hamidiain, S. and Diab, M., 2015. Rumor detec-
tion and classification for twitter data. In The Fifth
International Conference on Social Media Technolo-
gies, Communication, and Informatics, SOTICS, IAR-
IA (pp. 71-77).
Liu, X., Nourbakhsh, A., Li, Q., Fang, R. and Shah,
S., 2015, October. Real-time rumor debunking on twit-
ter. In Proceedings of the 24th ACM International on
Conference on Information and Knowledge Manage-
ment (pp. 1867-1870). ACM.
Zhao, Z., Resnick, P. and Mei, Q., 2015, May. En-
quiring minds: Early detection of rumors in social
media from enquiry posts. In Proceedings of the 24th
International Conference on World Wide Web (pp.
1395-1405). ACM.
Galitsky, B., 2015, March. Detecting Rumor and
Disinformation by Web Mining. In 2015 AAAI Spring
Symposium Series.
Bird, Steven, Edward Loper and Ewan Klein
(2009), Natural Language Processing with Python.
O’Reilly Media Inc.
NLTK Tweet Tokenizer (2015). Retrieved April 1,
2017, from http://www.nltk.org/api/nltk.tokenize.html
Scikit-learn Chi-Squared Feature Selection, 2010.
Retrieved April 1, 2017, from http://scikit-

You might also like