Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

i

Two-path Deep Semi-supervised Learning for


Timely Fake News Detection
Xishuang Dong, Uboho Victor, and Lijun Qian, Senior Member, IEEE,

Abstract—News in social media such as Twitter has been effective approach to recognize fake news by identifying the
generated in high volume and speed. However, very few of them signature of text phrases in social media posts [9]. Moreover,
are labeled (as fake or true news) by professionals in near real temporal features play a crucial role in the fast-paced social
time. In order to achieve timely detection of fake news in social
media environment because information spreads more rapidly
arXiv:2002.00763v1 [cs.CL] 31 Jan 2020

media, a novel framework of two-path deep semi-supervised


learning is proposed where one path is for supervised learning than traditional media [10]. For example, detecting a burst of
and the other is for unsupervised learning. The supervised topics on social media can capture the variations in temporal
learning path learns on the limited amount of labeled data while patterns of news [11], [12]. Specifically, deep learning based
the unsupervised learning path is able to learn on a huge amount fake news detection achieves the state-of-the-art performance
of unlabeled data. Furthermore, these two paths implemented
with convolutional neural networks (CNN) are jointly optimized on different datasets [13], where both recurrent neural net-
to complete semi-supervised learning. In addition, we build a works (RNN) and convolutional neural networks (CNN) are
shared CNN to extract the low level features on both labeled employed to recognize fake news [6], [14], [15]. However,
data and unlabeled data to feed them into these two paths. To since news spreads on social media at very high speed when
verify this framework, we implement a Word CNN based semi- an event happens, only very limited labeled data is available
supervised learning model and test it on two datasets, namely,
LIAR and PHEME. Experimental results demonstrate that the in practice for fake news detection, which is inadequate for
model built on the proposed framework can recognize fake news the supervised model to perform well.
effectively with very few labeled data. As an emerging task in the field of natural language pro-
Index Terms—Fake News Detection, Deep Semi-supervised cessing (NLP), fake news detection requires big labeled data
Learning, Convolutional Neural Networks, Joint Optimization to meet the requirement of building supervised learning based
detection models. However, annotating the news on social
media is too expensive and costs a huge amount of human
I. I NTRODUCTION labor due to the huge size of social media data. Furthermore,
this is almost impossible to achieve in near real time. In
Social media (e.g., Twitter and Facebook) has become a
addition, it is difficult to ensure the annotation consistency
new ecosystem for spreading news [1]. Nowadays, people are
for big data labeling [16]. With the increment of the data
relying more on social media services rather than traditional
size, the annotation inconsistence will be worse. Therefore,
media because of its advantages such as social awareness,
using unlabeled data to enhance fake news detection becomes
global connectivity, and real-time sharing of digital infor-
a promising solution and more urgent.
mation. Unfortunately, social media is full of fake news.
In this paper, we propose a deep semi-supervised learning
Fake news consists of information that is intentionally and
framework by building two-path convolutional neural networks
verifiably false to mislead readers, which is motivated by
to accomplish timely fake news detection in the case of limited
chasing personal or organizational profits [2]. For example,
labeled data, where the framework is shown in Figure 1.
fake news has been propagated on Twitter like infectious virus
It consists of three components, namely, a shared CNN, a
during the 2016 election cycle in the United States [3], [4].
supervised CNN, and an unsupervised CNN. One path is
Understanding what can be done to discourage fake news is
composed of the shared CNN and supervised CNN while the
of great importance.
other is made of the shared CNN and unsupervised CNN.
One of the fundamental steps to discourage fake news would
Moreover, the architectures of these three CNNs can be similar
be timely fake news detection. Fake news detection [5], [6],
or different, which are determined by the application and
[7] is to determine the truthfulness of the news by analyzing
performance. All data (labeled and unlabeled data) will be
the news contents and related information such as propagation
used to generate the mean squared error loss, while only
patterns. It attracts a lot of attention to resolve this problem
labeled data will be used to calculate the cross-entropy loss.
from different aspects, where supervised learning based fake
Then a weighted sum of these two losses is used to optimize
news detection dominates this domain. For instance, Ma et.al
the proposed framework. We validate the proposed framework
detects fake news with data representations of the contents that
on detecting fake news using two datasets, namely, LIAR [6]
are learned on the labeled news [8]. Early detection is also an
and PHEME [17]. Experimental results demonstrate the ef-
X. Dong, U. Victor, and L. Qian are with the Center of Excellence in fectiveness of the proposed framework even with very limited
Research and Education for Big Military Data Intelligence (CREDIT Center), labeled data.
Department of Electrical and Computer Engineering, Prairie View A&M
University, Texas A&M University System, Prairie View, TX 77446, USA. In summary, the contributions of this study are as follows:
Email: xidong@pvamu.edu, uboho.dpc@outlook.com, liqian@pvamu.edu • We propose a novel two-path deep semi-supervised learn-
ii

Only for labeled data

yi
zi y' cross
... Softmax
entropy
li
w(t)
Supervised CNN
xi Word
Embedding
... weighted
sum of loss
Optimizer

zi
Shared CNN ... mean squared l'i
error
z'i
Unsupervised CNN
For all data

: Convolutional Layer : Pooling Layer

Fig. 1. Framework of two-path deep semi-supervised learning. Samples xi are inputs. Labels yi are available only for the labeled inputs and the associated
cross-entropy loss component is evaluated only for those. zi and zi0 are outputs from the supervised CNN and the unsupervised CNN, respectively. yi0 is the
predicted label for xi . li is the cross-entropy loss and li0 is the mean squared error loss. w(t) are the weights for joint optimization of li and li0 .

ing (TDSL) framework containing three CNNs, where RNN models: one learns on simple word embeddings, and the
both labeled data and unlabeled data are used jointly to other enhances the performance by concatenating long short-
train the model and enhance the detection performance. term memory (LSTM) outputs with LIWC feature vectors.
• We implement a Word CNN [18] based TDSL to detect Doc2vec [28] is also applied to represent content that is
fake news with limited labeled data and compare its related to each social engagement. Attention based RNNs
performance with various deep learning based baselines. are employed to achieve better performance as well. Long
Specifically, we validate the implemented model by test- et al. [15] incorporates the speaker names and the statement
ing on the LIAR and PHEME datasets. It is observed that topics into the inputs to the attention based RNN. In addition,
the proposed model detects fake news effectively even convolutional neural networks are also widely used since
with extremely limited labeled data. they succeed in many text classification tasks. Karimi et al.
• The proposed framework could be applied to address [29] proposed Multi-source Multi-class Fake news Detection
other tasks. Furthermore, novel deep semi-supervised framework (MMFD), where CNN analyzes local patterns of
learning models can be implemented based on the pro- each text in a claim and LSTM analyze temporal dependencies
posed framework with various designs of CNNs, which in the entire text.
will be determined by the intended applications and tasks.
In propagation pattern based method, the propagation pat-
terns have been extracted from time-series information of news
II. R ELATED W ORK spreading on social media such as Twitter and they are used
Fake news detection has attracted a lot of attention in as features for detecting fake news. For instance, to identify
recent years. There are extensive studies such as content fake news from microblogs, Ma et al. [30] proposed the
based method and propagation pattern based method. Content Dynamic Series-Time Structure (DSTS) to capture variations
based method typically involves two steps: preprocessing in social context features such as microblog contents and
news contents and training supervised learning model on users over time for early detection of rumors. Lan et al. [31]
the preprocessed contents. The first step usually involves proposed Hierarchical Attention RNN (HARNN) that uses a
tokenization, stemming, and/or weighting words [19], [20]. In Bi-GRU layer with the attention mechanism to capture high-
the second step, Term Frequency-Inverse Document Frequency level representations of rumor contents, and a gated recurrent
(TF-IDF) [21], [22] may be employed to build samples to train unit (GRU) layer to extract semantic changes. Hashimoto et
supervised learning models. However, the samples generated al. [32] visualized topic structures in time-series variation and
by TF-IDF will be sparse, especially for social media data. seeks help from external reliable source to determine the topic
To overcome this challenge, word embedding methods such truthfulness.
as word2vec [23] and GloVe [24] are used to convert words In summary, most of the current methods are based on
into vectors. supervised learning. It requires a large amount of labeled
In addition, Mihalcea et al. [25] used linguistic inquiry and data to implement the detection processes, especially for the
word count (LIWC) [26] to explore the difference of word deep learning based approaches. However, annotating the news
usage between deceptive language and non-deceptive ones. on social media is too expensive and costs a huge amount
Specifically, deep learning based models have been explored of human labor due to the huge size of social media data.
more than other supervised learning models [13]. For example, Furthermore, this is almost impossible to achieve in near
Rashkin et al. [27] built the detection model with two LSTM real time. Even with labeled data, constructing the huge
iii

Algorithm 1 Learning in the proposed framework


Require: xi = training sample
Require: S = set of training samples
Require: yi = label for labeled xi i ∈ S
Require: fembedding (x) = word embedding
Require: fθshared (x) = shared CNN with trainable parameters θshared
Require: fθsup (x) = supervised CNN with trainable parameters θsup
Require: fθunsup (x) = unsupervised CNN with trainable parameters θunsup
Require: w(t) = unsupervised weight ramp-up function
1: for t in [1, num epochs] do
2: for each minibatch B do
3: x0i∈B ← fembedding (xi∈B ) . represent words with word embedding
4: zi∈B ← fθsup (fθshared (x0i∈B )) . evaluate supervised cnn outputs for inputs
0 0
5: zi∈B ← fθunsup (f
P θshared i∈B(x )) . evaluate unsupervised cnn outputs for inputs
1
6: li∈B ← − |B| logf (z
sof tmax i )[yi ] . supervised loss component
0 1
P i∈B∩S 0 2
7: li∈B ← C|B| i∈B ||zi − zi || . unsupervised loss component
0
8: loss ← li∈B + w(t) × li∈B . total loss
9: update θshared , θsup , θunsup using, e.g., ADAM . update parameters
return θshared , θsup , θunsup

amount of labeled corpus is an extremely difficult task in before these two paths, there is a shared CNN to extract low-
the field of natural language processing as it costs a large level features to feed the later two CNNs. It is similar to deep
volume of resources and it is challenging to guarantee the multi-task learning [37], [38] since the low-level features are
label consistence. Therefore, it is imperative to incorporate shared in the different tasks [39], [40]. The major difference
unlabeled data together with labeled data in fake news detec- between the proposed framework and deep multi-task learning
tion to enhance the detection performance. Semi-supervised is that tasks in the proposed framework will involve both
learning [33], [34] is a technique that is able to use both supervised learning and unsupervised learning, while all tasks
labeled data and unlabeled data. Therefore, we propose a novel in deep multi-task learning are only based on supervised
framework of deep semi-supervised learning via convolutional learning.
neural networks for timely fake news detection. In the next In addition, these two paths can have independent CNNs
section, we present the proposed framework and show how to with the identical or different setups for supervised learning
implement the deep semi-supervised learning model in detail. and unsupervised learning, respectively. They generate two
prediction vectors that are new representations for the inputs
III. M ODEL with respect to their tasks. For the identical setups of these
two path, i.e., using the same CNN structure for both paths, it
We propose a general framework of deep semi-supervised is important to notice that, because of dropout regularization,
learning and apply it to accomplish fake news detection. training CNNs in these two paths is a stochastic process. This
Suppose the training data consist of total N inputs, out of will result in the two CNNs having different link weights and
which M are labeled. The inputs, denoted by xj , where filters during training. It implies that there will be difference
j ∈ 1...N , are the news contents that contain sentences related between the two prediction vectors zi and zi0 of the same
to fake news. In general, the news on social media normally input xi . Given that the original input xi is the same, this
contains limited number of words like 100 or less. It will lead difference can be seen as an error and thus minimizing the
to the data sparsity if we apply TF-IDF to extract features. To mean square error (MSE) is a reasonable objective in the
relieve the data sparsity problem, we employ word embedding learning procedure.
techniques, for instance, word2vec [23], [35], [36], to represent We utilize those two vectors zi and zi0 to calculate the loss
the news contents. Here S is the set of labeled inputs, |S|= M . given by
For every i ∈ S, we have a known correct label yi ∈ 1...C,
where C is the number of different classes. 1 X
Loss = − logfsof tmax (zi )[yi ]
The proposed framework of deep semi-supervised learning |B|
i∈B∩S
and corresponding learning procedures are shown in Figure (1)
1 X
1 and Algorithm 1, respectively. As shown in Figure 1, we + w(t) × ||zi − zi0 ||2 ,
C|B|
evaluate the network for each training input xi with the i∈B

supervised path and the unsupervised path to complete two where B is the minibatch in the learning process. The loss
tasks, resulting in prediction vectors zi and zi0 , respectively. consists of two components. As illustrated in Algorithm 1,
One task is to learn how to mine patterns of fake news li is the standard cross-entropy loss to evaluate the loss for
regarding the news labels while the other is to optimize the labeled inputs only. On the other hand, li0 , evaluated for all
representations of news without the news labels. Specially, inputs, penalizes different predictions for the same training
iv

Only for labeled data


yi
r1

r2
zi y' cross
entropy
Softmax
Word Embedding r3
x'i
k
li
... x'i
xi e1
... Supervised CNN
e2
x'i w(t) weighted
sum of loss
Optimizer
...
...
...
...
...
...
...
...

en ...
x'i = e1 e2 ... en r'1
zi l'i
Shared CNN
r'2 mean squared
error
r'3 z'i
For all data
Unsupervised CNN

: Convolutional Layer : Pooling Layer

Fig. 2. Word-CNN Based Deep Semi-supervised Learning. In the shared CNN, each convolutional layer contains 100 (3 × 3) filters, 100 (4 × 4) filters, and
100 (5 × 5) filters, respectively. Both the supervised CNN and the unsupervised CNN have the same architecture of the shared CNN with different numbers
of filters, where each convolutional layer contains 100 (3 × 3) filters. We use (2 × 2) max-pooling for all pooling layers. ⊕ is the concatenation operator.
r1 , r2 and r3 are outputs from the supervised path while r10 , r20 and r30 are those from the unsupervised path. Furthermore, we concatenate r1 , r2 and r3
to conduct zi and connect r10 , r20 and r30 to generate zi0 .

input xi by taking the mean squared error between zi and zi0 . to the supervised CNN and unsupervised CNN, respectively.
To combine the supervised loss li and unsupervised loss li0 , Moreover, the outputs from pooling layers in these two paths
we scale the latter by time-dependent weighting function w(t) are concatenated as two vectors zi and zi0 , respectively. For the
[41] that ramps up, starting from zero, along a Gaussian curve. supervised path, we use the label yi and the result generated
In the beginning of training, the total loss and the learning by running softmax on zi to build the supervised loss with the
gradients are dominated by the supervised loss component, cross entropy. On the other hand, these two vectors zi and zi0
i.e., the labeled data only. Unlabeled data will contribute more are employed to construct the unsupervised loss with the mean
than the labeled data at later stage of training. squared error. Finally, the supervised loss and unsupervised
Based on the proposed framework, we implement a deep loss are jointly optimized with the Adam optimizer, which is
semi-supervised learning model with Word CNN [18], where shown in Algorithm 1.
the architecture is shown in Figure 2. The Word CNN is a Although there are a few related works in the literature
powerful classifier with the simple architecture of CNN for such as the Π model [41], there exist significant differences.
sentence classification [18]. Considering fake news detection, Firstly, the Π model is designed for processing image data
let ei ∈ Rk be the k-dimensional word vector corresponding while the proposed framework is more suitable to be used to
to the i-th word of the sentence in the news. A sentence of solve problems related to NLP. Secondly, in the Π model, it
length n is represented as combined image data augmentation with dropout to generate
two outputs, but it cannot be used to process language in NLP
x0i = e1 ⊕ e2 ⊕ ... ⊕ en , (2)
tasks since data augmentation for processing text contents
where ⊕ is the concatenation operator. A convolution opera- is difficult to implement. Finally, instead of using one path
tion involves a filter c ∈ Rhk , which is applied to a window CNN, we apply two independent CNNs to generate those two
of h words to produce a new feature. The pooling operation outputs. Furthermore, the proposed model is more flexible as
deals with variable sentence lengths. As shown in Figure 2, the two independent paths can be tuned for specific goals.
the input xi is represented as x0i with the word embedding1 . IV. E XPERIMENT
Then the representation x0i is input into three convolutional
A. Datasets
layers followed by pooling layers to extract low-level features
in the shared CNN. The extracted low-level features are given To validate the effectiveness of the proposed framework on
fake news detection, we test the implemented model on two
1 https://www.tensorflow.org/api docs/python/tf/nn/embedding lookup benchmarks: LIAR and PHEME.
v

Charlie Hebdo Ferguson Germanwings-cras Ottawa shooting Sydney siege


thank
killing sure
video terrible
lost oh
day terrorists
thanks
news
breaking got
lol day
really need
yes safe
thank
point
dead americagun southern
pilots time
cirillo man
said
really
make media
racist emergency
altitude shootingsgod illridewithyoudont
need
stop chief
thank descentyes hopeisis time
need
cartoon
dont really
look say
im think
nathan want
live
time
god shoot
make survivors oh attack
shooter love
god
want
free stop
way thoughts
cockpit killed
dead really
breaking
said
terrorist killed
point timerip thatsrip situation oh
cartoons yes shooting
dont board
crew say
right world rt
terrorists killrt going
want
yes
thats
german
says
today
thanks
breaking
terrorist
yes
siege
good
world unarmedtime prayers
muslim good im islamic
media
police im robberysay god
crashes stayrt say
news
speech
hebdo said
officer passengers rt prayers
thank right
australia
frenchsay good
man amp
like family
thoughts good
think
killed
france need
brown copilot
alps know
amp religion
hope
thats
think think
cop pilot
people people like muslim
thats
freedom
right shot
white know
france ottawashooting
news isis
im
muslim
amp mikebrown rt breaking
families memorial hill muslimsamp
know
charlie thatsim airbus
crashed sad
police islam
hostage
jesuischarlie
attack cops
right french
news canadian war police
cafe
like
religion amp
know sad
a320 safe
shot like
know
paris
islam black
like flight
4u9525 shooting
parliament hostages flag
muslims
people people
police plane
crash soldier
canada people
sydney
charliehebdo ferguson germanwings ottawa sydneysiege

0 20 0 20 0 20 0 20 0 20
TF-IDF (10-e3) TF-IDF (10-e3) TF-IDF (10-e3) TF-IDF (10-e3) TF-IDF (10-e3)

Fig. 3. An example of the difference of word distributions between five events in PHEME dataset. x-axis indicates the TF-IDF value of the word while y-axis
shows top 50 words ranked by corresponding TF-IDF values.

1) LIAR: LIAR [6] is the recent benchmark dataset for the data is preprocessed such as removing data redundancy.
fake news detection. This dataset includes 12,836 real-world It is observed that the class distribution is different among
short statements collected from PolitiFact2 , where editors these events. For example, three events including CH, SS, and
handpicked the claims from a variety of occasions such as FE have more true news while the event GC and OS have
debate, campaign, Facebook, Twitter, interviews, ads, etc. Each more fake news. Furthermore, class distributions for events
statement is labeled with six-grade truthfulness, namely, true, such as CH, SS, FE are significantly imbalanced, which will
false, half-true, part-fire, barely-true, and mostly-true. The be a challenge to the detection task (a binary classification
information about the subjects, party, context, and speakers task).
are also included in this dataset. In this paper, this benchmark In addition, the word distributions vary among five events,
contains three parts: training dataset with 10,269 statements, where one example is shown in Figure 3: there are 50 words
validation dataset with 1,284 statements, and testing dataset ranked by their TF-IDF values, where TF-IDF values are used
with 1,283 statements. Furthermore, we reorganize the data to evaluate the word relevance to the event [42]. It is observed
as two classes by treating five classes including false, half- that the top ranked words are very different for different events.
true, part-fire, barely-true, and mostly-true as Fake class and Moreover, even for the same word, their relevance are not the
true as True class. Therefore, the fake news detection on this same for different events. For example, the relevance (TF-
benchmark becomes a binary classification task. IDF values) of the word “thank” are different between events
2) PHEME: We employ PHEME [17] to verify the ef- including Charlie Hebdo (CH), Ferguson (FE), and Ottawa
fectiveness of the proposed framework on social media data, shooting (OS).
where it collects 330 twitter threads. Tweets in PHEME are In addition, we perform leave-one-event-out (LOEO) cross-
labeled as true or false in terms of thread structures and follow- validation [43], which is closer to the realistic scenario where
follower relationships. PHEME dataset is related to nine events we have to verify unseen truth. For example, the training data
whereas this paper only focuses on the five main events, can contain the events GC, CH, SS, and FE whereas the testing
namely, Germanwings-crash (GC), Charlie Hebdo (CH), Syd- data will contain the event OS. However, as highlighted in
ney siege (SS), Ferguson (FE), and Ottawa shooting (OS). It Figure 3, this fake news detection task becomes very difficult
has different levels of annotations such as thread level and since the training data has very different data distribution from
tweet level. We only adopt the annotations on the tweet level that of the testing data,.
and thus classify the tweets as fake or true. The detailed
distribution of tweets and classes is shown in table II after It should be noted that the data in original datasets is fully
labeled. We employ all labeled data to train the baseline
2 https://www.politifact.com models. Compared to the baselines, we train the proposed
vi

TABLE I
H YPER - PARAMETERS FOR THE TRAINING OF W ORD CNN BASED TDSL 2 × P recision × Recall
F score = . (4)
Hyper-parameters Values P recision + Recall
Dropout 0.5
Minibatch size 128 where P recision indicates precision measurement that
Number of epochs 200 defines the capability of a model to represent only fake
Maximum learning rate 1e-4
statements and Recall computes the aptness to refer all
corresponding fake statements:
TABLE II
N UMBER OF TWEETS AND CLASS DISTRIBUTION IN THE PHEME TP
DATASET. P recision = . (5)
TP + FP
Events Tweets Fake True
Germanwings-crash 3,920 2,220 1,700
Charlie Hebdo 34,236 6,452 27,784 TP
Sydney siege 21.837 7,765 14,072 Recall = . (6)
Ferguson 21,658 5,952 15,706 TP + FN
Ottawa shooting 10,848 5,603 5,245
Total 92,499 27,992 64,507 whereas T P (True Positive) counts total number of news
matched with the news in the labels. F P (False Positive)
measures the number of recognized label does not match
models with only partially labeled data and the percentage the annotated corpus dataset. F N (False Negative) counts
of labeled data for training of our proposed models is from the number of news that does not match the predicted
1% to 30%. label news.
• PHEME: Accuracy is one of the common evaluation
metric to measure the performance of fake news de-
B. Experiment Setup
tection on this dataset [13]. However, we also evaluate
The key hyper-parameters for the implemented model are the performance based on the Fscore since our task on
shown in Table I and we employ Adam optimizer to complete PHEME datasets is the binary text classification with
the training the implemented proposed model (TDSL). imbalanced data. Specifically, as we perform leave-one-
In addition, we employ five deep supervised learning models event-out cross-validation on the PHEME dataset, we
as baselines including 1) Word-level CNN (Word CNN) [18], utilize macro-averaged Fscore [48] to evaluate the whole
2) Character-level CNN (Char CNN) [44], 3) Very Deep CNN performance of mining fake news on different events.
(VD CNN) [45], 4) Attention-Based Bidirectional RNN (Att
RNN) [46], and 5) Recurrent CNN (RCNN) [47], where these T
models perform well on text classification. Specifically, Word 1X
M acroF = F scoret . (7)
CNN performs well on sentence classification, which is more T t=1
suitable to process social media data as the length of the
content of the data is short like that of the sentence. In addition, T
we build 6) word-level bidirectional RNN (Word RNN) to 1X
M acroP = P recisiont . (8)
compare the implemented model, where Word RNN contains T t=1
one embedding layer and one bidirectional RNN layer, and
concatenate all the outputs from the RNN layer to feed to the T
final layer that is a fully-connected layer. Thus, there are total 1X
M acroR = Recallt . (9)
6 baseline models. Note that baseline models use all labeled T t=1
data from the original datasets.
where T denotes the total number of events and F scoret ,
C. Evaluation P recisiont , Recallt are F score, P recision, Recall
values in the tth event. Additionally, we use macro-
We apply different metrics to evaluate the performance of
average accuracy on five events to examine performance.
fake news detection regarding the task features on these two
The main goal for learning from imbalanced datasets
benchmarks.
is to improve the recall without hurting the precision.
• LIAR: We employ accuracy, precision, recall and Fs-
However, recall and precision goals can be often con-
core to evaluate the detection performance. Accuracy is flicting, since when increasing the true positive (TP) for
calculated by dividing the number of statements detected the minority class (True), the number of false positives
correctly over the total number of statements. (FP) can also be increased; this will reduce the precision
Ncorrect [49]. It means that when the MacroP is increased, the
Accuracy = . (3) MacroR might be decreased for the case of PHEME.
Ntotal
In addition, we employ F score values of each class to T
check the performance since the task is a binary text 1X
M acroA = Accuracyt . (10)
classification. T t=1
vii

D. Results and Discussion TABLE III


C OMPARING PERFORMANCE BETWEEN BASELINES AND PROPOSED
We compare the two-path deep semi-supervised learning MODEL (TDSL) ON LIAR DATASETS . T HE BASELINES , NAMELY, W ORD
CNN, C HAR CNN, VD CNN, RCNN, WORD RNN, AND ATT RNN, ARE
(TDSL) implemented based on Word CNN with the baselines BUILT WITH THE TRAINING DATA THAT IS FULLY LABELED . O N THE
on two datasets: LIAR and PHEME. In addition, we examine CONTRARY, WE ONLY APPLY 1% AND 30% LABELED TRAINING DATA AND
the effects on the performance when applying different hyper- REST OF UNLABELED TRAINING DATA TO ACCOMPLISH LEARNING OF THE
PROPOSED MODEL .
parameters. All evaluation results are average values on five
runs. Model Accuracy Precision Recall Fscore
1) LIAR: Table III presents the performance comparison Word CNN[18] 85.01% 83.57% 99.94% 91.02%
Char CNN)[44] 77.93% 84.09% 91.41% 87.59%
on LIAR datasets. When we focus on the baselines, Word VD CNN[45] 83.77% 83.56% 99.88% 91.00%
CNN outperforms other baselines with respect to the accuracy, RCNN[47] 79.30% 84.19% 89.66% 86.81%
recall, and Fscore. However, with respect to Precision, Att Word RNN 72.44% 84.19% 81.52% 82.79%
Att RNN[46] 75.90% 84.26% 86.79% 85.52%
RNN is better than other baselines. It means we should select TDSL (1%) 79.81% 83.62% 94.30% 88.62%
specific models when we concern certain evaluation metrics. TDSL (30%) 83.36% 83.59% 99.64% 90.91%
In addition, we present results generated by the proposed
TDSL, where we utilize 1% and 30% of labeled data and
the rest of unlabeled data to build the deep semi-supervised TABLE IV
C OMPARING PERFORMANCES GENERATED BY PROPOSED MODEL (TDSL)
model. It is observed that the performance is strengthened by LEARNING ON DIFFERENT RATIOS OF LABELED TRAINING DATA AND REST
increasing the amount of the labeled data for training TDSL. OF UNLABELED TRAINING DATA .
Specifically, even we use little amount of labeled data, we
Ratio of Labeled Data Accuracy Precision Recall Fscore
still obtain acceptable performance. For example, we use 1% 1% 79.81% 83.62% 94.30% 88.62%
labeled training data to construct Word CNN based TDSL, 3% 80.71% 83.69% 95.54% 89.21%
compared with the Word CNN, its accuracy and Fscore are 5% 80.74% 83.70% 95.59% 89.23%
8% 82.24% 83.56% 98.02% 90.21%
only reduced about 6% and 3%, respectively. Moreover, the 10% 82.52% 83.58% 98.41% 90.39%
accuracy, recall and Fscore of TDSL (1%) are better than those 30% 83.36% 83.59% 99.84% 90.91%
of some baselines such as Char CNN, RCNN, Word RNN, and
Att RNN. It means that the deep semi-supervised learning built
based on the proposed framework is able to detect fake news
even with little labeled data. data, choosing embedding size should be carefully because
different embedding sizes will lead to different performances.
In table IV, we show how the ratio of labeled data affect
In addition, larger embedding size enhances the performance.
the detection performance generated with TDSL. When we
In Figure 6, we observe that there is no significant difference
increase the ratio of labeled data step by step, the performances
when using different learning rates in the case of 10% labeled
such as accuracy, recall, and Fscore are improved significantly
data and 30% labeled data while only small performance
while only precision is relatively stable. Moreover, when we
difference can be observed in the case of 1% labeled data.
use the 30% labeled data, the performance is similar to the-
state-of-the-art obtained with Word CNN. Specifically, the 2) PHEME: Compared to LIAR dataset, PHEME datasets
precision is decreased slightly when increasing the ratio of will introduce new challenges such as imbalanced class distri-
labeled data. The supervised loss, cross entropy, is defined bution and word distribution differences among these events.
in terms of the prediction accuracy. Therefore, we are able Therefore, it will lead to different observations, compared to
to gain higher accuracy when adding more labeled data into the case of LIAR. Table V indicates the performance compari-
the training procedure, but there is no guarantee to increase son on PHEME datasets. When we examine the baselines, VD
precision. CNN outperforms other baselines with respect to the MacroA,
Additionally, we examine the performance effects from
different hyper-parameters in Figure 4, 5, and 6. In Figure
TABLE V
4, we focus on the effects from different batch sizes. When C OMPARING PERFORMANCE BETWEEN BASELINES AND PROPOSED
we train the model with fewer labeled data, the different batch MODEL (TDSL) ON PHEME DATASETS . T HE BASELINES , NAMELY,
sizes affect the performance more significantly. For instance, W ORD CNN, C HAR CNN, VD CNN, RCNN, WORD RNN, AND ATT
RNN, ARE BUILT WITH THE TRAINING DATA THAT IS FULLY LABELED . O N
the performance hardly change when there are 30% labeled THE CONTRARY, WE ONLY APPLY 1% AND 30% LABELED TRAINING DATA
data while the performance varies when there are only 1% AND REST OF UNLABELED TRAINING DATA TO ACCOMPLISH LEARNING OF
labeled data. It is because more information on the ground THE PROPOSED MODEL .

truth will be embedded in the training samples when batch Model MacroA MacroP MacroR MacroF
size is large, which results in robust prediction. Word CNN[18] 61.75% 50.82% 17.60% 24.03%
For Figure 5 and 6, we examine the performance effects Char CNN)[44] 63.68% 50.66% 19.91% 26.73%
VD CNN [45] 65.42% 49.21% 30.04% 28.50%
on different embedding sizes and learning rates. In Figure 5, RCNN[47] 60.62% 45.86% 16.40% 22.24%
there are similar observations to those of the case of batch Word RNN 59.70% 45.57% 22.89% 28.22%
size. For example, for the case of more labeled data for Att RNN[46] 60.32% 45.58% 25.49% 31.15%
TDSL (1%) 56.19% 38.83% 18.73% 21.13%
training, different embedding sizes doesn’t affect the perfor- TDSL (30%) 60.64% 41.14% 4.77% 6.75%
mance significantly. On the contrary, for the fewer labeled
viii

1% Labeled Data 10% Labeled Data 30% Labeled Data


100 100 100
batch size: 128 batch size: 128 batch size: 128
batch size: 256 batch size: 256 batch size: 256
batch size: 512 batch size: 512 batch size: 512
90 90 90

80 80 80

70 70 70

60 60 60

50 50 50
Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore

Fig. 4. Different performances generated with three batch sizes, 128, 256, and 512 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is for
different evaluation metrics while y-axis is for performance. Different color bars illustrate different batch sizes, where green bars are for batch size 128, blue
bars are for batch size 256, and red bars are for batch size 512.

1% Labeled Data 10% Labeled Data 30% Labeled Data


100 100 100
embedding size: 64 embedding size: 64 embedding size: 64
embedding size: 128 embedding size: 128 embedding size: 128
embedding size: 256 embedding size: 256 embedding size: 256
90 90 90

80 80 80

70 70 70

60 60 60

50 50 50
Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore

Fig. 5. Different performances generated with three embedding sizes, 64, 128, and 256 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is
for different evaluation metrics while y-axis is for performance. Different color bars show different batch sizes, where green bars are for embedding size 64,
blue bars are for embedding size 128, and red bars are for embedding size 256.

TABLE VI
C OMPARING PERFORMANCES GENERATED BY PROPOSED MODEL (TDSL)
LEARNING ON DIFFERENT RATIOS OF LABELED TRAINING DATA FROM TABLE VII
PHEME DATASETS . C OMPARING PERFORMANCE WITH DIFFERENT BATCH SIZES ON PHEME
DATASETS . W E CHOOSE THREE CASES OF RATIOS OF LABELED TRAINING
Labeled Ratio MacroA MacroP MacroR MacroF DATA , NAMELY, 1%, 10%, AND 30%.
1% 56.19% 38.83% 18.73% 21.13%
3% 58.58% 39.38% 13.12% 17.83% 1% Labeled Data
5% 58.40% 39.18% 12.58% 16.31% Batch size MacroA MacroP MacroR MacroF
8% 59.74% 40.48% 8.11% 11.18% 128 56.19% 38.83% 18.73% 21.13%
10% 59.84% 40.38% 7.08% 10.49% 256 57.78% 39.72% 18.50% 23.52%
30% 60.64% 41.14% 4.77% 6.75% 512 57.78% 39.07% 15.73% 20.43%
10% Labeled Data
Batch size MacroA MacroP MacroR MacroF
128 59.84% 40.38% 7.08% 10.49%
MacroR, and MacroF. However, considering MacroP, Word 256 60.08% 40.46% 8.55% 12.49%
512 59.02% 40.81% 12.96% 17.18%
CNN is better than other baselines. In addition, we observe
30% Labeled Data
that the performance (MacroA) is enhanced when increasing Batch size MacroA MacroP MacroR MacroF
the ratio of the labeled data for training TDSL. Moreover, even 128 60.64% 41.14% 4.77% 6.75%
we use little amount of labeled data, we still obtain acceptable 256 60.59% 41.50% 7.09% 10.77%
512 59.94% 42.77% 8.51% 12.27%
performance. For example, we use 1% labeled training data
to construct Word CNN based TDSL, compared with the
VD CNN, its MacroA and MacroF are just reduced about
ix

1% Labeled Data 10% Labeled Data 30% Labeled Data


100 100 100
learning rate: 1e-4 learning rate: 1e-4 learning rate: 1e-4
learning rate: 1e-3 learning rate: 1e-3 learning rate: 1e-3
90 90 90

80 80 80

70 70 70

60 60 60

50 50 50
Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore

Fig. 6. Different performances generated with three learning rate, 1e-3 and 1e-4 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is for
different evaluation metrics while y-axis is for performance. Different color bars indicate different batch sizes, where blue bars are for learning rate 1e-3, and
red bars are for learning rate 1e-4.

TABLE VIII training procedure.


C OMPARING PERFORMANCE WITH DIFFERENT EMBEDDING SIZES ON
PHEME DATASETS . W E CHOOSE THREE CASES OF RATIOS OF LABELED
In table VII and VIII, we examine the performance dif-
TRAINING DATA , NAMELY, 1%, 10%, AND 30%. ferences when choosing different batch sizes and embedding
sizes, respectively. We observe the similar trends regarding
1% Labeled Data the MacroA and MacroP, where there is no big difference on
Embedding size MacroA MacroP MacroR MacroF MacroA and MacroP when choosing different batch sizes and
64 57.38% 39.20% 16.57% 20.78% embedding sizes for building the proposed model. However,
128 56.19% 38.83% 18.73% 21.13%
256 57.13% 38.86% 18.07% 22.23%
MacroR is changed more significantly by different batch sizes
10% Labeled Data when comparing to the case of embedding sizes.
Embedding size MacroA MacroP MacroR MacroF Moreover, in the Figure 7, 8, and 9, we show the detailed
64 59.73% 40.44% 7.91% 11.06% performances for five events when choosing different batch
128 59.84% 40.38% 7.08% 10.49%
256 58.89% 40.27% 11.67% 15.22% sizes to train the model on different ratios of labeled data.
30% Labeled Data When examining the results shown in Figure 7, MacroA is
Embedding size MacroA MacroP MacroR MacroF increased for these events Charlie Hebdo, Sydney siege, and
64 60.79% 42.32% 4.37% 6.99% Ferguson when increasing the amount of labeled data whereas
128 60.64% 41.14% 4.77% 6.75%
256 60.63% 42.54% 4.63% 6.66% for the events Germanwings-cras and Ottawa shooting, the
MacroA is decreased. It is because more imbalanced classes
involved in the training procedure will lead to reducing the
performance. For MacroR and MacroF, the performance is
9% and 7%, respectively. However, the MacroF is decreased reduced when adding the ratios of labeled data. In addition,
when MacroA is increased when adding more labeled data for the similar observations can be obtained in terms of results
training. There are two reasons for this observation. One is that shown in 8, and 9. However, the difference is that larger batch
the learning of TDSL aims to optimize the accuracy, but not sizes can reduce the performance affections that are from batch
the Fscore. The other is that the data distribution of training sizes.
data is different from that of testing data since we utilize the Finally, we examine the performance differences when
leave-one-out policy to complete the validation, which breaks choosing two different embedding sizes, namely 64 and 256,
the assumption that the training data should share the same where the results are shown in Figure 10 and 11, respectively.
distribution to the testing data. The more labeled data is, the MacroA and MacroP are increased for the events Charlie
more serious the difference on the distribution is. Hebdo, Sydney siege, and Ferguson when increasing the
Similar to the case of LIAR, in table VI, we illustrate how amount of labeled data for training models whereas for the
the ratio of labeled data affects the detection performance. events Germanwings-cras and Ottawa shooting, the MacroA
When we increase the ratio of labeled data step by step, is decreased. It is caused by the same reason of the case of
the MacroA is improved as well, but the MacroF is reduced batch size.
significantly. Specifically, MacroR and MacroF are decreased In summary, in terms of observations that are from the
significantly when increasing the ratio of labeled data. The aforementioned results, increasing the amount of labeled data
supervised loss, cross entropy, is defined in terms of the pre- for training TDSL will enhance performance when training
diction accuracy. Therefore, there is no guarantee to increase data has the similar distribution to the testing data, for instance,
MacroR and MacroF when adding more labeled data into the the case of LIAR. In addition, for both of benchmarks, we can
x

Charlie Hebdo Ferguson Germanwings-cras Ottawa shooting Sydney siege


80 80 80 80 80
1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data
10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data
30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data
70 70 70 70 70

60 60 60 60 60

50 50 50 50 50

40 40 40 40 40

30 30 30 30 30

20 20 20 20 20

10 10 10 10 10

0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF

Fig. 7. Comparing detailed performances generated with batch size 128 for five events. x-axis is for different evaluation metrics while y-axis is for performance.
Different color bars are for different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%

obtain acceptable performance even using extremely limited neering (OUSD(R&E)) under agreement number FA8750-15-
labeled data for training. However, we should pay more 2-0119. The U.S. Government is authorized to reproduce and
attention to choosing the ratio of labeled when processing im- distribute reprints for governmental purposes notwithstanding
balance classification task, for example, the case of PHEME. any copyright notation thereon. The views and conclusions
Meantime, we should delicately choose the hyper-parameters contained herein are those of the authors and should not be
if we plan to obtain reasonable performance. interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of the Office of
V. C ONCLUSION AND F UTURE W ORK the Under Secretary of Defense for Research and Engineering
(OUSD(R&E)) or the U.S. Government.
In this paper, a novel framework of deep semi-supervised
learning is proposed for fake news detection. Because of the R EFERENCES
fast propagation of fake news, timely detection is critical [1] G. Pennycook and D. G. Rand, “Fighting misinformation on social me-
to mitigate their effects. However, usually very few data dia using crowdsourced judgments of news source quality,” Proceedings
samples can be labeled in a short time, which in turn makes of the National Academy of Sciences, p. 201806781, 2019.
[2] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on
the supervised learning models infeasible. Hence, a deep social media: A data mining perspective,” ACM SIGKDD Explorations
semi-supervised learning model is implemented based on the Newsletter, vol. 19, no. 1, pp. 22–36, 2017.
proposed framework. The two paths in the model generate [3] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016
election,” Journal of economic perspectives, vol. 31, no. 2, pp. 211–36,
supervised loss (cross-entropy) and unsupervised loss (Mean 2017.
Squared Error), respectively. Then training is performed by [4] N. Grinberg, K. Joseph, L. Friedland, B. Swire-Thompson, and D. Lazer,
jointly optimizing these two losses. Experimental results indi- “Fake news on twitter during the 2016 us presidential election,” Science,
vol. 363, no. 6425, pp. 374–378, 2019.
cate that the implemented model could detect fake news from [5] D. Hovy, “The enemy in your own camp: How well can we detect
these two benchmarks, LIAR and PHEME effectively using statistically-generated fake reviews–an adversarial study,” in Proceedings
very limited labeled data and a large amount of unlabeled data. of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), vol. 2, 2016, pp. 351–356.
Furthermore, given the data distribution differences between [6] W. Y. Wang, “” liar, liar pants on fire”: A new benchmark dataset for
training and testing datasets for the case of PHEME, and fake news detection,” arXiv preprint arXiv:1705.00648, 2017.
using the leave-one-event-out cross-validation, increasing the [7] M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, and B. Stein, “A
stylometric inquiry into hyperpartisan and fake news,” in Proceedings
percentage of labeled data to train the semi-supervised model of the 56th Annual Meeting of the Association for Computational
does not automatically imply performance improvement. In Linguistics (Volume 1: Long Papers), 2018, pp. 231–240.
the future, we plan to examine the proposed framework on [8] J. Ma, W. Gao, P. Mitra, S. Kwon, B. J. Jansen, K.-F. Wong, and M. Cha,
“Detecting rumors from microblogs with recurrent neural networks.” in
other NLP tasks such as sentiment analysis and dependency Ijcai, 2016, pp. 3818–3824.
analysis. [9] Z. Zhao, P. Resnick, and Q. Mei, “Enquiring minds: Early detection
of rumors in social media from enquiry posts,” in Proceedings of the
24th International Conference on World Wide Web. International World
ACKNOWLEDGMENT Wide Web Conferences Steering Committee, 2015, pp. 1395–1405.
[10] S. Kwon, M. Cha, K. Jung, W. Chen, and Y. Wang, “Prominent features
This research work is supported in part by the U.S. Office of rumor propagation in online social media,” in 2013 IEEE 13th
of the Under Secretary of Defense for Research and Engi- International Conference on Data Mining. IEEE, 2013, pp. 1103–1108.
xi

Charlie Hebdo Ferguson Germanwings-cras Ottawa shootingFerguson Sydney siege


80 80 80 80 80
1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data
10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data
30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data
70 70 70 70 70

60 60 60 60 60

50 50 50 50 50

40 40 40 40 40

30 30 30 30 30

20 20 20 20 20

10 10 10 10 10

0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF

Fig. 8. Comparing detailed performances generated with batch size 256 for five events. x-axis is for different evaluation metrics while y-axis is for performance.
Different color bars show different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

Charlie Hebdo Ferguson Germanwings-cras Ottawa shootingFerguson Sydney siege


80 80 80 80 80
1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data
10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data
30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data
70 70 70 70 70

60 60 60 60 60

50 50 50 50 50

40 40 40 40 40

30 30 30 30 30

20 20 20 20 20

10 10 10 10 10

0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF

Fig. 9. Comparing detailed performances generated with batch size 512 for five events. x-axis is for different evaluation metrics while y-axis is for performance.
Different color bars indicate different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

[11] Y. Chang, M. Yamada, A. Ortega, and Y. Liu, “Lifecycle modeling 2: Short Papers), 2017, pp. 252–256.
for buzz temporal pattern discovery,” ACM Transactions on Knowledge [16] C. Bosco, V. Lombardo, L. Lesmo, and V. Daniela, “Building a treebank
Discovery from Data (TKDD), vol. 11, no. 2, p. 20, 2016. for italian: a data-driven annotation schema.” in LREC 2000. ELDA,
[12] ——, “Ups and downs in buzzes: Life cycle modeling for temporal 2000, pp. 99–105.
pattern discovery,” in 2014 IEEE International Conference on Data [17] A. Zubiaga, M. Liakata, R. Procter, G. W. S. Hoi, and P. Tolmie,
Mining. IEEE, 2014, pp. 749–754. “Analysing how people orient to and spread rumours in social media by
[13] R. Oshikawa, J. Qian, and W. Y. Wang, “A survey on natural language looking at conversational threads,” PloS one, vol. 11, no. 3, p. e0150989,
processing for fake news detection,” arXiv preprint arXiv:1811.00770, 2016.
2018. [18] Y. Kim, “Convolutional neural networks for sentence classification,”
[14] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi, “Truth of arXiv preprint arXiv:1408.5882, 2014.
varying shades: Analyzing language in fake news and political fact- [19] C. D. Manning, C. D. Manning, and H. Schütze, Foundations of
checking,” in Proceedings of the 2017 Conference on Empirical Methods statistical natural language processing. MIT press, 1999.
in Natural Language Processing, 2017, pp. 2931–2937. [20] S. Tong and D. Koller, “Support vector machine active learning with
[15] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection applications to text classification,” Journal of machine learning research,
through multi-perspective speaker profiles,” in Proceedings of the Eighth vol. 2, no. Nov, pp. 45–66, 2001.
International Joint Conference on Natural Language Processing (Volume [21] J. Ramos et al., “Using tf-idf to determine word relevance in document
xii

Charlie Hebdo Ferguson Germanwings-cras Ottawa shootingFerguson Sydney siege


80 80 80 80 80
1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data
10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data
30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data
70 70 70 70 70

60 60 60 60 60

50 50 50 50 50

40 40 40 40 40

30 30 30 30 30

20 20 20 20 20

10 10 10 10 10

0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF

Fig. 10. Comparing performance for the case of embedding size 64. x-axis is for different evaluation metrics while y-axis is for performance. Different color
bars present different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

Charlie Hebdo Ferguson Germanwings-cras Ottawa shootingFerguson Sydney siege


80 80 80 80 80
1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data 1% Labeled Data
10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data 10% Labeled Data
30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data 30% Labeled Data
70 70 70 70 70

60 60 60 60 60

50 50 50 50 50

40 40 40 40 40

30 30 30 30 30

20 20 20 20 20

10 10 10 10 10

0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF

Fig. 11. Comparing performance for the case of embedding size 256. x-axis is for different evaluation metrics while y-axis is for performance. Different
color bars illustrate different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.

queries,” in Proceedings of the first instructional conference on machine Computational Linguistics, 2009, pp. 309–312.
learning, vol. 242. Piscataway, NJ, 2003, pp. 133–142. [26] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry
[22] J. H. Paik, “A novel tf-idf weighting scheme for effective ranking,” and word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates,
in Proceedings of the 36th international ACM SIGIR conference on vol. 71, no. 2001, p. 2001, 2001.
Research and development in information retrieval. ACM, 2013, pp. [27] N. Ruchansky, S. Seo, and Y. Liu, “Csi: A hybrid deep model for fake
343–352. news detection,” in Proceedings of the 2017 ACM on Conference on
[23] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of Information and Knowledge Management. ACM, 2017, pp. 797–806.
word representations in vector space,” arXiv preprint arXiv:1301.3781, [28] Q. Le and T. Mikolov, “Distributed representations of sentences and
2013. documents,” in International conference on machine learning, 2014, pp.
[24] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors 1188–1196.
for word representation,” in Proceedings of the 2014 conference on [29] H. Karimi, P. Roy, S. Saba-Sadiya, and J. Tang, “Multi-source multi-
empirical methods in natural language processing (EMNLP), 2014, pp. class fake news detection,” in Proceedings of the 27th International
1532–1543. Conference on Computational Linguistics, 2018, pp. 1546–1557.
[25] R. Mihalcea and C. Strapparava, “The lie detector: Explorations in [30] J. Ma, W. Gao, Z. Wei, Y. Lu, and K.-F. Wong, “Detect rumors using
the automatic recognition of deceptive language,” in Proceedings of time series of social context information on microblogging websites,”
the ACL-IJCNLP 2009 Conference Short Papers. Association for in Proceedings of the 24th ACM International on Conference on
xiii

Information and Knowledge Management. ACM, 2015, pp. 1751–1754.


[31] T. Lan, C. Li, and J. Li, “Mining semantic variation in time series
for rumor detection via recurrent neural networks,” in 2018 IEEE
20th International Conference on High Performance Computing and
Communications; IEEE 16th International Conference on Smart City;
IEEE 4th International Conference on Data Science and Systems
(HPCC/SmartCity/DSS). IEEE, 2018, pp. 282–289.
[32] T. Hashimoto, T. Kuboyama, and Y. Shirota, “Rumor analysis framework
in social media,” in TENCON 2011-2011 IEEE Region 10 Conference.
IEEE, 2011, pp. 133–137.
[33] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning
(chapelle, o. et al., eds.; 2006)[book reviews],” IEEE Transactions on
Neural Networks, vol. 20, no. 3, pp. 542–542, 2009.
[34] X. J. Zhu, “Semi-supervised learning literature survey,” University of
Wisconsin-Madison Department of Computer Sciences, Tech. Rep.,
2005.
[35] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
“Distributed representations of words and phrases and their composi-
tionality,” in Advances in neural information processing systems, 2013,
pp. 3111–3119.
[36] Y. Goldberg and O. Levy, “word2vec explained: deriving mikolov
et al.’s negative-sampling word-embedding method,” arXiv preprint
arXiv:1402.3722, 2014.
[37] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by
deep multi-task learning,” in European conference on computer vision.
Springer, 2014, pp. 94–108.
[38] S. Ruder, “An overview of multi-task learning in deep neural networks,”
arXiv preprint arXiv:1706.05098, 2017.
[39] S. Chowdhury, X. Dong, L. Qian, X. Li, Y. Guan, J. Yang, and Q. Yu,
“A multitask bi-directional rnn model for named entity recognition on
chinese electronic medical records,” BMC bioinformatics, vol. 19, no. 17,
p. 499, 2018.
[40] X. Dong, S. Chowdhury, L. Qian, X. Li, Y. Guan, J. Yang, and Q. Yu,
“Deep learning for named entity recognition on chinese electronic
medical records: Combining deep transfer learning with multitask bi-
directional lstm rnn,” PloS one, vol. 14, no. 5, p. e0216046, 2019.
[41] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn-
ing,” arXiv preprint arXiv:1610.02242, 2016.
[42] A. Rajaraman and J. D. Ullman, Mining of massive datasets. Cambridge
University Press, 2011.
[43] E. Kochkina, M. Liakata, and A. Zubiaga, “All-in-one: Multi-task learn-
ing for rumour verification,” in Proceedings of the 27th International
Conference on Computational Linguistics, 2018, pp. 3402–3413.
[44] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional
networks for text classification,” in Advances in neural information
processing systems, 2015, pp. 649–657.
[45] A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, “Very
deep convolutional networks for text classification,” arXiv preprint
arXiv:1606.01781, 2016.
[46] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, “Attention-
based bidirectional long short-term memory networks for relation classi-
fication,” in Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 207–
212.
[47] S. Lai, L. Xu, K. Liu, and J. Zhao, “Recurrent convolutional neural
networks for text classification,” in Twenty-ninth AAAI conference on
artificial intelligence, 2015.
[48] Y. Yang, “A study of thresholding strategies for text categorization,” in
Proceedings of the 24th annual international ACM SIGIR conference on
Research and development in information retrieval, 2001, pp. 137–145.
[49] N. V. Chawla, “Data mining for imbalanced datasets: An overview,” in
Data mining and knowledge discovery handbook. Springer, 2009, pp.
875–886.

You might also like