Professional Documents
Culture Documents
2002.00763
2002.00763
Abstract—News in social media such as Twitter has been effective approach to recognize fake news by identifying the
generated in high volume and speed. However, very few of them signature of text phrases in social media posts [9]. Moreover,
are labeled (as fake or true news) by professionals in near real temporal features play a crucial role in the fast-paced social
time. In order to achieve timely detection of fake news in social
media environment because information spreads more rapidly
arXiv:2002.00763v1 [cs.CL] 31 Jan 2020
yi
zi y' cross
... Softmax
entropy
li
w(t)
Supervised CNN
xi Word
Embedding
... weighted
sum of loss
Optimizer
zi
Shared CNN ... mean squared l'i
error
z'i
Unsupervised CNN
For all data
Fig. 1. Framework of two-path deep semi-supervised learning. Samples xi are inputs. Labels yi are available only for the labeled inputs and the associated
cross-entropy loss component is evaluated only for those. zi and zi0 are outputs from the supervised CNN and the unsupervised CNN, respectively. yi0 is the
predicted label for xi . li is the cross-entropy loss and li0 is the mean squared error loss. w(t) are the weights for joint optimization of li and li0 .
ing (TDSL) framework containing three CNNs, where RNN models: one learns on simple word embeddings, and the
both labeled data and unlabeled data are used jointly to other enhances the performance by concatenating long short-
train the model and enhance the detection performance. term memory (LSTM) outputs with LIWC feature vectors.
• We implement a Word CNN [18] based TDSL to detect Doc2vec [28] is also applied to represent content that is
fake news with limited labeled data and compare its related to each social engagement. Attention based RNNs
performance with various deep learning based baselines. are employed to achieve better performance as well. Long
Specifically, we validate the implemented model by test- et al. [15] incorporates the speaker names and the statement
ing on the LIAR and PHEME datasets. It is observed that topics into the inputs to the attention based RNN. In addition,
the proposed model detects fake news effectively even convolutional neural networks are also widely used since
with extremely limited labeled data. they succeed in many text classification tasks. Karimi et al.
• The proposed framework could be applied to address [29] proposed Multi-source Multi-class Fake news Detection
other tasks. Furthermore, novel deep semi-supervised framework (MMFD), where CNN analyzes local patterns of
learning models can be implemented based on the pro- each text in a claim and LSTM analyze temporal dependencies
posed framework with various designs of CNNs, which in the entire text.
will be determined by the intended applications and tasks.
In propagation pattern based method, the propagation pat-
terns have been extracted from time-series information of news
II. R ELATED W ORK spreading on social media such as Twitter and they are used
Fake news detection has attracted a lot of attention in as features for detecting fake news. For instance, to identify
recent years. There are extensive studies such as content fake news from microblogs, Ma et al. [30] proposed the
based method and propagation pattern based method. Content Dynamic Series-Time Structure (DSTS) to capture variations
based method typically involves two steps: preprocessing in social context features such as microblog contents and
news contents and training supervised learning model on users over time for early detection of rumors. Lan et al. [31]
the preprocessed contents. The first step usually involves proposed Hierarchical Attention RNN (HARNN) that uses a
tokenization, stemming, and/or weighting words [19], [20]. In Bi-GRU layer with the attention mechanism to capture high-
the second step, Term Frequency-Inverse Document Frequency level representations of rumor contents, and a gated recurrent
(TF-IDF) [21], [22] may be employed to build samples to train unit (GRU) layer to extract semantic changes. Hashimoto et
supervised learning models. However, the samples generated al. [32] visualized topic structures in time-series variation and
by TF-IDF will be sparse, especially for social media data. seeks help from external reliable source to determine the topic
To overcome this challenge, word embedding methods such truthfulness.
as word2vec [23] and GloVe [24] are used to convert words In summary, most of the current methods are based on
into vectors. supervised learning. It requires a large amount of labeled
In addition, Mihalcea et al. [25] used linguistic inquiry and data to implement the detection processes, especially for the
word count (LIWC) [26] to explore the difference of word deep learning based approaches. However, annotating the news
usage between deceptive language and non-deceptive ones. on social media is too expensive and costs a huge amount
Specifically, deep learning based models have been explored of human labor due to the huge size of social media data.
more than other supervised learning models [13]. For example, Furthermore, this is almost impossible to achieve in near
Rashkin et al. [27] built the detection model with two LSTM real time. Even with labeled data, constructing the huge
iii
amount of labeled corpus is an extremely difficult task in before these two paths, there is a shared CNN to extract low-
the field of natural language processing as it costs a large level features to feed the later two CNNs. It is similar to deep
volume of resources and it is challenging to guarantee the multi-task learning [37], [38] since the low-level features are
label consistence. Therefore, it is imperative to incorporate shared in the different tasks [39], [40]. The major difference
unlabeled data together with labeled data in fake news detec- between the proposed framework and deep multi-task learning
tion to enhance the detection performance. Semi-supervised is that tasks in the proposed framework will involve both
learning [33], [34] is a technique that is able to use both supervised learning and unsupervised learning, while all tasks
labeled data and unlabeled data. Therefore, we propose a novel in deep multi-task learning are only based on supervised
framework of deep semi-supervised learning via convolutional learning.
neural networks for timely fake news detection. In the next In addition, these two paths can have independent CNNs
section, we present the proposed framework and show how to with the identical or different setups for supervised learning
implement the deep semi-supervised learning model in detail. and unsupervised learning, respectively. They generate two
prediction vectors that are new representations for the inputs
III. M ODEL with respect to their tasks. For the identical setups of these
two path, i.e., using the same CNN structure for both paths, it
We propose a general framework of deep semi-supervised is important to notice that, because of dropout regularization,
learning and apply it to accomplish fake news detection. training CNNs in these two paths is a stochastic process. This
Suppose the training data consist of total N inputs, out of will result in the two CNNs having different link weights and
which M are labeled. The inputs, denoted by xj , where filters during training. It implies that there will be difference
j ∈ 1...N , are the news contents that contain sentences related between the two prediction vectors zi and zi0 of the same
to fake news. In general, the news on social media normally input xi . Given that the original input xi is the same, this
contains limited number of words like 100 or less. It will lead difference can be seen as an error and thus minimizing the
to the data sparsity if we apply TF-IDF to extract features. To mean square error (MSE) is a reasonable objective in the
relieve the data sparsity problem, we employ word embedding learning procedure.
techniques, for instance, word2vec [23], [35], [36], to represent We utilize those two vectors zi and zi0 to calculate the loss
the news contents. Here S is the set of labeled inputs, |S|= M . given by
For every i ∈ S, we have a known correct label yi ∈ 1...C,
where C is the number of different classes. 1 X
Loss = − logfsof tmax (zi )[yi ]
The proposed framework of deep semi-supervised learning |B|
i∈B∩S
and corresponding learning procedures are shown in Figure (1)
1 X
1 and Algorithm 1, respectively. As shown in Figure 1, we + w(t) × ||zi − zi0 ||2 ,
C|B|
evaluate the network for each training input xi with the i∈B
supervised path and the unsupervised path to complete two where B is the minibatch in the learning process. The loss
tasks, resulting in prediction vectors zi and zi0 , respectively. consists of two components. As illustrated in Algorithm 1,
One task is to learn how to mine patterns of fake news li is the standard cross-entropy loss to evaluate the loss for
regarding the news labels while the other is to optimize the labeled inputs only. On the other hand, li0 , evaluated for all
representations of news without the news labels. Specially, inputs, penalizes different predictions for the same training
iv
r2
zi y' cross
entropy
Softmax
Word Embedding r3
x'i
k
li
... x'i
xi e1
... Supervised CNN
e2
x'i w(t) weighted
sum of loss
Optimizer
...
...
...
...
...
...
...
...
en ...
x'i = e1 e2 ... en r'1
zi l'i
Shared CNN
r'2 mean squared
error
r'3 z'i
For all data
Unsupervised CNN
Fig. 2. Word-CNN Based Deep Semi-supervised Learning. In the shared CNN, each convolutional layer contains 100 (3 × 3) filters, 100 (4 × 4) filters, and
100 (5 × 5) filters, respectively. Both the supervised CNN and the unsupervised CNN have the same architecture of the shared CNN with different numbers
of filters, where each convolutional layer contains 100 (3 × 3) filters. We use (2 × 2) max-pooling for all pooling layers. ⊕ is the concatenation operator.
r1 , r2 and r3 are outputs from the supervised path while r10 , r20 and r30 are those from the unsupervised path. Furthermore, we concatenate r1 , r2 and r3
to conduct zi and connect r10 , r20 and r30 to generate zi0 .
input xi by taking the mean squared error between zi and zi0 . to the supervised CNN and unsupervised CNN, respectively.
To combine the supervised loss li and unsupervised loss li0 , Moreover, the outputs from pooling layers in these two paths
we scale the latter by time-dependent weighting function w(t) are concatenated as two vectors zi and zi0 , respectively. For the
[41] that ramps up, starting from zero, along a Gaussian curve. supervised path, we use the label yi and the result generated
In the beginning of training, the total loss and the learning by running softmax on zi to build the supervised loss with the
gradients are dominated by the supervised loss component, cross entropy. On the other hand, these two vectors zi and zi0
i.e., the labeled data only. Unlabeled data will contribute more are employed to construct the unsupervised loss with the mean
than the labeled data at later stage of training. squared error. Finally, the supervised loss and unsupervised
Based on the proposed framework, we implement a deep loss are jointly optimized with the Adam optimizer, which is
semi-supervised learning model with Word CNN [18], where shown in Algorithm 1.
the architecture is shown in Figure 2. The Word CNN is a Although there are a few related works in the literature
powerful classifier with the simple architecture of CNN for such as the Π model [41], there exist significant differences.
sentence classification [18]. Considering fake news detection, Firstly, the Π model is designed for processing image data
let ei ∈ Rk be the k-dimensional word vector corresponding while the proposed framework is more suitable to be used to
to the i-th word of the sentence in the news. A sentence of solve problems related to NLP. Secondly, in the Π model, it
length n is represented as combined image data augmentation with dropout to generate
two outputs, but it cannot be used to process language in NLP
x0i = e1 ⊕ e2 ⊕ ... ⊕ en , (2)
tasks since data augmentation for processing text contents
where ⊕ is the concatenation operator. A convolution opera- is difficult to implement. Finally, instead of using one path
tion involves a filter c ∈ Rhk , which is applied to a window CNN, we apply two independent CNNs to generate those two
of h words to produce a new feature. The pooling operation outputs. Furthermore, the proposed model is more flexible as
deals with variable sentence lengths. As shown in Figure 2, the two independent paths can be tuned for specific goals.
the input xi is represented as x0i with the word embedding1 . IV. E XPERIMENT
Then the representation x0i is input into three convolutional
A. Datasets
layers followed by pooling layers to extract low-level features
in the shared CNN. The extracted low-level features are given To validate the effectiveness of the proposed framework on
fake news detection, we test the implemented model on two
1 https://www.tensorflow.org/api docs/python/tf/nn/embedding lookup benchmarks: LIAR and PHEME.
v
0 20 0 20 0 20 0 20 0 20
TF-IDF (10-e3) TF-IDF (10-e3) TF-IDF (10-e3) TF-IDF (10-e3) TF-IDF (10-e3)
Fig. 3. An example of the difference of word distributions between five events in PHEME dataset. x-axis indicates the TF-IDF value of the word while y-axis
shows top 50 words ranked by corresponding TF-IDF values.
1) LIAR: LIAR [6] is the recent benchmark dataset for the data is preprocessed such as removing data redundancy.
fake news detection. This dataset includes 12,836 real-world It is observed that the class distribution is different among
short statements collected from PolitiFact2 , where editors these events. For example, three events including CH, SS, and
handpicked the claims from a variety of occasions such as FE have more true news while the event GC and OS have
debate, campaign, Facebook, Twitter, interviews, ads, etc. Each more fake news. Furthermore, class distributions for events
statement is labeled with six-grade truthfulness, namely, true, such as CH, SS, FE are significantly imbalanced, which will
false, half-true, part-fire, barely-true, and mostly-true. The be a challenge to the detection task (a binary classification
information about the subjects, party, context, and speakers task).
are also included in this dataset. In this paper, this benchmark In addition, the word distributions vary among five events,
contains three parts: training dataset with 10,269 statements, where one example is shown in Figure 3: there are 50 words
validation dataset with 1,284 statements, and testing dataset ranked by their TF-IDF values, where TF-IDF values are used
with 1,283 statements. Furthermore, we reorganize the data to evaluate the word relevance to the event [42]. It is observed
as two classes by treating five classes including false, half- that the top ranked words are very different for different events.
true, part-fire, barely-true, and mostly-true as Fake class and Moreover, even for the same word, their relevance are not the
true as True class. Therefore, the fake news detection on this same for different events. For example, the relevance (TF-
benchmark becomes a binary classification task. IDF values) of the word “thank” are different between events
2) PHEME: We employ PHEME [17] to verify the ef- including Charlie Hebdo (CH), Ferguson (FE), and Ottawa
fectiveness of the proposed framework on social media data, shooting (OS).
where it collects 330 twitter threads. Tweets in PHEME are In addition, we perform leave-one-event-out (LOEO) cross-
labeled as true or false in terms of thread structures and follow- validation [43], which is closer to the realistic scenario where
follower relationships. PHEME dataset is related to nine events we have to verify unseen truth. For example, the training data
whereas this paper only focuses on the five main events, can contain the events GC, CH, SS, and FE whereas the testing
namely, Germanwings-crash (GC), Charlie Hebdo (CH), Syd- data will contain the event OS. However, as highlighted in
ney siege (SS), Ferguson (FE), and Ottawa shooting (OS). It Figure 3, this fake news detection task becomes very difficult
has different levels of annotations such as thread level and since the training data has very different data distribution from
tweet level. We only adopt the annotations on the tweet level that of the testing data,.
and thus classify the tweets as fake or true. The detailed
distribution of tweets and classes is shown in table II after It should be noted that the data in original datasets is fully
labeled. We employ all labeled data to train the baseline
2 https://www.politifact.com models. Compared to the baselines, we train the proposed
vi
TABLE I
H YPER - PARAMETERS FOR THE TRAINING OF W ORD CNN BASED TDSL 2 × P recision × Recall
F score = . (4)
Hyper-parameters Values P recision + Recall
Dropout 0.5
Minibatch size 128 where P recision indicates precision measurement that
Number of epochs 200 defines the capability of a model to represent only fake
Maximum learning rate 1e-4
statements and Recall computes the aptness to refer all
corresponding fake statements:
TABLE II
N UMBER OF TWEETS AND CLASS DISTRIBUTION IN THE PHEME TP
DATASET. P recision = . (5)
TP + FP
Events Tweets Fake True
Germanwings-crash 3,920 2,220 1,700
Charlie Hebdo 34,236 6,452 27,784 TP
Sydney siege 21.837 7,765 14,072 Recall = . (6)
Ferguson 21,658 5,952 15,706 TP + FN
Ottawa shooting 10,848 5,603 5,245
Total 92,499 27,992 64,507 whereas T P (True Positive) counts total number of news
matched with the news in the labels. F P (False Positive)
measures the number of recognized label does not match
models with only partially labeled data and the percentage the annotated corpus dataset. F N (False Negative) counts
of labeled data for training of our proposed models is from the number of news that does not match the predicted
1% to 30%. label news.
• PHEME: Accuracy is one of the common evaluation
metric to measure the performance of fake news de-
B. Experiment Setup
tection on this dataset [13]. However, we also evaluate
The key hyper-parameters for the implemented model are the performance based on the Fscore since our task on
shown in Table I and we employ Adam optimizer to complete PHEME datasets is the binary text classification with
the training the implemented proposed model (TDSL). imbalanced data. Specifically, as we perform leave-one-
In addition, we employ five deep supervised learning models event-out cross-validation on the PHEME dataset, we
as baselines including 1) Word-level CNN (Word CNN) [18], utilize macro-averaged Fscore [48] to evaluate the whole
2) Character-level CNN (Char CNN) [44], 3) Very Deep CNN performance of mining fake news on different events.
(VD CNN) [45], 4) Attention-Based Bidirectional RNN (Att
RNN) [46], and 5) Recurrent CNN (RCNN) [47], where these T
models perform well on text classification. Specifically, Word 1X
M acroF = F scoret . (7)
CNN performs well on sentence classification, which is more T t=1
suitable to process social media data as the length of the
content of the data is short like that of the sentence. In addition, T
we build 6) word-level bidirectional RNN (Word RNN) to 1X
M acroP = P recisiont . (8)
compare the implemented model, where Word RNN contains T t=1
one embedding layer and one bidirectional RNN layer, and
concatenate all the outputs from the RNN layer to feed to the T
final layer that is a fully-connected layer. Thus, there are total 1X
M acroR = Recallt . (9)
6 baseline models. Note that baseline models use all labeled T t=1
data from the original datasets.
where T denotes the total number of events and F scoret ,
C. Evaluation P recisiont , Recallt are F score, P recision, Recall
values in the tth event. Additionally, we use macro-
We apply different metrics to evaluate the performance of
average accuracy on five events to examine performance.
fake news detection regarding the task features on these two
The main goal for learning from imbalanced datasets
benchmarks.
is to improve the recall without hurting the precision.
• LIAR: We employ accuracy, precision, recall and Fs-
However, recall and precision goals can be often con-
core to evaluate the detection performance. Accuracy is flicting, since when increasing the true positive (TP) for
calculated by dividing the number of statements detected the minority class (True), the number of false positives
correctly over the total number of statements. (FP) can also be increased; this will reduce the precision
Ncorrect [49]. It means that when the MacroP is increased, the
Accuracy = . (3) MacroR might be decreased for the case of PHEME.
Ntotal
In addition, we employ F score values of each class to T
check the performance since the task is a binary text 1X
M acroA = Accuracyt . (10)
classification. T t=1
vii
truth will be embedded in the training samples when batch Model MacroA MacroP MacroR MacroF
size is large, which results in robust prediction. Word CNN[18] 61.75% 50.82% 17.60% 24.03%
For Figure 5 and 6, we examine the performance effects Char CNN)[44] 63.68% 50.66% 19.91% 26.73%
VD CNN [45] 65.42% 49.21% 30.04% 28.50%
on different embedding sizes and learning rates. In Figure 5, RCNN[47] 60.62% 45.86% 16.40% 22.24%
there are similar observations to those of the case of batch Word RNN 59.70% 45.57% 22.89% 28.22%
size. For example, for the case of more labeled data for Att RNN[46] 60.32% 45.58% 25.49% 31.15%
TDSL (1%) 56.19% 38.83% 18.73% 21.13%
training, different embedding sizes doesn’t affect the perfor- TDSL (30%) 60.64% 41.14% 4.77% 6.75%
mance significantly. On the contrary, for the fewer labeled
viii
80 80 80
70 70 70
60 60 60
50 50 50
Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore
Fig. 4. Different performances generated with three batch sizes, 128, 256, and 512 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is for
different evaluation metrics while y-axis is for performance. Different color bars illustrate different batch sizes, where green bars are for batch size 128, blue
bars are for batch size 256, and red bars are for batch size 512.
80 80 80
70 70 70
60 60 60
50 50 50
Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore
Fig. 5. Different performances generated with three embedding sizes, 64, 128, and 256 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is
for different evaluation metrics while y-axis is for performance. Different color bars show different batch sizes, where green bars are for embedding size 64,
blue bars are for embedding size 128, and red bars are for embedding size 256.
TABLE VI
C OMPARING PERFORMANCES GENERATED BY PROPOSED MODEL (TDSL)
LEARNING ON DIFFERENT RATIOS OF LABELED TRAINING DATA FROM TABLE VII
PHEME DATASETS . C OMPARING PERFORMANCE WITH DIFFERENT BATCH SIZES ON PHEME
DATASETS . W E CHOOSE THREE CASES OF RATIOS OF LABELED TRAINING
Labeled Ratio MacroA MacroP MacroR MacroF DATA , NAMELY, 1%, 10%, AND 30%.
1% 56.19% 38.83% 18.73% 21.13%
3% 58.58% 39.38% 13.12% 17.83% 1% Labeled Data
5% 58.40% 39.18% 12.58% 16.31% Batch size MacroA MacroP MacroR MacroF
8% 59.74% 40.48% 8.11% 11.18% 128 56.19% 38.83% 18.73% 21.13%
10% 59.84% 40.38% 7.08% 10.49% 256 57.78% 39.72% 18.50% 23.52%
30% 60.64% 41.14% 4.77% 6.75% 512 57.78% 39.07% 15.73% 20.43%
10% Labeled Data
Batch size MacroA MacroP MacroR MacroF
128 59.84% 40.38% 7.08% 10.49%
MacroR, and MacroF. However, considering MacroP, Word 256 60.08% 40.46% 8.55% 12.49%
512 59.02% 40.81% 12.96% 17.18%
CNN is better than other baselines. In addition, we observe
30% Labeled Data
that the performance (MacroA) is enhanced when increasing Batch size MacroA MacroP MacroR MacroF
the ratio of the labeled data for training TDSL. Moreover, even 128 60.64% 41.14% 4.77% 6.75%
we use little amount of labeled data, we still obtain acceptable 256 60.59% 41.50% 7.09% 10.77%
512 59.94% 42.77% 8.51% 12.27%
performance. For example, we use 1% labeled training data
to construct Word CNN based TDSL, compared with the
VD CNN, its MacroA and MacroF are just reduced about
ix
80 80 80
70 70 70
60 60 60
50 50 50
Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore Accuracy Precision Recall Fscore
Fig. 6. Different performances generated with three learning rate, 1e-3 and 1e-4 on three ratios of labeled data, namely 1%, 10%, and 30%. x-axis is for
different evaluation metrics while y-axis is for performance. Different color bars indicate different batch sizes, where blue bars are for learning rate 1e-3, and
red bars are for learning rate 1e-4.
60 60 60 60 60
50 50 50 50 50
40 40 40 40 40
30 30 30 30 30
20 20 20 20 20
10 10 10 10 10
0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF
Fig. 7. Comparing detailed performances generated with batch size 128 for five events. x-axis is for different evaluation metrics while y-axis is for performance.
Different color bars are for different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%
obtain acceptable performance even using extremely limited neering (OUSD(R&E)) under agreement number FA8750-15-
labeled data for training. However, we should pay more 2-0119. The U.S. Government is authorized to reproduce and
attention to choosing the ratio of labeled when processing im- distribute reprints for governmental purposes notwithstanding
balance classification task, for example, the case of PHEME. any copyright notation thereon. The views and conclusions
Meantime, we should delicately choose the hyper-parameters contained herein are those of the authors and should not be
if we plan to obtain reasonable performance. interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of the Office of
V. C ONCLUSION AND F UTURE W ORK the Under Secretary of Defense for Research and Engineering
(OUSD(R&E)) or the U.S. Government.
In this paper, a novel framework of deep semi-supervised
learning is proposed for fake news detection. Because of the R EFERENCES
fast propagation of fake news, timely detection is critical [1] G. Pennycook and D. G. Rand, “Fighting misinformation on social me-
to mitigate their effects. However, usually very few data dia using crowdsourced judgments of news source quality,” Proceedings
samples can be labeled in a short time, which in turn makes of the National Academy of Sciences, p. 201806781, 2019.
[2] K. Shu, A. Sliva, S. Wang, J. Tang, and H. Liu, “Fake news detection on
the supervised learning models infeasible. Hence, a deep social media: A data mining perspective,” ACM SIGKDD Explorations
semi-supervised learning model is implemented based on the Newsletter, vol. 19, no. 1, pp. 22–36, 2017.
proposed framework. The two paths in the model generate [3] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016
election,” Journal of economic perspectives, vol. 31, no. 2, pp. 211–36,
supervised loss (cross-entropy) and unsupervised loss (Mean 2017.
Squared Error), respectively. Then training is performed by [4] N. Grinberg, K. Joseph, L. Friedland, B. Swire-Thompson, and D. Lazer,
jointly optimizing these two losses. Experimental results indi- “Fake news on twitter during the 2016 us presidential election,” Science,
vol. 363, no. 6425, pp. 374–378, 2019.
cate that the implemented model could detect fake news from [5] D. Hovy, “The enemy in your own camp: How well can we detect
these two benchmarks, LIAR and PHEME effectively using statistically-generated fake reviews–an adversarial study,” in Proceedings
very limited labeled data and a large amount of unlabeled data. of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), vol. 2, 2016, pp. 351–356.
Furthermore, given the data distribution differences between [6] W. Y. Wang, “” liar, liar pants on fire”: A new benchmark dataset for
training and testing datasets for the case of PHEME, and fake news detection,” arXiv preprint arXiv:1705.00648, 2017.
using the leave-one-event-out cross-validation, increasing the [7] M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorff, and B. Stein, “A
stylometric inquiry into hyperpartisan and fake news,” in Proceedings
percentage of labeled data to train the semi-supervised model of the 56th Annual Meeting of the Association for Computational
does not automatically imply performance improvement. In Linguistics (Volume 1: Long Papers), 2018, pp. 231–240.
the future, we plan to examine the proposed framework on [8] J. Ma, W. Gao, P. Mitra, S. Kwon, B. J. Jansen, K.-F. Wong, and M. Cha,
“Detecting rumors from microblogs with recurrent neural networks.” in
other NLP tasks such as sentiment analysis and dependency Ijcai, 2016, pp. 3818–3824.
analysis. [9] Z. Zhao, P. Resnick, and Q. Mei, “Enquiring minds: Early detection
of rumors in social media from enquiry posts,” in Proceedings of the
24th International Conference on World Wide Web. International World
ACKNOWLEDGMENT Wide Web Conferences Steering Committee, 2015, pp. 1395–1405.
[10] S. Kwon, M. Cha, K. Jung, W. Chen, and Y. Wang, “Prominent features
This research work is supported in part by the U.S. Office of rumor propagation in online social media,” in 2013 IEEE 13th
of the Under Secretary of Defense for Research and Engi- International Conference on Data Mining. IEEE, 2013, pp. 1103–1108.
xi
60 60 60 60 60
50 50 50 50 50
40 40 40 40 40
30 30 30 30 30
20 20 20 20 20
10 10 10 10 10
0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF
Fig. 8. Comparing detailed performances generated with batch size 256 for five events. x-axis is for different evaluation metrics while y-axis is for performance.
Different color bars show different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.
60 60 60 60 60
50 50 50 50 50
40 40 40 40 40
30 30 30 30 30
20 20 20 20 20
10 10 10 10 10
0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF
Fig. 9. Comparing detailed performances generated with batch size 512 for five events. x-axis is for different evaluation metrics while y-axis is for performance.
Different color bars indicate different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.
[11] Y. Chang, M. Yamada, A. Ortega, and Y. Liu, “Lifecycle modeling 2: Short Papers), 2017, pp. 252–256.
for buzz temporal pattern discovery,” ACM Transactions on Knowledge [16] C. Bosco, V. Lombardo, L. Lesmo, and V. Daniela, “Building a treebank
Discovery from Data (TKDD), vol. 11, no. 2, p. 20, 2016. for italian: a data-driven annotation schema.” in LREC 2000. ELDA,
[12] ——, “Ups and downs in buzzes: Life cycle modeling for temporal 2000, pp. 99–105.
pattern discovery,” in 2014 IEEE International Conference on Data [17] A. Zubiaga, M. Liakata, R. Procter, G. W. S. Hoi, and P. Tolmie,
Mining. IEEE, 2014, pp. 749–754. “Analysing how people orient to and spread rumours in social media by
[13] R. Oshikawa, J. Qian, and W. Y. Wang, “A survey on natural language looking at conversational threads,” PloS one, vol. 11, no. 3, p. e0150989,
processing for fake news detection,” arXiv preprint arXiv:1811.00770, 2016.
2018. [18] Y. Kim, “Convolutional neural networks for sentence classification,”
[14] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi, “Truth of arXiv preprint arXiv:1408.5882, 2014.
varying shades: Analyzing language in fake news and political fact- [19] C. D. Manning, C. D. Manning, and H. Schütze, Foundations of
checking,” in Proceedings of the 2017 Conference on Empirical Methods statistical natural language processing. MIT press, 1999.
in Natural Language Processing, 2017, pp. 2931–2937. [20] S. Tong and D. Koller, “Support vector machine active learning with
[15] Y. Long, Q. Lu, R. Xiang, M. Li, and C.-R. Huang, “Fake news detection applications to text classification,” Journal of machine learning research,
through multi-perspective speaker profiles,” in Proceedings of the Eighth vol. 2, no. Nov, pp. 45–66, 2001.
International Joint Conference on Natural Language Processing (Volume [21] J. Ramos et al., “Using tf-idf to determine word relevance in document
xii
60 60 60 60 60
50 50 50 50 50
40 40 40 40 40
30 30 30 30 30
20 20 20 20 20
10 10 10 10 10
0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF
Fig. 10. Comparing performance for the case of embedding size 64. x-axis is for different evaluation metrics while y-axis is for performance. Different color
bars present different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.
60 60 60 60 60
50 50 50 50 50
40 40 40 40 40
30 30 30 30 30
20 20 20 20 20
10 10 10 10 10
0 0 0 0 0
MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF MacroA MacroP MacroR MacroF
Fig. 11. Comparing performance for the case of embedding size 256. x-axis is for different evaluation metrics while y-axis is for performance. Different
color bars illustrate different ratios of labeled data, where green bars are for 1%, blue bars are for 10%, and red bars are for 30%.
queries,” in Proceedings of the first instructional conference on machine Computational Linguistics, 2009, pp. 309–312.
learning, vol. 242. Piscataway, NJ, 2003, pp. 133–142. [26] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry
[22] J. H. Paik, “A novel tf-idf weighting scheme for effective ranking,” and word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates,
in Proceedings of the 36th international ACM SIGIR conference on vol. 71, no. 2001, p. 2001, 2001.
Research and development in information retrieval. ACM, 2013, pp. [27] N. Ruchansky, S. Seo, and Y. Liu, “Csi: A hybrid deep model for fake
343–352. news detection,” in Proceedings of the 2017 ACM on Conference on
[23] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of Information and Knowledge Management. ACM, 2017, pp. 797–806.
word representations in vector space,” arXiv preprint arXiv:1301.3781, [28] Q. Le and T. Mikolov, “Distributed representations of sentences and
2013. documents,” in International conference on machine learning, 2014, pp.
[24] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors 1188–1196.
for word representation,” in Proceedings of the 2014 conference on [29] H. Karimi, P. Roy, S. Saba-Sadiya, and J. Tang, “Multi-source multi-
empirical methods in natural language processing (EMNLP), 2014, pp. class fake news detection,” in Proceedings of the 27th International
1532–1543. Conference on Computational Linguistics, 2018, pp. 1546–1557.
[25] R. Mihalcea and C. Strapparava, “The lie detector: Explorations in [30] J. Ma, W. Gao, Z. Wei, Y. Lu, and K.-F. Wong, “Detect rumors using
the automatic recognition of deceptive language,” in Proceedings of time series of social context information on microblogging websites,”
the ACL-IJCNLP 2009 Conference Short Papers. Association for in Proceedings of the 24th ACM International on Conference on
xiii