Multimodal Disentangled Domain Adaption For Social Media Event Rumor Detection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

23, 2021 4441

Multimodal Disentangled Domain Adaption for


Social Media Event Rumor Detection
Huaiwen Zhang , Shengsheng Qian , Quan Fang, and Changsheng Xu , Fellow, IEEE

Abstract—With the rapid development of social media and I. INTRODUCTION


the increasing scale of social media data, the rumor detection
OWADAYS, social media has become one of the most
on social media platforms has become vitally crucial. The key
challenges for rumor detection on social media platforms are how
to identify rumors deeply entangled with the specific content and
N important platforms for people to share information. Hun-
dreds of millions of users spontaneously release the latest news
how to detect rumors for the emerging social media events without or share their opinions every day. However, few users carefully
labeled data. Unfortunately, most of the existing approaches
can hardly handle these challenges since they tend to learn
check the authenticity of the shared information, which means
event-specific features and cannot transfer the learned features large volumes of rumors may emerge and spread. Without proper
to newly emerged events. To tackle the above challenges, we supervision, widespread social media rumors could cause seri-
propose a novel Multimodal Disentangled Domain Adaption ous consequences and sometimes may even manipulate critical
(MDDA) method which can derive event-invariant features and public events. For example, during the 2016 U.S. election, ru-
thus benefit the detection of rumors on emerging social media
events. The model consists of two components: the multimodal
mors about presidential candidates were widely circulated on
disentangled representation learning and the unsupervised domain social media, which slandered the candidates and misled the
adaptation. The multimodal disentangled representation learning voters [1]. Therefore, it is urgent to detect and regulate social
is responsible for disentangling the multimedia posts into the media rumors for ensuring users to receive truthful information
content features and the rumor style features, and removing and maintaining social harmony.
the content-specific features from post representation. The
unsupervised domain adaptation aims to filter out the event-specific
To debunk rumors and minimize their harmful effects, many
features and keep shared rumor style features among events. Based efforts have been made. The early efforts come from news
on the final event-invariant rumor style features, we train a robust websites,1 which try to confirm rumors by expert analysis and
social media rumor detector that can transfer knowledge from crowdsourcing. However, collecting and investigating rumors
source events to the target events, which can perform well on manually is time-consuming and has obvious limitations on ef-
the newly emerged events. Extensive experiments on two Twitter
benchmark datasets demonstrate that our rumor detection model
ficiency. Recently, automatically mining and detecting rumors
outperforms state-of-the-art methods. has drawn much attention in the research community. Basically,
existing studies on automatic rumor detection can be summa-
Index Terms—Disentanglement representation learning, domain rized into two categories: (1) The first one is traditional learning
adaptation, event rumor detection, social media.
methods [2]–[6], which design plenty of hand-crafted features
from the media content of posts and the social context of users.
With these sophisticated features, SVM classifiers [2], [3], [6]
and decision tree classifiers [4], [5] have been trained to debunk
Manuscript received June 3, 2020; revised October 27, 2020; accepted
November 17, 2020. Date of publication December 7, 2020; date of current ver-
rumors. (2) The second one is deep learning methods [7]–[9],
sion December 9, 2021. This work was supported in part by the National Natural which capture deep features based on neural networks. For ex-
Science Foundation of China under Grants, 61721004, 61532009, 61720106006, ample, Ma et al. [7] employ Recurrent Neural Networks (RNNs)
61572503, 61802405, 61872424, 61702509, 61832002, 61936005, and
U1705262, in part by the Key Research Program of Frontier Sciences, CAS,
to learn the hidden features from posts. Yu et al. [8] use Con-
under Grant QYZDJ-SSW-JSC039, and in part by the K.C.Wong Education volutional Neural Networks (CNNs) to obtain key features and
Foundation. The associate editor coordinating the review of this manuscript and their high-level interactions from rumors. Khattar et al. [10] pro-
approving it for publication was Prof. Pradeep K Atrey. (Corresponding author:
Changsheng Xu.)
pose the multimodal Variational Autoencoder (VAE) to obtain
Huaiwen Zhang, Shengsheng Qian, and Quan Fang are with the National the latent multimodal representations of multimedia posts and
Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy classify them via a binary classifier.
of Sciences, Beijing 100190, China, and also with the School of Artificial
Intelligence, University of Chinese Academy of Sciences, Beijing 100049,
With sufficient posts on different events, existing deep learn-
China (e-mail: huaiwen.zhang@nlpr.ia.ac.cn; shengsheng.qian@nlpr.ia.ac.cn; ing models have achieved performance improvements over tra-
qfang@nlpr.ia.ac.cn). ditional ones due to their superior ability of feature extraction.
Changsheng Xu is with the National Laboratory of Pattern Recognition, In-
stitute of Automation,Chinese Academy of Sciences, Beijing 100190, China,
However, they are still not able to handle the following chal-
with the School of Artificial Intelligence, University of Chinese Academy of lenges:
Sciences, Beijing 100049, China, and also with the Peng Cheng Laboratory, 1) Entanglement Challenge: In real-world social media plat-
Shenzhen 518066, China (e-mail: csxu@nlpr.ia.ac.cn).
Color versions of one or more figures in this article are available at https:
forms, rumors are always entangled with specific content.
//doi.org/10.1109/TMM.2020.3042055.
Digital Object Identifier 10.1109/TMM.2020.3042055 1 snopes.com, politifact.com

1520-9210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4442 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

the multimodal disentangled representation learning algorithm,


which disentangles the feature space of multimedia post into
event content space and rumor style space. Then, we remove the
content features that vary with different posts and concentrate
on the content-invariant rumor style features which determine
the classification result. (2) For the Domain Challenge, we pro-
pose the domain adversarial neural network to deal with the lack
of labeled data problems in the newly emerged events. The pro-
posed domain adaptation method aligns the feature distributions
over different events and learns transferable rumor style features
from the multimedia posts. Based on the final transferable ru-
mor style features, we train a robust social media rumor detector
that transfers knowledge learned from source events to the target
events and performs well on newly emerged events. Extensive
experiments on two Twitter benchmark datasets demonstrate
that our rumor detection method achieves much better results
than state-of-the-art methods.
The main contributions of this work are four-fold:
1) Based on the practical rumor detection application, we
consider a new detection scenario, social media event ru-
Fig. 1. The content of rumors within the same event is different. Besides,
rumors under different events have domain gaps. mor detection, which is common in reality and more dif-
ficult to solve.
2) We introduce the multimodal disentangled representation
learning to the social media rumor detection, which dis-
Thus, they perform diversely along with different post con- entangles the multimedia post into content-specific infor-
tent. For example, in Figure 1, the two rumors that belong mation and rumor style information. Without the distrac-
to the same event (i.e., a terrorist attack) have a significant tion of the content-specific information, rumor classifiers
difference in texts and images. Existing methods tend to trained from style information can be more robust and
capture lots of content-specific features for rumor detec- achieve better performance.
tion, while ignoring the entanglement of rumor writing 3) We propose the domain adversarial neural network to
style and content information in posts, leading to unsatis- solve the problem of lacking labeled data in newly
factory performance. emerged events, which make the rumor classifiers to be
2) Domain Challenge: In real-world social media platforms, generalized for all arising events in social media platforms.
social media events keep arising. The rumor detection 4) We experimentally evaluate our method on two public
models in real situations are always facing the newly benchmark datasets, and the results demonstrate that our
emerged and time-critical events which have no labeled proposed model is more robust and effective than state-of-
data. Although historical events have annotated data, there the-art baselines for the social media event rumor detection
are domain gaps between past events and new events. task.
For example, in Figure 1, the rumor detector trained with The rest of the paper is organized as follows. In Section II,
the terrorist attack data may perform poorly in the pop the related work is reviewed. In Section III, we introduce the
star samples, since the two events are different in ru- problem statement of social media event rumor detection. In
mor writing style. Existing methods extract large num- Section IV, we present the proposed method. The extensive ex-
bers of event-specific features in the past event data. Such perimental results and analysis are presented in Section V. Fi-
event-specific features, though being able to help classify nally, we conclude the paper in Section VI.
the posts on verified events, are hard to transfer into the
newly emerged events, resulting in poor performance. II. RELATED WORK
For the sake of brevity and also to highlight the difference
In this section, we briefly review the previous methods most
with the social media rumor detection, we summarize two chal-
relevant to our work, including social media rumor detection,
lenges in real-life social media rumor detection and propose a
disentangled representation learning, and domain adaptation.
new and more realistic task: social media event rumor detec-
tion. Specifically, the social media event rumor detection task
is designed to detect rumors for emerging social media events, A. Social Media Rumor Detection
which have no labeled data. Social psychology literature generally defines a rumor as
In order to address the aforementioned challenges, we intro- “unverified and instrumentally relevant information statements
duce a novel end-to-end framework named Multimodal Disen- in circulation” [11]. This unverified information may eventu-
tangled Domain Adaptation (MDDA) for social media event ru- ally turn out to be true or false. Existing approaches [7], [10],
mor detection. (1) For the Entanglement Challenge, we propose [12]–[16] regard social media rumor detection as a supervised

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: MULTIMODAL DISENTANGLED DOMAIN ADAPTION FOR SOCIAL MEDIA EVENT RUMOR DETECTION 4443

binary classification problem. Thus, the main concern of the su- the model achieves state-of-the-art performance and generalizes
pervised classification approach is to define effective features well for emerging social media events.
for training rumor classifiers. Basically, existing studies on au-
tomatic rumor detection can be summarized into two categories. B. Disentangled Representation Learning
The first category of methods extracts features from the mul-
Disentangled representation learning aims to learn a proper
timedia content of posts. Most existing models [7], [12], [13]
representation, which has the ability to disentangle the factors
only focus on the textual features. For example, Ma et al. [7]
of variation [36]. To seek the great property of learned represen-
introduce recurrent neural networks to learn the hidden repre-
tations, researchers have shown substantial interest in learning
sentations from the text content of relevant posts. Chen et al. [12]
disentangled representations [37], [38], including some work
utilize the attention mechanism to learn selectively temporal hid-
based on generative models [39]–[41]. One of the earliest ar-
den representations of sequential posts for identifying rumors.
chitectures for learning disentangled representations using deep
Potthast et al. [13] try to detect rumors by extracting various
learning [42] is applied to the task of emotion recognition. Reed
style features from text contents. Recently, many studies explore
et al. [41] propose to learn each factor of variation of the image
the multimedia contents on social media for understanding so-
manifold as its own sub-manifold using a higher-order Boltz-
cial events, such as multi-modal topic model [17], [18], election
mann machine. Recently, Mathieu et al. [40] combine a varia-
prediction [19], opinion mining [20], information retrieval [21],
tional autoencoder with adversarial training to disentangle rep-
[22], question answering [23] and abuse detection [24]. Inspired
resentations, in which the learned code encodes the semantic
by these work, some multimodal approaches [10], [14]–[16] are
information of specific factors, successfully controlling some
proposed, which take the visual features into account and con-
factors of variation in generating images. Chen et al. [43] extend
duct rumor detection based on the multimedia content. For ex-
the generative adversarial network with information-theoretic,
ample, Jin et al. [15] propose a recurrent neural network with an
which can learn disentangled representations in an unsupervised
attention mechanism to fuse image and text features of the post
manner. It successfully disentangles writing styles from digit
for rumor detection.
shapes on the MNIST dataset, poses from the lighting of 3D
Yang et al. [25] propose a convolutional neural network for
rendered images, and background digits from the central digit
fake news detection by projecting the explicit and latent features
on the SVHN dataset. Even though the disentangled represen-
of text and image into a unified feature space. Zhou et al. [26]
tation learning has been extensively studied, to the best of our
take the similarity relationship between the textual and visual in-
knowledge, no existing work has explored to utilize the disen-
formation into account and detect fake news by jointly consider-
tangled representation learning for rumor detection.
ing the text, image and similarity of news articles. Furthermore,
some recent work [27]–[29] adopts external resources to deter-
mine the truthfulness of posts. For example, Zhang et al. [29] C. Domain Adaptation
propose a knowledge-aware network to obtain the background The idea behind domain adaptation is inspired by the human
knowledge features of posts from external knowledge graphs to being’s ability to learn with minimal or no supervision based on
improve the accuracy of rumor detection. previously acquired knowledge. In domain adaptation field, the
The second category of methods [2], [4], [30]–[33] adopts fea- data distribution is assumed to change across the training and
tures from the social context of users since they are the primary the testing data while the learning task remains the same. The
path for posts disseminating [34]. For example, Shu et al. [30] work of domain adaptation can be reviewed into two branches:
extract user features from user profiles to measure their credibil- domain transfer based methods [44]–[48] and common feature
ity and estimate the veracity of posts they shared. Yang et al. [31] embedding based methods [49]–[56].
utilize users’ opinions on social media and their credibility to Domain transfer based methods generalize the functional
detect rumors. Tacchini et al. [33] extract the stance information component from the source domain to the target. Bousmalis
from users’ social responses towards the post to infer rumor ve- et al. [44] use a generative adversarial network to transfer source
racity. Wu et al. [35] aim to capture propagation patterns such images to target images and assign them their corresponding
as graph structure of the message propagation in social media source labels. Russo et al. [45] and Hoffman et al. [46] ensure
to detect rumors. Kwon et al. [4] construct a diffusion network mutual transfer between the source and the target datasets by
for posts based on the propagation and extract network-based adding cycle consistency terms. Saito et al. [47] train two clas-
features for rumor detection. sifiers on the source dataset, in order to artificially label the target
Although these methods have achieved fairly good perfor- dataset.
mance, existing approaches on social media rumor detection A different approach to domain adaptation is to embed both
ignore the entanglement between rumor style information and domains in a shared feature space, which can minimize the
content-specific information, and tend to capture lots of untrans- discrepancy between the source and the target data. For ex-
ferable event-specific features, making them hard to be applied ample, Arthur et al. [51] propose Maximum Mean Discrep-
to newly emerged events. In this paper, we propose a novel ancy (MMD) metrics to measure the discrepancy of different
end-to-end multimodal disentangled domain adaption to disen- domains. Ganin et al. [52] propose Domain Adversarial Neu-
tangle the rumor style features and content-specific features, and ral Networks (DANN), which can employ the same encoder
learn transferable features across different events. In addition, network, followed by classification layers, to correctly classify

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4444 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

Fig. 2. Illustration of the proposed Multimodal Disentangled Domain Adaptation (MDDA) method. The MDDA first performs multimodal disentangled rep-
resentation learning to separate the representation multimedia post into content features and style features and remove the content-specific features for training.
Without the distraction of content information, the rumor classifier trained only from style features goes more precise and robust. Then adversarial learning based
domain adaptation is employed to deal with the style representation distribution drifts over different events. In this way, the MDDA can handle the social media
event rumor detection task and consistently performs well on the newly emerged events. The red line is the inference pipeline. Once the model is trained, the target
data can be directly fed into the Style Encoders and Label Predictor to get the label prediction.

the source dataset, while fooling a domain classifier. Haeusser {xi , vi }. In addition, we have unlabeled posts DT = {pi }N T
i=1
et al. [53] produce statistically domain invariant embeddings by for newly emerged events (target event) T with NT being the
reinforcing associations between source and target data directly number of posts in the target event.
in embedding space. Our ultimate goal is to train a cross-event model p(y|p, θ)
In recent years, domain adaptation technology has been ap- with parameter θ that can classify the post in the target event T
plied to many fields, however, to the best of our knowledge, our without having any information about class labels in T .
work is the first effort to focus on the social media event rumor
detection scenario.
IV. METHODOLOGY
III. PROBLEM STATEMENT A. Model Overview
In this work, we address the realistic rumor detection scenario In order to accurately detect rumors in social media events,
facing by social media platforms: social media event rumor de- we propose the Multimodal Disentangled Domain Adaptation
tection, which aims to detect rumors for the emerging social (MDDA) method. The overall architecture is illustrated in Fig-
media events without labeled data. ure 2. Our model consists of the following components:
Let DS = {pi , y i }N
i=1 be the set of labeled posts for historical
S
1) Multimodal Disentangled Representation Learning:
events (source event) S, where y i ∈ {0, 1} is the class label for Rumors are always entangled with specific content,
post pi , NS is the number of posts for the source event and resulting in diverse expressions in different posts.
the post pi consists of text sentence xi and image vi , i.e. pi = We propose the multimodal disentangled representation

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: MULTIMODAL DISENTANGLED DOMAIN ADAPTION FOR SOCIAL MEDIA EVENT RUMOR DETECTION 4445

TABLE I shown in the top half of Figure 2. We employ the variational


MAIN NOTATIONS OF THIS PAPER AND THEIR EXPLANATIONS
autoencoder (VAE [57]) as the base model of the textual dis-
entangled representation learning since it enables more fluent
sentence generation from a latent space than normal autoen-
coder [58]. To disentangle the text content distribution and the
rumor writing style distribution from the text latent distribution,
we design three special encoders. The first encoder is the basic
encoder Ebx , which learns the latent distribution of the text. The
second encoder and the third encoder are the content encoder Ecx
and the style encoder Esx , which take the latent distribution of
text as input to learn the text content distribution and the rumor
writing style distribution, respectively.
The text basic encoder module Ebx aims to capture the contex-
tual information of the given text sentence x = [x1 , x2 , . . . , xn ],
where n is the number of words. To accomplish this, the text ba-
sic encoder module uses a recurrent neural network with Gated
Recurrent Units (GRU) [59], which reads the input sentence
word by word and captures the whole contextual information of
sentence x in the final hidden state:
learning method to disentangle the multimedia posts into hn = Ebx (x; θ Ebx ) = GRU (xn , hn−1 ) (1)
the content-specific features and the rumors style features.
Then, we remove the content features that vary with differ- where θ Ebx is the parameter of the text basic encoder, and hn is
ent posts and concentrate on the content-invariant rumor the n-step hidden state of post p.
style features which determine the classification results. The text content encoder Ecx and the style encoder Esx employ
2) Unsupervised Domain Adaptation: The social media the Multilayer Perceptron (MLP) to extract the content informa-
event rumor detection aims to detect rumors for the emerg- tion and style information from the final hidden state hn :
 
ing social events which have no labeled data. To address μc , log σc2 = Ecx (hn ; θ Ecx ) = M LPcontent (hn )
this challenge, we employ the adversarial learning based   (2)
domain adaptation to learn the transferable features from μs , log σs2 = Esx (hn ; θ Esx ) = M LPstyle (hn )
multimedia posts. Based on the final transferable rumor where μ and σ are the mean and the standard variance of the
style features, we train a robust social media rumor detec- distribution of content information and style information, and
tor that transfers knowledge learned from historical events θ Ecx and θ Esx are the parameters of the text content encoder and
to the new events and performs well on newly emerged the text style encoder.
events. Then, we sample the content latent variable xc and the style
We detail the architecture of multimodal disentangled repre- latent variable xs from the distribution of content information
sentation learning in Section IV-B. The design of the domain and style information, respectively:
adversarial neural network is detailed in Section IV-C. The pro-  
posed training and inference flow are described in subsection xc ∼ N μc , σc2 I
IV-D. The main notations of this paper are summarized in Ta-   (3)
xs ∼ N μs , σs2 I
ble I.
The final post representation xz is generated by concatenating
the xc and xs , i.e., xz = xc ⊕ xs where ⊕ is the concatenation
B. Multimodal Disentangled Representation Learning operator.
Rumor posts perform diversely along with different post con- The text decoder module Dx is also built by GRUs. It takes the
tent, even they belong to the same event. Existing rumor de- post representation xz as input to generate a decoded sentence
tection methods capture lots of content-specific features from x̂, which ideally should equal to x.
posts, leading to an unsatisfactory performance on unseen posts.
x̂ = Dx (xz ; θ Dx ) (4)
To avoid the misleading content features, we propose the multi-
modal disentangled representation learning method to disentan- where θ Dx is the parameter of the text decoder.
gle the feature space of multimedia posts into event content space To sum up, we formulate the loss function of the proposed
and rumor style space, and only use the rumor style features for disentanglement VAE as follows:
training. Given a multimedia post p = {x, v}, which consists of
Lx (θ Ebx , θ Ecx , θ Esx , θ Dx ) = − EqE (xc |x) [log p(x|xz )]
text sentence x and image v, we perform the textual disentangled
representation learning and visual disentangled representation − EqE (xs |x) [log p(x|xz )]
learning, respectively. (5)
+ λkl KL (qE (xc |x)p(xc ))
1) Textual Disentangled Representation Learning: The
framework of textual disentangled representation learning is + λkl KL (qE (xs |x)p(xs ))

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4446 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

where λkl is the hyper-parameter balancing the reconstruction TABLE II


NETWORKS ARCHITECTURES OF VISUAL DISENTANGLED
loss and the KL term, p(xc ) and p(xs ) are the priors, typically REPRESENTATION LEARNING
the standard normal N (0, I), qE (xc |x) and qE (xs |x) are the
posteriors in the form N (μc , σc2 I) and N (μs , σs2 I).
To further ensure the content-specific information and the ru-
mor style information to be effectively encoded in corresponding
feature distribution, we design two auxiliary classifiers. The first
one is the text style predictor Psx , which takes the mean of the
style distribution μs as input to ensure μs to be discriminative
for the style. The second one is the adversarial text style dis-
criminator Pcx which takes the mean of the content distribution
μc as input to remove the style features from μc .
The text style predictor Psx deploys a fully connected layer
with softmax to predict whether the post is rumor or non-rumor.
It is built on top of the text style encoder Esx , and takes the mean
of the style distribution μs as input:
 
yxs = Psx μs ; θ Psx (6)
of Figure 2, and the network architectures of visual disentangled
where θ Psx is the parameter of the style predictor, and yxs is the representation learning can be found in Table II.
output of softmax layer. The text style predictor is trained with The visual content encoder Ecv consists of several strided con-
cross-entropy loss against the ground-truth distribution y: volutional layers to downsample the input and several residual
 
Lxs θ Esx , θ Psx = − E(p,y)∼DS [y log(yxs ) blocks [60] to further process it. Furthermore, all the convolu-
tional layers are followed by Instance Normalization (IN) [61],
+ (1 − y) log(1 − yxs )] (7) since it can remove the mean and variance of the original feature,
By minimizing the Lxs , we can find the optimal parameters for which discourages the content space contains style information.
the text style encoder Esx and the style predictor Psx . Note that, vc = Ecv (v; θ Ecv ) (11)
the parameter of the text basic encoder Ebx is not updated, during
the training of Psx . where θ Ecv is the parameter of the visual content encoder and vc
The text style discriminator Pcx is a neural network that con- is the content features of image v.
sists of two fully connected layers with corresponding activa- The visual style encoder Esv includes several strided convo-
tion functions. It takes the content representations μc as in- lutional layers, followed by a global average pooling layer and
put, and tries to discriminate whether the input post is rumor a fully connected layer:
or non-rumor:
  vs = Esv (v; θ Esv ) (12)
yxc = Pcx μc ; θ Pcx (8)
where θ Esv is the parameter of the visual style encoder and vs is
where θ Pcx is the parameter of text style discriminator, and yxc the style features of image v.
is the output. Hence, we define the domain adversary loss as The visual decoder Dv reconstructs the input image from its
follows: content and style features.
 
Lxc θ Ecx , θ Pcx = −Ey=1 [log yxc ] − Ey=0 log(1 − yxc ) (9)
v̂ = Dv (vc , vs ; θ Dv ) (13)
In adversarial training, we seek optimal parameters for text
content encoder Ecx and text style discriminator Pcx by the min- where θ Dv is the parameter of the visual decoder and v̂ is the
max game: reconstructed image.
  In particular, the Dv processes the content features by a set
θ ∗Ecx , θ ∗Pcx = arg min max Lxc θ Ecx , θ Pcx (10) of residual blocks and finally produces the reconstructed im-
Ex
c c Px
age by several upsampling and convolutional layers, where the
Specifically, the text content encoder tries to fool the text style residual blocks are equipped with Adaptive Instance Normal-
discriminator to maximize the discrimination loss. In contrast, ization (AdaIN) [62] layers whose parameters are dynamically
the style discriminator aims to discover the style information generated by a multilayer perceptron from the style features.
from content code to recognize the rumors. After the adversarial The image reconstruction loss can be formulated as:
training, we can obtain a text content encoder that only focuses
on extracting the content information of text. Lv (θ Ecv , θ Esv , θ Dv ) = Ep [Dv (Ecv (v) , Ecv (v)) − v1 ]
2) Visual Disentangled Representation Learning: The (14)
framework of visual disentangled representation learning is To further constrain the visual content space and the visual
composed of a visual content encoder Ecv , a visual style encoder style space, we propose two visual auxiliary classifiers: the vi-
Esv and a visual decoder Dv . The network framework of visual sual style predictor Psv and the adversarial visual style discrim-
disentangled representation learning is shown in the bottom half inator Pcv .

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: MULTIMODAL DISENTANGLED DOMAIN ADAPTION FOR SOCIAL MEDIA EVENT RUMOR DETECTION 4447

The visual style predictor Psv takes the style code vs as input to
help the visual style encoder efficiently extract the style features
from the image. In particular, it is designed as a fully connected
layer with softmax and aims to predict whether the post is rumor
or non-rumor:
 
yvs = Psv vs ; θ Psv (15)
where θ Psv is the parameter of the visual style predictor, and
yvs is the output of softmax layer. The classifier is trained with
cross-entropy loss against the ground-truth distribution y by:
 
Lvs θ Esv , θ Psv = − E(p,y)∼DS [y log(yvs )
+ (1 − y) log(1 − yvs )] (16)
The visual style discriminator Pcv is also built by two fully
connected layers with the corresponding activation functions.
Pcv aims to correctly discriminate the post into rumor or non-
rumor based on the content representations vc :
 
yvc = Pcv vc ; θ Pcx (17)
where θ Pcx is the parameter of the visual style discriminator, yvc
is the output.
Thus, we define the domain adversary loss as follows:
 
Lvc θ Ecv , θ Pcv = −Ey=1 [log yvc ] − Ey=0 log(1 − yvc ) (18)
and the visual adversarial training process is summarized as
whether the input post comes from DS or DT .
follows:
   
θ ∗Ecv , θ ∗Pcv = arg min max Lvc θ Ecv , θ Pcv (19) dz = Pdz z; θ Pdz (22)
cEv c Pv

3) Multimodal Integration: Given a multimedia post p = where θ Pdz is the parameter of domain discriminator, and dz is
{x, v}, which is composed of text sentence x and image v, we the output. The domain adversary loss is defined as follows:
extract the text style features and the visual style features through  
the textual and visual disentangled representation learning, re- Ld θ Ebx , θ Esx , θ Esv , θ Pdz = − Ep∼DS [log dz ]
(23)
spectively. In order to use both modalities of style features for − Ep∼DT [log(1 − dz )]
rumor detection, we combine the text style predictor Psx and the
visual style predictor Psv into the final label predictor Psz (·, θ Psz ) The parameters of label discriminator Pdz minimizing the loss
which takes the multimodal style representation z as input to Ld (·, ·, ·, ·) can be written as:
classify the input post into rumor or non-rumor:  
  θ ∗Pdz = arg min Ld θ Ebx , θ Esx , θ Esv , θ Pdz (24)
yzs = Psz z; θ Psz (20) Pd
z

where z is generated by concatenating the text style features μs


The above loss Ld (·, ·, ·, θ ∗Pdz ) can be used to estimate the dissim-
and the visual style features vs , i.e., z = μs ⊕ vs . Thus, the Eq.7
ilarities of different event distributions. Here, the distributions
and Eq.16 can be replaced with:
of different event representations are similar and the learned fea-
 
Lz θ Ebx , θ Esx , θ Esv , θ Psd tures are event-invariant. Thus, in order to remove the unique-
ness of each event, we need to maximize the discrimination loss
= −E(p,y)∼DS [y log(yzs ) + (1 − y) log(1 − yzs )] (21) Ld (·, ·, ·, θ ∗Pdz ) by seeking the optimal parameters for text basic
encoder Ebx , text style encoder Esx and visual style encoder Esv .
C. Unsupervised Domain Adaptation The above idea motivates a minimax game between the en-
To detect rumor for the newly emerged events, we employ coder Ebx , Esx , Esv and the domain discriminator Pdz . On one
the adversarial learning based domain adaptation. Specifically, hand, the three encoders try to fool the domain discriminator to
we use a domain adversarial neural network to remove the maximize the discrimination loss. On the other hand, the domain
event-specific information from the multimedia post style repre- discriminator aims to discover the event-specific information in-
sentation z and align the feature distribution across source event cluded in the style feature representations to recognize the event.
DS and target event DT . Based on the final event-invariance style information represen-
We propose a domain discriminator Pdz , which takes the mul- tation, we can train a robust social media rumor detector which
timodal style representations z as input, and tries to discriminate performs well on the newly emerged events.

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4448 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

TABLE III 3) Ferguson Unrest (F): One civil unrest caused by a white
STATISTICAL DETAILS OF THE NINE EVENTS
police officer shooting a young African American man in
the city of Ferguson, Missouri, on Aug. 9th, 2014.
4) Ottawa Shooting (O): A serial shooting at Parliament Hill
in Ottawa, Canada, on Oct. 22nd, 2014.
5) Germanwings Crash (G): A plane crash. An Airbus A320-
211 operated by Germanwings crashed in the French Alps
On Mar. 24th, 2015.
6) The Missing of Putin (M). An online farce. In March 2015,
Russian President Vladimir Putin disappeared from public
view for several weeks, with numerous rumors circulating.
7) Toronto Prince Show (T). A celebrity gossip says that
Prince would perform a secret show at Toronto’s Massey
Hall on Nov. 4th 2014.
D. Training and Inference 8) Gurlitt Trove. A politically sensitive event. Gurlitt Trove
The main idea of our training approach is to alternate between was a collection of around 1500 artworks assembled by
the disentangled representation learning and domain adaptation. the late German art dealer Hildebrand Gurlitt, which was
Algorithm 1 illustrates the training algorithm. The training it- announced as “Nazi loot” by the media in 2014.
erates over two consecutive stages. The goal of the first disen- 9) Ebola Essien. Celebrity gossip. The Ghana midfielder
tanglement stage is combining the style features with content Michael Essien was left out of his country’s squad in the
features to reconstruct the input, and ensuring that the style fea- African Cup of Nations qualifier on Oct. 12nd 2014, which
tures contain the label-related massage and the content features prompted rumors on Twitter that he had contracted the
are content-specific information. The former is achieved by min- Ebola virus.
imizing loss Lx , while the latter depends on the label loss Lz and As shown in Table III, we divided the nine events into three
two adversarial losses Lvs , Lvc . The second domain adaptation parts. (1) The four largest events (C, S, F, O) have enough data
stage learns to extract common style features, which is discrim- for training; (2) The three small events (G, M, T), which have
inative and event-invariant. In other words, the common style relatively less data, are regarded as the target events waiting
feature should ideally capture all the information relevant to the for knowledge transfer; (3) The two tiny events (Gurlitt Trove
classification task, where the information is not event-specific. and Ebola Essien) are abandoned since their data volume is too
In the inference phase, given a post in the target event, the small.
proposed multimodal disentangled domain adaptation network To achieve an unbiased evaluation, we performed two experi-
extracts the event-invariant style features for testing. ments. In Experiment One, we compare all methods on twelve
tasks, in which the source events and target events come from
the four largest events: C → S, C → F, C → O, S → C, S → F,
V. EXPERIMENTS S → O, F → C, F → S, F → O, O → C, O → S, O → F. Then,
we conducted the Experiment Two: comparing all methods on
A. Experimental Setup the twelve tasks, where the source events come from the four
1) Dataset: Nine events which come from the PHEME [63] largest events, and the target events of tasks come from three
and PHEME_veracity [64] datasets are used to validate the ef- small events: C → G, C → M, C → T, S → G, S → M, S → T,
fectiveness of the proposed method on social media event rumor F → G, F → M, F → T, O → G, O → M, O → T.
detection. 2) Implementation Details: For textual disentangled repre-
The PHEME dataset is constructed by collecting thousands of sentation learning, we set the embedding size to 200, and the
claims about five breaking news events, including: the shooting pre-trained Glove word vector [65] on twitter data is used to
at Charlie Hebdo, the hostage situation in Sydney, the Fergu- initialize the word embedding. The hidden size of GRU in the
son unrest, the shooting in Ottawa, and the crash of a German- text encoder and the text decoder is 128. The GRU units n is
wings plane. The PHEME_veracity dataset extends the PHEME set to 15. The dimension of the text style latent variable and the
dataset with four more events: the missing of Putin, the secret text content latent variable are set to 16 and 128. The activa-
show of Prince in Toronto, the Gurlitt’s art collection, and the tion functions of text style discriminator Pcx and the visual style
Ebola virus rumor about Essien. Table III gives the statistical discriminator Pcv are T anh. We use the Adam optimizer [66]
details of the nine events. for the text autoencoder, text label discriminator, and the label
To give a rough estimation of how difficult it is to migrate predictor with the initial learning rate of 10−3 . For visual dis-
knowledge between events, we briefly describe each event: entangled representation learning, we set the dimension of the
1) Charlie Hebdo (C): A terrorist attack on the offices of visual style feature and the visual content feature to 8 and 64.
Charlie Hebdo (a French satirical weekly newspaper), in We use the Adam optimizer for the visual autoencoder, visual
Paris, France, On Jan. 7th, 2015. style discriminator with the initial learning rate of 10−4 . The
2) Sydney Siege (S): A hostage situation took place in Syd- visual disentangled representation learning is first pre-trained
ney, Australia, On Dec. 15th, 2014. from [67]. The hyper parameters λz , λxc , λvc are set to 10, 1, 1.

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: MULTIMODAL DISENTANGLED DOMAIN ADAPTION FOR SOCIAL MEDIA EVENT RUMOR DETECTION 4449

The Batch size is 32. We conduct the adversarial learning based feature extractor to obtain the multimodal features of the
domain adaptation by the gradient reversal layer [52]. post and sends them into a bimodal variational autoen-
3) Evaluation Protocol: In this paper, we follow the standard coder to produce the latent multimodal code of the post.
evaluation protocol for unsupervised domain adaptation and use 3) MKEMN: The MKEMN obtains the background knowl-
all source examples with labels and all target examples with- edge features of posts from external knowledge graphs.
out labels [68]–[71]. We calculate the mean accuracies for each Then, the knowledge features and the image and text
category and the overall mean of these accuracies. features are jointly used to represent tweets. An ex-
ternal memory shared during the whole training pro-
cess is adapted to capture training data’s internal la-
B. Comparison Models tent topic information. In this paper, we remove the se-
To validate the effectiveness of the proposed model, we quence modeling module of MKEMN to suit the problem
choose baselines from the following five categories: the style settings.
based rumor detection models, the conventional rumor detec- 4) Deep Domain Adaptation Models: In this category, we
tion models, the multimodal based rumor detection models, the choose some deep domain adaptation models as the baselines
deep domain adaptation models, and the variants of the proposed of social media event rumor detection. The comparisons of
model. deep domain adaptation models are Domain Adversarial Neu-
1) Style Based Rumor Detection Models: We first compare ral Networks (DANN) [52], Event Adversarial Neural Net-
our model with the traditional style based rumor detection work (EANN) [16], and Joint Adversarial Domain Adaptation
models, which extract various hand-crafted style features from (JADA) [74]. The details are as follows:
tweets. The comparisons of style based models are: Linguistic 1) DANN: The DANN aims to learn the deep features that
Inquiry and Word Count (LIWC) [72] and Stylometric Inquiry are discriminative for the main learning task and invariant
(SI) [13]. The details are as follow: between domains. We re-implement the DANN algorithm
1) LWIC: LIWC is a widely accepted psycho-linguistics lex- and apply it to the rumor detection task. In particular, we
icon. Given a tweet, LIWC can count the words in the use only the textual information of post to conduct rumor
text falling into one or more of over 80 linguistic, psy- detection.
chological, and topical categories. These numbers act as 2) EANN: The EANN aims to classify a post as rumor or not
hand-crafted features used by a Text-CNN network to pre- by leveraging both the textual and visual information of the
dict rumors. post and utilizing a gradient reversal layer [52] to remove
2) SI: SI proposes a bunch of hand-crafted features to identify event content information from post representation.
the hyperpartisan news the fake news, including readabil- 3) JADA: The JADA simultaneously aligns domain-wise and
ity scores, ratios of quoted words and external links and class-wise distributions across source and target in a uni-
dictionary features, etc. These hand-crafted features are fied adversarial learning process. We apply the JADA to
also used by a Text-CNN network to predict rumors. the social media event rumor detection by replacing the
2) Conventional Rumor Detection Models: We also compare event discriminator in EANN with JADA.
our model with the conventional deep learning models: recurrent 5) Variants of the Proposed MDDA: The proposed MDDA
neural network with Gated Recurrent Unit (GRU) [59], Text model consists of three components: the textual disentangled
Convolutional Neural Network (Text-CNN) [73]. Some details representation learning, the visual disentangled representation
are as follows: learning, and domain adaptation. To evaluate the effectiveness
1) Text-CNN: Text-CNN uses a convolutional neural net- of different components in our method, we ablate our method
work to learn the representation for the text of the post, into several simplified models and compare their performance
and adopts a fully connected layer with sigmoid function against related methods. The details of these methods are de-
to predict the label of the post. scribed as follows:
2) GRU: The GRU method uses a gated recurrent neural net- 1) MDDA w/o V: A single modality model, in which we re-
work to extract the textual feature for each post. Then, a move the visual disentangled representation learning mod-
fully connected layer with a sigmoid function is used to ule.
predict whether this post is rumor or not. 2) MDDA w/o S: A variant of MDDA with the text and visual
3) Multimodal Based Rumor Detection Models: We select style discriminators being removed.
some state-of-the-art multimodal based models as baselines: 3) MDDA w/o D: A variant of MDDA with the domain dis-
Recurrent Neural Network with an attention (att-RNN) [15] and criminator being removed.
Multimodal Variational Autoencoder (MVAE) [10], Multimodal Since not all the posts have an image attached, we do not set
Knowledge-aware Event Memory Network (MKEMN) [29]. a variant which has only visual modality.
1) att-RNN: The att-RNN is a multimodal model which uses
attention mechanism to fuse the textual, visual, and social
context features to debunk rumors. For a fair comparison, C. Result
in our experiments, we work with a variant of att-RNN According to the problem setting of social media event rumor
which does not include the social context information. detection, we use the entire labeled data in the source domain
2) MVAE: The MVAE aims to learn the latent multimodal and unlabeled data in the target domain for the experiment. The
code of post for rumor detection. It uses a multimodal classification accuracy results in Experiment One are shown in

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4450 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

TABLE IV
RESULTS OF THE CLASSIFICATION ACCURACY (%) IN EXPERIMENT ONE FOR THE SOCIAL MEDIA EVENT RUMOR DETECTION. FROM LEFT TO RIGHT: DETECTION
METHODS, TWELVE EVENT RUMOR DETECTION TASKS, AND AVERAGE CLASSIFICATION ACCURACY. NOTION: C → S REPRESENT THE EVENT RUMOR DETECTION
TASK WITH THE LABELED DATA IN EVENT C AND UNLABELED DATA IN EVENT S. THE DETAILS OF THE EVENT ARE DESCRIBED IN SECTION V-A1

TABLE V
RESULTS OF THE CLASSIFICATION ACCURACY (%) IN EXPERIMENT TWO FOR THE SOCIAL MEDIA EVENT RUMOR DETECTION. FROM LEFT TO RIGHT: DETECTION
METHODS, TWELVE EVENT RUMOR DETECTION TASKS, AND AVERAGE CLASSIFICATION ACCURACY. NOTION: C → G REPRESENT THE EVENT RUMOR DETECTION
TASK WITH THE LABELED DATA IN EVENT C AND UNLABELED DATA IN EVENT G. THE DETAILS OF THE EVENT ARE DESCRIBED IN SECTION V-A1

Table IV. Table V illustrates the results in Experiment Two, in learned by deep networks are able to reduce the domain
which the source events come from the four largest events and discrepancy [75]. The MKEMN method, which adopts
the target events come from three small events. external knowledge features, works worse than expected.
Based on the results of experiment one in Table IV, we can This is because, under the setting of domain adaptation,
have the following observations. the memory network in MKEMN can only capture the
1) The LIWC and SI, which use hand-crafted style features, source data’s latent topic information.
perform worst among all compared methods in Experi- 4) Deep domain adaptation methods outperform the tradi-
ment One. This may be because the hand-crafted style tional rumor detection methods. The conventional rumor
features cannot cover all the style features needed in ru- detection methods tend to capture lots of event-specific
mor detection and suffer limited generalization. features that are not shared among different events. Such
2) The Text-CNN model performs poorly in Experiment one event-specific features, though being able to help classify
and two. This is because Text-CNN tends to capture the the posts on verified events, would hurt the detection with
local specific features, which cannot transfer to the newly regard to newly emerged events. The results also validate
emerged events. The GRU outperforms CNN on most that domain discrepancy can be further reduced by in-
tasks, indicating that the higher-level features, for exam- serting domain adaptation constraints into deep networks
ple, the sentence level features captured by GRU, can pro- (MDDA, EANN, JADA).
vide more information for rumor detection. 5) The multimodal domain adaptation models (EANN,
3) The multimodal models (att-RNN, MVAE, MKEMN) per- JADA) outperform the single-modal domain adaptation
form better than single-modal models (Text-CNN, GRU), model (DANN), which shows that the multimodal in-
which prove that the additional visual and knowledge formation provides more transferable knowledge. The
data provide more information for rumor detection [10], JADA performs better than EANN because the JADA
[15]. Furthermore, the MVAE performs better with a more model simultaneously aligns domain-wise and class-
transferable latent multimodal code generated by VAE. wise distributions across source and target domains in a
This confirms that the abstract feature representations unified adversarial learning process, which can prevent

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: MULTIMODAL DISENTANGLED DOMAIN ADAPTION FOR SOCIAL MEDIA EVENT RUMOR DETECTION 4451

performance degradation caused by the class-wise mis- learning from MDDA, denoted as MDDA w/o V. The improve-
match across domains. ments of MDDA in Table VI and Table VII show the effective-
6) The MDDA outperforms all the baseline methods by a ness of multimodal information, a.k.a. the additional visual data
large margin. We attribute the superiority of MDDA to provide more information for rumor detection.
its two properties: 1) The MDDA model uses the mul- 2) Impact of the Disentangled Representation Learning: To
timodal disentangled representation learning to disentan- evaluate the effectiveness of disentangled representation learn-
gle the multimedia post into event content-specific infor- ing, we compare the performance of MDDA with EANN. Ac-
mation and rumor style information. Without the distrac- cording to the results in Table VI and Table VII, the MDDA can
tion of the content-specific information, rumor classifiers consistently and significantly beat EANN on both datasets with
trained from style information can get more robust and various tasks, confirming the ability of the disentangled repre-
achieve better performance. 2) We propose the domain sentation in conducting rumor detection for the newly emerged
adversarial neural network to solve the problem of lack- social events. Furthermore, it is surprising that the MDDA w/o
ing labeled data in newly emerged events, which makes the V can surpass EANN, where the EANN leverages both the tex-
rumor classifier to be generalized well for all the emerging tual and visual information of the post, while the MDDA w/o V
events in social media platforms. uses only the textual information. It proves the great power of
Experiment Two provides some harder tasks to test the per- the disentangled representation learning.
formance of our MDDA model. From the results in Table V, we 3) Impact of the Domain Adaptation: To know the effects
can observe that MDDA outperforms the baseline methods on of the proposed adversarial learning based domain adaptation
most tasks. Reasons are similar to the results on Experiment One method, we remove the gradient reverse layer from MDDA,
in Table IV. Besides, some other interesting conclusions can be denoted as MDDA w/o D.
drawn: As shown in Table VI and Table VII, the MDDA model signif-
1) The LIWC and SI achieve competitive results in Exper- icantly outperforms MDDA w/o D, indicating the effectiveness
iment Two. This demonstrates that the hand-crafted fea- of the domain alignment method in conducting rumor detection
tures specifically designed for writing style are constantly for the newly emerged social events.
effective in identifying rumors, while the deep learning 4) Impact of the Style Discriminator: To validate the impor-
methods may suffer from the data size limitation and lead tance of the separation between style and content distribution,
to the performance decline. we remove the text and visual style discriminator from MDDA,
2) The conventional rumor detection models work fine in named as MDDA w/o S. Without the filtering from style discrim-
experiment one, perhaps only because the four events in inator, the stye code xs and vs may contain content information
experiment one are similar. With some different events that decreases the performance of domain knowledge transfer.
in experiment two, for example *→M, *→T, they per- As shown in Table VI and Table VII, the MDDA w/o S under-
form badly. It indicates that conventional methods cannot performs the MDDA in a large gap, indicating the importance of
handle the social media event rumor detection scenario the separation operation between style and content feature space
effectively. in disentangled representation learning.
3) The proposed MDDA model outperforms these unsuper-
vised domain adaptation methods on most event rumor E. Visualization
detection tasks, especially in some hard tasks like C →
For a more intuitive understanding of what the proposed mul-
T, S → T, where the source and target events are quite
timodal disentangled representation learning is doing, we vi-
different. The performance of MDDA in these hard tasks
sualize the features generated by the text style encoder, the
can reflect the robustness of algorithms.
text content encoder, the visual style encoder, the visual con-
4) We can observe that MDDA underperforms JADA on task
tent encoder in the task C → S using t-SNE embedding [76],
F → M and F → T.
respectively. Note that the multimodal disentangled representa-
Few labeled training data and the overfitting on source
tion learning aims to learn the content representation and the
data can explain the experiment phenomenon.
style representation of multimedia posts, where the rumor writ-
In addition, JADA proposes an improved adversarial loss
ing style features are included in the style representation and the
to reduce domain discrepancy while MDDA uses the orig-
event content features are included in the content representation.
inal adversarial loss, which may also lead to a performance
As shown in Figure 3, posts with different styles are noticeably
drop. Even so, MDDA can achieve the best average per-
and cleanly separated in the style space but are indistinguish-
formance among the mentioned methods.
able in the content space, no matter whether they are textual or
visual distributions. The visual style space is less discriminative
D. Ablation Study than the text style space since not all the posts are attached with
images.
To demonstrate the effectiveness of each proposed compo-
nent in our model, we design several variants of our model and
F. Parameter Sensitivity
compare the performance of them.
1) Impact of the Multimodal Information: To investigate In this subsection we evaluate the parameter sensitivity of
how the multimodal information helps learn more transferable the hyper parameters λz , λxc , λvc . Specifically, we first select
features, we remove the visual disentangled representation Task C → S from Experiment One and task C → G from

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4452 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

TABLE VI
COMPARISON AMONG MDDA VARIANTS ON EXPERIMENT ONE TASK SET

TABLE VII
COMPARISON AMONG MDDA VARIANTS ON EXPERIMENT TWO TASK SET

Fig. 3. t-SNE plots of the disentangled style and content spaces on Charlie Hebdo (C) event [a–d] and Sydney Siege (S) event [e–h] in task C → S.

Experiment Two as the target tasks. Then, we keep two of the


hyper parameters and change the value of the rest one. In Fig-
ure 4, we illustrate the effect on the performance of the hyper
parameter λz , λxc , and λvc . When the λxc and λvc are fixed to 1,
the relative performance between different λz is quite consistent
(±1%) both on Task C → S and Task C → G. This result shows
that if we learn a good representation from the disentangled
learning, it will achieve good performance on rumor detection.
We set the λz to 10 and set one of the λ∗c ∈ {λxc , λvc } to 1,
when we investigate the other one. From the Figure 4, we can
see that the fluctuations of λxc and λvc make more impact on
the model performance. This is because they can directly impact
the separation of style and content distribution, and with an im-
proper parameter setting, there will be significant performance
degradation of the model.

VI. CONCLUSION
In this work, we address the realistic rumor detection sce-
Fig. 4. Effect of the λz , λxc and λvc on model performance. nario facing by social media platforms: social media event rumor

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: MULTIMODAL DISENTANGLED DOMAIN ADAPTION FOR SOCIAL MEDIA EVENT RUMOR DETECTION 4453

detection, which aims to detect rumor for the emerging social [20] Q. Fang et al., “Word-of-mouth understanding: Entity-centric multimodal
media events without labeled data. To tackle the challenge of aspect-opinion mining in social media,” IEEE Trans. Multimedia, vol. 17,
no. 12, pp. 2281–2296, Dec. 2015.
social media event rumor detection, we introduce a novel Mul- [21] S. Liu et al., “Joint-modal distribution-based similarity hashing for large-
timodal Disentangled Domain Adaption (MDDA) method. It scale unsupervised deep cross-modal retrieval,” in SIGIR 2020, Virtual
consists of two main components: the multimodal disentangled Event, China, Jul. 25-30, ACM, 2020, pp. 1379–1388.
[22] D. Kim, D. Kim, E. Hwang, and S. Rho, “TwitterTrends: A spatio-temporal
representation learning and the unsupervised domain adaptation. trend detection and related keywords recommendation scheme,” Multime-
The multimodal disentangled representation learning is respon- dia Syst., vol. 21, pp. 73–86, 2013.
sible for decomposing the multimedia posts into the event con- [23] Y. Zhang, S. Qian, Q. Fang, and C. Xu, “Multi-modal knowledge-aware hi-
erarchical attention network for explainable medical question answering,”
tent information and the rumor writing style information. The in MM 2019, Assoc. Comput. Mach., Nice, France, 2019, pp. 1089–1097.
unsupervised domain adaptation is to remove the event-specific [24] A. Kumar and N. Sachdeva, “Multi-input integrative learning us-
features and keep shared rumor style features among events. Ex- ing deep neural networks and transfer learning for cyberbullying
detection in real-time code-mix data,” Multimedia Syst., pp. 1–15,
tensive experiments on two Twitter benchmark datasets demon- 2020, doi: 10.1007/s00530-020-00672-7.
strate that our rumor detection method achieves much better [25] Y. Yang et al., “Ti-CNN: Convolutional neural networks for fake news
results than the state-of-the-art methods. detection,” 2018, arXiv:1806.00749.
[26] X. Zhou, J. Wu, and R. Zafarani, “Safe: Similarity-aware multi-modal fake
REFERENCES news detection,” 2020, arXiv:2003.04981.
[27] A. Vlachos and S. Riedel, “Fact checking: Task definition and
[1] H. Allcott and M. Gentzkow, “Social media and fake news in the 2016 dataset construction,” in Proc. ACL 2014 Workshop. Baltimore,
election,” J. Econ. Perspectives, vol. 31, no. 2, pp. 211–36, 2017. MD, USA: Association for Computational Linguistics, Jun. 2014,
[2] C. Castillo, M. Mendoza, and B. Poblete, “Information credibility on twit- pp. 18–22.
ter,” in Proc. 20th Int. Conf. World Wide Web, India, 28 Mar.–1 Apr. 2011, [28] A. Magdy and N. Wanas, “Web-based statistical fact checking of textual
pp. 675–684. documents,” in Proc. 2nd Int. Workshop Search Mining User-Generated
[3] S. Kwon, M. Cha, and K. Jung, “Rumor detection over varying time win- Contents. New York, NY, USA: Association for Computing Machinery,
dows,” Plos One, vol. 12, no. 1, 2017, Art. no. e0 168 344. 2010, pp. 103–110.
[4] S. Kwon et al., “Prominent features of rumor propagation in online so- [29] H. Zhang, Q. Fang, S. Qian, and C. Xu, “Multi-modal knowledge-aware
cial media,” in Proc. IEEE 13th Int. Conf. Data Mining, Dec. 2013, event memory network for social media rumor detection,” in Proc. 27th
pp. 1103–1108. ACM Int. Conf. Multimedia, 2019, pp. 1942–1951.
[5] X. Liu et al., “Real-time rumor debunking on twitter,” in Proc. 24th ACM [30] K. Shu, S. Wang, and H. Liu, “Understanding user profiles on social media
Int. Conf. Inform. Knowl. Manag., New York, NY, USA: ACM, 2015, for fake news detection,” in Proc. IEEE Conf. Multimedia Inf. Process.
pp. 1867–1870. Retrieval (MIPR), Apr. 2018, pp. 430–435.
[6] J. Ma et al., “Detect rumors using time series of social context information [31] S. Yang et al., “Unsupervised fake news detection on social media: A
on microblogging websites,” in Proc. 24th ACM Int. Conf. Inf. Knowl. generative approach,” in Proc. Ass. Adv. Artif. Intell., vol. 33, 2019,
Management, Melbourne, Australia, Oct. 2015, pp. 1751–1754. pp. 5644–5651.
[7] J. Ma et al., “Detecting rumors from microblogs with recurrent neural [32] Z. Jin, J. Cao, Y. Zhang, and J. Luo, “News verification by exploiting con-
networks,” in Proc. 25th Int. Joint Conf. Artif. Intell., New York, NY, flicting social viewpoints in microblogs,” in Proc. Ass. Adv. Artif. Intell.,
USA, 9–15, Jul. 2016, pp. 3818–3824. 2016.
[8] F. Yu et al., “A convolutional approach for misinformation identification,” [33] E. Tacchini et al., “Some like it hoax: Automated fake news detection in
in Proc. 25th Int. Joint Conf. Artif. Intell., 2017, pp. 3901–3907. social networks,” in Proc. 2nd Workshop Data Sci. Social Good, ECML-
[9] J. Ma, W. Gao, and K.-F. Wong, “Detect rumors on twitter by promot- PKDD, vol. 1960, pp. 1–15, 2017.
ing information campaigns with generative adversarial learning,” in Proc. [34] S. Tang, N. Blenn, C. Doerr, and P. V. Mieghem, “Digging in the digg social
Conf. World Wide Web. New York, NY, USA, 2019, pp. 3049–3055. news website,” IEEE Trans. Multimedia, vol. 13, no. 5, pp. 1163–1175,
[10] D. Khattar, J. S. Goud, M. Gupta, and V. Varma, “MVAE: Multimodal Oct. 2011.
variational autoencoder for fake news detection,” in Proc. Conf. World [35] K. Wu, S. Yang, and K. Q. Zhu, “False rumors detection on sina weibo by
Wide Web. New York, NY, USA, 2019, pp. 2915–2921. propagation structures,” in ICDE, Seoul, South Korea, 2015, pp. 651–662.
[11] N. DiFonzo and P. Bordia, Rumor Psychology: Social and Organizational [36] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A re-
Approaches. Washington, DC, USA: American Psychological Associa- view and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell.,
tion, 2007. vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
[12] T. Chen, X. Li, H. Yin, and J. Zhang, “Call attention to rumors: Deep atten- [37] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-
tion based recurrent neural networks for early rumor detection,” in Proc. encoders,” in Proc. Int. Conf. Artif. Neural Netw., Berlin, Germany:
PAKDD 2018 Workshops, BDASC, BDM, ML4Cyber, PAISI, DaMEMO, Springer, 2011, pp. 44–51.
Melbourne, VIC, Australia, Jun. 3, 2018, vol. 11154. Berlin, Germany: [38] J. B. Tenenbaum and W. T. Freeman, “Separating style and content,” in
Springer, 2018, pp. 40–52. Proc. Adv. Neural Inf. Process. Syst., 1997, pp. 662–668.
[13] M. Potthast et al., “A stylometric inquiry into hyperpartisan and fake [39] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-
news,” in Proc. ACL. Melbourne, Australia, Jul. 2018, pp. 231–240. supervised learning with deep generative models,” in Proc. Adv. Neural
[14] Z. Jin et al., “Novel visual and statistical image features for microblogs Inf. Process. Syst., 2014, pp. 3581–3589.
news verification,” IEEE Trans. Multimedia, vol. 19, no. 3, pp. 598–608, [40] M. F. Mathieu et al., “Disentangling factors of variation in deep represen-
Mar. 2017. tation using adversarial training,” in Proc. Adv. Neural Inf. Process. Syst.,
[15] Z. Jin et al., “Multimodal fusion with recurrent neural networks for rumor 2016, pp. 5040–5048.
detection on microblogs,” in Proc. Multimedia Conf., New York, NY, USA, [41] S. Reed, K. Sohn, Y. Zhang, and H. Lee, “Learning to disentangle factors
2017, pp. 795–816. of variation with manifold interaction,” in Proc. 31st Int. Conf. Int. Conf.
[16] Y. Wang et al., “EANN: Event adversarial neural networks for multi- Mach. Learn., 2014, pp. II-1431–II-1439.
modal fake news detection,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. [42] S. Rifai et al., “Disentangling factors of variation for facial expres-
Discov. Data Mining., New York, NY, USA: ACM, 2018, pp. 849–857. sion recognition,” in Proc. Eur. Conf. Comput. Vis., Berlin, Heidelberg:
[17] S. Qian, T. Zhang, C. Xu, and J. Shao, “Multi-modal event topic model Springer-Verlag, 2012, pp. 808–822.
for social event analysis,” IEEE Trans. Multimedia, vol. 18, no. 2, [43] X. Chen et al., “Infogan: Interpretable representation learning by infor-
pp. 233–246, Feb. 2016. mation maximizing generative adversarial nets,” in Proc. Adv. Neural Inf.
[18] S. Qian, T. Zhang, and C. Xu, “Multi-modal multi-view topic-opinion Process. Syst., Barcelona, Spain, 2016, pp. 2172–2180.
mining for social event analysis,” in MM 2016, Assoc. Comput. Mach., [44] K. Bousmalis et al., “Unsupervised pixel-level domain adaptation with
Amsterdam, The Netherlands, 2016, pp. 2–11. generative adversarial networks,” in Proc. Conf. Comput. Vis. Pattern
[19] Q. You et al., “A multifaceted approach to social multimedia-based predic- Recognit., 2017, pp. 3722–3731.
tion of elections,” IEEE Trans. Multimedia, vol. 17, no. 12, pp. 2271–2280,
2015.

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.
4454 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 23, 2021

[45] P. Russo, F. M. Carlucci, T. Tommasi, and B. Caputo, “From source to [72] Y. Tausczik and J. Pennebaker, “The psychological meaning of words:
target and back: Symmetric bi-directional adaptive gan,” in Proc. Conf. Liwc and computerized text analysis methods,” J. Lang. Social Psychol.,
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8099–8108. vol. 29, pp. 24–54, 2010.
[46] J. Hoffman et al., “Cycada: Cycle-consistent adversarial domain adap- [73] Y. Kim, “Convolutional neural networks for sentence classification,” in
tation,” in Proc. Int. Conf. Mach. Learn., PMLR, vol. 80, 2017, pp. Proc. Empirical Methods Natural Lang. Process., 2014.
1994–2003. [74] S. Li et al., “Joint adversarial domain adaptation,” in Proc. Multimedia
[47] K. Saito, Y. Ushiku, and T. Harada, “Asymmetric tri-training for unsuper- Conf., New York, NY, USA, 2019, Art. no. 729-737.
vised domain adaptation,” in Proc. 34th Int. Conf. Mach. Learn., 2017, [75] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are
pp. 2988–2997. features in deep neural networks?,” in Proc. Adv. Neural Inf. Process.
[48] X. Yang, T. Zhang, and C. Xu, “Cross-domain feature learning in multi- Syst., Cambridge, MA, USA: MIT Press, 2014, pp. 3320–3328.
media,” IEEE Trans. Multimedia, vol. 17, no. 1, pp. 64–78, Jan. 2015. [76] L. v. d. Maaten and G. Hinton, “Visualizing data using T-SNE,” J. Mach.
[49] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Deep transfer learning with Learn. Res., vol. 9, no. Nov, pp. 2579–2605, 2008.
joint adaptation networks,” in Proc. 34th Int. Conf. Mach. Learn., 2017,
pp. 2208–2217.
[50] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discrimi-
native domain adaptation,” in Proc. Conf. Comput. Vis. Pattern Recognit., Huaiwen Zhang received the B.E. degree from the
2017, pp. 7167–7176. Inner Mongolia University, Inner Mongolia, China,
[51] A. Gretton et al., “A kernel two-sample test,” J. Mach. Learn. Res., vol. 13, in 2016. He is currently working toward the Ph.D.
pp. 723–773, Mar. 2012. degree with the Multimedia Computing Group, Na-
[52] Y. Ganin et al., “Domain-adversarial training of neural networks,” J. Mach. tional Laboratory of Pattern Recognition, Institute of
Learn. Res., vol. 17, no. 1, pp. 2096–2030, 2016. Automation, Chinese Academy of Sciences, Beijing,
[53] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers, “Asso- China. His research interests include social multime-
ciative domain adaptation,” in Proc. Int. Conf. Comput. Vis., 2017, dia analysis and multimedia computing.
pp. 2765–2773.
[54] X. Ma, T. Zhang, and C. Xu, “Deep multi-modality adversarial networks
for unsupervised domain adaptation,” IEEE Trans. Multimedia, vol. 21,
no. 9, pp. 2419–2431, 2019.
[55] S. Qian, T. Zhang, and C. Xu, “Cross-domain collaborative learning via
discriminative nonparametric bayesian model,” IEEE Trans. Multimedia, Shengsheng Qian received the B.E. degree from the
vol. 20, no. 8, pp. 2086–2099, Aug. 2018. Jilin University, Changchun, China, in 2012, and the
[56] M. Wu et al., “Unsupervised domain adaptive graph convolutional net- Ph.D. degree in pattern recognition and intelligent
works,” in WWW’20: Web Conf., Taipei, Taiwan, Apr. 20–24, 2020, systems from the Institute of Automation, Chinese
pp. 1457–1467. Academy of Sciences, Beijing, China, in 2017. He
[57] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proc. is currently an Associate Professor with the Institute
ICLR, 2014. of Automation, Chinese Academy of Sciences. His
[58] S. R. Bowman et al., “Generating sentences from a continuous space,” in current research interests include social media data
Proc. CoNLL, 2015. mining and social event content analysis.
[59] K. Cho et al., “Learning phrase representations using rnn encoder-decoder
for statistical machine translation,” 2014, arXiv:1406.1078.
[60] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” Comput. Vis. Pattern Recognit., pp. 770–778, 2015.
[61] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Improved texture networks:
Quan Fang received the B.E. degree from Beihang
Maximizing quality and diversity in feed-forward stylization and texture
University, Beijing, China, in 2010, and the Ph.D. de-
synthesis,” Comput. Vis. Pattern Recognit., pp. 4105–4113, 2017.
gree from the National Laboratory of Pattern Recog-
[62] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time nition, Institute of Automation, Chinese Academy of
with adaptive instance normalization,” in Proc. Int. Conf. Comput. Vis.,
Sciences, Beijing, China, in 2015. He is currently an
pp. 1510–1519, 2017.
Associate Professor with the Institute of Automation,
[63] A. Zubiaga, M. Liakata, and R. Procter, “Exploiting context for rumour
Chinese Academy of Sciences. His research interests
detection in social media,” in Proc. SocInfo, Oxford, U.K., vol. 10539, include georeferenced social media mining and appli-
2017, pp. 109–123.
cation, multimedia content analysis, knowledge min-
[64] E. Kochkina, M. Liakata, and A. Zubiaga, “All-in-one: Multi-task learning
ing, computer vision, and pattern recognition.
for rumour verification,” in Proc. 27th Int. Conf. Comput. Linguistics.
Santa Fe, New Mexico, USA: Association for Computational Linguistics,
Aug. 2018, pp. 3402–3413.
[65] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for Changsheng Xu (Fellow, IEEE) is a Professor with
word representation,” in Proc. Empirical Methods Natural Lang. Process., National Laboratory of Pattern Recognition, Insti-
2014, pp. 1532–1543. tute of Automation, Chinese Academy of Sciences
[66] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” and Executive Director of China-Singapore Institute
in Proc. Int. Conf. Learn. Represent., San Diego, CA, USA, 2015. of Digital Media. He has hold 30 granted/pending
[67] X. Huang, M.-Y. Liu, S. J. Belongie, and J. Kautz, “Multimodal unsu- patents and published more than 200 refereed re-
pervised image-to-image translation,” in Proc. Eur. Conf. Comput. Vis., search papers in these areas. His research interests in-
2018. clude multimedia content analysis/indexing/retrieval,
[68] B. Gong, K. Grauman, and F. Sha, “Connecting the dots with landmarks: pattern recognition and computer vision. He is an
Discriminatively learning domain-invariant features for unsupervised do- Associate Editor of IEEE TRANSACTION ON MUL-
main adaptation,” in Proc. 30th Int. Conf. Int. Conf. Mach. Learn., 2013, TIMEDIA, ACM Transaction on Multimedia Comput-
pp. I-222–I-230. ing, Communications and Applications and ACM/Springer Multimedia Systems
[69] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features Journal. He was the recipient of the Best Associate Editor Award of ACM Trans-
with deep adaptation networks,” in Proc. Int. Conf. Mach. Learn., 2015. action on Multimedia Computing, Communications and Applications in 2012
[70] S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic represen- and the Best Editorial Member Award of ACM/Springer Multimedia Systems
tations for unsupervised domain adaptation,” in Proc. Int. Conf. Mach. Journal in 2008. He served as Program Chair of ACM Multimedia 2009. He was
Learn., 2018. an Associate Editor, Guest Editor, General Chair, Program Chair, Area/Track
[71] S. Li et al., “Deep residual correction network for partial domain Chair, Special Session Organizer, Session Chair and TPC Member for more than
adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., to be published, 20 IEEE and ACM prestigious multimedia journals, conferences, and work-
doi: 10.1109/TPAMI.2020.2964173. shops. He is IAPR Fellow and ACM Distinguished Scientist.

Authorized licensed use limited to: Northeastern University. Downloaded on April 13,2023 at 15:27:55 UTC from IEEE Xplore. Restrictions apply.

You might also like