Multi Modal Emotion and Cause

Multi-modal Emotion and Cause Analysis in Modality-Switching
Conversations: A New Task and the Benchmarks
Anonymous ACL submission
Abstract widely used in many fields, such as customer ser- 041

vice, online chatting, news analysis, and dialogue 042
001 Previous emotion recognition mainly focuses
systems. 043
002 on text domains or multi-modal domains where
003 different modalities appear simultaneously, As we know, in the real world, conversations are 044
004 while emotion cause extraction studies mainly the commonest form of communication between 045
005 focus on text domains, e.g., blogs, documents, people. When a person interacts with others in 046
006 and news. However, in reality, especially in a dialogue, emotional fluctuation is more likely 047
007 the field of customer service and chatting con- to happen. Thus, this scenario normally yields a 048
008 versations, image and text utterances appear
causal pair (emotion and cause utterance), in which 049
009 alternately in a conversation. Therefore, in this
010 paper, we attempt to recognize emotion and the cause has the responsibility for the emotion gen- 050
011 extract cause in a new scenario, i.e., modality- erated by the speaker. Due to the rich meaning of 051
012 switching conversations, where each turn of people’s textual, visual, and acoustic expressions, 052
013 utterance has only one modality of text and im- conversations naturally occur in a multi-modal 053
014 age. In this scenario, we naturally explore two form. For example, towards the conversations in 054
015 new tasks, namely Emotion Recognition and e-commerce customer service, the customers and 055
016 Cause Extraction in Modality-Switching Con-
servers often communicate with texts and images 056
017 versations (aka ERMSC and CEMSC). Since
018 the existing studies have never investigated
in a dialogue window. To be noted, in this scenario, 057
019 the above two tasks, we develop a benchmark the modalities appear alternately in each turn rather 058
020 dataset1 . and set some benchmark approaches than simultaneously, which we refer to modality- 059
021 to accomplish both tasks. Specifically, to an- switching phenomenon. As shown in Figure 1, two 060
022 notate a dataset from modality-switching con- examples present the modality-switching conver- 061
023 versations, we first build an annotation sys- sation, in which each turn of utterance has only 062
024 tem2 . Then, we obtain 5740 emotion-cause one modality (text or image) the modality may 063
025 pairs and 53464 utterances with labels (we call
change in the next turn. Different from previous 064
026 this dataset MECMSC for short). Finally, we
027 try to find some suitable methods that can be studies(Tripathi et al., 2018; Jia et al., 2021; Po- 065
028 applied in this scenario and build some strong ria et al., 2021), the emotion and cause analysis in 066
029 benchmarks on this dataset. The benchmark modality-switching conversations, we focused in 067
030 results demonstrate that there still have some this paper, raises the following challenges. 068
031 room for improvement and this task has great Emotion Recognition in Modality-switching 069
032 potential for application in real-life scenarios.
Conversations, has a modality missing problem 070
033 1 Introduction for each turn, which cannot be handled by one 071
temporal encoder, such as BERT or Transformer 072
034 Emotion is a key part of human beings intrinsically, for a long dialogue (Ghosal et al., 2019; Shen et al., 073
035 and how to understand people’s emotions become 2021). This is mainly because different modalities 074
036 a vital research orientation in artificial intelligence at different time steps exhibit different semantic 075
037 (AI). Especially, in the natural language processing spaces. Moreover, as shown in Figure 1, we can 076
038 (NLP) community, emotion analysis (Izard, 1992; not change the dialogue content to a uni-modal 077
039 Hu et al., 2021b; Shen et al., 2021; Rajan et al., status since removing any utterances may miss the 078
040 2022) has received much attention, as it can be complete meaning for a dialogue. For traditional 079
1
The dataset will be released upon acceptance. multi-modal approaches (Chen and Zeng, 2021; 080
2
The label system will be open upon acceptance. Jia et al., 2021; Rajan et al., 2022; Poria et al., 081
1
顾客客服顾客客服
(Customer) (Server) (Customer) (Server)
1:全是店面瑕疵货(All defective store 生气 1:难受以前购买你家鞋子都没这么严重沮丧
(angry) (Frustrated)
goods) (sad, your shoes are not so serious before)
中性 2:麻烦您提供下商品问题照片呢(Can you 沮丧 2:由于质检疏忽,给您带来不便了,还望谅
(Neural) (Frustrated )解下哦(Due to the quality control inspection
provide photos of the product problem?)
negligence, sorry to bring you trouble)
3:全是痕迹(There are traces all over this) 生气 3:毕竟也不是第一次到你家买反正你们也中性
(angry)
出运费那我也不能找你们麻烦(After all, (Neural)
it is not the first time to buy, and you pay for
4:
中性 shipping, I will not cause you trouble)
(Neural)
中性 4:订单退换货完成后可联系我们给您退邮
(Neural)
费哦(After the order return is completed, you
can contact us to give you a postage refund!)
中性 5:尽快为您查看处理呢(As soon as possible 5:嗯嗯这种小事完全没问题中性
(Neural) (Neural)
for your processing it） (ok, such a small thing is no
problem at all) 6:
6:这个脚后跟明显试穿发黑(This heel 生气开心
(angry)
(happy)
obviously dirty)
(a) (b)
Figure 1: Two examples of our dataset based on Modality-Switching Conversations.
082 2021), they default every utterance to contain every versations(aka ERMSC and CEMSC). 110
083 possible modality, then normally fuse all modalities
084 in each turn. However, these kinds of multi-modal 2. We build an annotation toolkit and annotate a 111
085 approaches are difficult to extend to our modality- new dataset for the analysis of ERMSC and 112
086 switching scenario. CEMSC. To the best of our knowledge, we 113
087 Cause Extraction in Modality-switching Con- are the first to provide the dataset for these 114
088 versations, also meets the modality missing prob- two tasks. 115
089 lem similar to emotion recognition. Besides, al-

3. We provide some benchmarks for further re- 116
090 though only one study (Poria et al., 2021) has
search due to the lack of studies in this sce- 117
091 started to explore emotion cause extraction in tex-
nario. 118
092 tual conversations, none have conducted that in
093 multi-modal conversations. A classical difference 2 Related Work 119
094 between textual and modality-switching conver-
095 sations is that both text and image can be cause Emotion Recognition in Conversations(ERC). 120
096 expressions for an emotion. For example, in Fig- In the past years, many studies have been carried 121
097 ure 1(a), utterance 4 (an image), can not only be out for this task. We divide the current works into 122
098 the context of target emotion utterance 6, but also textual and multi-modal approaches. 123
099 the cause of the emotion. And in Figure 1(b), the Textual approaches are first proposed in con- 124
100 emotion sometimes comes from an image(server, versation research. Several related studies (Poria 125
101 utterance 6) and the cause of emotion comes from et al., 2017; Majumder et al., 2019) with focus 126
102 the textual utterance in context (customer, utterance on static emotion recognition leverage past and fu- 127
103 5), which contains the customer’s forgiveness. This ture context to forecast target utterance emotion. 128
104 raises the modality representation and multi-modal (Ghosal et al., 2019) propose a graph neural net- 129
105 fusion problems for emotion cause extraction in the work based approach, which leverages self and 130
106 new scenario. inter-speaker dependency of the interlocutors to 131
107 To summarise, our contributions are as follows: model conversational context. (Jiao et al., 2020) 132
proposes an Attention Gated Hierarchical Memory 133
108 1. We define the Emotion Recognition and Cause Network to conduct real-time emotion recognition, 134
109 Extraction tasks in Modality-Switching Con- which captures historical context and summarizes 135
2
136 the memories appropriately to retrieve relevant in- (Poria et al., 2021) introduce the task of recogniz- 187
137 formation. (Ghosal et al., 2020) propose a new ing emotion cause in textual conversations, and 188
138 framework, which incorporates different elements collect a dataset named RECCON. Although a 189
139 of commonsense. (Hu et al., 2021a) propose a con- pre-printed study(Wang et al., 2021) try to ex- 190
140 textual reasoning network, namely DialogueCRN, plore multi-modal emotion-cause pair extraction on 191
141 to comprehend the conversational context from a TV shows re-constructed from MELD(Poria et al., 192
142 cognitive perspective. 2019), namely ECF. The essential distinctions from 193
143 With the development of technology, multiple our dataset are that 1) the modalities of ECF ap- 194
144 modalities can complement each other for bet- pear simultaneously at each turn while ours appear 195
145 ter performance. (Zhang et al., 2020) proposes alternately. 2) Following the previous studies, ECF 196
146 a bidirectional dynamic dual influence network focuses on virtual multi-modal dialogues by actors, 197
147 for real-time emotion recognition in conversations, instead of real conversation in e-commerce. 198
148 which can simultaneously model both intra- and To this end, to advance the emotion cause analy- 199
149 inter-modal influence with bidirectional informa- sis and solve the practical problem in customer ser- 200
150 tion propagation. (Hu et al., 2021b) propose a vice, we build a new multi-modal dataset for emo- 201
151 model based on GCN, to explore a more effec- tion recognition and cause extraction in modality- 202
152 tive way of utilizing both multi-modal and long- switching conversations. Furthermore, we set some 203
153 distance contextual information. (Li et al., 2022) benchmark baselines to observe and analysis the 204
154 introduce a new structure to extract multi-modal performance of the above mentioned new tasks. 205
155 emotion vectors from different modalities and in-
3 Task Definition 206
156 volve them in an emotion capsule.
157 However, these approaches mainly focus on We distinguish between emotion and cause in con- 207
158 textual scenario or multi-modal scenarios where versations, following the studies (Poria et al., 2021; 208
159 all modalities appear simultaneously in each turn. Wang et al., 2021): 209
160 They have never taken modality-switching conver-
- Emotion is the psychological state expressed 210
161 sations into consideration.
in the utterance that indicates the mood of 211
162 Emotion Cause Extraction. Although many the speaker. For emotion analysis, we usu- 212
163 studies focus on emotion recognition, there still ally classify emotions into seven concrete cat- 213
164 lacks research on emotion cause extraction(ECE), egories, Neutral, Happy, Frustrated, Angry, 214
165 especially in conversation. We split the current Surprised, Sad, and Fear. Moreover, emotion 215
166 studies into no-conversational and conversational can be expressed by textual or visual utterance 216
167 approaches. in our modality-switching conversation. 217
168 Several studies have tried to participate in the
169 no-conversational scenario. (Lee et al., 2010) pro- - Emotion cause is the utterance expressing the 218
170 pose the task originally. Then, some studies follow reason for why the speaker expresses the emo- 219
171 with rule-based methods, such as (Li and Xu, 2014; tion given by the target utterance. In our work, 220
172 Gao et al., 2015a,b; Yada et al., 2017) or with ma- the cause comes from the visual or textual 221
173 chine learning methods, such as (Ghazi et al., 2015; utterance and each target emotion utterance 222
174 Song and Meng, 2015). Inspired by the corpus of contains at least one cause utterance. 223
175 (Lee et al., 2010), (Chen et al., 2010) speculate that We define two kinds of tasks in our MECMSC 224
176 clause may be suitable for annotation and cause dataset: Emotion Recognition and Cause Extrac- 225
177 analysis. Then, several studies (Russo et al., 2011; tion in Modality-Switching Conversations (aka 226
178 Gui et al., 2014) follow his task setting. Particu- ERMSC and CEMSC). The goal of ERMSC is 227
179 larly, (Gui et al., 2016b) release an Chinese emotion to detect the emotion of utterances, while the goal 228
180 cause dataset, which receive much attention and of CEMSC is to extract the cause of the target emo- 229
181 is utilized as the benchmark dataset of following tion. Moreover, the two tasks can be united as 230
182 studies, such as (Xu et al., 2017; Gui et al., 2017; emotion-cause pair extraction, though it still needs 231
183 Li et al., 2018; Yu et al., 2019; Ding et al., 2019; research and is not present in this work. 232
184 Xia and Ding, 2019; Yan et al., 2021). Therefore, we define the following notations, 233
185 These approaches mainly focus on articles, used throughout the paper. Let D = 234
186 blogs, and other areas, but not conversations. Since {(Xn , En , Cn )}N
n=1 be the set of data samples. 235
3
ERC Dataset Language Modality Source Size Format
IEMOCAP(Busso et al., 2008) English T,A,V Conversation 7,433 utterances
DailyDialog(Li et al., 2017) English T Conversation 102,979 utterances
EmotionLines(Hsu et al., 2018) English T Conversation 14,503 utterances
SEMAINE (McKeown et al., 2012) English T,A,V Conversation 5,798 utterances
EmoContext(Chatterjee et al., 2019) English T Conversation 115,272 utterances
MELD(Poria et al., 2019) English T,A,V Conversation 13,708 utterances
MELSD(Firdaus et al., 2020) English T,A,V Conversation 20,000 utterances
ECE Dataset Language Modality scene Size Format
Emotion-Stimulus(Ghazi et al., 2015) English T FrameNet 2,414 sentences
ECE Corpus (Gui et al., 2016a) Chinese T SINA city news 2,105 documents
NTCIR-13-ECA(Gao et al., 2017) Chinese T SINA city news 2,403 documents
Weibo-Emotion(Cheng et al., 2017) Chinese T Microblog 7,000 posts
REMAN(Kim and Klinger, 2018) English T Fiction 1,720 documents
GoodNewsEveryone(Bostan et al., 2020) English T News 5,000 sentences
RECCON(Poria et al., 2021) English T Conversation 11,769 utterances
ECF(Wang et al., 2021) English T,A,V Conversation 13,509 utterances
Our ERC& ECE Dataset Language Modality scene Size Format
MECMSC-SHP Chinese T/V Conversation 25,974 utterances
MECMSC-COS Chinese T/V Conversation 27,490 utterances
MECMSC Chinese T/V Conversation 53,464 utterances
Table 1: The comparison of emotion and cause datasets, following with (Wang et al., 2021; Poria et al., 2021). ERC:
emotion Recognition in conversation, ECE: emotion cause extraction, T: text, A: audio V: vision. "T,A,V" denotes
that text, audio, and vision appear simultaneously, while "T/V" denotes that text and vision appear alternately.
236 Given an conversation X = {u1 , u2 , . . . , uk }, the room, and other areas. Here, we utilize the JD 264
237 ERMSC is to classify the utterances into emotion customer service conversations from (Zhao et al., 265
238 list E = {e1 , e2 , . . . , ek }. And given X and E, 2021). The raw conversations is a large-scale 266
239 the CEMSC task is to extract emotion cause into multi-modal multi-turn dialogue dataset collected 267
240 list C = {c1 , c2 , . . . , ck }. k is the length of the from a mainstream Chinese E-commerce plat- 268
241 conversation. ui = {si , ti /oi } contains two items. form3 . Distinguished with previous data, such as 269
242 si denotes the speaker name, ti denotes the textual RECCON(Poria et al., 2021),ECF (Wang et al., 270
243 utterance and oi denotes the visual utterance. Note 2021), IEMOCAP(Busso et al., 2008),DailyDia- 271
244 that ti /oi means the textual utterance or the visual log(Li et al., 2017) and MELD(Poria et al., 2019), 272
245 utterances. this raw data is presented as modality-switching 273
246 For 1 ≤ i ≤ k, ei ∈ E is the id of the emotion conversations, which contains multiple modalities 274
247 and E is emotion label set. ci ∈ ζ is the position of and modalities appear alternately. 275
248 the cause utterance and ζ is the position set. ζ = We find that customer service contains rich emo- 276
249 {−η, −η + 1, · · · , 0, oth} and η is the maximal tions and causes. Most of the conversations come 277
250 position span before target emotion utterance. Note from after-sales services, which usually involve 278
251 that: 1) when there exists more than one cause, we many disputes. Normally, the customer may point 279
252 select the most related cause as the final cause. 2) out the flaw and give a negative emotion. And the 280
253 when the cause exists behind the target emotion cause of the emotion clearly exists in the conver- 281
254 utterance, we abandon the cause and classify it into sation. These preconditions make our annotation 282
255 "oth". 3) when the span between target emotion become feasible. 283
256 and cause utterance is longer than η, the cause
In concrete, We select the raw conversations 284
257 position is classified into "oth". The above three
that related to goods of small home appliances 285
258 assumptions happen rarely and will be discussed in
(SHP) and the costume (COS) for annotation. 286
259 the dataset analysis.
Thus our MECMSC dataset consists of two parts: 287
MECMSC-SHP and MECMSC-COS. The two 288
260 4 Building the MECMSC Dataset
datasets share the same emotion and cause label 289
261 4.1 Conversation Sources set. 290
262 The modality-switching conversations often ap-

3
263 pear in reality, i.e., customer service, chatting https://JD.com
4
291 4.2 Annotation Guidelines and Toolkit
292 To receive an excellent dataset, we make some
293 annotation settings as shown in Section A.1 of Ap-
294 pendix.
295 In the toolkit, we can easily read, annotate, and
296 save the dataset. Context can be viewed by turning
297 the page and images can be amplified by click. We
298 fix the input format and some input judgment to
299 ensure that the annotators do not make unwanted
300 errors. We have improved the toolkit many times
301 and will release the final toolkit. The details can be
302 seen in Section A.2 of Appendix.
Metrics(100%) COS SHP Total

Emotion Cohen’s Kappa 79.5 70.6 75.4
Cause Cohen’s Kappa 75.3 67.7 71.8
Table 2: The inter-personal agreement for emotion and

cause annotations.
303 4.3 Annotation Quality Assessment Figure 2: The part of the images in the dataset.
304 To evaluate the quality of the annotated dataset, we

305 calculate the consistency of the annotations with our datasets is large comparably, especially among 333
306 some existing methods. In our annotation, each ECE datasets. 334
307 utterance was first annotated by two workers with Concrete Statistics: As shown in Table 3, we 335
308 professional knowledge backgrounds. The incon- can see that the MECMSC contains 1562 dialogues 336
309 sistent annotations are identified by an expert, and and 53464 utterances in total. 5740 utterances were 337
310 then the expert gives a final judgment. annotated with emotion and cause (exclude Neu- 338
311 Here, we utilize Cohen’s Kappa to help estimate tral). 63.7%(3658 items ) of the causes lie in the 339
312 the consistency, exhibited in Figure 2. Cohen’s target emotion utterance itself, while 36.3% (2082 340
313 Kappa is used to measure the consistency of any items) of the causes lie in the contextual utterances. 341
314 two annotators (Cohen, 1960). As shown in the Moreover, a total of 350 target emotion utterances 342
315 figure, it can be seen that 1) the annotators achieve contain more than one cause. For better execution, 343
316 75.4% and 71.8% Cohen’s Kappa for emotion and the model can only select the cause with the highest 344
317 cause annotation respectively. 2) MECMSC-COS priority presented in annotation guidelines as the 345
318 performs better than MECMSC-SHP, this is mainly final cause. 346
319 because the costume (COS) after-sales issues are For further statistics, it can be seen that: 1)Im- 347
320 relatively simple and unambiguous. ages often exist in dialogues. 6479 images are 348
involved in total dialogues and about average 4.1 349
321 4.4 Dataset Statistics and Analysis images consist in each dialogue. 112 cause and 27 350
322 Global View: In Table 1, We collected many emo- emotion utterances come from the image. 2) The 351
323 tion and cause datasets and compared them in terms average number of utterances in each dialogue is 352
324 of language, modality, source, size, and format. about 34.2 and each utterance contains about 14.1 353
325 Here, we can see that we are the first to construct a words. 3) About 3.7 utterances have emotions (ex- 354
326 dataset for emotion and cause analysis in modality- clude Neutral) in each dialogue and each emotion 355
327 switching conversations(refer to "T/V"). Particu- has about 32.0 contextual utterances 4) The length 356
328 larly in ECE datasets, few datasets (RECCON and of the span between emotion and cause utterances 357
329 ECF) involve in conversations. Only an unpub- is about 0.68, which means cause utterance is near 358
330 lished dataset ECF focuses on multi-modal conver- the target emotion utterance. 359
331 sation, though it is still not involved in modality- Image Composition. Figure 2 gives a part of the 360
332 switching conversations. Moreover, the size of images in our modality-switching conversations. 361
5
Number of items COS SHP Total
Dialogues 790 772 1562
Utterances 27490 25974 53464
Utterances annotated with emotion and cause 3243 2497 5740
Utterances where cause solely lies in the same utterance 2018 1640 3658
Utterances where cause solely lies in the contextual utterances 1225 857 2082
Utterances contain more than one causes 146 204 350
Total number of images in dialogues 3189 3290 6479
Total number of cause from image 76 36 112
Total number of emotion from image 24 3 27
Average number of images in each dialogue 4.0 4.3 4.1
Average number of utterances in each dialogue 34.8 33.6 34.2
Average number of emotions in each dialogue 4.1 3.2 3.7
Average word length of textual utterances in each dialogue 13.8 14.4 14.1
Average word length of textual cause utterances 15.7 13.8 14.8
Average word length of textual emotion utterances 13.3 12.9 13.1
Average number of contextual utterances for each emotion 33.6 29.8 32.0
Average length between emotion and cause utterance 0.74 0.59 0.68
Table 3: The concrete statistics of the dataset.
Name Neutral Happy Frustrated Angry Surprised Sad Fear

MECMSC-COS 24247 422 735 1898 143 9 36
MECMSC-SHP 23477 483 686 1053 164 50 61
MECMSC 47724 905 1421 2951 307 59 97
Table 4: Emotion distribution of MECMSC-COS, MECMSC-SHP, MECMSC
362 The images are mainly related to after-sales cus- Cause Positions COS SHP Total
363 tomer service. Here, we can find that images are cause at U(>t−0) 2 2 4
cause at U(t−0) 2018 1640 3658
364 mainly screenshots of order information, screen- cause at U(t−1) 711 588 1299
365 shots of products for sale, photos of products after cause at U(t−2) 252 151 403
366 purchase, customer service emoticons, and so on. cause at U(t−3) 110 58 168
cause at U(<t−3) 152 60 212
367 The sources of the images are concentrated in a few
368 categories and the content of the images is distin- Table 5: Cause distribution of the dataset. U(t−i) de-
369 guished. Thus, it is worthwhile to consider how to notes that the cause occurs i positions before current
370 make the model understand the pictures. A more emotion utterance position t.
371 comprehensive view of the images is in Section
372 A.3 of the Appendix.
1) emotion causes prefer to occur near the target 389
373 Emotion Distribution: The emotion set con-
emotion utterance. 2) The longer the span between 390
374 tains seven categories, i.e., Neutral, Happy, Frus-
emotion and cause utterance, the less the emotion- 391
375 trated, Angry, Surprised, Sad, and Fear, exhibited
cause pair exists. When the span is beyond a certain 392
376 in Table 4. For each emotion, at least one cause
length, the emotion-cause pairs can be negligible. 393
377 utterance connects to it. Here, we can see that the
Thus, some emotion-cause pairs with too long span 394
378 distribution of emotion categories is unbalanced.
will be ignored in the experiment. 395
379 Especially, the dataset contains too many "Neutral".
380 This is because two many emotionless product de-
5 Benchmarking 396
381 scriptions and polite expressions in customer ser-
382 vice conversations. Also, the dataset contains more In this section, we build several benchmarks for 397
383 negative emotions(i.e., Frustrated and Angry) than emotion and cause analysis in modality-switching 398
384 positive emotions(i.e., Happy and surprised). This conversations. Since few studies have focused on 399
385 is mainly because the dataset comes from customer this scenario, we design several approaches entirely. 400
386 service, which has many after-sales complaints. Inspired by previous deep learning studies, i.e., 401
387 Cause Distribution: The detailed cause distri- (Jiao et al., 2020; Vaswani et al., 2017), we propose 402
388 bution is shown in Table 3 and 5. We can find that two approaches named MMSC-ERC and MMSC- 403
6
Modality- Add Modality- Contexual
CEMSC Task
Switching Feature Text/image Switching Feature Feature Extracion Cause Position
Attention Linear 1 ...
Encoding Module mask feature Fusion Module Module Probability
Interlocutor att1 att 3 att j 1
顾客 ... ...
难受以前购买你家鞋子都没这么严重 Encoder
(customer): MSFFM CFEM ...
GRU
GRU
GRU
(sad, your shoes are not so serious before) Text b oth b 1j b 2j ... b jj-1 b jj -ȵ -1 0 oth
...
U1 Encoder
MLP MSUTM
Interlocutor
客服由于质检疏忽,给您带来不便了,还望谅解  j  2j  jj1  jj
GRU
GRU
Encoder Dense ...
Linear
(Server): 下哦(Due to the quality control inspection MSFFM CFEM
negligence, sorry to bring you trouble) Text Emb Emb Emb Emb
U2 Encoder
Add&Norm ...
. e1 e2 e j 1 ej FeedForward
Nx
. ERMSC Task Add&Norm
Attention
. r1 , r2 ,..., rn 1 , rn Mutli-Head
Interlocutor Attention CFEM
顾客 Encoder
(customer): 嗯嗯这种小事完全没问题 ...
(ok, such a small thing is no problem at all)
MSFFM CFEM
U n 1 Text
Concatenate
Encoder
Emotion Classify o
CLS
t
his1 ... hiLs s hit1 ... hiLt
hio1 ... hiLo Reshape
Interlocutor
客服
(Server):
Encoder
MSFFM CFEM
 1  2 ...  k 1  k Interlocutor
Feature
Text Feature Image Feature
& Linear
Image
Un Encoder Predicted Emotions Text/Image is masked feature
Figure 3: The overview of our baselines. Details can be seen in Figure 6 of Appendix.
404 ECE for the two tasks, ERMSC and CEMSC sepa- emotion recognition, shown in Figure 3. 441
405 rately. MMSC-ERC and MMSC-ECE share some To capture the relations among utterances in con- 442
406 modules. Figure 3 shows an overview of our bench- versation, we apply a bidirectional GRU(Cho et al., 443
407 marks. 2014) in CFEM, following (Jiao et al., 2020). We 444
408 Modality-Switching Feature Encoding. The only can make use of past contextual utterances, 445
409 task is defined in Section 3. ui = {si , ti /oi } is due to that the task is real-time and future utter- 446
410 a portion of a modality-switching conversation. ances are unknown. The detailed description of 447
411 For the reason that textual and visual utterance CFEM for ERMSC can be seen in Section B.3 of 448
412 appears alternately, we first complement the tex- Appendix. 449
413 tual utterance with a blank image or complement
Emotion Recognition and Cause Extraction in 450
414 the image utterance with a blank textual utter-
Modality-Switching Conversations. The ERMSC 451
415 ance. Then we obtain a new data group containing
task is shown in Figure 3. We involve the refined 452
416 three items, one speaker, one existing utterance,
vector into a classification module and obtain the 453
417 and one padding utterance. Finally, we involve
emotion probability. Then we calculate the loss 454
418 the new data group into three encoders (i.e., In-
by a cross-entropy function. The process of emo- 455
419 terlocEncoder, TextEncoder and ImageEncoder )
tion recognition can be seen in Section B.4 of the 456
420 and obtain the multi-modal utterance feature group
Appendix. 457
421 hi = [hsi , hti , hoi ], i ∈ {1, 2, . . . , k}. Note that k is
422 the length of the conversation. The InterlocEncoder Meanwhile, we utilize the shared Modality- 458
423 and TextEncoder are based on BERT(Devlin et al., Sitching Feature Encoding and Fusion modules 459
424 2019), while ImageEncoder is based on ResNet(He to encode and fuse the multi-modal feature. And 460
425 et al., 2016). The three encoders can be seen in then we get the multi-modal utterance common 461
426 Section B.1 of the Appendix. representation. Subsequently, we involve the repre- 462
427 Modality-Switching Feature Fusion. After sentation into a modified CFEM and finally obtain 463
428 modality Encoding, we utilize a multi-modal fusion the vector with contextual clues. Finally, we try 464
429 structure to study the relation between modalities to predict the position of the cause utterance with 465
430 and build a common feature for each utterance. The attention methods and calculate the loss by a cross- 466
431 structure is based on the multi-modal Transformer entropy function. Detailed description can be seen 467
432 Encoder(Vaswani et al., 2017) for intra-utterance in Section B.5 of Appendix. 468
433 fusion, shown as MSFFM in Figure 3. The encoder

434 consists of N layers and each layer contains Multi-
435 Head Attention and FeedForward modules. Details 6 Experimentation 469
436 can be seen in Section B.2 of Appendix.

437 Contextual Feature Extraction. The past two We evaluate emotion recognition and cause extrac- 470
438 modules are shared by the two tasks and the Contex- tion with different settings. We discuss the details 471
439 tual Feature Extraction Module (CFEM) is shared of the benchmarks and the hyperparameters for our 472
440 partially. Here, we first introduce the CFEM for experiments in Section C.1 of the Appendix. 473
7
Approaches dataset Neutral positive Negative w-avg Neutral Emotion w-avg
COS 0.9591 0.0280 0.0000 0.9214 0.9604 0.0087 0.9238
MMSC-ERC SHP 0.9674 0.0000 0.0000 0.9221 0.9615 0.0296 0.9259
Total 0.9599 0.0137 0.0150 0.9224 0.9565 0.0246 0.9166
Table 6: The performance of the ERMSC task by two emotion groups on the MECMSC dataset. The evaluation
metrics is F1 and w-avg is the weighted-average F1 score.
Approaches
dataset Pos.0 Pos.1 Pos.2 Happy Frustrated Angry Other Positive Negative w-avg
COS 0.7465 0.2785 0.0000 0.4231 0.7887 0.4948 0.4000 0.4333 0.6154 0.5879
MMSC-ECE SHP 0.7273 0.6372 0.1250 0.7551 0.6857 0.5469 0.5000 0.7451 0.5960 0.6467
Total 0.7901 0.6575 0.1212 0.7551 0.6933 0.6605 0.2500 0.7353 0.6709 0.6903
Table 7: The performance of the CEMSC task by three aspects on MECMSC datasets. The evaluation metrics is
F1 and w-avg is the weighted-average F1 score. Pos.0 denotes the span between cause and emotion utterance is 0.
"Other" denotes the emotion samples, excluding Happy, Frustrated, and Angry.
474 6.1 Experimental Results we give some experiment results and find that there 508
475 In this section, we mainly make an exhibition of re- still have some room for improvement. To our best 509
476 sults and give an analysis of the two tasks, ERMSC Knowledge, we are the first to conduct emotion 510
477 and CEMSC. and cause analysis research in modality-switching 511
478 Emotion Recognition in Modality-Switching conversations. 512
479 Conversations. Table 6 exhibits the experiment Emotion recognition and cause extraction in 513
480 results based on the two groups. We give a detailed modality-switching conversations are still challeng- 514
481 analysis in Section C.2 of the Appendix. ing tasks. Our work focus on introducing the new 515
482 Emotion Cause Extraction in Modality- tasks and the annotated dataset. Only some concise 516
483 Switching Conversations. As shown in Table 7, benchmarks are provided and we intuitively believe 517
484 we exhibit the performance on three datasets from that these benchmarks still have some weaknesses. 518
485 three aspects. The detailed analysis is shown in In the future, the following hypothesis deserves 519
486 Section C.3 of the Appendix. to be explored for better performance in modality- 520
switching conversations. 521
487 6.2 Analysis and Discussion
• In fact, the annotated dataset is still not large 522
488 In this section, we provide some interesting further
enough and a big-scale dataset may improve the 523
489 analysis.
performance probably. 524
490 Case Study. We analyze the output cases of our
491 approach for the task CEMSC. Due to the limitation • Our dataset contains two interlocutors, a cus- 525
492 of space, we put the analysis in Section C.4 of the tomer, and a server. Better modeling the inter- 526
493 Appendix. locutor interaction may assist the ERMSC and 527
494 Context Analysis. We analyze the effect on the CEMSC tasks. 528
495 length of the context and we put the analysis in
496 Section C.5 of the Appendix. • Pre-trained language model BERT may not be 529
suitable completely for the tasks based on con- 530
497 7 Conclusion and future work versations. A better conversation pre-trained lan- 531
498 In this work, we introduce our research on Multi- guage model may bring some enhancement. 532
499 modal Emotion and Cause Analysis in Modality-

• Better understanding the images from customer 533
500 Switching Conversations. This research has great
service in modality-switching conversations may 534
501 potential for application in real-life scenarios.
promote the work. 535
502 We first construct and annotate a dataset, named
503 MECMSC, from modality-switching conversations • Better fusing and interacting with the textual and 536
504 with the help of our designed annotation system. visual modality in modality-switching conversa- 537
505 Secondly, inspired by some advanced multi-modal tions may enhance the performance. 538
506 methods, proposed in past few years, we design
507 some strong benchmarks for this dataset. Finally,
8
539 References dialogue dataset for emotion recognition and senti- 594
ment analysis in conversations. In Proceedings of 595
540 Laura Ana Maria Bostan, Evgeny Kim, and Roman COLING 2020, pages 4441–4453. International Com- 596
541 Klinger. 2020. Goodnewseveryone: A corpus of mittee on Computational Linguistics. 597
542 news headlines annotated with emotions, semantic
543 roles, and reader perception. In Proceedings of LREC Kai Gao, Hua Xu, and Jiushuo Wang. 2015a. Emotion 598
544 2020, pages 1554–1566. European Language Re- cause detection for chinese micro-blogs based on 599
545 sources Association. ECOCC model. In Proceedings of PAKDD 2015, 600
volume 9078 of Lecture Notes in Computer Science, 601
546 Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe pages 3–14. Springer. 602
547 Kazemzadeh, Emily Mower, Samuel Kim, Jean-
548 nette N. Chang, Sungbok Lee, and Shrikanth S. Kai Gao, Hua Xu, and Jiushuo Wang. 2015b. A rule- 603
549 Narayanan. 2008. IEMOCAP: interactive emotional based approach to emotion cause detection for chi- 604
550 dyadic motion capture database. Lang. Resour. Eval- nese micro-blogs. Expert Syst. Appl., 42(9):4517– 605
551 uation, 42(4):335–359. 4528. 606
552 Ankush Chatterjee, Umang Gupta, Manoj Kumar Chin- Qinghong Gao, Jiannan Hu, Ruifeng Xu, Lin Gui, Yulan 607
553 nakotla, Radhakrishnan Srikanth, Michel Galley, and He, Kam-Fai Wong, and Qin Lu. 2017. Overview 608
554 Puneet Agrawal. 2019. Understanding emotions in of NTCIR-13 ECA task. In Proceedings of NTCIR 609
555 text using deep learning and big data. Comput. Hum. 2017. National Institute of Informatics (NII). 610
556 Behav., 93:309–317.
Diman Ghazi, Diana Inkpen, and Stan Szpakowicz. 611
557 Guanghui Chen and Xiaoping Zeng. 2021. Multi-modal
2015. Detecting emotion stimuli in emotion-bearing 612
558 emotion recognition by fusing correlation features of
sentences. In Proceedings of CICLing 2015, volume 613
559 speech-visual. IEEE Signal Process. Lett., 28:533–
9042 of Lecture Notes in Computer Science, pages 614
560 537.
152–165. Springer. 615
561 Ying Chen, Sophia Yat Mei Lee, Shoushan Li, and Chu-
Deepanway Ghosal, Navonil Majumder, Alexander F. 616
562 Ren Huang. 2010. Emotion cause detection with
Gelbukh, Rada Mihalcea, and Soujanya Poria. 2020. 617
563 linguistic constructions. In Proceedings of COLING
COSMIC: commonsense knowledge for emotion 618
564 2010, pages 179–187. Tsinghua University Press.
identification in conversations. In Proceedings of 619
565 Xiyao Cheng, Ying Chen, Bixiao Cheng, Shoushan Li, EMNLP 2020, volume EMNLP 2020 of Findings of 620
566 and Guodong Zhou. 2017. An emotion cause corpus ACL, pages 2470–2481. Association for Computa- 621
567 for chinese microblogs with multiple-user structures. tional Linguistics. 622
568 ACM Trans. Asian Low Resour. Lang. Inf. Process.,
569 17(1):6:1–6:19. Deepanway Ghosal, Navonil Majumder, Soujanya Poria, 623
Niyati Chhaya, and Alexander F. Gelbukh. 2019. Di- 624
570 Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- aloguegcn: A graph convolutional neural network for 625
571 danau, and Yoshua Bengio. 2014. On the properties emotion recognition in conversation. In Proceedings 626
572 of neural machine translation: Encoder-decoder ap- of EMNLP-IJCNLP 2019, pages 154–164. Associa- 627
573 proaches. In Proceedings of SSST@EMNLP 2014, tion for Computational Linguistics. 628
574 pages 103–111. Association for Computational Lin-
575 guistics. Lin Gui, Jiannan Hu, Yulan He, Ruifeng Xu, Qin 629
Lu, and Jiachen Du. 2017. A question answer- 630
576 Jacob Cohen. 1960. A coefficient of agreement for ing approach to emotion cause extraction. CoRR, 631
577 nominal scales. Educational and Psychological Mea- abs/1708.05482. 632
578 surement, 20(1):37–46.
Lin Gui, Dongyin Wu, Ruifeng Xu, Qin Lu, and 633
579 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yu Zhou. 2016a. Event-driven emotion cause ex- 634
580 Kristina Toutanova. 2019. BERT: pre-training of traction with corpus construction. In Proceedings of 635
581 deep bidirectional transformers for language under- EMNLP 2016, pages 1639–1649. The Association 636
582 standing. In Proceedings of NAACL-HLT 2019, for Computational Linguistics. 637
583 pages 4171–4186. Association for Computational
584 Linguistics. Lin Gui, Ruifeng Xu, Qin Lu, Dongyin Wu, and 638
Yu Zhou. 2016b. Emotion cause extraction, A chal- 639
585 Zixiang Ding, Huihui He, Mengran Zhang, and Rui lenging task with corpus construction. In Proceed- 640
586 Xia. 2019. From independent prediction to reordered ings of SMP 2016, volume 669 of Communications 641
587 prediction: Integrating relative position and global in Computer and Information Science, pages 98–109. 642
588 label information to emotion cause identification. In
589 Proceedings of AAAI 2019, pages 6343–6350. AAAI Lin Gui, Li Yuan, Ruifeng Xu, Bin Liu, Qin Lu, and 643
590 Press. Yu Zhou. 2014. Emotion cause detection with linguis- 644
tic construction in chinese weibo text. In Proceedings 645
591 Mauajama Firdaus, Hardik Chauhan, Asif Ekbal, and of NLPCC 2014, volume 496 of Communications in 646
592 Pushpak Bhattacharyya. 2020. MEISD: A multi- Computer and Information Science, pages 457–464. 647
593 modal multi-label emotion, intensity and sentiment Springer. 648
9
649 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang 703
650 Sun. 2016. Deep residual learning for image recogni- Cao, and Shuzi Niu. 2017. Dailydialog: A manually 704
651 tion. In Proceedings of CVPR 2016, pages 770–778. labelled multi-turn dialogue dataset. In Proceedings 705
652 IEEE Computer Society. of IJCNLP 2017, pages 986–995. Asian Federation 706
of Natural Language Processing. 707
653 Chao-Chun Hsu, Sheng-Yeh Chen, Chuan-Chun Kuo,
654 Ting-Hao K. Huang, and Lun-Wei Ku. 2018. Emo- Zaijing Li, Fengxiao Tang, Ming Zhao, and Yusen Zhu. 708
655 tionlines: An emotion corpus of multi-party conver- 2022. Emocaps: Emotion capsule based model for 709
656 sations. In Proceedings of LREC 2018. European conversational emotion recognition. In Proceedings 710
657 Language Resources Association (ELRA). of ACL 2022, pages 1610–1618. Association for Com- 711
putational Linguistics. 712
658 Dou Hu, Lingwei Wei, and Xiaoyong Huai. 2021a. Di-
659 aloguecrn: Contextual reasoning networks for emo- Navonil Majumder, Soujanya Poria, Devamanyu Haz- 713
660 tion recognition in conversations. In Proceedings of arika, Rada Mihalcea, Alexander F. Gelbukh, and 714
661 ACL/IJCNLP 2021, pages 7042–7052. Association Erik Cambria. 2019. Dialoguernn: An attentive RNN 715
662 for Computational Linguistics. for emotion detection in conversations. In Proceed- 716
ings of AAAI 2019, pages 6818–6825. AAAI Press. 717
663 Jingwen Hu, Yuchen Liu, Jinming Zhao, and Qin Jin.
664 2021b. MMGCN: multimodal fusion via deep graph Gary McKeown, Michel François Valstar, Roddy Cowie, 718
665 convolution network for emotion recognition in con- Maja Pantic, and Marc Schröder. 2012. The SE- 719
666 versation. In Proceedings of ACL/IJCNLP 2021, MAINE database: Annotated multimodal records of 720
667 pages 5666–5675. Association for Computational emotionally colored conversations between a person 721
668 Linguistics. and a limited agent. IEEE Trans. Affect. Comput., 722
3(1):5–17. 723
669 Carroll E Izard. 1992. Basic emotions, relations among
670 emotions, and emotion-cognition relations. Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. 724
2013. On the difficulty of training recurrent neural 725
671 Ziyu Jia, Youfang Lin, Jing Wang, Zhiyang Feng, Xi-
networks. In Proceedings of ICML 2013, volume 28 726
672 angheng Xie, and Caijie Chen. 2021. Hetemotionnet:
of JMLR Workshop and Conference Proceedings, 727
673 Two-stream heterogeneous graph recurrent neural
pages 1310–1318. JMLR.org. 728
674 network for multi-modal emotion recognition. In
675 Proceedings of ACM MM 2021, pages 1047–1056. Soujanya Poria, Erik Cambria, Devamanyu Hazarika, 729
676 ACM. Navonil Majumder, Amir Zadeh, and Louis-Philippe 730
Morency. 2017. Context-dependent sentiment analy- 731
677 Wenxiang Jiao, Michael R. Lyu, and Irwin King. 2020.
sis in user-generated videos. In Proceedings of ACL 732
678 Real-time emotion recognition via attention gated
2017, pages 873–883. Association for Computational 733
679 hierarchical memory network. In Proceedings of
Linguistics. 734
680 AAAI 2020, pages 8002–8009. AAAI Press.
681 Evgeny Kim and Roman Klinger. 2018. Who feels what Soujanya Poria, Devamanyu Hazarika, Navonil Ma- 735
682 and why? annotation of a literature corpus with se- jumder, Gautam Naik, Erik Cambria, and Rada Mi- 736
683 mantic roles of emotions. In Proceedings of COLING halcea. 2019. MELD: A multimodal multi-party 737
684 2018, pages 1345–1359. Association for Computa- dataset for emotion recognition in conversations. In 738
685 tional Linguistics. Proceedings of ACL 2019, pages 527–536. Associa- 739
tion for Computational Linguistics. 740
686 Diederik P. Kingma and Jimmy Ba. 2015. Adam: A
687 method for stochastic optimization. In Proceedings Soujanya Poria, Navonil Majumder, Devamanyu Haz- 741
688 of ICLR 2015. arika, Deepanway Ghosal, Rishabh Bhardwaj, Sam- 742
son Yu Bai Jian, Pengfei Hong, Romila Ghosh, Ab- 743
689 Sophia Yat Mei Lee, Ying Chen, and Chu-Ren Huang. hinaba Roy, Niyati Chhaya, Alexander F. Gelbukh, 744
690 2010. A text-driven rule-based system for emotion and Rada Mihalcea. 2021. Recognizing emotion 745
691 cause detection. In Proceedings of NAACL HLT 2010, cause in conversations. Cogn. Comput., 13(5):1317– 746
692 pages 45–53, Los Angeles, CA. Association for Com- 1332. 747
693 putational Linguistics.
Vandana Rajan, Alessio Brutti, and Andrea Cavallaro. 748
694 Weiyuan Li and Hua Xu. 2014. Text-based emotion 2022. Is cross-attention preferable to self-attention 749
695 classification using emotion cause extraction. Expert for multi-modal emotion recognition? In Proceed- 750
696 Syst. Appl., 41(4):1742–1749. ings of ICASSP 2022, pages 4693–4697. IEEE. 751
697 Xiangju Li, Kaisong Song, Shi Feng, Daling Wang, and Irene Russo, Tommaso Caselli, Francesco Rubino, Ester 752
698 Yifei Zhang. 2018. A co-attention neural network Boldrini, and Patricio Martínez-Barco. 2011. Emo- 753
699 model for emotion cause analysis with emotional cause: An easy-adaptable approach to extract emo- 754
700 context awareness. In Proceedings of EMNLP 2018, tion cause contexts. In Proceedings of WASSA@ACL 755
701 pages 4752–4757. Association for Computational 2011, pages 153–160. Association for Computational 756
702 Linguistics. Linguistics. 757
10
758 Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhix- and inter-modal influence for real-time emotion de- 811
759 ian Xie. 2021. Dialogxl: All-in-one xlnet for multi- tection in conversations. In Proceedings of ACM MM 812
760 party conversation emotion recognition. In Proceed- 2020, pages 503–511. 813
761 ings of AAAI 2021, pages 13789–13797. AAAI Press.
Nan Zhao, Haoran Li, Youzheng Wu, Xiaodong He, 814
762 Shuangyong Song and Yao Meng. 2015. Detecting and Bowen Zhou. 2021. The JDDC 2.0 corpus: A 815
763 concept-level emotion cause in microblogging. In large-scale multimodal multi-turn chinese dialogue 816
764 Proceedings of WWW 2015, pages 119–120. ACM. dataset for e-commerce customer service. CoRR, 817
abs/2109.12913. 818
765 Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky,
766 Ilya Sutskever, and Ruslan Salakhutdinov. 2014. A Dataset 819
767 Dropout: a simple way to prevent neural networks
768 from overfitting. J. Mach. Learn. Res., 15(1):1929– A.1 Annotation Guidelines 820
769 1958. The following guidelines are used during annota- 821
770 Samarth Tripathi, Sarthak Tripathi, and Homayoon tion: 822

771 Beigi. 2018. Multi-modal emotion recognition on
• We classify the emotion into seven categories: 823
772 iemocap dataset using deep learning. arXiv preprint
773 arXiv:1804.05788. Neutral, Happy, Frustrated, Angry, Surprised, 824
Sad, and Fear. 825
774 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob • For each utterance in conversation, we select one 826
775 Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
776 Kaiser, and Illia Polosukhin. 2017. Attention is all
emotion as the utterance final emotion. 827
777 you need. In Proceedings of NIPS 2017, pages 5998– • The emotion or cause utterance can be an image. 828
778 6008. • The emotion cause can exist before or behind the 829
target emotion utterance. Also, the cause can be 830
779 Fanfan Wang, Zixiang Ding, Rui Xia, Zhaoyu Li, and
780 Jianfei Yu. 2021. Multimodal emotion-cause pair
the target emotion utterance itself. 831
781 extraction in conversations. CoRR, abs/2110.08020. • If the emotion of utterance is Neutral, the annota- 832
tion of emotion cause is needless. 833
782 Penghui Wei, Jiahao Zhao, and Wenji Mao. 2020. Ef- • When an emotion utterance has more than one 834
783 fective inter-clause modeling for end-to-end emotion-
784 cause pair extraction. In Proceedings of ACL 2020, cause utterance, the cause nearest to the emotion 835
785 pages 3171–3181. Association for Computational utterance or in the above context has the higher 836
786 Linguistics. priority. 837
787 Rui Xia and Zixiang Ding. 2019. Emotion-cause pair A.2 Toolkit 838
788 extraction: A new task to emotion analysis in texts.
789 In Proceedings of ACL 2019, pages 1003–1012. As- Figure 4 gives an exhibition of our annotation 839
790 sociation for Computational Linguistics. toolkit. At the beginning of the annotation, we 840
label the dataset purely by hand. Then, we found 841
791 Ruifeng Xu, Jiannan Hu, Qin Lu, Dongyin Wu, and Lin that the speed of annotation was very slow and 842
792 Gui. 2017. An ensemble approach for emotion cause
793 detection with event extraction and multi-kernel svms. many mistakes appeared. During annotation, we 843
794 Tsinghua Science and Technology, 22(6):646–659. need to search the image by imageId in the image 844
file folder, which brings much trouble. Especially, 845
795 Shuntaro Yada, Kazushi Ikeda, Keiichiro Hoashi, and the annotator sometimes may forget or mix the pre- 846
796 Kyo Kageura. 2017. A bootstrap method for auto-
797 matic rule acquisition on emotion cause extraction. vious images, when there is more than one image 847
798 In Proceedings of ICDM 2017, pages 414–421. IEEE in context. Moreover, the cost of annotation is ex- 848
799 Computer Society. pensive. Therefore, we developed an annotation 849
toolkit. 850
800 Hanqi Yan, Lin Gui, Gabriele Pergola, and Yulan He.
801 2021. Position bias mitigation: A knowledge-aware In this toolkit, we can easily read the raw files, 851
802 graph model for emotion cause extraction. In Pro- annotate the sample with labels, and save the an- 852
803 ceedings of ACL/IJCNLP 2021, pages 3364–3375. notated samples into new files. Context can be 853
804 Association for Computational Linguistics. viewed by turning the page conveniently. Images 854
805 Xinyi Yu, Wenge Rong, Zhuo Zhang, Yuanxin Ouyang, can be amplified by clicking conveniently. We 855
806 and Zhang Xiong. 2019. Multiple level hierarchical write the code to fix the input format and make 856
807 network-based clause selection for emotion cause some input judgments to ensure that the annotators 857
808 extraction. IEEE Access, 7:9071–9079. do not make unwanted errors. We have improved 858
809 Dong Zhang, Weisheng Zhang, Shoushan Li, Qiaoming the toolkit many times and finally it is examined by 859
810 Zhu, and Guodong Zhou. 2020. Modeling both intra- the annotators. 860
11
Figure 4: The interface of our developed annotation toolkit.
Figure 5: The overview of the images in the dataset.
861 A.3 Image Composition
862 Figure 5 gives an overview of the images in our

863 modality-switching conversations.
12
U1 U2 U n 1 Un
顾客
(customer):
客服
(Server):
... 顾客
(customer):
由于质检疏忽,给您带来
难受以前购买你家鞋子嗯嗯这种小事完全没问
不便了,还望谅解下哦
都没这么严重题
(Due to the quality control
(sad, your shoes are not so (ok, such a small thing is no
inspection negligence, sorry
serious before) problem at all)
to bring you trouble)
Text Interlocutor Text Text Interlocutor Image

Encoder Encoder Encoder Encoder Encoder
Encoder
... MSFFM
... CFEM
CFEM
CEMSC Task Cause Position
1
Probability
Attention Linear ...
att1 ... att 3 ... att j 1
...
j j
-ȵ -1 0 oth
b jj
GRU
GRU
b jj-1
GRU
b oth b 1 b 2
... ...
MLP
MSUTM
 
j
 j
 j
j 1  j
j
2 Dense
GRU
GRU
...
Linear
Emb Emb Emb Emb

Add&Norm
e1 e2 e j 1 ej ...
FeedForward
Nx
ERMSC Task Add&Norm Attention
r1 , r2 ,..., rn 1 , rn Mutli-Head
Attention
...
Concatenate
Emotion Classify s s t o
CLS hi1
... h
iLs h t
i1 ... hiLt hio1 ... hiLo Reshape &
1  2 ...  k 1  k Interlocutor
Feature
Text Feature Image Feature
Linear
Predicted Emotions Text/Image is masked feature
Figure 6: The detailed overview of our baselines.
13
864 B Benchmarking Details Subsequently, we involve the Ĥ into the multi- 906
modal Transformer Encoder(Vaswani et al., 2017) 907
865 B.1 Modality-Switching Feature Encoding
for intra-utterance fusion. The encoder consists 908
866 Module
of N layers and each layer contains Multi-Head 909
867 Interlocutor Encoder. We utilize BERT(Devlin Attention and FeedForward modules. 910
868 et al., 2019) to encode the name of the interlocutors.
fi = TransEnc(ĥi ), i ∈ {1, 2, · · · , k} (7) 911
869 hsi = InterlocEncoder(si ), i ∈ {1, 2, · · · , k} (1)
mi = ficls (8) 912
870 where =hsi {hsij }L

∈ s
j=1 RLs ×d and
Ls is the word M = {m1 , m2 , · · · , mi , · · · , mk } (9) 913
871 length of the speaker with the addition of padding.
872 Text Encoder. Similar to the interlocutor en- where fi ∈ RL×d . We select the hidden feature in 914
873 coder, we utilize BERT(Devlin et al., 2019) to position "CLS" in fi as the final utterance repre- 915
874 extract the textual utterance feature. BERT as a sentation mi . Finally, we obtain the multi-modal 916
875 language model has excellent feature extraction common representation group M ∈ Rk×d . 917
876 capability. B.3 Contextual Feature Extraction Module 918
877 hti = TextEncoder(ti ), i ∈ {1, 2, · · · , k} (2) The detailed CFEM is shown as follows. Assuming 919
that we are calculating the refined representation 920
878 where hti = {htij }L
j=1 ∈ R
t Lt ×d and L is the word
t of j-th utterance. we obtain: 921
879 length of the textual utterance with the addition of
→
−j −−−→
880 padding. ci = GRU(mi , cji−1 ), i ∈ {1,2,· · · , j}(10) 922
881 Image Encoder We utilize ResNet(He et al., ←
− ←−−−
882 2016) to extract the image information and. ResNet cji = GRU(mi , cji+1 ), i ∈ {1,2,· · · , j}(11) 923
883 as a deep convolutional network extracts image ←
− →
−
cj = {[ cji , mi , cji ]}ji=1 (12) 924
884 feature well and has been applied in many previous
885 multi-modal studies. φj = cji Wc + bc (13) 925
886 hoi = ImageEncoder(oi ), i ∈ {1, 2, · · · , k} (3) where cj ∈ Rj×3d , φj ∈ Rj×d and Wc ∈ R3d×d . 926
Subsequently, we apply a unidirectional GRU to 927
887 where hoi = {hoij }L
j=1 ∈ R
o Lo ×d and L is the se-
o capture the context representation and set the last 928
888 quence length of the visual information with the vector in φj as the query. 929
889 addition of padding.
−−−→
vj = GRU(φji ) + φji , 1 ≤ i < j (14) 930
890 B.2 Modality-Switching Feature Fusion
891 Module qj = φjj W1 + b1 (15) 931
892 After modality Encoding, we obtain the multi- where v j ∈ Rj−1×d , qj ∈ R1×d and W1 ∈ Rd×d 932
893 modal feature groups: Finally, it is essential to weigh and summarize 933
894 H = {h1 , h2 , · · · , hk }, hi = [hsi , hti , hoi ] (4) the context to refine the representation of the query. 934
The query in conversation depends on the context 935
895 Then, we utilize a multi-modal fusion structure to and context can provide some vital information. 936
896 study the relation between modalities and build a Here, we resort to the attention layer to make the 937
897 common feature for each utterance. query interact with context and produce a refined 938
898 Firstly, we reshape the image feature and convert vector r. 939
899 the dimension of the feature into d with a Linear T
900 module. We still use hoi to denote the image feature. exp(qj vij )
ai = Pj−1 T
, 1 ≤ i < j (16) 940
901 Then, We concatenate the multi-modal feature with
i =1′exp(qj vij′ )
902 a CLS embedding and obtain: j−1
ai vij + qj
X
Ls Lt Lo rj = (17) 941
903 ĥi = {cls, {hsij }j=1 , {htij }j=1 , {hoij }j=1 } (5)
i=1
904 Ĥ = {ĥi }ki=1 (6)
where rj ∈ R1×d is the refined vector of the emo- 942
905 where ĥi ∈ RL×d , L = 1 + Ls + Lt + Lo . tion cues. 943
14
944 B.4 Emotion Recognition representation) as the query. 984
945 We involve the refined vector r into a classification emoi = Embe (ei ), i ∈ {1, 2, · · · , j} (20) 985
946 Linear and obtain: posi = Embp (pi ), i ∈ {1,2,· · · ,η+1}(21) 986
φˆj = {[φjj−i ;emoi ;posj−i+1]}ji=j−η (22) 987

947 ρj = sof tmax(rj Wρ + bρ ) (18)
bj = Relu(φˆj W2 + b2 )W3 + b3 (23) 988
bˆj = {bj1 , bj2 , · · · , bjη+1 , both } (24) 989

948 where Wρ ∈ Rd×|E| and ρj ∈ R1×|E| is the emo-
ˆ
949 tion probability distribution. |E| is the length of qj = φjj W4 + b2 (25) 990
950 emotion label set E.
951 Loss function. We calculate the loss by a cross- where {W2 , W3 , W4 } ∈ R{3d×d,d×d,3d×d} and 991
952 entropy function expressed as: bˆj ∈ R(η+2)×d , qj ∈ R1×d . 992

Finally, we try to predict the position of the cause 993
utterance. The cause comes from the target utter- 994
|E|
k X
1 X ′ ′ ′ ance itself or its context. We utilize the dot product 995
953 Le = PN we yje log(ρej ) (19) to measure the probability of cause utterance, as 996
i=1 k j=1 e′ =1
follows: 997
ˆT
954 where yj is one-hot vector of the true utterance exp(qj bji )
′ ′ ρji = ,1 ≤ i ≤ η + 2 (26) 998
955 emotion ej . yje and ρej are the elements of yj Pη+2 ˆj T
956 and ρj for the emotion e′ , respectively. Note that i′ =1 exp(qj bi′ )
957 k is the length of the conversation and N is the
′ where ρji is the probability that i-th utterance is the 999
958 total number of conversations. we is the weighted
emotion cause of the target j-th utterance. 1000
959 parameter of emotion e′ , which is the inverse ratio
Loss function. We calculate the loss by a cross- 1001
960 of the number of samples with emotion e′ to the
entropy function expressed as: 1002
961 total number of samples, divided by ten.
k
1 X
Lc = • (27) 1003
962 B.5 Emotion Cause Extraction PN
i=1 k j=1

963 We utilize the shared modules to conduct multi- 0 , ej = Neu
Pη+2 j j 1004
964 modal feature encoding and fusing, as equation z
c′ =1 c′ log(ρc′ ) , ej ̸= Neu
965 (1-9). And finally, we obtain vector group M in
966 equation (9) as the multi-modal utterance common where z j is a one-hot vector of the true emotion 1005
967 representation. cause position cj . zcj′ and ρjc′ are the elements of 1006
968 Contextual Feature extraction Module. Then, z j and ρj for the emotion cause position c′ , respec- 1007
969 we involve the M into a bidirectional GRU and tively. |ζ| = η + 2, denoted in Task Definition 3, is 1008
970 finally obtain vector group φj ∈ Rj×d , j ∈ the length of cause position set ζ. ej is the emotion 1009
971 {1, 2, · · · , k}, as equation (10-13). of the jth target utterance and note that when the 1010
target utterance is labeled with the emotion "Neu- 1011
972 Subsequently, we embed the utterance emotion tral", the target utterance has no cause and does not 1012
973 labels E, defined in Section 3, and append them need to calculate the loss. 1013
974 into utterance representation. Also, we embed the
975 utterance’s relative position to the target utterance C Experiment 1014
976 for adding location information, due to the position
977 that can influence the cause extraction performance C.1 Experiment Setup 1015
978 significantly, such as (Yan et al., 2021; Wei et al., We implement our approach via Pytorch toolkit 1016
979 2020). Besides, we append a randomized vector (torch-1.7.0) with a piece of Tesla V100-SXM2. 1017
980 both into bj . When the cause span is larger than For a better comparison, we perform 10-fold cross- 1018
981 normal or the cause is not exist, both will be se- validation in all our experiments. The hidden size 1019
982 lected to predict the "oth" label. Moreover, we set d in our model is 768 same to dim in BERT(Devlin 1020
983 the last representation(aka target emotion utterance et al., 2019)). The number of heads in Muti-Head 1021
15
1022 Attention is 8. We set η = 9 in our experiment and though most of them are "Neutral". 1072
1023 the detailed params can be seen in the code.
C.3 Emotion Cause Extraction Analysis in 1073
1024 Besides, we make use of the dropout regulariza-
Modality-Switching Conversations 1074
1025 tion (Srivastava et al., 2014) to avoid overfitting
1026 and clip the gradients (Pascanu et al., 2013) to the Firstly, we exhibit the three F1 scores in the three 1075
1027 maximum norm of 10.0. We train each model for nearest cause positions. We can find that the ap- 1076
1028 a fixed number of epochs 20 and monitor its per- proach performs the best in Pos.0 and performs the 1077
1029 formance on the validation set. Once the training worst in Pos.3. This is mainly because cause often 1078
1030 is finished, we select the model with the best F1 exists near the target emotion utterance and more 1079
1031 score on the validation set as our final model and samples in Pos.0 bring better performance. 1080
1032 evaluate its performance on the test set. Secondly, we evaluate the score on three emo- 1081
1033 During the training stage, we use the Adam tions with the top three sample sizes. The "Other" 1082
1034 (Kingma and Ba, 2015) optimization method to contains Surprised, Sad and Fear samples. We can 1083
1035 minimize the loss of the training data. Due to the find that among all the scores, the score on Happy 1084
1036 conversation being very large, we only input one and Frustrated samples performs well, though the 1085
1037 conversation at each step and split the utterances in size is half of the Angry samples. This is because 1086
1038 the conversation into several groups. Each group the emotion Happy and Frustrated, presenting the 1087
1039 contains 5 target utterances and 9 contextual utter- mood of the speaker, often comes from the target 1088
1040 ances. Then, We accumulate the loss of 4 groups, emotion utterances itself, which makes the model 1089
1041 and finally, we obtain a whole batch size of 20. For learning easy. 1090
1042 the hyper-parameters of the Adam optimizer, we Thirdly, we evaluate the F1 on the "Positive" 1091
1043 set the learning rate as 0.001 for emotion recog- and "Negative" groups. The "Positive" contains 1092
1044 nition and 0.00001 for cause extraction with two Happy, and Surprised samples, while the "Nega- 1093
1045 momentum parameters of β1 and β2 , 0.9 and 0.999 tive" contains Angry, Fear, Frustrated and Sad sam- 1094
1046 respectively. ples. Here, the F1 score on both groups performs 1095
comparably and "Positive" samples sometimes per- 1096
1047 C.2 Emotion Recognition Analysis in form better than "Negative" samples, though the 1097
1048 Modality-Switching Conversations size of "Negative" is 3.7 times of "Positive". This 1098
reveals that emotion cause extraction tasks are less 1099
1049 During the experiment, "Neutral" and "Angry" sam- influenced by the size of emotions. 1100
1050 ples (the most in non-neutral samples), have an im- Finally, we overview the whole results through- 1101
1051 balance of about 50:3 significantly, while "Neutral" out the three datasets. It can be seen that: 1) The 1102
1052 and "non-neutral" utterances have a huge imbal- F1 on MECMSC-SHP performs better than the 1103
1053 ance of about 25: 3. This brings great difficulty to counterparts on MECMSC-COS, though the size 1104
1054 train the model. Therefore, we conduct the stud- is smaller. This is because the after-sales issues 1105
1055 ies on two types of groups. The first is to classify are more straightforward in small home appliances 1106
1056 the emotions into three categories, Neutral, Posi- (SHP). 2) The total dataset performs better than the 1107
1057 tive, and Negative. Neutral: {Neutral}, Positive: sub-dataset COS and SHP. This is mainly because 1108
1058 {Happy, Surprised}, Negative:{ Angry, Fear, Frus- the large-scale dataset can make detection feasible. 1109
1059 trated, Sad}. The second is to classify the emotions
1060 into two categories,i.e., Neutral, and Emotion. Neu- C.4 Case Analysis 1110
1061 tral: {Neutral}, Emotion:{Angry, Fear, Frustrated, Figure 7 shows the output cases of our proposed ap- 1111
1062 Happy, Sad, Surprised}. Table 6 exhibits the exper- proaches. Here, it can be seen that: 1) In (a), when 1112
1063 iment results based on the two groups. the span between cause and target emotion utter- 1113
1064 In the table, it can be seen that: 1) though we ances is too long (i.e., 5), it is difficult to predict 1114
1065 involve the emotion categories into two types of the target emotion cause. 2) In (b), for the reason 1115
1066 groups, the imbalance still exists. The huge im- that the target emotion "Surprised" appear rarely 1116
1067 balance still influences the performance of the ap- (5.3% of emotion utterances), the approach gives 1117
1068 proach. We expect some following work in fu- a wrong prediction. 3) (c-f) give four cases of pre- 1118
1069 ture may solve this problem. 2) The total w-avg dicting the cause of "Frustrated" utterances. In (c) 1119
1070 score is very high, more than 90%, which denotes and (d), the servers give apologies to the customers 1120
1071 the approach can classify the categories accurately, for quality and delivery issues, while in (e) and (f), 1121
16
顾客客服顾客客服顾客客服
(customer) (server) (customer) (server) (customer) (server)
1:那我还得花钱买个瑕疵品生气 1:只用得到几个时间就不保温，什么质量太差
生气中性
(Then I have to pay for a defective product) (Angry) (Only get a few time is not warm, what quality is (Angry) （Neutral)
too bad) Prediction
Target
2:怎么出仓的时候没有检查清楚？ &Target But if you need to boil tea, you can choose the
生气
(How the warehouse goods was not checked clearly?) machine. The machine only provides boiling
(Angry) 沮丧 2:给您造成不便，敬请谅解
中性 Prediction (Frustrated)
3:这个一点缝隙是正常的哈（Sorry for the inconvenience caused）
Target （Neutral) （This little gap is normal）谢谢开心
(c) Thank you (Happy) (g)
4:我在线下看的都是好的。生气
(What I see offline is all good) (Angry) 1:寄来寄去，还得出邮费，真是麻烦
生气 Prediction
(It's a hassle to send it back and forth and have to
5:所以我才在线上下单生气 (Angry) &Target
(Angry) pay the postage)
(That's why I order online) Target
6:我就是信任你们的产品才买的生气
(I bought it because I trusted your product) (Angry) 沮丧 2:给您造成不便，敬请谅解开心
Prediction 嗯嗯谢谢
(Frustrated) (Sorry for the inconvenience) (Happy)
(Well well thank you)
Prediction
(a) (d) (h)
Target
沮丧 1:实在抱歉哦，优惠活动都是限时的
1:需要您个人承担的呢 Prediction
中性 &Target (Frustrated) (I'm so sorry. These deals are only for a limited time)
(It needs to be your personal responsibility)
（Neutral)
(e) This has a secondary heating function, so
2:啊 you can use this instead
惊讶
(ah!, oh my god) (Surprised) 1:稍等下哦，目前咨询量较大，您耐心等待回网页上介绍的养生壶功能，是弄错了吗
Prediction 沮丧复哈，深表歉意。生气 Prediction
(Is there a mistake on the web page about (Angry) &Target
Prediction &Target (Frustrated) (Please wait a minute. There is a large amount of
the functions of the health pot?)
consultation at present. Please wait for the Reply)
(b) (i)
(f)
Figure 7: The overview of the output cases.
1122 the servers give the reasons (i.e., the limitation of

1123 promotional activities and too many consultations
1124 ) and apologies simultaneously. The former pre-
1125 diction was wrong and the latter was right. This
1126 is mainly because the proposed approach can not
1127 well capture the causal logic and facts in the di-
1128 alogue. 4) In (g) and (h), the causes come from
1129 images and are predicted correctly. The image with
1130 text in (g) and the image with a meme in (h) both
1131 can provide some key clues for the emotion utter- Length COS SHP Total
1132 ances. Though the meaning of images is hard to 2 0.5628 0.5729 0.6873
1133 understand, we still think that images play an im- 3 0.5879 0.5879 0.6873
1134 portant role in our tasks. 5) In (i), We can find that 4 0.5879 0.6467 0.6903
1135 the image in context from the server gives some 5 0.5829 0.6467 0.6873
1136 supplementary information (i.e., the health pot has 6 0.5879 0.6400 0.6903
1137 a secondary heating function.) to the customer’s 7 0.5678 0.6400 0.6844
1138 suspicion, though the cause is the target emotion 8 0.5879 0.6400 0.6873
1139 utterance itself. 9 0.5879 0.6467 0.6903
10 0.5829 0.6467 0.6903
1140 C.5 Context Analysis
11 0.5829 0.6467 0.6696
1141 Table 8 presents an exhibition of the effect on con- 12 0.5829 0.6400 0.6711
1142 text length. From the table, we can find that when 13 0.5879 0.6467 0.6873
1143 context length is 9, the approach performs well. Be-
1144 sides, the context has little effect on the extraction Table 8: Effect of the length of context on the ap-
1145 of cause. proaches.
17

Multi Modal Emotion and Cause

Uploaded by

Copyright:

Available Formats

You might also like

Multi Modal Emotion and Cause

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi Modal Emotion and Cause

Uploaded by

Copyright:

Available Formats

Multi-modal Emotion and Cause Analysis in Modality-Switching

Conversations: A New Task and the Benchmarks

Anonymous ACL submission

Abstract widely used in many fields, such as customer ser- 041

Figure 1: Two examples of our dataset based on Modality-Switching Conversations.

086 switching scenario. CEMSC. To the best of our knowledge, we 113

089 lem similar to emotion recognition. Besides, al-

262 The modality-switching conversations often ap-

Metrics(100%) COS SHP Total

Table 2: The inter-personal agreement for emotion and

304 To evaluate the quality of the annotated dataset, we

Table 3: The concrete statistics of the dataset.

Name Neutral Happy Frustrated Angry Surprised Sad Fear

Table 4: Emotion distribution of MECMSC-COS, MECMSC-SHP, MECMSC

433 fusion, shown as MSFFM in Figure 3. The encoder

436 can be seen in Section B.2 of Appendix.

477 and CEMSC. and cause analysis research in modality-switching 511

478 Emotion Recognition in Modality-Switching conversations. 512

499 modal Emotion and Cause Analysis in Modality-

770 Samarth Tripathi, Sarthak Tripathi, and Homayoon tion: 822

Figure 5: The overview of the images in the dataset.

861 A.3 Image Composition

862 Figure 5 gives an overview of the images in our

Text Interlocutor Text Text Interlocutor Image

Emb Emb Emb Emb

Predicted Emotions Text/Image is masked feature

Figure 6: The detailed overview of our baselines.

870 where =hsi {hsij }L

876 capability. B.3 Contextual Feature Extraction Module 918

φˆj = {[φjj−i ;emoi ;posj−i+1]}ji=j−η (22) 987

bˆj = {bj1 , bj2 , · · · , bjη+1 , both } (24) 989

952 entropy function expressed as: bˆj ∈ R(η+2)×d , qj ∈ R1×d . 992

Figure 7: The overview of the output cases.

1122 the servers give the reasons (i.e., the limitation of

You might also like