Multi Modal Emotion and Cause

Multi-modal Emotion and Cause Analysis in Modality-Switching

Conversations: A New Task and the Benchmarks

Abstract

Abstract widely used in many fields, such as customer ser- 041

vice, online chatting, news analysis, and dialogue 042
001 Previous emotion recognition mainly focuses
systems. 043
002 on text domains or multi-modal domains where
003 different modalities appear simultaneously, As we know, in the real world, conversations are 044
004 while emotion cause extraction studies mainly the commonest form of communication between 045
005 focus on text domains, e.g., blogs, documents, people. When a person interacts with others in 046
006 and news. However, in reality, especially in a dialogue, emotional fluctuation is more likely 047
007 the field of customer service and chatting con- to happen. Thus, this scenario normally yields a 048
008 versations, image and text utterances appear
causal pair (emotion and cause utterance), in which 049
009 alternately in a conversation. Therefore, in this
010 paper, we attempt to recognize emotion and the cause has the responsibility for the emotion gen- 050

011 extract cause in a new scenario, i.e., modality- erated by the speaker. Due to the rich meaning of 051
012 switching conversations, where each turn of people’s textual, visual, and acoustic expressions, 052
013 utterance has only one modality of text and im- conversations naturally occur in a multi-modal 053
014 age. In this scenario, we naturally explore two form. For example, towards the conversations in 054
015 new tasks, namely Emotion Recognition and e-commerce customer service, the customers and 055
016 Cause Extraction in Modality-Switching Con-
servers often communicate with texts and images 056
017 versations (aka ERMSC and CEMSC). Since
018 the existing studies have never investigated
in a dialogue window. To be noted, in this scenario, 057

019 the above two tasks, we develop a benchmark the modalities appear alternately in each turn rather 058
020 dataset1 . and set some benchmark approaches than simultaneously, which we refer to modality- 059
021 to accomplish both tasks. Specifically, to an- switching phenomenon. As shown in Figure 1, two 060
022 notate a dataset from modality-switching con- examples present the modality-switching conver- 061
023 versations, we first build an annotation sys- sation, in which each turn of utterance has only 062
024 tem2 . Then, we obtain 5740 emotion-cause one modality (text or image) the modality may 063
025 pairs and 53464 utterances with labels (we call
change in the next turn. Different from previous 064
026 this dataset MECMSC for short). Finally, we
027 try to find some suitable methods that can be studies(Tripathi et al., 2018; Jia et al., 2021; Po- 065

028 applied in this scenario and build some strong ria et al., 2021), the emotion and cause analysis in 066
029 benchmarks on this dataset. The benchmark modality-switching conversations, we focused in 067
030 results demonstrate that there still have some this paper, raises the following challenges. 068
031 room for improvement and this task has great Emotion Recognition in Modality-switching 069
032 potential for application in real-life scenarios.
Conversations, has a modality missing problem 070

033 1 Introduction for each turn, which cannot be handled by one 071
temporal encoder, such as BERT or Transformer 072
034 Emotion is a key part of human beings intrinsically, for a long dialogue (Ghosal et al., 2019; Shen et al., 073
035 and how to understand people’s emotions become 2021). This is mainly because different modalities 074
036 a vital research orientation in artificial intelligence at different time steps exhibit different semantic 075
037 (AI). Especially, in the natural language processing spaces. Moreover, as shown in Figure 1, we can 076
038 (NLP) community, emotion analysis (Izard, 1992; not change the dialogue content to a uni-modal 077
039 Hu et al., 2021b; Shen et al., 2021; Rajan et al., status since removing any utterances may miss the 078
040 2022) has received much attention, as it can be complete meaning for a dialogue. For traditional 079
The dataset will be released upon acceptance. multi-modal approaches (Chen and Zeng, 2021; 080
The label system will be open upon acceptance. Jia et al., 2021; Rajan et al., 2022; Poria et al., 081

顾客 客服 顾客 客服
(Customer) (Server) (Customer) (Server)
1:全是店面瑕疵货(All defective store 生气 1:难受 以前购买你家鞋子都没这么严重 沮丧
(angry) (Frustrated)
goods) (sad, your shoes are not so serious before)
中性 2:麻烦您提供下商品问题照片呢(Can you 沮丧 2:由于质检疏忽,给您带来不便了,还望谅
(Neural) (Frustrated )解下哦(Due to the quality control inspection
provide photos of the product problem?)
negligence, sorry to bring you trouble)
3:全是痕迹(There are traces all over this) 生气 3:毕竟也不是第一次到你家买 反正你们也 中性
出运费 那我也不能找你们麻烦(After all, (Neural)
it is not the first time to buy, and you pay for
中性 shipping, I will not cause you trouble)
中性 4:订单退换货完成后可联系我们给您退邮
费哦(After the order return is completed, you
can contact us to give you a postage refund!)
中性 5:尽快为您查看处理呢(As soon as possible 5:嗯嗯 这种小事完全没问题 中性
(Neural) (Neural)
for your processing it) (ok, such a small thing is no
problem at all) 6:
6:这个脚后跟明显试穿发黑(This heel 生气 开心
obviously dirty)
(a) (b)

Figure 1: Two examples of our dataset based on Modality-Switching Conversations.

082 2021), they default every utterance to contain every versations(aka ERMSC and CEMSC). 110
083 possible modality, then normally fuse all modalities
084 in each turn. However, these kinds of multi-modal 2. We build an annotation toolkit and annotate a 111

085 approaches are difficult to extend to our modality- new dataset for the analysis of ERMSC and 112

086 switching scenario. CEMSC. To the best of our knowledge, we 113

087 Cause Extraction in Modality-switching Con- are the first to provide the dataset for these 114

088 versations, also meets the modality missing prob- two tasks. 115

089 lem similar to emotion recognition. Besides, al-

3. We provide some benchmarks for further re- 116
090 though only one study (Poria et al., 2021) has
search due to the lack of studies in this sce- 117
091 started to explore emotion cause extraction in tex-
nario. 118
092 tual conversations, none have conducted that in
093 multi-modal conversations. A classical difference 2 Related Work 119
094 between textual and modality-switching conver-
095 sations is that both text and image can be cause Emotion Recognition in Conversations(ERC). 120
096 expressions for an emotion. For example, in Fig- In the past years, many studies have been carried 121
097 ure 1(a), utterance 4 (an image), can not only be out for this task. We divide the current works into 122
098 the context of target emotion utterance 6, but also textual and multi-modal approaches. 123
099 the cause of the emotion. And in Figure 1(b), the Textual approaches are first proposed in con- 124
100 emotion sometimes comes from an image(server, versation research. Several related studies (Poria 125
101 utterance 6) and the cause of emotion comes from et al., 2017; Majumder et al., 2019) with focus 126
102 the textual utterance in context (customer, utterance on static emotion recognition leverage past and fu- 127
103 5), which contains the customer’s forgiveness. This ture context to forecast target utterance emotion. 128
104 raises the modality representation and multi-modal (Ghosal et al., 2019) propose a graph neural net- 129
105 fusion problems for emotion cause extraction in the work based approach, which leverages self and 130
106 new scenario. inter-speaker dependency of the interlocutors to 131
107 To summarise, our contributions are as follows: model conversational context. (Jiao et al., 2020) 132
proposes an Attention Gated Hierarchical Memory 133
108 1. We define the Emotion Recognition and Cause Network to conduct real-time emotion recognition, 134
109 Extraction tasks in Modality-Switching Con- which captures historical context and summarizes 135

136 the memories appropriately to retrieve relevant in- (Poria et al., 2021) introduce the task of recogniz- 187
137 formation. (Ghosal et al., 2020) propose a new ing emotion cause in textual conversations, and 188
138 framework, which incorporates different elements collect a dataset named RECCON. Although a 189
139 of commonsense. (Hu et al., 2021a) propose a con- pre-printed study(Wang et al., 2021) try to ex- 190
140 textual reasoning network, namely DialogueCRN, plore multi-modal emotion-cause pair extraction on 191
141 to comprehend the conversational context from a TV shows re-constructed from MELD(Poria et al., 192
142 cognitive perspective. 2019), namely ECF. The essential distinctions from 193
143 With the development of technology, multiple our dataset are that 1) the modalities of ECF ap- 194
144 modalities can complement each other for bet- pear simultaneously at each turn while ours appear 195
145 ter performance. (Zhang et al., 2020) proposes alternately. 2) Following the previous studies, ECF 196
146 a bidirectional dynamic dual influence network focuses on virtual multi-modal dialogues by actors, 197
147 for real-time emotion recognition in conversations, instead of real conversation in e-commerce. 198
148 which can simultaneously model both intra- and To this end, to advance the emotion cause analy- 199
149 inter-modal influence with bidirectional informa- sis and solve the practical problem in customer ser- 200
150 tion propagation. (Hu et al., 2021b) propose a vice, we build a new multi-modal dataset for emo- 201
151 model based on GCN, to explore a more effec- tion recognition and cause extraction in modality- 202
152 tive way of utilizing both multi-modal and long- switching conversations. Furthermore, we set some 203
153 distance contextual information. (Li et al., 2022) benchmark baselines to observe and analysis the 204
154 introduce a new structure to extract multi-modal performance of the above mentioned new tasks. 205
155 emotion vectors from different modalities and in-
3 Task Definition 206
156 volve them in an emotion capsule.
157 However, these approaches mainly focus on We distinguish between emotion and cause in con- 207
158 textual scenario or multi-modal scenarios where versations, following the studies (Poria et al., 2021; 208
159 all modalities appear simultaneously in each turn. Wang et al., 2021): 209
160 They have never taken modality-switching conver-
- Emotion is the psychological state expressed 210
161 sations into consideration.
in the utterance that indicates the mood of 211
162 Emotion Cause Extraction. Although many the speaker. For emotion analysis, we usu- 212
163 studies focus on emotion recognition, there still ally classify emotions into seven concrete cat- 213
164 lacks research on emotion cause extraction(ECE), egories, Neutral, Happy, Frustrated, Angry, 214
165 especially in conversation. We split the current Surprised, Sad, and Fear. Moreover, emotion 215
166 studies into no-conversational and conversational can be expressed by textual or visual utterance 216
167 approaches. in our modality-switching conversation. 217
168 Several studies have tried to participate in the
169 no-conversational scenario. (Lee et al., 2010) pro- - Emotion cause is the utterance expressing the 218

170 pose the task originally. Then, some studies follow reason for why the speaker expresses the emo- 219

171 with rule-based methods, such as (Li and Xu, 2014; tion given by the target utterance. In our work, 220

172 Gao et al., 2015a,b; Yada et al., 2017) or with ma- the cause comes from the visual or textual 221

173 chine learning methods, such as (Ghazi et al., 2015; utterance and each target emotion utterance 222

174 Song and Meng, 2015). Inspired by the corpus of contains at least one cause utterance. 223

175 (Lee et al., 2010), (Chen et al., 2010) speculate that We define two kinds of tasks in our MECMSC 224
176 clause may be suitable for annotation and cause dataset: Emotion Recognition and Cause Extrac- 225
177 analysis. Then, several studies (Russo et al., 2011; tion in Modality-Switching Conversations (aka 226
178 Gui et al., 2014) follow his task setting. Particu- ERMSC and CEMSC). The goal of ERMSC is 227
179 larly, (Gui et al., 2016b) release an Chinese emotion to detect the emotion of utterances, while the goal 228
180 cause dataset, which receive much attention and of CEMSC is to extract the cause of the target emo- 229
181 is utilized as the benchmark dataset of following tion. Moreover, the two tasks can be united as 230
182 studies, such as (Xu et al., 2017; Gui et al., 2017; emotion-cause pair extraction, though it still needs 231
183 Li et al., 2018; Yu et al., 2019; Ding et al., 2019; research and is not present in this work. 232
184 Xia and Ding, 2019; Yan et al., 2021). Therefore, we define the following notations, 233
185 These approaches mainly focus on articles, used throughout the paper. Let D = 234
186 blogs, and other areas, but not conversations. Since {(Xn , En , Cn )}N
n=1 be the set of data samples. 235

ERC Dataset Language Modality Source Size Format
IEMOCAP(Busso et al., 2008) English T,A,V Conversation 7,433 utterances
DailyDialog(Li et al., 2017) English T Conversation 102,979 utterances
EmotionLines(Hsu et al., 2018) English T Conversation 14,503 utterances
SEMAINE (McKeown et al., 2012) English T,A,V Conversation 5,798 utterances
EmoContext(Chatterjee et al., 2019) English T Conversation 115,272 utterances
MELD(Poria et al., 2019) English T,A,V Conversation 13,708 utterances
MELSD(Firdaus et al., 2020) English T,A,V Conversation 20,000 utterances
ECE Dataset Language Modality scene Size Format
Emotion-Stimulus(Ghazi et al., 2015) English T FrameNet 2,414 sentences
ECE Corpus (Gui et al., 2016a) Chinese T SINA city news 2,105 documents
NTCIR-13-ECA(Gao et al., 2017) Chinese T SINA city news 2,403 documents
Weibo-Emotion(Cheng et al., 2017) Chinese T Microblog 7,000 posts
REMAN(Kim and Klinger, 2018) English T Fiction 1,720 documents
GoodNewsEveryone(Bostan et al., 2020) English T News 5,000 sentences
RECCON(Poria et al., 2021) English T Conversation 11,769 utterances
ECF(Wang et al., 2021) English T,A,V Conversation 13,509 utterances
Our ERC& ECE Dataset Language Modality scene Size Format
MECMSC-SHP Chinese T/V Conversation 25,974 utterances
MECMSC-COS Chinese T/V Conversation 27,490 utterances
MECMSC Chinese T/V Conversation 53,464 utterances

Table 1: The comparison of emotion and cause datasets, following with (Wang et al., 2021; Poria et al., 2021). ERC:
emotion Recognition in conversation, ECE: emotion cause extraction, T: text, A: audio V: vision. "T,A,V" denotes
that text, audio, and vision appear simultaneously, while "T/V" denotes that text and vision appear alternately.

236 Given an conversation X = {u1 , u2 , . . . , uk }, the room, and other areas. Here, we utilize the JD 264
237 ERMSC is to classify the utterances into emotion customer service conversations from (Zhao et al., 265
238 list E = {e1 , e2 , . . . , ek }. And given X and E, 2021). The raw conversations is a large-scale 266
239 the CEMSC task is to extract emotion cause into multi-modal multi-turn dialogue dataset collected 267
240 list C = {c1 , c2 , . . . , ck }. k is the length of the from a mainstream Chinese E-commerce plat- 268
241 conversation. ui = {si , ti /oi } contains two items. form3 . Distinguished with previous data, such as 269
242 si denotes the speaker name, ti denotes the textual RECCON(Poria et al., 2021),ECF (Wang et al., 270
243 utterance and oi denotes the visual utterance. Note 2021), IEMOCAP(Busso et al., 2008),DailyDia- 271
244 that ti /oi means the textual utterance or the visual log(Li et al., 2017) and MELD(Poria et al., 2019), 272
245 utterances. this raw data is presented as modality-switching 273
246 For 1 ≤ i ≤ k, ei ∈ E is the id of the emotion conversations, which contains multiple modalities 274
247 and E is emotion label set. ci ∈ ζ is the position of and modalities appear alternately. 275
248 the cause utterance and ζ is the position set. ζ = We find that customer service contains rich emo- 276
249 {−η, −η + 1, · · · , 0, oth} and η is the maximal tions and causes. Most of the conversations come 277
250 position span before target emotion utterance. Note from after-sales services, which usually involve 278
251 that: 1) when there exists more than one cause, we many disputes. Normally, the customer may point 279
252 select the most related cause as the final cause. 2) out the flaw and give a negative emotion. And the 280
253 when the cause exists behind the target emotion cause of the emotion clearly exists in the conver- 281
254 utterance, we abandon the cause and classify it into sation. These preconditions make our annotation 282
255 "oth". 3) when the span between target emotion become feasible. 283
256 and cause utterance is longer than η, the cause
In concrete, We select the raw conversations 284
257 position is classified into "oth". The above three
that related to goods of small home appliances 285
258 assumptions happen rarely and will be discussed in
(SHP) and the costume (COS) for annotation. 286
259 the dataset analysis.
Thus our MECMSC dataset consists of two parts: 287
MECMSC-SHP and MECMSC-COS. The two 288
260 4 Building the MECMSC Dataset
datasets share the same emotion and cause label 289
261 4.1 Conversation Sources set. 290

262 The modality-switching conversations often ap-

263 pear in reality, i.e., customer service, chatting

291 4.2 Annotation Guidelines and Toolkit
292 To receive an excellent dataset, we make some
293 annotation settings as shown in Section A.1 of Ap-
294 pendix.
295 In the toolkit, we can easily read, annotate, and
296 save the dataset. Context can be viewed by turning
297 the page and images can be amplified by click. We
298 fix the input format and some input judgment to
299 ensure that the annotators do not make unwanted
300 errors. We have improved the toolkit many times
301 and will release the final toolkit. The details can be
302 seen in Section A.2 of Appendix.

Metrics(100%) COS SHP Total

Emotion Cohen’s Kappa 79.5 70.6 75.4
Cause Cohen’s Kappa 75.3 67.7 71.8

Table 2: The inter-personal agreement for emotion and

cause annotations.

303 4.3 Annotation Quality Assessment Figure 2: The part of the images in the dataset.

304 To evaluate the quality of the annotated dataset, we

305 calculate the consistency of the annotations with our datasets is large comparably, especially among 333
306 some existing methods. In our annotation, each ECE datasets. 334
307 utterance was first annotated by two workers with Concrete Statistics: As shown in Table 3, we 335
308 professional knowledge backgrounds. The incon- can see that the MECMSC contains 1562 dialogues 336
309 sistent annotations are identified by an expert, and and 53464 utterances in total. 5740 utterances were 337
310 then the expert gives a final judgment. annotated with emotion and cause (exclude Neu- 338
311 Here, we utilize Cohen’s Kappa to help estimate tral). 63.7%(3658 items ) of the causes lie in the 339
312 the consistency, exhibited in Figure 2. Cohen’s target emotion utterance itself, while 36.3% (2082 340
313 Kappa is used to measure the consistency of any items) of the causes lie in the contextual utterances. 341
314 two annotators (Cohen, 1960). As shown in the Moreover, a total of 350 target emotion utterances 342
315 figure, it can be seen that 1) the annotators achieve contain more than one cause. For better execution, 343
316 75.4% and 71.8% Cohen’s Kappa for emotion and the model can only select the cause with the highest 344
317 cause annotation respectively. 2) MECMSC-COS priority presented in annotation guidelines as the 345
318 performs better than MECMSC-SHP, this is mainly final cause. 346
319 because the costume (COS) after-sales issues are For further statistics, it can be seen that: 1)Im- 347
320 relatively simple and unambiguous. ages often exist in dialogues. 6479 images are 348
involved in total dialogues and about average 4.1 349
321 4.4 Dataset Statistics and Analysis images consist in each dialogue. 112 cause and 27 350
322 Global View: In Table 1, We collected many emo- emotion utterances come from the image. 2) The 351
323 tion and cause datasets and compared them in terms average number of utterances in each dialogue is 352
324 of language, modality, source, size, and format. about 34.2 and each utterance contains about 14.1 353
325 Here, we can see that we are the first to construct a words. 3) About 3.7 utterances have emotions (ex- 354
326 dataset for emotion and cause analysis in modality- clude Neutral) in each dialogue and each emotion 355
327 switching conversations(refer to "T/V"). Particu- has about 32.0 contextual utterances 4) The length 356
328 larly in ECE datasets, few datasets (RECCON and of the span between emotion and cause utterances 357
329 ECF) involve in conversations. Only an unpub- is about 0.68, which means cause utterance is near 358
330 lished dataset ECF focuses on multi-modal conver- the target emotion utterance. 359
331 sation, though it is still not involved in modality- Image Composition. Figure 2 gives a part of the 360
332 switching conversations. Moreover, the size of images in our modality-switching conversations. 361

Number of items COS SHP Total
Dialogues 790 772 1562
Utterances 27490 25974 53464
Utterances annotated with emotion and cause 3243 2497 5740
Utterances where cause solely lies in the same utterance 2018 1640 3658
Utterances where cause solely lies in the contextual utterances 1225 857 2082
Utterances contain more than one causes 146 204 350
Total number of images in dialogues 3189 3290 6479
Total number of cause from image 76 36 112
Total number of emotion from image 24 3 27
Average number of images in each dialogue 4.0 4.3 4.1
Average number of utterances in each dialogue 34.8 33.6 34.2
Average number of emotions in each dialogue 4.1 3.2 3.7
Average word length of textual utterances in each dialogue 13.8 14.4 14.1
Average word length of textual cause utterances 15.7 13.8 14.8
Average word length of textual emotion utterances 13.3 12.9 13.1
Average number of contextual utterances for each emotion 33.6 29.8 32.0
Average length between emotion and cause utterance 0.74 0.59 0.68

Table 3: The concrete statistics of the dataset.

Name Neutral Happy Frustrated Angry Surprised Sad Fear

MECMSC-COS 24247 422 735 1898 143 9 36
MECMSC-SHP 23477 483 686 1053 164 50 61
MECMSC 47724 905 1421 2951 307 59 97

Table 4: Emotion distribution of MECMSC-COS, MECMSC-SHP, MECMSC

362 The images are mainly related to after-sales cus- Cause Positions COS SHP Total
363 tomer service. Here, we can find that images are cause at U(>t−0) 2 2 4
cause at U(t−0) 2018 1640 3658
364 mainly screenshots of order information, screen- cause at U(t−1) 711 588 1299
365 shots of products for sale, photos of products after cause at U(t−2) 252 151 403
366 purchase, customer service emoticons, and so on. cause at U(t−3) 110 58 168
cause at U(<t−3) 152 60 212
367 The sources of the images are concentrated in a few
368 categories and the content of the images is distin- Table 5: Cause distribution of the dataset. U(t−i) de-
369 guished. Thus, it is worthwhile to consider how to notes that the cause occurs i positions before current
370 make the model understand the pictures. A more emotion utterance position t.
371 comprehensive view of the images is in Section
372 A.3 of the Appendix.
1) emotion causes prefer to occur near the target 389
373 Emotion Distribution: The emotion set con-
emotion utterance. 2) The longer the span between 390
374 tains seven categories, i.e., Neutral, Happy, Frus-
emotion and cause utterance, the less the emotion- 391
375 trated, Angry, Surprised, Sad, and Fear, exhibited
cause pair exists. When the span is beyond a certain 392
376 in Table 4. For each emotion, at least one cause
length, the emotion-cause pairs can be negligible. 393
377 utterance connects to it. Here, we can see that the
Thus, some emotion-cause pairs with too long span 394
378 distribution of emotion categories is unbalanced.
will be ignored in the experiment. 395
379 Especially, the dataset contains too many "Neutral".
380 This is because two many emotionless product de-
5 Benchmarking 396
381 scriptions and polite expressions in customer ser-
382 vice conversations. Also, the dataset contains more In this section, we build several benchmarks for 397
383 negative emotions(i.e., Frustrated and Angry) than emotion and cause analysis in modality-switching 398
384 positive emotions(i.e., Happy and surprised). This conversations. Since few studies have focused on 399
385 is mainly because the dataset comes from customer this scenario, we design several approaches entirely. 400
386 service, which has many after-sales complaints. Inspired by previous deep learning studies, i.e., 401
387 Cause Distribution: The detailed cause distri- (Jiao et al., 2020; Vaswani et al., 2017), we propose 402
388 bution is shown in Table 3 and 5. We can find that two approaches named MMSC-ERC and MMSC- 403

Modality- Add Modality- Contexual
Switching Feature Text/image Switching Feature Feature Extracion Cause Position
Attention Linear 1 ...
Encoding Module mask feature Fusion Module Module Probability
Interlocutor att1 att 3 att j 1
顾客 ... ...
难受 以前购买你家鞋子都没这么严重 Encoder
(customer): MSFFM CFEM ...


(sad, your shoes are not so serious before) Text b oth b 1j b 2j ... b jj-1 b jj -ȵ -1 0 oth
U1 Encoder
客服 由于质检疏忽,给您带来不便了,还望谅解  j  2j  jj1  jj


Encoder Dense ...

(Server): 下哦(Due to the quality control inspection MSFFM CFEM
negligence, sorry to bring you trouble) Text Emb Emb Emb Emb
U2 Encoder
Add&Norm ...
. e1 e2 e j 1 ej FeedForward
. ERMSC Task Add&Norm

. r1 , r2 ,..., rn 1 , rn Mutli-Head
Interlocutor Attention CFEM
顾客 Encoder
(customer): 嗯嗯 这种小事完全没问题 ...
(ok, such a small thing is no problem at all)
U n 1 Text
Emotion Classify o
his1 ... hiLs s hit1 ... hiLt
hio1 ... hiLo Reshape
 1  2 ...  k 1  k Interlocutor
Text Feature Image Feature
& Linear

Un Encoder Predicted Emotions Text/Image is masked feature

Figure 3: The overview of our baselines. Details can be seen in Figure 6 of Appendix.

404 ECE for the two tasks, ERMSC and CEMSC sepa- emotion recognition, shown in Figure 3. 441
405 rately. MMSC-ERC and MMSC-ECE share some To capture the relations among utterances in con- 442
406 modules. Figure 3 shows an overview of our bench- versation, we apply a bidirectional GRU(Cho et al., 443
407 marks. 2014) in CFEM, following (Jiao et al., 2020). We 444
408 Modality-Switching Feature Encoding. The only can make use of past contextual utterances, 445
409 task is defined in Section 3. ui = {si , ti /oi } is due to that the task is real-time and future utter- 446
410 a portion of a modality-switching conversation. ances are unknown. The detailed description of 447
411 For the reason that textual and visual utterance CFEM for ERMSC can be seen in Section B.3 of 448
412 appears alternately, we first complement the tex- Appendix. 449
413 tual utterance with a blank image or complement
Emotion Recognition and Cause Extraction in 450
414 the image utterance with a blank textual utter-
Modality-Switching Conversations. The ERMSC 451
415 ance. Then we obtain a new data group containing
task is shown in Figure 3. We involve the refined 452
416 three items, one speaker, one existing utterance,
vector into a classification module and obtain the 453
417 and one padding utterance. Finally, we involve
emotion probability. Then we calculate the loss 454
418 the new data group into three encoders (i.e., In-
by a cross-entropy function. The process of emo- 455
419 terlocEncoder, TextEncoder and ImageEncoder )
tion recognition can be seen in Section B.4 of the 456
420 and obtain the multi-modal utterance feature group
Appendix. 457
421 hi = [hsi , hti , hoi ], i ∈ {1, 2, . . . , k}. Note that k is
422 the length of the conversation. The InterlocEncoder Meanwhile, we utilize the shared Modality- 458
423 and TextEncoder are based on BERT(Devlin et al., Sitching Feature Encoding and Fusion modules 459
424 2019), while ImageEncoder is based on ResNet(He to encode and fuse the multi-modal feature. And 460
425 et al., 2016). The three encoders can be seen in then we get the multi-modal utterance common 461
426 Section B.1 of the Appendix. representation. Subsequently, we involve the repre- 462

427 Modality-Switching Feature Fusion. After sentation into a modified CFEM and finally obtain 463

428 modality Encoding, we utilize a multi-modal fusion the vector with contextual clues. Finally, we try 464

429 structure to study the relation between modalities to predict the position of the cause utterance with 465

430 and build a common feature for each utterance. The attention methods and calculate the loss by a cross- 466

431 structure is based on the multi-modal Transformer entropy function. Detailed description can be seen 467

432 Encoder(Vaswani et al., 2017) for intra-utterance in Section B.5 of Appendix. 468

433 fusion, shown as MSFFM in Figure 3. The encoder

434 consists of N layers and each layer contains Multi-
435 Head Attention and FeedForward modules. Details 6 Experimentation 469

436 can be seen in Section B.2 of Appendix.

437 Contextual Feature Extraction. The past two We evaluate emotion recognition and cause extrac- 470
438 modules are shared by the two tasks and the Contex- tion with different settings. We discuss the details 471
439 tual Feature Extraction Module (CFEM) is shared of the benchmarks and the hyperparameters for our 472
440 partially. Here, we first introduce the CFEM for experiments in Section C.1 of the Appendix. 473

Approaches dataset Neutral positive Negative w-avg Neutral Emotion w-avg
COS 0.9591 0.0280 0.0000 0.9214 0.9604 0.0087 0.9238
MMSC-ERC SHP 0.9674 0.0000 0.0000 0.9221 0.9615 0.0296 0.9259
Total 0.9599 0.0137 0.0150 0.9224 0.9565 0.0246 0.9166

Table 6: The performance of the ERMSC task by two emotion groups on the MECMSC dataset. The evaluation
metrics is F1 and w-avg is the weighted-average F1 score.

dataset Pos.0 Pos.1 Pos.2 Happy Frustrated Angry Other Positive Negative w-avg
COS 0.7465 0.2785 0.0000 0.4231 0.7887 0.4948 0.4000 0.4333 0.6154 0.5879
MMSC-ECE SHP 0.7273 0.6372 0.1250 0.7551 0.6857 0.5469 0.5000 0.7451 0.5960 0.6467
Total 0.7901 0.6575 0.1212 0.7551 0.6933 0.6605 0.2500 0.7353 0.6709 0.6903

Table 7: The performance of the CEMSC task by three aspects on MECMSC datasets. The evaluation metrics is
F1 and w-avg is the weighted-average F1 score. Pos.0 denotes the span between cause and emotion utterance is 0.
"Other" denotes the emotion samples, excluding Happy, Frustrated, and Angry.

474 6.1 Experimental Results we give some experiment results and find that there 508

475 In this section, we mainly make an exhibition of re- still have some room for improvement. To our best 509

476 sults and give an analysis of the two tasks, ERMSC Knowledge, we are the first to conduct emotion 510

477 and CEMSC. and cause analysis research in modality-switching 511

478 Emotion Recognition in Modality-Switching conversations. 512

479 Conversations. Table 6 exhibits the experiment Emotion recognition and cause extraction in 513

480 results based on the two groups. We give a detailed modality-switching conversations are still challeng- 514

481 analysis in Section C.2 of the Appendix. ing tasks. Our work focus on introducing the new 515

482 Emotion Cause Extraction in Modality- tasks and the annotated dataset. Only some concise 516

483 Switching Conversations. As shown in Table 7, benchmarks are provided and we intuitively believe 517

484 we exhibit the performance on three datasets from that these benchmarks still have some weaknesses. 518

485 three aspects. The detailed analysis is shown in In the future, the following hypothesis deserves 519

486 Section C.3 of the Appendix. to be explored for better performance in modality- 520
switching conversations. 521
487 6.2 Analysis and Discussion
• In fact, the annotated dataset is still not large 522
488 In this section, we provide some interesting further
enough and a big-scale dataset may improve the 523
489 analysis.
performance probably. 524
490 Case Study. We analyze the output cases of our
491 approach for the task CEMSC. Due to the limitation • Our dataset contains two interlocutors, a cus- 525
492 of space, we put the analysis in Section C.4 of the tomer, and a server. Better modeling the inter- 526
493 Appendix. locutor interaction may assist the ERMSC and 527
494 Context Analysis. We analyze the effect on the CEMSC tasks. 528
495 length of the context and we put the analysis in
496 Section C.5 of the Appendix. • Pre-trained language model BERT may not be 529
suitable completely for the tasks based on con- 530
497 7 Conclusion and future work versations. A better conversation pre-trained lan- 531

498 In this work, we introduce our research on Multi- guage model may bring some enhancement. 532

499 modal Emotion and Cause Analysis in Modality-

• Better understanding the images from customer 533
500 Switching Conversations. This research has great
service in modality-switching conversations may 534
501 potential for application in real-life scenarios.
promote the work. 535
502 We first construct and annotate a dataset, named
503 MECMSC, from modality-switching conversations • Better fusing and interacting with the textual and 536
504 with the help of our designed annotation system. visual modality in modality-switching conversa- 537
505 Secondly, inspired by some advanced multi-modal tions may enhance the performance. 538
506 methods, proposed in past few years, we design
507 some strong benchmarks for this dataset. Finally,

