Professional Documents
Culture Documents
Affective Visual Dialog: A Large-Scale Benchmark For Emotional Reasoning Based On Visually Grounded Conversations
Affective Visual Dialog: A Large-Scale Benchmark For Emotional Reasoning Based On Visually Grounded Conversations
Affective Visual Dialog: A Large-Scale Benchmark For Emotional Reasoning Based On Visually Grounded Conversations
{kilichbek.haydarov,xiaoqian.shen,avinash.madasu,mahmoud.salem}@kaust.edu.sa
jia@healthunity.org, gamaleldin.elsayed@google.com, mohamed.elhoseiny@kaust.edu.sa
Answerer
Abstract Questioner
Dial, consisting of 50K 10-turn visually grounded dialogs A1: I seen two people in the image
as well as concluding emotion attributions and dialog- Q2: What are they doing in the image
informed textual emotion explanations, resulting in a to- A2: They are sitting on the boat, and they
tal of 27,180 working hours. We explain our design de- both are looking some where.
***
2. Related Work This picture shows a landscape, i find a river, some trees and a mountain here
Can you clarify the emplacement of these object then?
Vision-Language datasets. Several notable datasets The river is in the foreground of the image, these trees are located in
the middle of the picture while this mountain is at the background
(e.g. [31, 7, 50, 57]) have been curated to explore the link
About those trees, what do you can say about them, color and shape and type?
between visual and linguistic information, facilitating the It is difficult to know their type because this picture is not real but i can
investigation of tasks such as image captioning, visual ques- say these trees have the same height, the color of their leaves are green
Do you see how many of mountain there in fact?
tion answering (VQA), and visual dialog. Prominent among I just find one big mountain at the background of the picture
these datasets is the MSCOCO dataset [31], which con- Speaking about that big mountain, what color do you see in it?
tains a vast collection of images, each paired with human- This big mountain is colored in dark green, it is grassy
generated captions, enabling the training and evaluation of And seeing at its contour, do you see a curve shape or pyramid or zigzag?
When it concerns its contour, i find a mountain of a pyramid shape here
image captioning models. Another significant dataset is
So you see the sky then there or i'm right in saying that?
VQA [7], where questions related to images are accom- Yes, the sky appears on the top at the background of the picture
panied by human-generated answers, fostering research on Don't you notice some object like person in that picture?
bridging vision and language understanding. No, i cannot find any persons or any animals in this picture
About the brightness of that picture, what weather do you see in it?
Visual Dialog datasets (e.g., VisDial [15] and Guess- The weather is cloudy because there are many clouds
What?! [16]) focus on generating natural language re- in the sky, it is a bit dark i guess it is going to rain
What else do you want to add about that picture?
sponses to questions about images, given an image, dialog I find a reflection of the trees on the lake, so the lake also is colored in dark green
history consisting of a sequence of question-answer pairs, Affective emotion explanations:
and a follow-up question. Compared to other vision and because i like the place like that as it contains many trees and mountain, i think
that the environment in that place is still fresh.
language tasks, this task is more challenging and requires The greenery of this picture is not terrific nor attractive in my eyes as it looks
like a algae.
visual understanding, reasoning, and text comprehension I feel happy when i am looking at this picture because i like this place, the place
from ML systems. For diagnostic purposes, synthetic di- is quiet and the environment is clean
Percentage Coverage
0.150
VisDial Answers
0.125 0.6 is 52.7% for VisDial and for 25.5% for AffectVisDial (>
0.100
0.4
50% reduction).
0.075
0.050
0.2
0.025 Analyzing Answers. Answers in AffectVisDial are much
0.0
0.000
0 5 10 15 20 25 30 0.0 0.25 0.5 0.75 1.0 longer and more informative - average length 14.85 words;
# Words in sentence # Unique Answers (x 100000)
see Tab. 1 and the solid green and dashed green plots in
(a) (b) Fig. 3a. The top-1000 frequent answers in VisDial [15]
cover 63% of all answers, while for AffectVisDial they
Figure 3: a) Distribution of lengths for questions, answers, cover only 38% . Since the number of dialogs in VisDial is
and explanations b) percent coverage of overall answers higher than AffectVisDial , we sampled a subset of dialogs
from the training dataset covered top frequent unique an- of the same size as our whole dataset size and computed the
swers, compared to VisDial. Based on the coverage, Af- percentage coverage of answers on this subset from Vis-
fectVisDial ’s answers are more diverse (i.e., less percent- Dial. Fig. 3b shows the percent coverage of unique answers
age of our AffectVisDial is covered by the same set of overall answers in the training set. Fig. 4 (b) visualize an-
unique answers). swer distribution based on the starting few words. A large
portion of answers are descriptive to convey the existence
Dial and VisDial. Compared to VisDial, AffectVisDial has of objects in the image (e.g., “there are/is” or “there seems
a higher frequency of PoS tags in questions and answers, like”), and many answers start with“there are” or “is” since
suggesting that the answers are more detailed and descrip- questioners are querying whether things are in the image.
tive of the image attributes and activities. Emotional expla-
nations in AffectVisDial have higher PoS tags than ques- Analyzing emotion explanations. We report three types of
tions and answers. This is expected since the agents should analysis for explanations: explanations of the Questioner
be expressive in their explanations in order to convey what (1) before and (2) after the image observation, and An-
constructed their emotion from the dialog. We also break swerer explanations. Tab. 1 shows that on average expla-
down the results for dialogs concluded with a positive emo- nations have lengths 27.29 and 28.95 for Questioner and
tion (i.e., amusement, awe, excitement, contentment) vs Answerer, respectively. Tab. 2 shows that the average num-
negative emotion (i.e., sad, anger, disgust, fear). We found ber of PoS from explanations before observing images is
that there are not many statistical differences between ques- larger than those after observing the image. Answerer ex-
tions from dialogs with positive conclusions and questions planation contains slightly more pronouns, so we speculate
from dialogues with negative conclusions. that they can easily refer to visual cues since they can see
Analyzing Questions. Fig. 3a shows the distribution of the the hidden image.
lengths of questions for AffectVisDial and VisDial (see the
solid red lines). The plot indicates that questions and an- Emotion Distribution. We report emotion distribution be-
swers of AffectVisDial are significantly longer. Fig. 4 (a) fore observing the image in Fig. 4 (d) with the top words as-
shows Sunbursts visualization of the distribution of ques- sociated with each emotion. It is obvious that the most dom-
tions based on the first four words in the dataset. Binary inant emotion is “contentment”. Fig. 5 (top) shows the per-
Questions vs Binary Answers: It is interesting to see that centage of Questioner, who changed their emotion choice
the top 10 answers are binary answers with details, e.g., “No after observing the hidden image. On average, 23.22% of
there are no trees or plants in the image” or “Yes there is the time, annotators changed their minds after looking at
a man in the image.” This is a consequence of the ques- the hidden image. In general, dialog information is infor-
tioner not being able to see the image. Thus, they are ask- mative enough for the Questioner to conclude their final
ing contextually relevant questions like “Do you see any emotional responses. The emotion category that was influ-
trees or plants in the image?” in an attempt to explore enced the most by the emotion change is “something else”,
the scene. Das et al. [15] observed that the VQA dataset where the Questioner may not have to find enough linguis-
contains binary questions that can be answered by sim- tic details associated with images. Fig. 5 (bottom) shows the
ply “yes”/“no”/“maybe”. Regarding answers which con- distribution of emotions before observing the image across
tain “yes”/“no” as binary answers, they reported that only continents. “contentment” is the most dominant emotion
46.96% of all “yes”/“no” responses were binary answers across annotators from all continents. Interestingly, we can
containing “yes” in VisDial. However, in VQA there was observe that there are some similarities in the emotion dis-
a bias towards “yes”, with 61.40% of binary responses be- tribution between North America and Europe: the emotion
ing “yes”. In our dataset, there is a balance between binary labels like “contentment”, “excitement”, and“fear” are al-
questions containing “yes” and “no”, 50.32% and 49.68%, most identical between these groups.
Amusement (4.71%)
Anger (1.99%)
red, wearing, bright
red, nothing, animals
Excitement (7.76%)
Disgust (7.41%)
sky, flowers, river
red, nothing, animals
Awe (9.09%)
Fear (6.31%) weather, beautiful, water
dark, black, scary
Sadness (11.12%)
sitting, old, alone
Contentment (45.38%)
plants, mountains, human
(a) (b) (c) (d)
Figure 4: Distribution of first n-grams for AffectVisDial (a) questions, (b) answers, (c) explanations. Word ordering starts
towards the center and radiates outwards, and arc length is proportional to the number of sentences containing the word. (d)
Questioner’s emotion distribution pie chart before observing the hidden image, along with the most affective words for each
specific emotion from the dialogue.
Neg: it is sad to think that all the flowers what else is there which is seeking more attention in the picture?
that bloom naturally today are blooming
artificially with difficulty womans eyes and the way is she holding her elbow with other
hand
is the picture realistic or an imagination of the painter ?
there is a mountain in the picture
well it is realistic as it is related to the nature
can you describe me what is it ?
it is about some different flowers assembled into a bouquet
are those flowers inside a vase or something like that ?
yes those flowers are inside of a vase
could you identify if there are different kinds of flower ?
there are maybe ten different type of it if i am not
mistaken and each has its own color
can you give me a few description about the vase ?
personally i can hardly figure out the vase since it is recovered
by the flowers around but i can say its color is brown
is that bouquet of flowers on a table then ? Figure 7: Example of altering answers to evoke opposite
i do not think so because the focus of the artwork but i can say it is on the floor emotions, and make ”edit” in the original image.
could you quote those different colors ?
i see some white color red orange green and pink through it
what else can you see around the bouquet of flowers ?
no the painter just wants to evolve a portrait of bouquet ually altering the answers can shift the original emotion to-
according to you is it an oil or a water painting ?
wards an opposing direction, illustrating the role of linguis-
in my opinion it is a water drawing based on how the artist draws it
does the painting contain a signature or a mark to recognize the artist ?
tic details in driving emotions (e.g., the absence of trees in-
no it does not have any mark nor the signature of the painter on the artwork creases the probability score of ”fear”).
Generated emotion explanations: Towards Emotionally Reasoned Image Editing. While
I am amazed by the creativity of the artist to come up with a concept of a
painting that uses different kinds of flowers in a vase and makes it look the above study has provided valuable insights, it is essen-
realistic and appealing.
tial to acknowledge certain limitations in the methodology.
Firstly, emotions are complex and subjective, and the clas-
Figure 6: Generated Questioner’s Explanation from our T5- sifier’s output may not always align perfectly with true hu-
Large baseline. man emotional experiences. Furthermore, the chosen an-
swers may not necessarily be grounded in the visual input.
An interesting avenue for exploration is editing initial vi-
conducted studies on Amazon Mechanical Turk. We asked sual input such that it is aligned with the candidate’s an-
participants to evaluate the reasonableness of our dialog- swers. This approach could be valuable, for example, for
based QA task and conducted a Turing test for emotion ex- emotion-driven content creation. To illustrate this, we uti-
planation generation. Results indicate that over 90% of pro- lized our roBERTa-based classifier to select an answer for a
duced answers were considered reasonable and 55% of ex- given question to increase the probability score of the target
planations were considered human-like, demonstrating the emotion. Subsequently, we used an image editing model [9]
effectiveness of our approach in capturing emotional rea- to modify the original image accordingly (see Fig. 7). In
soning in visually grounded conversations (Details in sup- this example, the suggested practice reasoned that the con-
plementary). tentment emotion of the visual stimuli can be increased by
Fig. 6 shows the generated emotion and corresponding suggesting to include a mountain in the image. This ap-
explanation from the Questioner. The output closely resem- proach paves the way for the creation of collaborative image
bles human explanations (more examples are in the supple- editing systems capable of offering emotionally informed
mentary). suggestions for edits.
Emotion Guided Answer Generation. In this study, we Emotion Awareness of ChatGPT and GPT-4. We ex-
investigate the influence of dialogue answers on resulting plored the potential of ChatGPT [1], a foundation model
emotions. To achieve this, we manipulate the answers to known for its impressive zero-shot question-answering per-
elicit specific emotional reactions. For each question in a formance, for predicting emotions and generating corre-
dialogue, we select the answer candidate from similar ques- sponding explanations on our newly proposed dataset. To
tions that maximizes the probability of the target emotion, this end, we employed the latest API [2], GPT-4, with ei-
as determined by our pretrained RoBERTa-based emotion ther {E, C} or {E, C, D}, along with our designed instruc-
classifier [33]. The initial two or three turns of the dialogue tions, as inputs to enable ChatGPT to generate emotions and
remain unchanged, while subsequent answers are replaced their explanations. The results showed that the Emotion F1
by the selected candidate answers. Fig. 8 shows how grad- scores were 19.84% and 26.64%, respectively for two in-
initial emotion:
85.58% initial emotion: 92.34%
excitement
sadness
do you see humans in the image? how many people are there in this image?
yes there are two humans sketch in the image there is only one woman in the image
Fixed
Fixed
what type of dress style they were wore in the image? what is she doing?
it is hard to depict because it is sand sketch in the image she is looking down in the image
do you see any living things in the image? how the woman is looking in this image?
yes there are animals no any other living things are her hair seems
20.37% she is half old and half young 76.67%
in the image there in the image like comical in
and is looking upset in the image
the image
**
**
do you see any trees or plants in the image? what type of attire is she wearing
yes there are plants in no there are no trees and
77.28% only the face is show in she is wearing
the image plants in the image 83.71%
the image and she wore a suit-type attire
**
**
the humans and animals what are the other things present in this image?
cloudy climate with green
and plants are in a circle is
trees are the most eye- 86.63%
the most eye-catching thing there are no other a person and a bird is 84.94%
catching thing in the image things in the image there in the picture
in the image
**
**
what is the background of the image? does the image is looking clear or blurred?
the background seems the background has black brown the image is not clear
sand in the image and mixed green and yellow 92.70% and not blurred it seems a clear image 85.54%
Figure 8: Examples of altering answers to evoke opposite emotions. The left side of each example shows the original dialog
and the right side shows the replaced answer with the corresponding probability of the target emotion
put combinations. While additional dialog input improved We also demonstrated that the models trained on AffectVis-
the emotion prediction, ChatGPT still struggled to capture Dial can facilitate both emotion-guided answer generation
emotional information through conversation and understand and pave the way toward building and advancing emotion-
artwork descriptions at a human level. In contrast, our base- ally reasoned collaborative image editing systems. We hope
lines trained on our dataset showed superior results, yield- that our dataset and models will help ease future research in
ing more than 40% in F1 score. These limitations highlight affective vision and language.
the significance of our dataset in advancing emotion-aware
AI systems for future applications. Further details regard-
ing the instructions and generated examples are provided in
Acknowledgement. This project is funded by KAUST
the supplementary material.
BAS/1/1685-01-01, SDAIA-KAUST Center of Excellence
in Data Science and Artificial Intelligence. The authors ex-
6. Conclusion press their appreciation to Jack Urbanek, Sirojiddin Kari-
mov, and Umid Nejmatullayev for their valuable assistance
Emotions are an integral part of human nature. A
in data collection setup. Lastly, the authors extend their
deeper comprehension of how emotions arise from natural
gratitude to the diligent efforts of the Amazon Mechanical
conversation can aid in the development of more human-
Turkers, DeepenAI, and SmartOne teams, as their contribu-
compatible machines. In pursuit of this goal, we introduce
tions were indispensable for the successful completion of
a novel dataset, called AffectVisDial , which records the di-
this work.
alogue between two agents discussing an image and the re-
sultant emotion they construct from the visually grounded
conversation. By adapting and training the state-of-the-art References
models using our AffectVisDial dataset, we show the po-
tential of AI systems for generating affective explanations [1] Introducing chatgpt. https://openai.com/blog/
grounded in dialogues that are similar to human emotions. chatgpt, 2023. Accessed: March 7, 2023. 8
[2] Openai api: Gpt-4. https://platform.openai. subjective experiences associated with music across different
com/docs/models/gpt-4, 2023. Accessed: July 25, cultures. Proceedings of the National Academy of Sciences,
2023. 2, 8 117(4):1924–1934, 2020. 3
[3] Adnen Abdessaied, Mihai Bâce, and Andreas Bulling. [15] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh,
Neuro-symbolic visual dialog. arXiv preprint Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Ba-
arXiv:2208.10353, 2022. 3 tra. Visual dialog. In Proceedings of the IEEE conference
[4] Panos Achlioptas, Maks Ovsjanikov, Leonidas Guibas, and on computer vision and pattern recognition, pages 326–335,
Sergey Tulyakov. Affection: Learning affective explanations 2017. 3, 4, 5, 6, 7
for real-world visual data. In Proceedings of the IEEE/CVF [16] Harm de Vries, Florian Strub, Sarath Chandar, Olivier
Conference on Computer Vision and Pattern Recognition Pietquin, Hugo Larochelle, and Aaron Courville. Guess-
(CVPR), pages 6641–6651, June 2023. 2, 3 what?! visual object discovery through multi-modal dia-
[5] Panos Achlioptas, Maks Ovsjanikov, Kilichbek Haydarov, logue. In Proceedings of the IEEE Conference on Computer
Mohamed Elhoseiny, and Leonidas J Guibas. Artemis: Vision and Pattern Recognition (CVPR), July 2017. 3
Affective language for visual art. In Proceedings of the [17] Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko,
IEEE/CVF Conference on Computer Vision and Pattern Alan Cowen, Gaurav Nemade, and Sujith Ravi. Goemo-
Recognition, pages 11569–11579, 2021. 2, 3, 4 tions: A dataset of fine-grained emotions. arXiv preprint
[6] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Anirud- arXiv:2005.00547, 2020. 3
dha Kembhavi. Don’t just assume; look and answer: Over- [18] Ed Diener, Christie Napa Scollon, and Richard E Lucas. The
coming priors for visual question answering. In Proceed- evolving concept of subjective well-being: the multifaceted
ings of the IEEE conference on computer vision and pattern nature of happiness. 2009. 4
recognition, pages 4971–4980, 2018. 3 [19] Paul Ekman. An argument for basic emotions. Cognition &
emotion, 6(3-4):169–200, 1992. 4
[7] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. [20] Jianyu Fan, Miles Thorogood, and Philippe Pasquier. Emo-
Vqa: Visual question answering. In Proceedings of the IEEE soundscapes: A dataset for soundscape emotion recognition.
international conference on computer vision, pages 2425– In 2017 Seventh international conference on affective com-
2433, 2015. 3, 6 puting and intelligent interaction (ACII), pages 196–201.
IEEE, 2017. 3
[8] Dmitry Bogdanov, Alastair Porter, Philip Tovstogan, and
[21] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal-
Minz Won. Mediaeval 2019: Emotion and theme recognition
ski, Joanna Materzynska, Susanne Westphal, Heuna Kim,
in music using jamendo. In Larson M, Hicks S, Constantin
Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz
MG, Bischke B, Porter A, Zhao P, Lux M, Cabrera Quiros
Mueller-Freitag, et al. The” something something” video
L, Calandre J, Jones G, editors. MediaEval’19, Multimedia
database for learning and evaluating visual common sense.
Benchmark Workshop; 2019 Oct 27-30, Sophia Antipolis,
In Proceedings of the IEEE international conference on com-
France. Aachen: CEUR; 2019. CEUR Workshop Proceed-
puter vision, pages 5842–5850, 2017. 3
ings, 2019. 3
[22] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
[9] Tim Brooks, Aleksander Holynski, and Alexei A. Efros.
tra, and Devi Parikh. Making the v in vqa matter: Elevating
Instructpix2pix: Learning to follow image editing instruc-
the role of image understanding in visual question answer-
tions. In Proceedings of the IEEE/CVF Conference on Com-
ing. In Proceedings of the IEEE conference on computer
puter Vision and Pattern Recognition (CVPR), pages 18392–
vision and pattern recognition, pages 6904–6913, 2017. 3
18402, June 2023. 8
[23] Hatice Gunes and Massimo Piccardi. A bimodal face and
[10] Sven Buechel and Udo Hahn. Emobank: Studying the body gesture database for automatic analysis of human non-
impact of annotation perspective and representation for- verbal affective behavior. In 18th International conference
mat on dimensional emotion analysis. arXiv preprint on pattern recognition (ICPR’06), volume 1, pages 1148–
arXiv:2205.01996, 2022. 3 1153. IEEE, 2006. 3
[11] Cheng Chen, Zhenshan Tan, Qingrong Cheng, Xin Jiang, [24] Hsiao-Tzu Hung, Joann Ching, Seungheon Doh, Nabin Kim,
Qun Liu, Yudong Zhu, and Xiaodong Gu. Utc: A unified Juhan Nam, and Yi-Hsuan Yang. Emopia: a multi-modal pop
transformer with inter-task contrastive learning for visual di- piano dataset for emotion recognition and emotion-based
alog. In Proceedings of the IEEE/CVF Conference on Com- music generation. arXiv preprint arXiv:2108.01374, 2021.
puter Vision and Pattern Recognition, pages 18103–18112, 3
2022. 6 [25] Chen Kong, Dahua Lin, Mohit Bansal, Raquel Urtasun, and
[12] Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, Lun- Sanja Fidler. What are you talking about? text-to-image
Wei Ku, et al. Emotionlines: An emotion corpus of multi- coreference. In Proceedings of the IEEE conference on
party conversations. arXiv preprint arXiv:1802.08379, 2018. computer vision and pattern recognition, pages 3558–3565,
3 2014. 3
[13] Community. Wiki art. https://www.wikiart.org/, [26] Satwik Kottur, José MF Moura, Devi Parikh, Dhruv Batra,
2020. Accessed: 2020-11-06. 4 and Marcus Rohrbach. Clevr-dialog: A diagnostic dataset
[14] Alan S Cowen, Xia Fang, Disa Sauter, and Dacher Keltner. for multi-round reasoning in visual dialog. arXiv preprint
What music makes us feel: At least 13 dimensions organize arXiv:1903.03166, 2019. 3, 4
[27] Vicki R LeBlanc, Meghan M McConnell, and Sandra D [39] Vishvak Murahari, Dhruv Batra, Devi Parikh, and Abhishek
Monteiro. Predictable chaos: a review of the effects of emo- Das. Large-scale pretraining for visual dialog: A simple
tions on attention, memory and decision making. Advances state-of-the-art baseline. In European Conference on Com-
in Health Sciences Education, 20(1):265–282, 2015. 1 puter Vision, pages 336–352. Springer, 2020. 6, 7
[28] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine- [40] Vishvak Murahari, Prithvijit Chattopadhyay, Dhruv Batra,
jad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy- Devi Parikh, and Abhishek Das. Improving generative vi-
anov, and Luke Zettlemoyer. Bart: Denoising sequence-to- sual dialog by answering diverse questions. arXiv preprint
sequence pre-training for natural language generation, trans- arXiv:1909.10470, 2019. 6
lation, and comprehension. In Proceedings of the 58th An- [41] Van-Quang Nguyen, Masanori Suganuma, and Takayuki
nual Meeting of the Association for Computational Linguis- Okatani. Efficient attention mechanism for visual dialog that
tics, pages 7871–7880, 2020. 6, 7 can handle all the interactions between multiple inputs. In
[29] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. European Conference on Computer Vision, pages 223–240.
Blip: Bootstrapping language-image pre-training for unified Springer, 2020. 6, 7
vision-language understanding and generation. In ICML,
[42] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
2022. 7
Zhu. Bleu: a method for automatic evaluation of machine
[30] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao,
translation. In Proceedings of the 40th annual meeting of the
and Shuzi Niu. Dailydialog: A manually labelled multi-turn
Association for Computational Linguistics, pages 311–318,
dialogue dataset. arXiv preprint arXiv:1710.03957, 2017. 3
2002. 7
[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [43] Jeffrey Pennington, Richard Socher, and Christopher D Man-
Zitnick. Microsoft coco: Common objects in context. In ning. Glove: Global vectors for word representation. In
European conference on computer vision, pages 740–755. Proceedings of the 2014 conference on empirical methods in
Springer, 2014. 3 natural language processing (EMNLP), pages 1532–1543,
[32] Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, 2014. 6
and Guoying Zhao. imigue: An identity-free video dataset [44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
for micro-gesture understanding and emotion analysis. In Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Pe-
Proceedings of the IEEE/CVF Conference on Computer Vi- ter J Liu, et al. Exploring the limits of transfer learning
sion and Pattern Recognition, pages 10631–10642, 2021. 3 with a unified text-to-text transformer. J. Mach. Learn. Res.,
[33] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar 21(140):1–67, 2020. 6, 7
Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- [45] Hiranmayi Ranganathan, Shayok Chakraborty, and Sethura-
moyer, and Veselin Stoyanov. Roberta: A robustly optimized man Panchanathan. Multimodal emotion recognition using
bert pretraining approach. arXiv preprint arXiv:1907.11692, deep learning architectures. In 2016 IEEE Winter Confer-
2019. 7, 8 ence on Applications of Computer Vision (WACV), pages 1–
[34] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: 9. IEEE, 2016. 3
Pretraining task-agnostic visiolinguistic representations for [46] Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong
vision-and-language tasks. Advances in neural information Joon Oh, and Bernt Schiele. Generating descriptions with
processing systems, 32, 2019. 6 grounded and co-referenced people. In Proceedings of the
[35] Jana Machajdik and Allan Hanbury. Affective image classi- IEEE Conference on Computer Vision and Pattern Recogni-
fication using features inspired by psychology and art theory. tion, pages 4979–4989, 2017. 3
In Proceedings of the 18th ACM international conference on [47] Stuart Russell. Human compatible: Artificial intelligence
Multimedia, pages 83–92, 2010. 4 and the problem of control. Penguin, 2019. 2
[36] Youssef Mohamed, Mohamed Abdelfattah, Shyma Al-
[48] Fawaz Sammani, Tanmoy Mukherjee, and Nikos Deligian-
huwaider, Feifan Li, Xiangliang Zhang, Kenneth Church,
nis. Nlx-gpt: A model for natural language explanations
and Mohamed Elhoseiny. Artelingo: A million emotion
in vision and vision-language tasks. In Proceedings of
annotations of wikiart with emphasis on diversity over lan-
the IEEE/CVF Conference on Computer Vision and Pattern
guage and culture. In Proceedings of the 2022 Conference on
Recognition, pages 8322–8332, 2022. 7
Empirical Methods in Natural Language Processing, pages
8770–8785, 2022. 2, 3, 4 [49] Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, and
[37] Youssef Mohamed, Faizan Farooq Khan, Kilichbek Hay- Leonid Sigal. Visual reference resolution using attention
darov, and Mohamed Elhoseiny. It is okay to not be okay: memory for visual dialog. Advances in neural information
Overcoming emotional bias in affective image captioning by processing systems, 30, 2017. 3, 4
contrastive data collection. In Proceedings of the IEEE/CVF [50] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Conference on Computer Vision and Pattern Recognition, Soricut. Conceptual captions: A cleaned, hypernymed, im-
pages 21263–21272, 2022. 2, 3, 4 age alt-text dataset for automatic image captioning. In Pro-
[38] Ali Mollahosseini, Behzad Hasani, and Mohammad H Ma- ceedings of ACL, 2018. 3, 6
hoor. Affectnet: A database for facial expression, valence, [51] Kurt Shuster, Samuel Humeau, Antoine Bordes, and Jason
and arousal computing in the wild. IEEE Transactions on Weston. Image chat: Engaging grounded conversations.
Affective Computing, 10(1):18–31, 2017. 3 arXiv preprint arXiv:1811.00945, 2018. 3
[52] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014. 6
[53] Carlo Strapparava and Rada Mihalcea. Semeval-2007 task
14: Affective text. In Proceedings of the Fourth International
Workshop on Semantic Evaluations (SemEval-2007), pages
70–74, 2007. 3
[54] Jack Urbanek and Pratik Ringshia. Mephisto: A frame-
work for portable, reproducible, and iterative crowdsourcing,
2023. 4
[55] Victoria Yanulevskaya, Jan C van Gemert, Katharina Roth,
Ann-Katrin Herbold, Nicu Sebe, and Jan-Mark Geusebroek.
Emotional valence categorization using holistic image fea-
tures. In 2008 15th IEEE international conference on Image
Processing, pages 101–104. IEEE, 2008. 4
[56] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang.
Building a large scale dataset for image emotion recognition:
The fine print and the benchmark. In Proceedings of the
AAAI conference on artificial intelligence, volume 30, 2016.
4
[57] Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken-
maier. From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descrip-
tions. Transactions of the Association for Computational
Linguistics, 2:67–78, 2014. 3
[58] Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore:
Evaluating generated text as text generation. Advances in
Neural Information Processing Systems, 34:27263–27277,
2021. 7
[59] Jonathan R Zadra and Gerald L Clore. Emotion and per-
ception: The role of affective information. Wiley interdisci-
plinary reviews: cognitive science, 2(6):676–685, 2011. 1
[60] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein-
berger, and Yoav Artzi. Bertscore: Evaluating text gener-
ation with bert. In International Conference on Learning
Representations, 2019. 7