Professional Documents
Culture Documents
M C: A Dataset For Captioning and Interpreting Memes: EME AP
M C: A Dataset For Captioning and Interpreting Memes: EME AP
M C: A Dataset For Captioning and Interpreting Memes: EME AP
Abstract
Memes are a widely popular tool for web
users to express their thoughts using visual
arXiv:2305.13703v1 [cs.CL] 23 May 2023
1 Introduction
Figure 1: A meme and its title. The caption describes
Web users frequently communicate their thoughts what the meme poster was trying to convey.
and feelings online using memes (Buchel, 2012;
Tanaka et al., 2022). Memes are created by taking
an existing widespread image and attaching new accurate descriptions of images in both zero-shot
meaning to it by altering the text inside the image. and in-context setups. Such models are first pre-
For example, in Figure 1, Tom cat is a metaphor for trained on language-only and vision-only datasets,
the person who posted the meme and the cats he is and then trained on tasks such as image captioning
shaking hands with represent his two regular fol- and visual question answering, where the redun-
lowers who always like his posts. This incongruity dancy between the vision and language is used to
between the image and the text makes memes hu- embed them in a shared space. For example, the
morous (Tanaka et al., 2022). majority of image captions in existing datasets de-
Because of their complementary nature, inter- scribe what is depicted in the image, at most adding
preting the meaning of a meme requires understand- subjective interpretations or inferences about the
ing both the visual and text modalities. Moreover, story behind the image (Alikhani et al., 2020). In
memes are often posted on social media platforms contrast, there is little work on visual metaphors to
along with additional text, such as “one of them date (Zhang et al., 2021; Chakrabarty et al., 2023).
is my alt” in Fig. 1, which is further needed to In this paper, we are investigating whether VL
understand the meme. models can successfully interpret memes. We pro-
Recently, there is a surge of vision and language pose the task of meme captioning, in which models
(VL) models (e.g. Alayrac et al., 2022; Li et al., are presented with a meme along with its title (e.g.
2023; OpenAI, 2023). VL models have shown re- the title of the post containing the meme), and is
markable capabilities in generating detailed and tasked with generating a concise caption describing
the meaning of the meme. This task goes beyond #Memes #M-Cap #I-Cap #Mph
object recognition and language understanding. It Train+Val 5,828 1.0 1.0 2.1
is challenging due to the metaphorical role of the Test 559 3.4 1.0 3.1
visual content of the meme (Scott, 2021). For ex-
ample, in Fig. 1, the model needs to recognize that Table 1: The number of memes in M EME C AP, and
the average number of meme captions (M-Cap.), image
Tom cat is merely a metaphor for the meme poster,
captions (I-Cap.), and metaphorical keywords (Mph)
and that handshaking signals appreciation. The per meme.
literal content of the image, such as Tom or the
handshake, should not be part of the meme cap-
tion. Recognizing and interpreting such metaphors and more. Chakrabarty et al. (2023) tested image
involve detecting facial expressions, the tone ex- generation models on prompts involving a visual
pressed in the texts, making commonsense infer- metaphor such as “My bedroom is a pigsty”. They
ences, and more (Bitton-Guetta et al., 2023). found the unsatisfactory performance can be im-
To that end, we collected a meme captioning proved by using a large language model (LLM) to
dataset M EME C AP, containing 6,384 memes along interpret the visual metaphors and add details to
with their captions. Each meme is also annotated the prompt, such as “messy bedroom”.
with the literal image description (e.g. “Tom cat is
shaking hands with two small cats and smiling”), 2.2 Memes
and the visual metaphors (e.g. Tom is a metaphor Prior work on memes focused on detecting hate-
for the meme poster). ful or harmful content in memes (Qu et al., 2022;
We establish comprehensive baseline perfor- Sharma et al., 2023; Kiela et al., 2021), classi-
mances with recent large-scale VL models, in vari- fying memes to humorous or not (Tanaka et al.,
ous training setups (e.g. zero-shot, few-shot, fine- 2022), analyzing the sentiment of memes (Sharma
tuning), and inputs (i.e. meme, title, literal image et al., 2020), and generating memes (e.g. the
captions, and metaphors). Human evaluation of the ImgFlip575K dataset).2
generated captions shows that models are far from Although MultiMET (Zhang et al., 2021) does
humans in captioning memes. In particular, models not focus specifically on memes, the images were
tend to ignore important visual or textual elements, collected from a range of sources including social
and instead, repeat the text inside the meme or media, which contains memes. The similar Met-
make up fake elements. Our findings merit future Meme dataset (Xu et al., 2022) focuses on memes.
research on this task.1 Differently from our work, both datasets contain
annotations for visual metaphors while M EME C AP
2 Background
also contains meme captions.
2.1 Metaphors
2.3 Other Image Datasets
Most work on metaphors is limited to textual
metaphors, and pertains to collecting resources The WHOOPS benchmark (Bitton-Guetta et al.,
(Dodge et al., 2015), detecting or interpreting 2023) consists of unconventional human-created
metaphorical expressions in context (Choi et al., and machine-generated images that defy common-
2021; Chakrabarty et al., 2021a; Aghazadeh et al., sense (e.g. an image of “Albert Einstein holding
2022; Chakrabarty et al., 2022), and metaphor a smartphone”), along with their textual descrip-
generation (Stowe et al., 2021; Chakrabarty et al., tions. It’s meant to be used for image captioning,
2021b). image-text matching, visual question answering,
Recently, there has been interest in visual and explanation generation. In contrast, our work
metaphors. Visual metaphors occur when a tar- focuses on memes, and tests models on their ability
get concept is compared to another visual element to interpret real memes posted by web users.
(vehicle) (Forceville, 1996). MultiMET (Zhang
et al., 2021) and Met-Meme (Xu et al., 2022) are 3 The M EME C AP Dataset
two datasets of text-image pairs with annotations
The overall data collection and annotation process
for the existence and types of metaphors, sentiment,
is illustrated in Figure 2. We collected memes
1
Our code and data are available at:
2
https://github.com/eujhwang/meme-cap https://github.com/schesa/ImgFlip575K_Dataset
! Memes $ Meme Captions and Metaphors
Title: Why they gotta be like this
• Scraped from /r/memes
Her: why doesn’t he
• Manually filtered for quality and
understand my signals?
to exclude offensive content The signals:
#: intersection = relationship
between a man and a woman
" Literal Image Captions #: tree of traffic lights = the
woman’s complicated signals
• Remove text from meme
• Crowdsource the image captions #: The worst intersection in the #: Women wonder why men don't
world has to be controlled by a understand their signals when they are
tree of traffic lights overly complicated.
Figure 2: Overall process of collecting memes, literal image captions, visual metaphors, and meme captions.
(Sec 3.1) and crowdsourced their captions (Sec 3.2). (Suvorov et al., 2021). We collected one caption
We present the data splits and statistics in Sec 3.3. for each meme, which we manually verified.
Figure 3: (1) Meme Type: Percent of memes with no visual metaphors, and with metaphors that can be understood
with the text alone, vision alone, or both (complementary). (2) Metaphor Vehicle Type: Types of visual elements
used to convey a metaphorical meaning. (3) Metaphor Target Type: The intended meanings of the metaphors.
the diversity of topics in both the training and test themselves (with a person vehicle, such as Drake).
sets, we then sampled 10% of the memes from each Other categories are an approach or a concept (for
cluster and assigned them to the test set, and the which the meme poster expresses a certain stance),
rest of the memes into the training and validation another person, and a “desire vs. reality” meme
set.6 Table 1 shows the statistics of our dataset. such as the drowning meme illustrated in Fig 3.
Figure 4: An example of the few-shot setup with the following inputs: meme, image description, and the text inside
the meme. The figure shows the last in-context meme and the target meme.
remain a mystery, we utilize MiniGPT4 as an al- text inside the image as part of the language input.
ternative to GPT4 (OpenAI, 2023).7 It has similar We incrementally added each of these inputs.
capabilities to GPT-4 in understanding and gen-
Learning Setups. We evaluate all models in a
erating the context (Zhu et al., 2023a). For its
zero-shot setup. Flamingo and LLaMA enable in-
language model, MiniGPT4 uses Vicuna (Chiang
context learning, so we experiment with 4, 8, and
et al., 2023), which is built on top of LLaMA-
12 shots. An example prompt (including the meme,
13B and performs on par with ChatGPT (OpenAI,
title, image caption, and text inside the meme) is
2023). For its vision component, it uses BLIP-2 (Li
illustrated in Figure 4. MiniGPT4 works in a chat
et al., 2023), which consists of CLIP ViT-G/14 and
format, so rather than in-context learning, we use
a Q-Former architecture. MiniGPT4 was trained
it in either a zero-shot setup, or fine-tuned on our
on various multimodal datasets, including images
training set.
from LAION (Schuhmann et al., 2022), Conceptual
Lastly, motivated by Chakrabarty et al. (2023)
Captions (Sharma et al., 2018), and SBU (Ordonez
and Zhang et al. (2023), we also tested models in a
et al., 2011).
Chain of Thought (CoT) style prompting (Wei et al.,
LLaMA LLaMA (Touvron et al., 2023) is a 2022). In our case, we elicit multi-step reasoning
transformer-based language model that was trained from the LLM by providing the visual metaphors,
on trillions of tokens from exclusively publicly- using the following prompt:
available data. The LLaMA-13B model outper-
forms GPT-3 (Brown et al., 2020) on most bench- <image>This is a meme with the title “{title}”.
The image description is “{image caption}”.
marks. We use the LLaMA-7B model, which The following text is written inside the meme:
achieves comparable performance to the LLaMA- “{OCR text}”.
What is the meme poster trying to convey?
13B model on most benchmarks. Since LLaMA Rationale: “{keyword1}” is a metaphor for
is a language model rather than a VL model, its “{meaning1}”. “{keyword2}” is a metaphor for
access to the visual content is through the image “{meaning2}”.
Answer:
caption and the OCR text alone.
Table 2: Performance in terms of automatic metrics of the various models and learning setups (with 4 shots for the
few-shot setup). We report the full experimental results, including 8 shots and 12 shots, in Appendix A.
ROUGE are based on n-gram overlap between the we show in Sec 5.2, while the fine-tuned model
generated captions and human-written reference learns to generate short captions, it tends to hallu-
captions, while BERTScore measures the semantic cinate more. We hypothesize that fine-tuning only
similarities between the two. the last layer is ineffective.
Table 2 shows the performance of the various
Inputs. In the few-shot setups, the best perfor-
models and input setups in terms of these metrics.
mance is achieved with as many of the inputs as
For the few-shot setup, we show the best perfor-
possible, i.e. including both the image caption and
mance across (4, 8, and 12 shots). See Appendix A
the OCR text, despite the redundancy with the vi-
for the full results.
sual inputs. This might be due to suboptimal cross-
Models. Flamingo dominates MiniGPT4 across modal interaction in VL models. While prior work
all metrics, with a gap of 15, 12, and 6 points in showed that explicitly stating the metaphors helps
BLEU, ROUGE, and BertScore respectively for the image generation models generate better images
best setups. This is likely due to the lengthy cap- (Chakrabarty et al., 2023), we did not see a similar
tions generated by MiniGPT4, despite the prompt gain in meme captioning.
including the instruction to generate a single sen- 5.2 Human Evaluation
tence. Finally, the LLaMA model is highly com-
petitive with Flamingo despite not having access to We focused on the models with the full set of in-
the image itself. It appears that the image captions puts except for the rationales (meme+title+img
and OCR text provide sufficient information. cap+OCR text) and evaluated the performance of
all models (focusing on 4-shots for the few-shot
Learning Setups. The Flamingo performance setups), with respect to the following criteria:
significantly improves from the zero-shot to few-
• Correctness: Does the caption correctly convey
shot setting, and continues to improve from 4 to
the meaning the meme poster wanted to convey?
8 shots but slightly decreases at 12 shots (see Ap-
pendix A). MiniGPT4 achieved better performance • Appropriate Length: Is the caption length ap-
in the zero-shot setup, while fine-tuning its last propriate for conveying the meaning (i.e. it is not
layer significantly decrease the performance. As too verbose)?
Figure 5: Performance in terms of human evaluation.
• Visual Completeness: Does the caption describe information about memes, and simply fine-tuning
all the important elements in the image? the last projection layer of the model is not enough
to produce high-quality captions. This conclusion
• Textual Completeness: Does the caption de-
is consistent with Zhou et al. (2023), according to
scribe all the important elements in the text inside
which most knowledge in LLM is learned during
the meme and the title text?
the pre-training stage.
• Faithfulness: Are all the elements of the caption
supported by either the visual or text elements Common Errors. Figure 6 shows two examples
(i.e. there are no made-up elements)? of meme captions generated by Flamingo 4-shot
along with the types of errors they exhibit. The
We randomly sampled 30 memes along with top example demonstrates an unfaithful caption
their model-generated and human-written captions. because neither the meme nor the title conveys
The annotation was performed by students in the anything about being successful in life. The bot-
lab. Figure 5 shows the performance according to tom example illustrates a common error in which
the human evaluation. All models perform signifi- the model copies text from inside the meme while
cantly worse than humans, except for appropriate ignoring important visual elements. In this case,
length criteria, with 36.6, 29.3, 24.5, and 18.4 point Spongebob’s smile indicates the meme poster’s pos-
differences on correctness, textual completeness, itive attitude towards reading old and long forum
visual completeness, and faithfulness respectively. threads, but the model-generated caption misses it.
Another common error (not illustrated here) occurs
Models. Model performance differs by criteria. when the model treats visual elements too literally,
Flamingo and LLaMA are more correct and faith- failing to interpret the metaphor. Finally, in some
ful, while MiniGPT4 is more visually complete. cases, the model might lack sufficient background
knowledge to correctly interpret the meme.
Learning Setups. For Flamingo, the few-shot
models improve in textual and visual completeness
5.3 Ablation Tests
upon the zero-shot model, but not in terms of cor-
rectness and faithfulness. This may suggest that The analysis in Sec 3.4 shows that interpreting most
while access to examples improves the model’s un- memes in M EME C AP will require understanding
derstanding of the task, it might also confuse it with both the visual and text modalities. We are inter-
information irrelevant to the target meme. LLaMA ested in the extent that models make use of each
doesn’t gain any performance improvements from modality. To that end, we perform an ablation test
in-context examples, likely for the same reason. to exclude each modality. Table 3 presents the
Without the visual features, it might struggle even results in terms of automatic metrics.
more to separate the text (title, image caption, and In most cases, the best performance is achieved
OCR) of the different examples. with both modalities. For Flamingo (zero-shot and
MiniGPT4 zero-shot is very verbose, but the fine- few-shot), excluding the meme results in more de-
tuned model learns to output captions in the length crease in performance than excluding the title, in-
of its training examples. Unfortunately, these cap- dicating that the model relies more on the visual
tions are far worse than those of the zero-shot modality than the information provided by the title.
model in all criteria. We hypothesize that the frozen The same is true for LLaMA (in both settings), for
language and vision model may not have enough which excluding the image caption yields worse
Error: unfaithful
Image caption: This is a poster of Game of throne from the tower scene.
Human-written meme caption: Meme poster abandoned Microsoft Excel in school, but need to
use it after they get their white collar job.
Model-generated meme caption: Meme poster is trying to convey that they want to be successful
in life.
Human-written meme caption: Meme poster finds it entertaining to read through long comment
threads of arguments that happened in the past.
Model-generated meme caption: Meme poster is trying to convey that they read a 153 comment
long argument that happened 7 years ago.
Figure 6: Examples of incorrect meme captions generated by the few-shot Flamingo model.
performance. This is expected since the title is typ- Quality of Metaphor Annotations. We put our
ically secondary in informativeness to the meme. best efforts into manually verifying the collected
In addition, Flamingo still has access to the text data, and indeed the human performance in Sec-
inside the meme via visual features. tion 5.2 shows the human-written captions are of
Conversely, MiniGPT4 exhibits a higher depen- high quality. With that said, we noticed that the
dency on textual modality, resulting in a signifi- quality of the visual metaphors is inconsistent. We
cant decrease when the title is not provided. Since believe that while people are capable of explaining
MiniGPT4 shows higher textual and visual com- a meme, they don’t always know to map the visual
pleteness when the OCR text is provided (§5.2), we vehicles into textual targets. This likely explains
hypothesize that MiniGPT4 makes limited usage why adding the metaphors as inputs didn’t improve
of the visual modality. the performance.
Subjectivity and Background Knowledge. The Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc,
meme captioning task involves employing back- Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katherine Millican, Mal-
ground knowledge which may vary between anno-
colm Reynolds, Roman Ring, Eliza Rutherford,
tators. To that end, we manually checked the meme Serkan Cabi, Tengda Han, Zhitao Gong, Sina Saman-
captions to minimize the number of incorrect cap- gooei, Marianne Monteiro, Jacob Menick, Sebastian
tions in the dataset. In addition, there is some level Borgeaud, Andrew Brock, Aida Nematzadeh, Sa-
of subjectivity with respect to the evaluation crite- hand Sharifzadeh, Mikolaj Binkowski, Ricardo Bar-
reira, Oriol Vinyals, Andrew Zisserman, and Karen
ria for the meme caption quality. For this reason, Simonyan. 2022. Flamingo: a visual language model
we ensured a high quality of annotations by hav- for few-shot learning. In Advances in Neural Infor-
ing in-house annotators that could ask clarification mation Processing Systems.
questions, but some subjectivity still remains. Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu
Soricut, and Matthew Stone. 2020. Cross-modal co-
Ethics Statement herence modeling for caption generation. In Proceed-
ings of the 58th Annual Meeting of the Association
Data All the datasets used in our work are pub- for Computational Linguistics, pages 6525–6535, On-
licly available. Our dataset is collected from Reddit line. Association for Computational Linguistics.
and may contain offensive, hateful, or sexual con-
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hes-
tent. Despite our best efforts to filter them out as sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe,
described in Section 3, we found people have dif- Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon
ferent criteria for what they perceive as offensive, Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell
hateful, or sexual, and thus, such content may still Wortsman, and Ludwig Schmidt. 2023. Open-
flamingo.
exist in our data.
Nitzan Bitton-Guetta, Yonatan Bitton, Jack Hessel,
Data Collection We use Amazon Mechanical Ludwig Schmidt, Yuval Elovici, Gabriel Stanovsky,
Turk to collect 6.3K image descriptions and 7.7K and Roy Schwartz. 2023. Breaking common sense:
meme captions. The annotators were compensated Whoops! a vision-and-language benchmark of syn-
with an average hourly wage of $13, which is com- thetic and compositional images.
parable to the US minimum wage. We did not Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie
collect any personal information from annotators. Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Models Our dataset may include some offensive Askell, Sandhini Agarwal, Ariel Herbert-Voss,
content or mild expletives and this can amplify po- Gretchen Krueger, Tom Henighan, Rewon Child,
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
tentially biased and unethical answers. In addition,
Clemens Winter, Christopher Hesse, Mark Chen, Eric
the large pre-trained VL models we used for the Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
experiments are trained on a large-scale publicly Jack Clark, Christopher Berner, Sam McCandlish,
available web corpus and may also bring some bias Alec Radford, Ilya Sutskever, and Dario Amodei.
when generating sentences. 2020. Language models are few-shot learners.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier A Additional Experimental Results
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal We show the full experimental results in Table 4.
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023. Llama: Open
and efficient foundation language models.
Table 4: 0, 4, 8, 12 shot results with Flamingo, LLaMA, and MiniGPT4 models. “-” indicates the model ran out of
memory.