Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Pun Generation with Surprise

He He1∗ and Nanyun Peng2∗ and Percy Liang1


1
Computer Science Department, Stanford University
2
Information Sciences Institute, University of Southern California
{hehe,pliang}@cs.stanford.edu, npeng@isi.edu

Global context
Abstract
We tackle the problem of generating a pun sen- Yesterday I accidentally swallowed some food
tence given a pair of homophones (e.g., “died” coloring. The doctor says I'm OK, but I feel like
arXiv:1904.06828v1 [cs.CL] 15 Apr 2019

and “dyed”). Supervised text generation is in- Local context


I've dyed a little inside.
appropriate due to the lack of a large corpus
of puns, and even if such a corpus existed, Pun word: dyed. Alternative word: died.
mimicry is at odds with generating novel con-
tent. In this paper, we propose an unsuper- Figure 1: An illustration of a homophonic pun. The
vised approach to pun generation using a cor- pun word appears in the sentence, while the alterna-
pus of unhumorous text and what we call the tive word, which has the same pronunciation but differ-
local-global surprisal principle: we posit that ent meaning, is implicated. The local context refers to
in a pun sentence, there is a strong associa- the immediate words around the pun word, whereas the
tion between the pun word (e.g., “dyed”) and global context refers to the whole sentence.
the distant context, as well as a strong associa-
tion between the alternative word (e.g., “died”) similarity to another sign, for an intended humor-
and the immediate context. This contrast cre- ous or rhetorical effect.” We focus on a typical
ates surprise and thus humor. We instanti- class of puns where the ambiguity comes from two
ate this principle for pun generation in two (near) homophones. Consider the example in Fig-
ways: (i) as a measure based on the ratio of ure 1: “Yesterday I accidentally swallowed some
probabilities under a language model, and (ii)
food coloring. The doctor says I’m OK, but I feel
a retrieve-and-edit approach based on words
suggested by a skip-gram model. Human like I’ve dyed (died) a little inside.”. The pun word
evaluation shows that our retrieve-and-edit ap- shown in the sentence (“dyed”) indicates one in-
proach generates puns successfully 31% of the terpretation: the person is colored inside by food
time, tripling the success rate of a neural gen- coloring. On the other hand, an alternative word
eration baseline. (“died”) is implied by the context for another in-
terpretation: the person is sad due to the accident.
1 Introduction
Current approaches to text generation require
Generating creative content is a key require- lots of training data, but there is no large corpus
ment in many natural language generation tasks of puns. Even such a corpus existed, learning the
such as poetry generation (Manurung et al., distribution of existing data and sampling from
2000; Ghazvininejad et al., 2016), story gen- it is unlikely to lead to truly novel, creative sen-
eration (Meehan, 1977; Peng et al., 2018; Fan tences. Creative composition requires deviating
et al., 2018; Yao et al., 2019), and social chat- from the norm, whereas standard generation ap-
bots (Weizenbaum, 1966; Hao et al., 2018). In proaches seek to mimic the norm.
this paper, we explore creative generation with a Recently, Yu et al. (2018) proposed an unsuper-
focus on puns. We follow the definition of puns vised approach that generates puns from a neural
in Aarons (2017); Miller et al. (2017): “A pun is a language model by jointly decoding conditioned
form of wordplay in which one sign (e.g., a word on both the pun and the alternative words, thus in-
or a phrase) suggests two or more meanings by jecting ambiguity to the output sentence. How-
exploiting polysemy, homonymy, or phonological ever, Kao et al. (2015) showed that ambiguity

Equal contribution. alone is insufficient to bring humor; the two mean-
ings must also be supported by distinct sets of pun word position, and are tickled by the relation
words in the sentence. between the pun word and the rest of the sentence.
Inspired by Kao et al. (2015), we propose a gen- Consider the following cloze test: “Yesterday I
eral principle for puns which we call local-global accidentally swallowed some food coloring. The
surprisal principle. Our key observation is that doctor says I’m OK, but I feel like I’ve a
the strength for the interpretation of the pun and little inside.”. Most people would expect the word
the alternative words flips as one reads the sen- in the blank to be “died” whereas the actual word
tence. For example, in Figure 1, “died” is favored is “dyed”. Locally, “died a little inside” is much
by the immediate (local) context, whereas “dyed” more likely than “dyed a little inside”. However,
is favored by the global context (i.e. “...food color- globally when looking back at the whole sentence,
ing...”). Our surprisal principle posits that the pun “dyed” is evoked by “food coloring”.
word is much more surprising in the local context Formally, wp is more surprising relative to wa
than in the global context, while the opposite is in the local context, but much less so in the global
true for the alternative word. context. We hypothesize that this contrast between
We instantiate our local-global surprisal princi- local and global surprisal creates humor.
ple in two ways. First, we develop a quantitative
metric for surprise based on the conditional prob- 3.2 A Local-Global Surprisal Measure
abilities of the pun word and the alternative word Let us try to formalize the local-global surprisal
given local and global contexts under a neural lan- principle quantitatively. To measure the amount of
guage model. However, we find that this metric is surprise due to seeing the pun word instead of the
not sufficient for generation. We then develop an alternative word in a certain context c, we define
unsupervised approach to generate puns based on surprisal S as the log-likelihood ratio of the two
a retrieve-and-edit framework (Guu et al., 2018; events:
Hashimoto et al., 2018) given an unhumorous cor- def p(wp | c) p(wp , c)
pus (Figure 2). We call our system S UR G EN S(c) = − log = − log . (1)
p(wa | c) p(wa , c)
(SURprisal-based pun GENeration).
We define the local surprisal to only consider con-
We test our approach on 150 pun-alternative
text of a span around the pun word, and the global
word pairs.1 First, we show a strong correlation
surprisal to consider context of the whole sen-
between our surprisal metric and funniness ratings
tence. Letting x1 , . . . , xn be a sequence of tokens,
from crowdworkers. Second, human evaluation
and xp be the pun word wp , we have
shows that our system generates puns successfully
31% of the time, compared to 9% of a neural gen- def
Slocal = S(xp−d:p−1 , xp+1:p+d ), (2)
eration baseline (Yu et al., 2018), and results in
def
higher funniness scores. Sglobal = S(x1:p−1 , xp+1:n ), (3)

2 Problem Statement where d is the local window size.


For puns, both the local and global surprisal
We assume access to a large corpus of raw (unhu- should be positive because they are unusual sen-
morous) text. Given a pun word wp (e.g., “dyed”) tences by nature. However, the global surprisal
and an alternative word wa (e.g., “died”) which should be lower than the local surprisal due to
are (near) homophones, we aim to generate a list topic words hinting at the pun word. We use the
of pun sentences. A pun sentence contains only following unified metric, local-global surprisal, to
the pun word wp , but both wp and wa should be quantify whether a sentence is a pun:
evoked by the sentence. (
def −1 Slocal < 0 or Sglobal < 0,
3 Approach Sratio =
Slocal /Sglobal otherwise.
3.1 Surprise in Puns (4)
What makes a good pun sentence? Our key ob- We hypothesize that larger Sratio is indicative of
servation is that as a reader processes a sentence, a good pun. Note that this hypothesis is invalid
he or she expects to see the alternative word at the when either Slocal or Sglobal is negative, in which
1
Our code and data are available at https://github. case we consider the sentences equally unfunny by
com/hhexiy/pungen. setting Sratio to −1.
wp = hare Local surprisal. The first step is to retrieve sen-
tences containing wa . A typical pattern of pun
wa = hair
sentences is that the pun word only occurs once
Retrieve using hair
towards the end of the sentence, which separates
local context from pun-related topics at the begin-
the man stopped to get a hair cut. ning. Therefore, we retrieve sentences containing
exactly one wa and rank them by the position of wa
Swap hair → hare in the sentence (later is better). Next, we replace
wa in the retrieved sentence with wp . The pun
the man stopped to get a hare cut. word usually fits in the context as it often has the
same part-of-speech tag as the alternative word.
Insert topic man → greyhound
Thus the swap creates local surprisal by putting
the greyhound stopped to get a hare cut. the pun word in an unusual but acceptable context.
We call this step R ETRIEVE +S WAP, and use it as
Figure 2: Overview of our pun generation approach. a baseline to generate puns.
Given a pair of pun/alternative word, we first retrieve
sentences containing wa from a generic corpus. Next, Global surprisal. While the pun word is locally
wa is replaced by wp to increase local surprisal. Lastly, unexpected, we need to foreshadow it. This global
we insert a topic word at the beginning of the sen- association must not be too strong that it elim-
tence to create global associations supporting wp and
inates the ambiguity. Therefore, we include a
decrease global surprisal.
single topic word related to the pun word by re-
placing one word at the beginning of the seed
3.3 Generating Puns sentence. We see this simple structure in many
The surprisal metric above can be used to assess human-written puns as well. For example, “Old
whether a sentence is a pun, but to generate puns, butchers never die, they only meat their fate.”,
we need a procedure that can ensure grammatical- where pun words and their corresponding topic
ity. Recall that the surprisal principle requires (1) words are underlined.
a strong association between the alternative word We define relatedness between two words wi
and the local context; (2) a strong association be- and wj based on a “distant” skip-gram model
tween the pun word and the distant context; and pθ (wj | wi ), where we train pθ to maximize
(3) both words should be interpretable given local pθ (wj | wi ) for all wi , wj in the same sentence
and global context to maintain ambiguity. between d1 to d2 words apart. Formally:
Our strategy is to model puns as deviations from i−d
X2 i+d
X2
normality. Specifically, we mine seed sentences log pθ (wj | wi ) + log pθ (wj | wi ).
(sentences with the potential to be transformed j=i−d1 j=i+d1
into puns) from a large, generic corpus, and edit (5)
them to satisfy the three requirements above.
We take the top-k predictions from pθ (w | wp ),
Figure 2 gives an overview of our approach.
where wp is the pun word, as candidate topic
Suppose we are generating a pun given wp =
words w to be further filtered next.
“hare” and wa = “hair”. To reinforce wa =
“hair” in the local context despite the appearance Type consistent constraint. The replacement
of “hare”, we retrieve sentences containing “hair” must maintain acceptability of the sentence. For
and replace occurrences of it with “hare”. Here, example, changing “person” to “ship” in “Each
the local context strongly favors the alternative person must pay their fare share” does not make
word (“hair cut”) relative to the pun word (“hare sense even though “ship” and “fare” are related.
cut”). Next, to make the pun word “hare” more Therefore, we restrict the deleted word in the seed
plausible, we insert a “hare”-related topic word sentence to nouns and pronouns, as verbs have
(“greyhound”) near the beginning of the sentence. more constraints on their arguments and replacing
In summary, we create local surprisal by putting them is likely to result in unacceptable sentences.
wp in common contexts for wa , and connect wp to In addition, we select candidate topic words that
a distant topic word by substitution. We describe are type-consistent with the deleted word, e.g., re-
each step in detail below. placing “person” with “passenger” as opposed to
“ship”. We define type-consistency (for nouns) tions with a simple retrieval baseline and a neural
based on WordNet path similarity.2 Given two generation model (Yu et al., 2018) (Section 4.3).
words, we get their synsets from WordNet con- We show that the local-global surprisal scores
strained by their POS tags.3 If the path similarity strongly correlate with human ratings of funni-
between any pair of senses from the two respective ness, and all of our systems outperform the base-
synsets is larger than a threshold, we consider the lines based on human evaluation. In particular,
two words type-consistent. In summary, the first R ETRIEVE +S WAP +T OPIC (henceforth S UR G EN)
noun or pronoun in the seed sentence is replaced achieves the highest success rate and average fun-
by a type-consistent topic word. We call this base- niness score among all systems.
line R ETRIEVE +S WAP +T OPIC.
4.1 Datasets
Improve grammaticality. Directly replacing a We use the pun dataset from 2017 SemEval
word with the topic word may result in un- task7 (Doogan et al., 2017). The dataset con-
grammatical sentences, e.g., replacing “i” with tains 1099 human-written puns annotated with pun
“negotiator” and getting “negotiator am just words and alternative words, from which we take
a woman trying to peace her life back to- 219 for development. We use BookCorpus (Zhu
gether.”. Therefore, we use a sequence-to- et al., 2015) as the generic corpus for retrieval and
sequence model to smooth the edited sentence training various components of our system.
(R ETRIEVE +S WAP +T OPIC +S MOOTHER).
We smooth the sentence by deleting words 4.2 Analysis of the Surprisal Principle
around the topic word and train a model to fill We evaluate the surprisal principle by analyzing
in the blank. The smoother is trained in a simi- how well the local-global surprisal score (Equa-
lar fashion to denoising autoencoders: we delete tion (4)) predicts funniness rated by humans. We
immediate neighbors of a word in a sentence, and first give a brief overview of previous computa-
ask the model to reconstruct the sentence by pre- tional accounts of humor, and then analyze the cor-
dicting missing neighbors. A training example is relation between each metric and human ratings.
shown below:
Prior funniness metrics. Kao et al. (2015) pro-
the man slowly walked towards posed two information-theoretic metrics: ambigu-
Original: the woods .
ity of meanings and distinctiveness of supporting
<i> man </i> walked towards the
Input: woods . words. Ambiguity says that the sentence should
Output: the man slowly support both the pun meaning and the alterna-
tive meaning. Distinctiveness further requires that
the two meanings be supported by distinct sets of
During training, the word to delete is selected in
words.
the same way as selecting the word to replace in a
In contrast, our metric based on the surprisal
seed sentence, i.e. nouns or pronouns at the begin-
principle imposes additional requirements. First,
ning of a sentence. At test time, the smoother is
surprisal says that while both meanings are accept-
expected to fill in words to connect the topic word
able (indicating ambiguity), the pun meaning is
with the seed sentence in a grammatical way, e.g.,
unexpected based on the local context. Second,
“the negotiator is just a woman trying to peace
the local-global surprisal contrast requires the pun
her life back together.” (the part rewritten by the
word to be well supported in the global context.
smoother is underlined).
Given the anomalous nature of puns, we
4 Experiments also consider a metric for unusualness based
on normalized log-probabilities under a language
We first evaluate how well our surprisal prin- model (Pauls and Klein, 2012):
ciple predicts the funniness of sentences per- n
!
ceived by humans (Section 4.2), and then com- def 1 Y
Unusualness = − log p(x1 , . . . , xn )/ p(xi ) .
pare our pun generation system and its varia- n
i=1
2
Path similarity is a score between 0 and 1 that is in- (6)
versely proportional to the shortest distance between two
word senses in WordNet. Implementation details. Both ambiguity and
3
Pronouns are mapped to the synset person.n.01. distinctiveness are based on a generative model
S EM E VAL K AO
Type Example
Count Funniness Count Funniness
Pun Yesterday a cow saved my life—it was bovine intervention. 33 1.13 141 1.09
Swap-pun Yesterday a cow saved my life—it was divine intervention. 33 0.05 0 —
The workers are all involved in studying the spread of
Non-pun 64 -0.34 257 -0.53
bovine TB.

Table 1: Dataset statistics and funniness ratings of S EM E VAL and K AO. Pun or alternative words are underlined
in the example sentence. Each worker’s ratings are standardized to z-scores. There is clear separation among the
three types in terms of funniness, where pun > swap-pun > non-pun.

Pun and non-pun Pun and swap-pun Pun


Metric
S EM E VAL K AO S EM E VAL S EM E VAL K AO
Surprisal (Sratio ) 0.46 p=0.00 0.58 p=0.00 0.48 p=0.00 0.26 p=0.15 0.08 p=0.37
Ambiguity 0.40 p=0.00 0.59 p=0.00 0.18 p=0.15 0.00 p=0.98 0.00 p=0.95
Distinctiveness -0.17 p=0.10 0.29 p=0.00 0.15 p=0.24 0.41 p=0.02 0.27 p=0.00
Unusualness 0.37 p=0.00 0.36 p=0.00 0.19 p=0.12 0.20 p=0.27 0.11 p=0.18

Table 2: Spearman correlation between different metrics and human ratings of funniness. Statistically significant
correlations with p-value < 0.05 are bolded. Our surprisal principle successfully differentiates puns from non-
puns and swap-puns. Distinctiveness is the only metric that correlates strongly with human ratings within puns.
However, no single metric works well across different types of sentences.

of puns. Each sentence has a latent variable z ∈ surprisal, we also included swap-puns where wp is
{wp , wa } corresponding to the pun meaning and replaced by wa , which results in sentences that are
the alternative meaning. Each word also has a ambiguous but not necessarily surprising.
latent meaning assignment variable f controlling
We collected all of our human ratings on Ama-
whether it is generated from an unconditional un-
zon Mechanical Turk (AMT). Workers are asked
igram language model or a unigram model condi-
to answer the question “How funny is this sen-
tioned on z. Ambiguity is defined as the entropy
tence?” on a scale from 1 (not at all) to 7 (ex-
of the posterior distribution over z given all the
tremely). We obtained funniness ratings on 130
words, and distinctiveness is defined as the sym-
sentences from the development set with 33 puns,
metrized KL-divergence between distributions of
33 swap-puns, and 64 non-puns. 48 workers each
the assignment variables given the pun meaning
read roughly 10–20 sentences in random order,
and the alternative meaning respectively. The gen-
counterbalanced for sentence types of non-puns,
erative model relies on p(xi | z), which Kao et al.
swap-puns, and puns. Each sentence is rated by
(2015) estimates using human ratings of word re-
5 workers, and we removed 10 workers whose
latedness. We instead use the skip-gram model
maximum Spearman correlation with other people
described in Section 3.3 as we are interested in a
rating the same sentence is lower than 0.2. The
fully-automated system.
average Spearman correlation among all the re-
For local-global surprisal and unusualness, we
maining workers (which captures inter-annotator
estimate probabilities of text spans using a neural
agreement) is 0.3. We z-scored the ratings of
language model trained on WikiText-103 (Merity
each worker for calibration and took the average z-
et al., 2016).4 The local context window size (d in
scored ratings of a sentence as its funniness score.
Equation (2)) is set to 2.
Table 1 shows the statistics of our annotated
Human ratings of funniness. Similar to Kao dataset (S EM E VAL) and Kao et al. (2015)’s dataset
et al. (2015), to test whether a metric can differen- (K AO). Note that the two datasets have different
tiate puns from normal sentences, we collected rat- numbers and types of sentences, and the human
ings for both puns from the SemEval dataset and ratings were collected separately. As expected,
non-puns retrieved from the generic corpus con- puns are funnier than both swap-puns and non-
taining either wp or wa . To test the importance of puns. Swap-puns are funnier than non-puns, pos-
4
https://dl.fbaipublicfiles.com/ sibly because they have inherit ambiguity brought
fairseq/models/wiki103_fconv_lm.tar.bz2. by the R ETRIEVE +S WAP operation.
Automatic metrics of funniness. We analyze Method Success Funniness Grammar
the following metrics: local-global surprisal NJD 9.2% 1.4 2.6
(Sratio ), ambiguity, distinctiveness, and unusual- R 4.6% 1.3 3.9
R+S 27.0% 1.6 3.5
ness, with respect to their correlation with hu- R+S+T+M 28.8% 1.7 2.9
man ratings of funniness. For each metric, we S UR G EN 31.4% 1.7 3.0
standardized the scores and outliers beyond two Human 78.9% 3.0 3.8
standard deviations are set to +2 or −2 accord-
ingly.5 We then compute the metrics’ Spear- Table 3: Human evaluation results of all sys-
man correlation with human ratings. On K AO, tems. We show average scores of funniness
we directly took the ambiguity scores and dis- and grammaticality on a 1-5 scale and success
tinctiveness scores from the original implementa- rate computed from yes/no responses. We com-
pare with two baselines: N EURAL J OINT D ECODER
tion which requires human-annotated word relat-
(NJD) and R ETRIEVE (R). R+S, S UR G EN, and
edness.6 On S EM E VAL, we used our reimplemen- R+S+T+M are three variations of our method:
tion of Kao et al. (2015)’s algorithm but with the R ETRIEVE +S WAP, R ETRIEVE +S WAP +T OPIC, and
skip-gram model. R ETRIEVE +S WAP +T OPIC +S MOOTHER, respectively.
Overall, S UR G EN performs the best across the board.
The results are shown in Table 2. For puns
and non-puns, all metrics correlate strongly with
4.3 Pun Generation Results
human scores, indicating all of them are useful
for pun detection. For puns and swap-puns, only Systems. We compare with a recent neural pun
local-global surprisal (Sratio ) has strong correla- generator (Yu et al., 2018). They proposed an un-
tion, which shows that surprisal is important for supervised approach based on generic language
characterizing puns. Ambiguity and distinctive- models to generate homographic puns.7 Their
ness do not differentiate pun word from the alter- approach takes as input two senses of a tar-
native word, and unusualness only considers prob- get word (e.g., bat.n01, bat.n02 from WordNet
ability of the sentence with the pun word, thus they synsets), and decodes from both senses jointly by
do not correlate as significantly as Sratio . taking a product of the probabilities conditioned
on the two senses respectively (e.g., bat.n01 and
Within puns, only distinctiveness has significant bat.n02), so that both senses are reflected in the
correlation, whereas the other metrics are not fine- output. To ensure that the target word appears
grained enough to differentiate good puns from in the middle of a sentence, they decode back-
mediocre ones. Overall, no single metric is ro- ward from the target word towards the beginning
bust enough to score funniness across all types of and then decode forward to complete the sen-
sentences, which makes it hard to generate puns tence. We adapted their method to generate ho-
by optimizing automatic metrics of funniness di- mophonic puns by considering wp and wa as two
rectly. input senses and decoding from the pun word. We
There is slight inconsistency between results on retrained their forward / backward language mod-
S EM E VAL and K AO. Specifically, for puns and els on the same BookCorpus used for our sys-
non-puns, the distinctiveness metric shows a sig- tem. For comparison, we chose their best model
nificant correlation with human ratings on K AO (N EURAL J OINT D ECODER), which mainly cap-
but not on S EM E VAL. We hypothesize that it is tures ambiguity in puns.
mainly due to differences in the two corpora and In addition, we include a retrieval baseline
noise from the skip-gram approximation. For ex- (R ETRIEVE) which simply retrieves sentences
ample, our dataset contains longer sentences with containing the pun word.
an average length of 20 words versus 11 words for For our systems, we include the entire pro-
K AO. Further, Kao et al. (2015) used human anno- gression of methods described in Section 3
tation of word relatedness while we used the skip- (R ETRIEVE +S WAP, R ETRIEVE +S WAP +T OPIC,
gram model to estimate p(xi | z). and R ETRIEVE +S WAP +T OPIC +S MOOTHER).
Implementation details. The key components
5
of our systems include a retriever, a skip-gram
Since both Sratio and distinctiveness are unbounded,
bounding the values gives more reliable correlation results. 7
Sentences where the pun word and alternative word have
6
https://github.com/amoudgl/pun-model the same written form (e.g., bat) but different senses.
S UR G EN v.s. NJD S UR G EN v.s. Human A breakdown of error types
Aspect
win % lose % win % lose %
14%
Success 48.0 5.3 6.0 78.7 28%
Funniness 56.7 25.3 10.7 85.3
Grammar 60.7 30.0 8.0 82.0
6%
38%
Table 4: Pairwise comparison between S UR G EN and 8%
N EURAL J OINT D ECODER (NJD), and between S UR -
G EN and human written puns. Win % (lose %) is 32%
the percentage among the human-rated 150 sentences topic not fit pun word not fit
where S UR G EN achieves a higher (lower) average topic-pun word not related grammar error
score compared to the other method. The rest are ties. fail to generate bad seed sentence

Figure 3: Error case breakdown shows that the main


model for topic word prediction, a type consis- issues lie in finding seed sentences that accommodates
tency checker, and a neural smoother. Given an both the pun word and the topic word (topic not fit +
alternative word, the retriever returned 500 candi- pun word not fit + bad seed sentence).
dates, among which we took the top 100 as seed
sentences (Section 3.3 local surprisal). For topic in the same group.
words, we took the top 100 words predicted by the We evaluated 150 pun/alternative word pairs.
skip-gram model and filtered them to ensure type Each generated sentence was rated by 5 workers
consistency with the deleted word (Section 3.3 and their scores were averaged. N/A ratings were
global surprisal). The WordNet path similarity excluded unless all ratings of a sentence were N/A,
threshold for type consistency was set to 0.3. in which case we set its score to 0. We attracted
The skip-gram model was trained on BookCor- 65, 93, 66 workers for the success, funniness, and
pus with d1 = 5 and d2 = 10 in Equation (5). We grammaticality surveys respectively, and removed
set the word embedding size to 300 and trained 3, 4, 4 workers because their maximum Spear-
for 15 epochs using Adam (Kingma and Ba, 2014) man correlation with other workers was lower than
with a learning rate of 0.0001. For the neural 0.2. We measure inter-annotator agreement using
smoother, we trained a single-layer LSTM (512 average Spearman correlation among all workers,
hidden units) sequence-to-sequence model with and the average inter-annotator Spearman correla-
attention on BookCorpus. The model was trained tion for success, funniness, and grammaticality are
for 50 epochs using AdaGrad (Duchi et al., 2010) 0.57, 0.36, and 0.32, respectively.
with a learning rate of 0.01 and a dropout rate of Table 3 shows the overall results. All 3 of our
0.1. systems outperform the baselines in terms of suc-
Human evaluation. We hired workers on AMT cess rate and funniness. More edits (i.e. swap-
to rate outputs from all 5 systems together with ping, inserting topic words) made the sentence less
expert-written puns from the SemEval pun dataset. grammatical, but also much more like puns (higher
Each worker was shown a group of sentences gen- success rate). Interestingly, introducing the neural
erated by all systems (randomly shuffled) given smoother did not improve grammaticality and hurt
the same pun word and alternative word pair. success rate slightly. Manual inspection shows
Workers were asked to rate each sentence on three that ungrammaticality is often caused by improper
aspects: (1) success (“Is the sentence a pun?”),8 topic word, thus fixing its neighboring words does
(2) funniness (“How funny is the sentence?”), and not truly solve the problem. For example, filling
(3) grammaticality (“How grammatical is the sen- “drum” (related to “lute”) in “if that was it
tence?”). Success was rated as yes/no, and fun- was likely that another body would turn up soon,
niness and grammaticality were rated on a scale because someone probably wouldn’t want to share
from 1 (not at all) to 5 (very). We also included the lute.”. In addition, when the neural model is
a N/A choice (does not make sense) for funniness given a rare topic word, it tends to rewrite it to
to exclude cases where the sentence are not under- a common phrase instead, again showing that su-
standable. Workers were explicitly instructed to pervised learning is against the spirit of generat-
try their best to give different scores for sentences ing novel content. For example, inserting “gentle-
woman” to “ not allow me to ...” produces
8
They were shown the definition from Miller et al. (2017). “these people did not allow me to ...”. Overall, our
Method Example Rating
1. Pun/alternative word pair: butter – better
NJD He is going to come up with the butter a ‘very good’ approach to the world’s economic crisis, the 1
world’s biggest economic climate.
S UR G EN Well, gourmet did it, he thought, it’d butter be right. 2
Human Why did the dairy churn? The less said, the butter... 1.5
2. Pun/alternative word pair: peace – piece
NJD Further, he said, at the end of the peace, it’s not clear that it will be a good example. 1
S UR G EN That’s because negotiator got my car back to me in one peace. 1.5
Human Life is a puzzle; look here for the missing peace. 3
3. Pun/alternative word pair: flour – flower
NJD Go, and if you are going on the flour. 1
S UR G EN Butter want to know who these two girls are, the new members of the holy flour. 1.5
Human Betty crocker was a flour child. 4.5
4. Pun/alternative word pair: wait – weight
NJD Gordon Brown, Georgia’s prime minister, said he did not have to wait, but he was not sure whether he 0
had been killed.
S UR G EN Even from the outside, I could tell that he’d already lost some wait. 2
Human Patience is a virtue heavy in wait. 3

Table 5: Examples of generated puns with average human ratings of funniness (1-5). 0 means that all ratings are
N/A (does not make sense).

S UR G EN performs the best and tripled the success In example 3, it failed to realize that “butter” is not
rate of N EURAL J OINT D ECODER with improved animate thus cannot “want” since our type consis-
funniness and grammaticality scores. Neverthe- tency checker is very simple.
less, there is still a significant gap between gen- To gain further insights on the limitation of
erated puns and expert-written puns across all as- our system, we randomly sampled 50 unsuccess-
pects, indicating that pun generation remains an ful generations (labeled by workers) to analyze
open challenge. the issues. We characterized the issues into 6
Table 4 shows the pairwise comparison results non-exclusive categories: (1) weak association be-
among our best model S UR G EN, N EURAL J OINT- tween the local context and wa (e.g., “...in the form
D ECODER, and expert-written puns. Given the of a batty (bat)”); (2) wp does not fit in the lo-
outputs of two systems, we decided win/lose/tie cal context, often due to different POS tags of wa
by comparing the average scores of both outputs. and wp (e.g., “vibrate with a taxed (text)”); (3)
We see that S UR G EN dominates N EURAL J OINT- the topic word is not related to wp (e.g., “pagan”
D ECODER with > 50% winning rate on funni- vs “fabrication”); (4) the topic word does not fit
ness and grammaticality. On success rate, the two in its immediate context, often due to inconsistent
methods have many ties since they both have rel- types (e.g., “slider won’t go...”), (5) grammatical
atively low success rate. Our generated puns were errors; and (6) fail to obtain seed sentences or topic
rated funnier than expert-written puns around 10% words. A breakdown of these errors is shown in
of the time. Figure 3. The main issues lie in finding seed sen-
tences that accommodate both the pun word and
4.4 Error Analysis the topic word. There is also room for improve-
In Table 5, we show example outputs of our S UR - ment in predicting pun-related topic words.
G EN, the N EURAL J OINT D ECODER baseline, and
expert-written puns. S UR G EN sometimes gen- 5 Discussion and Related Work
erates creative puns that are rated even funnier
5.1 Humor Theory
than human-written puns (example 1). In con-
trast, N EURAL J OINT D ECODER at best generates Humor involves complex cognitive activities and
ambiguous sentences (example 2 and 3) and some- many theories attempt to explain what might be
times the sentences are ungrammatical (example considered humorous. Among the leading theo-
1) or hard to understand (example 4). The exam- ries, the incongruity theory (Tony, 2004) is most
ples also show the current limitation of S UR G EN. related to our surprisal principle. The incongruity
theory posits that humor is perceived at the mo- Creative generation is more challenging as
ment of resolving the incongruity between two it requires both formality (e.g., grammaticality,
concepts, often involving unexpected shifts in per- rhythm, and rhyme) and novelty. Therefore, many
spectives. Ginzburg et al. (2015) applied the in- works (including us) impose strong constraints
congruity theory to explain laughter in dialogues. on the generative process, such as Petrovic and
Prior work (Kao et al., 2015) on formalizing in- Matthews (2013); Valitutti et al. (2013) for joke
congruity theory for puns focuses on ambiguity generation, Ghazvininejad et al. (2016) for poetry
between two concepts and the heterogeneity na- generation, and Yao et al. (2019) for storytelling.
ture of the ambiguity. Our surprisal principle fur-
ther formalizes unexpectedness (local surprisal) 6 Conclusion
and incongruity resolution (global association). In this paper, we tackled pun generation by de-
The surprisal principle is also related to stud- veloping and exploring a local-global surprisal
ies in psycholinguistics on the relation between principle. We show that a simple instantia-
surprisal and human comprehension (Levy, 2013; tion based on only a language model trained on
Levy and Gibson, 2013). Our study suggests non-humorous text is effective at detecting puns
it could be a fruitful direction to formally study (though is not fine-grained enough to detect the
the relationship between human perception of sur- degree of funniness within puns). To gener-
prisal and humor. ate puns, we operationalize the surprisal princi-
5.2 Humor generation ple with a retrieve-and-edit framework to create
contrast in the amount of surprise in local and
Early approaches to joke generation (Binsted, global contexts. While we improve beyond current
1996; Ritchie, 2005) largely rely on templates techniques, we are still far from human-generated
for specific types of puns. For example, puns.
JAPE (Binsted, 1996) generates noun phrase puns While we believe the local-global surprisal prin-
as question-answer pairs, e.g., “What do you call ciple is a useful conceptual tool, the principle it-
a [murderer] with [fiber]? A [cereal] [killer].” self is not quite yet formalized in a robust enough
Petrovic and Matthews (2013) fill in a joke tem- way that can be be used both as a principle for
plate based on word similarity and uncommon- evaluating sentences and can be directly optimized
ness. Similar to our editing approach, Valitutti to generate puns. A big challenge in humor, and
et al. (2013) substitutes a word with a taboo more generally, creative text generation, is to cap-
word based on form similarity and local coher- ture the difference between creativity (novel but
ence to generate adult jokes. Recently, Yu et al. well-formed material) and nonsense (ill-formed
(2018) generates puns from a generic neural lan- material). Language models conflate the two, so
guage model by simultaneously conditioning on developing methods that are nuanced enough to
two meanings. Most of these approaches leverage recognize this difference is key to future progress.
some assumptions of joke structures, e.g., incon-
gruity, relations between words, and word types. Acknowledgments
Our approach also relies on specific pun struc-
tures; we have proposed and operationalized a This work was supported by the DARPA CwC
local-global surprisal principle for pun generation. program under ISI prime contract no. W911NF-
15-1-0543 and ARO prime contract no. W911NF-
5.3 Creative text generation 15-1-0462. We thank Abhinav Moudgil and Jus-
Our work is also built upon generic text genera- tine Kao for sharing their data and results. We
tion techniques, in particular recent neural gen- also thank members of the Stanford NLP group
eration models. Hashimoto et al. (2018) devel- and USC Plus Lab for insightful discussions.
oped a retrieve-and-edit approach to improve both
Reproduciblility
grammaticality and diversity of the generated text.
Shen et al. (2017); Fu et al. (2018) explored ad- All code, data, and experiments for
versarial training to manipulate the style of a sen- this paper are available on the CodaLab
tence. Our neural smoother is also closely related platform: https://worksheets.
to Li et al. (2018)’s delete-retrieve-edit approach codalab.org/worksheets/
to text style transfer. 0x5a7d0fe35b144ad68998d74891a31ed6.
References J. Li, R. Jia, H. He, and P. Liang. 2018. Delete, retrieve,
generate: A simple approach to sentiment and style
D. Aarons. 2017. Puns and tacit linguistic knowledge. transfer. In North American Association for Compu-
The Routledge Handbook of Language and Humor, tational Linguistics (NAACL).
Routledge, New York, NY, Routledge Handbooks in
Linguistics. H. Manurung, G. Ritchie, and H. Thompson. 2000. To-
K. Binsted. 1996. Machine Humour: An Implemented wards a computational model of poetry generation.
Model of Puns. Ph.D. thesis, University of Edin- The University of Edinburgh Technical Report.
burgh.
J. R. Meehan. 1977. TALE-SPIN, an interactive pro-
S. Doogan, A. Ghosh, H. Chen, and T. Veale. 2017. Id- gram that writes stories. In International Joint Con-
iom savant at SemEval-2017 task 7: Detection and ference on Artificial Intelligence (IJCAI).
interpretation of English puns. In The 11th Interna-
tional Workshop on Semantic Evaluation. S. Merity, C. Xiong, J. Bradbury, and R. Socher. 2016.
Pointer sentinel mixture models. arXiv preprint
J. Duchi, E. Hazan, and Y. Singer. 2010. Adaptive sub- arXiv:1609.07843.
gradient methods for online learning and stochastic
optimization. In Conference on Learning Theory T. Miller, C. Hempelmann, and I. Gurevych. 2017.
(COLT). SemEval-2017 task 7: Detection and interpretation
of English puns. In Proceedings of the 11th Interna-
A. Fan, M. Lewis, and Y. Dauphin. 2018. Hier- tional Workshop on Semantic Evaluation.
archical neural story generation. arXiv preprint
arXiv:1805.04833. A. Pauls and D. Klein. 2012. Large-scale syntactic
language modeling with treelets. In Association for
Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan. 2018. Computational Linguistics (ACL).
Style transfer in text: Exploration and evaluation. In
Association for the Advancement of Artificial Intel- N. Peng, M. Ghazvininejad, J. May, and K. Knight.
ligence (AAAI). 2018. Towards controllable story generation. In
NAACL Workshop.
M. Ghazvininejad, X. Shi, Y. Choi, and K. Knight.
2016. Generating topical poetry. In Empir- S. Petrovic and D. Matthews. 2013. Unsupervised joke
ical Methods in Natural Language Processing generation from big data. In Association for Com-
(EMNLP). putational Linguistics (ACL).
J. Ginzburg, E. Breithholtz, R. Cooper, J. Hough, and
Y. Tian. 2015. Understanding laughter. In Proceed- G. Ritchie. 2005. Computational mechanisms for pun
ings of the 20th Amsterdam Colloquium. generation. In the 10th European Natural Language
Generation Workshop.
K. Guu, T. B. Hashimoto, Y. Oren, and P. Liang. 2018.
Generating sentences by editing prototypes. Trans- T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. 2017.
actions of the Association for Computational Lin- Style transfer from non-parallel text by cross-
guistics (TACL), 0. alignment. In Advances in Neural Information Pro-
cessing Systems (NeurIPS).
F. Hao, C. Hao, S. Maarten, C. Elizabeth, H. Ari,
C. Yejin, S. N. A, and O. Mari. 2018. Sounding V. Tony. 2004. Incongruity in humor: Root cause or
board: A user-centric and content-driven social chat- epiphenomenon? Humor: International Journal of
bot. arXiv preprint arXiv:1804.10202. Humor Research, 17.
T. Hashimoto, K. Guu, Y. Oren, and P. Liang. 2018. A. Valitutti, H. Toivonen, A. Doucet, and J. M. Toiva-
A retrieve-and-edit framework for predicting struc- nen. 2013. “let everything turn well in your wife:
tured outputs. In Advances in Neural Information Generation of adult humor using lexical constraints.
Processing Systems (NeurIPS). In Association for Computational Linguistics (ACL).
J. T. Kao, R. Levy, and N. D. Goodman. 2015. A com- J. Weizenbaum. 1966. ELIZA–a computer program
putational model of linguistic humor in puns. Cog- for the study of natural language communication be-
nitive Science. tween man and machine. Communications of the
D. Kingma and J. Ba. 2014. Adam: A method ACM, 9(1):36–45.
for stochastic optimization. arXiv preprint
arXiv:1412.6980. L. Yao, N. Peng, R. Weischedel, K. Knight, D. Zhao,
and R. Yan. 2019. Plan-and-write: Towards better
R. Levy. 2013. Memory and surprisal in human sen- automatic storytelling. In Association for the Ad-
tence comprehension. In Sentence Processing. vancement of Artificial Intelligence (AAAI).

R. Levy and E. Gibson. 2013. Surprisal, the PDC, and Z. Yu, J. Tan, and X. Wan. 2018. A neural approach to
the primary locus of processing difficulty in relative pun generation. In Association for Computational
clauses. Frontiers in Psychology, 4. Linguistics (ACL).
Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Ur-
tasun, A. Torralba, and S. Fidler. 2015. Aligning
books and movies: Towards story-like visual ex-
planations by watching movies and reading books.
arXiv preprint arXiv:1506.06724.

You might also like