Professional Documents
Culture Documents
2019 - 4 - BERT Has A Mouth, and It Must Speak - BERT As A Markov Random Field Language Model
2019 - 4 - BERT Has A Mouth, and It Must Speak - BERT As A Markov Random Field Language Model
2019 - 4 - BERT Has A Mouth, and It Must Speak - BERT As A Markov Random Field Language Model
a Markov random field language model. This it can take one of M items from a vocabulary
formulation gives way to a natural procedure V = {v1 , . . . , vM }. These random variables form
to sample sentences from BERT. We generate a fully-connected graph with undirected edges, in-
from BERT and find that it can produce high- dicating that each variable xi is dependent on all
quality, fluent generations. Compared to the
the other variables.
generations of a traditional left-to-right lan-
guage model, BERT generates sentences that Joint Distribution To define a Markov random
are more diverse but of slightly worse quality. field (MRF), we start by defining a potential over
cliques. Among all possible cliques, we only con-
sider the clique corresponding to the full graph.
1 Introduction
All other cliques will be assigned a potential of
1 (i.e. exp(0)). The potential for this full-graph
BERT (Devlin et al., 2018) is a recently released
clique decomposes into a sum of T log-potential
sequence model used to achieve state-of-art results
terms:
on a wide range of natural language understanding
T T
!
tasks, including constituency parsing (Kitaev and Y X
Klein, 2018) and machine translation (Lample and φ(X) = φt (X) = exp log φt (X) ,
Conneau, 2019). Early work probing BERT’s lin- t=1 t=1
guistic capabilities has found it surprisingly robust where we use X to denote the fully-connected
(Goldberg, 2019). graph created from the original sequence. Each
BERT is trained on a masked language model- log-potential φt (X) is defined as
ing objective. Unlike a traditional language mod-
>
eling objective of predicting the next word in a se- 1h(xt ) fθ (X\t ), if [MASK] ∈
/
quence given the history, masked language model- log φt (X) = X1:t−1 ∪ Xt+1:T
ing predicts a word given its left and right context.
0, otherwise,
Because the model expects context from both di- (1)
rections, it is not immediately obvious how BERT
can be used as a traditional language model (i.e., where fθ (X\t ) ∈ RM , 1h(xt ) is a one-hot vector
to evaluate the probability of a text sequence) or with index xt set to 1, and
how to sample from it.
We attempt to answer these questions by show- X\t = (x1 , . . . , xt−1 , [MASK] , xt+1 , . . . , xT )
ing that BERT is a combination of a Markov
From this log-potential, we can define a probabil-
random field language model (MRF-LM, Jernite
ity of a given sequence X as
et al., 2015; Mikolov et al., 2013) with pseudo log-
likelihood (Besag, 1977) training. This formula- T
1 Y
tion automatically leads to a sampling procedure pθ (X) = φt (X), (2)
Z(θ)
based on Gibbs sampling. t=1
where estimate it by
T
XY
|X|
Z(θ) = φt (X 0 ), 1 X
X 0 t=1 log p(xt |X\t )
|X|
t=1
X 0.
for all This normalization constant is unfor- =Et∼U ({1,...,|X|}) log p(xt |X\t )
tunately impractical to compute exactly, rendering K
exact maximum log-likelihood intractable. 1 X
≈ log p(xt̃k |X\t̃k ),
K
k=1
Conditional Distribution Given a fixed X\t ,
the conditional probability of xt is derived to be where t̃k ∼ U({1, . . . , |X|}. Let us refer to this as
stochastic pseudo log-likelihood learning.
1
p(xt |X\t ) = exp(1h(xt )> fθ (X\t )), In Reality The stochastic pseudo log-likelihood
Z(X\t ) learning above states that we “mask out” one to-
(3) ken in a sequence at a time and let fθ predict it
based on all the other “observed” tokens in the se-
where quence. Devlin et al. (2018) however proposed to
“mask out” multiple tokens at a time and predict
M
X all of them given both all “observed” and “masked
Z(X\t ) = exp(1h(m)> fθ (X\t )). out” tokens in the sequence. This brings the origi-
m=1
nal BERT closer to a denoising autoencoder (Vin-
cent et al., 2010), which could still be considered
This derivation follows from the peculiar formula-
as training a Markov random field with (approxi-
tion of the log-potential in Eq. (1). It is relatively
mate) score matching (Vincent, 2011).
straightforward to compute, as it is simply softmax
normalization over M terms (Bridle, 1990). 3 Using BERT as an MRF-LM
(Stochastic) Pseudo Log-Likelihood Learning The discussion so far implies that BERT is a
One way to avoid the issue of intractability in com- Markov random field language model (MRF-LM)
puting the normalization constant Z(θ) above1 and that it learns a distribution over sentences (of
is to resort to an approximate learning strategy. some given length). This framing suggests that we
BERT uses pseudo log-likelihood learning, where can use BERT not only as parameter initialization
the pseudo log-likelihood is defined as: for finetuning but as a generative model of sen-
tences to either score a sentence or sample a sen-
|X| tence.
1 XX
PLL(θ; D) = log p(xt |X\t ), . (4)
|D| Ranking Let us fix the length T . Then, we can
X∈D t=1
use BERT to rank a set of sentences. We can-
not compute the exact probabilities of these sen-
where D is a set of training examples. We maxi-
tences, but we can compute their unnormalized
mize the predictability of each token in a sequence
log-probabilities according to Eq. (2):
given all the other tokens, instead of the joint prob-
ability of the entire sequence. T
X
It is still expensive to compute the pseudo log- log φt (X).
likelihood in Eq. (4) for even one example, espe- t=1
cially when fθ is not linear. This is because we
These unnormalized probabilities can be used to
must compute |X| forward passes of fθ for each
find the most likely sentence within the set or to
sequence, when |X| can be long and fθ be compu-
sort the sentences according to their probabilities.
tationally heavy. Instead we could stochastically
Sampling Sampling from a Markov random
1
In BERT it is not intractable in the strictest sense, since field is less trivial than is from a directed graph-
the amount of computation is bounded (by T = 500) each
iteration. It however requires computation up to exp(500) ical model which naturally admits ancestral sam-
which is in practice impossible to compute exactly. pling. One of the most widely used approaches
the nearest regional centre is alemanno , with another connec- for all of thirty seconds , she was n’t going to speak . maybe
tion to potenza and maradona , and the nearest railway station this time , she ’d actually agree to go . thirty seconds later ,
is in bergamo , where the line terminates on its northern end she ’d been speaking to him in her head every
’ let him get away , mrs . nightingale . you could do it again “ oh , i ’m sure they would be of a good service , ” she assured
. ’ ’ he - ’ ’ no , please . i have to touch him . and when you me . “ how are things going in the morning ? is your husband
do , you run . well ? ” “ yes , very well
he also “ turned the tale [ of ] the marriage into a book ” as “ i know . ” she paused .“ did he touch you ? ” “ no . ” “ ah .
he wanted it to “ be elegiac ” . both sagas contain stories of ” “ oh , no , ” i said , confused , not sure why
both couple and their wedding night ;
“ i had a bad dream . ” “ about an alien ship ? who was it ? i watched him through the glass , wondering if he was going
” i check the text message that ’s been only partially restored to attempt to break in on our meeting . but he did n’t seem to
yet, the one that says love . even bother to knock when he entered the room . i was n’t
replaced chris hall ( st . louis area manager ) . june 9 : mike “ how long has it been since you have made yourself an offer
howard ( syndicated “ good morning ” , replaced steve koval like that ? ” asked planner . “ oh ” was the reply . planner
, replaced dan nickolas , and replaced phil smith ) ; had heard of some of his other business associates who had
Table 1: Random sample generations from BERT base (left) and GPT (right).
Table 2: Self-BLEU and percent of generated n-grams that are unique relative to own generations (left) WikiText-
103 test set (middle) a sample of 5000 sentences from Toronto Book Corpus (right). For the WT103 and TBC
rows, we sample 1000 sentences from the respective datasets.
model for evaluating generations and that the GPT a slightly more non-fluent samples. BERT large
generations might have collapsed to fairly generic has a bimodal distribution.
and simple sentences. This observation is further
bolstered by the fact that the GPT generations have 6 Conclusion
a higher corpus-BLEU with TBC than TBC has
We show that BERT is a Markov random field lan-
with itself. The perplexity on BERT samples is
guage model. Formulating BERT in this way gives
not absurdly high, and in reading the samples, we
rise to a practical algorithm for generating from
find that many are fairly coherent. The corpus-
BERT based on Gibbs sampling that does not re-
BLEU between BERT models and the datasets is
quire any additional parameters or training. We
low, particularly with WT103.
verify in experiments that the algorithm produces
We find that BERT generations are more di-
diverse and fairly fluent generations. The power
verse than GPT generations. GPT has high n-gram
of this framework is in allowing the principled ap-
overlap (smaller percent of unique n-grams) with
plication of Gibbs sampling, and potentially other
TBC, but surprisingly also with WikiText-103, de-
MCMC algorithms, for generating from BERT.
spite being trained on different data. Furthermore,
Future work might explore these improved sam-
GPT generations have greater n-gram overlap with
pling methods, especially those that do not need
these datasets than these datasets have with them-
to run the model over the entire sequence each
selves, further suggesting that GPT is relying sig-
iteration and that more robustly handle variable-
nificantly on generic sentences. BERT has lower
length sequences. To facilitate such investigation,
n-gram overlap with both corpora, with similar de-
we release our code on GitHub at https:
grees of n-gram overlap as the samples of the data.
//github.com/nyu-dl/bert-gen and
For a more rigorous evaluation of generation
a demo as a Colab notebook at https://
quality, we collect human judgments on sentence
colab.research.google.com/drive/
fluency for 100 samples from BERT large, BERT
1MxKZGtQ9SSBjTK5ArsZ5LKhkztzg52RV.
base, and GPT using a four point Likert scale.
For each sample we ask three annotators to rate Acknowledgements
the sentence on its fluency and take the average
of the three judgments as the sentence’s fluency We thank Ilya Kulikov and Nikita Nangia for their
score. We present a histogram of the results in help, as well as reviewers for insightful comments.
Figure 1. For BERT large, BERT base, and GPT AW is supported by an NSF Fellowship. KC
we respectively get mean scores over the samples is partly supported by Samsung Advanced Insti-
of 2.37 (σ = 0.83), 2.65 (σ = 0.65), and 2.80 tute of Technology (Next Generation Deep Learn-
(σ = 0.51). All means are within a standard devia- ing: from Pattern Recognition to AI) and Samsung
tion of each other. BERT base and GPT have simi- Electronics (Improving Deep Learning using La-
lar unimodal distributions with BERT base having tent Structure).
References Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
Julian Besag. 1977. Efficiency of pseudolikelihood uation of machine translation. In Proceedings of
estimation for simple gaussian fields. Biometrika, the 40th annual meeting on association for compu-
pages 616–618. tational linguistics, pages 311–318. Association for
John S Bridle. 1990. Probabilistic interpretation of Computational Linguistics.
feedforward classification network outputs, with re- Alec Radford, Karthik Narasimhan, Tim Salimans, and
lationships to statistical pattern recognition. In Neu- Ilya Sutskever. 2018. Improving language under-
rocomputing, pages 227–236. Springer. standing by generative pre-training. URL https://s3-
KyungHyun Cho, Tapani Raiko, and Alexander Ilin. us-west-2. amazonaws. com/openai-assets/research-
2010. Parallel tempering is efficient for learning re- covers/languageunsupervised/language under-
stricted boltzmann machines. In Neural Networks standing paper. pdf.
(IJCNN), The 2010 International Joint Conference Ruslan R Salakhutdinov. 2009. Learning in markov
on, pages 1–8. IEEE. random fields using tempered transitions. In Ad-
Yann N Dauphin, Angela Fan, Michael Auli, and David vances in neural information processing systems,
Grangier. 2016. Language modeling with gated con- pages 1598–1606.
volutional networks. arXiv preprint:1612.08083. Robert H Swendsen and Jian-Sheng Wang. 1986.
Replica monte carlo simulation of spin-glasses.
Guillaume Desjardins, Aaron Courville, Yoshua Ben-
Physical review letters, 57(21):2607.
gio, Pascal Vincent, and Olivier Delalleau. 2010.
Tempered markov chain monte carlo for training of Pascal Vincent. 2011. A connection between score
restricted boltzmann machines. In Proceedings of matching and denoising autoencoders. Neural com-
the thirteenth international conference on artificial putation, 23(7):1661–1674.
intelligence and statistics, pages 145–152.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yoshua Bengio, and Pierre-Antoine Manzagol.
Kristina Toutanova. 2018. Bert: Pre-training of deep 2010. Stacked denoising autoencoders: Learning
bidirectional transformers for language understand- useful representations in a deep network with a lo-
ing. arXiv preprint:1810.04805. cal denoising criterion. Journal of machine learning
research, 11(Dec):3371–3408.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Hierarchical neural story generation. arXiv Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
preprint:1805.04833. 2017. Seqgan: Sequence generative adversarial nets
with policy gradient. In AAAI, pages 2852–2858.
Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
ties. Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo,
Weinan Zhang, Jun Wang, and Yong Yu. 2018.
Yacine Jernite, Alexander Rush, and David Sontag.
Texygen: A benchmarking platform for text genera-
2015. A fast variational approach for learning
tion models. arXiv preprint:1802.01886.
markov random field language models. In Inter-
national Conference on Machine Learning, pages Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan
2209–2217. Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and
Nikita Kitaev and Dan Klein. 2018. Multilingual movies: Towards story-like visual explanations by
constituency parsing with self-attention and pre- watching movies and reading books. In arXiv
training. preprint:1506.06724.
Guillaume Lample and Alexis Conneau. 2019. Cross-
lingual Language Model Pretraining. arXiv e-prints, A Other Sampling Strategies
page arXiv:1901.07291.
We explored two other sampling strategies: left-
Stephen Merity, Caiming Xiong, James Bradbury, and to-right and generating for all positions at each
Richard Socher. 2016. Pointer sentinel mixture time step. See Section 3 for an explanation of
models.
the former. For the latter, we start with an ini-
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey tial sequence of all masks, and at each time step,
Dean. 2013. Efficient estimation of word represen- we would not mask any positions but would gen-
tations in vector space. arXiv preprint:1301.3781. erate for all positions. This strategy is designed
Radford M Neal. 1993. Probabilistic inference using to save on computation. However, we found that
markov chain monte carlo methods. this tended to get stuck in non-fluent sentences that
Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage could not be recovered from. We present sample
re-ranking with bert. arXiv preprint:1901.04085. generations for the left-to-right strategy in Table 4.
all the good people , no more , no less . no more . for ... the kind of better people ... for ... for ... for ... for ... for ... for ...
as they must become again .
sometimes in these rooms , here , back in the castle . but then : and then , again , as if they were turning , and then slowly
, and and then and then , and then suddenly .
other available songs for example are the second and final two complete music albums among the highest played artists ,
including : the one the greatest ... and the last recorded album , ” this sad heart ” respectively .
6 that is i ? ? and the house is not of the lord . i am well ... the lord is ... ? , which perhaps i should be addressing : ya is
then , of ye ? ?
four - cornered rap . big screen with huge screen two of his friend of old age . from happy , happy , happy . left ? left ?
left ? right . left ? right . right ? ?
Table 4: Random sample generations from BERT base using a sequential, left-to-right sampling strategy.