Professional Documents
Culture Documents
2203.12258 (Done)
2203.12258 (Done)
Understanding the
Invisible Risks from a Causal View
Boxi Cao1,3 , Hongyu Lin1 , Xianpei Han1,2,4∗, Fangchao Liu1,3 , Le Sun1,2∗
1
Chinese Information Processing Laboratory 2 State Key Laboratory of Computer Science
Institute of Software, Chinese Academy of Sciences, Beijing, China
3
University of Chinese Academy of Sciences, Beijing, China
4
Beijing Academy of Artificial Intelligence, Beijing, China
{boxi2020,hongyu,xianpei,fangchao2017,sunle}@iscas.ac.cn
models (PLMs). Unfortunately, recent stud- (a) Conventional evaluation in machine learning. Language
Expression
ies have discovered such an evaluation may be
inaccurate, inconsistent and unreliable. Fur- Test Data Desirable
Effect
thermore, the lack of understanding its inner Bundled Verbalize
workings, combined with its wide applicabil- Pretrain Data Prompt Verbalization Linguistic
Correlation
ity, has the potential to lead to unforeseen
risks for evaluating and applying PLMs in real- Task
PLM Predictor Performance Distributional
world applications. To discover, understand Correlation
and quantify the risks, this paper investigates (b) PLM evaluation via prompt-based probing.
the prompt-based probing from a causal view,
highlights three critical biases which could in- Figure 1: The illustrated procedure for two kinds of
duce biased results and conclusions, and pro- evaluation criteria.
poses to conduct debiasing via causal interven-
tion. This paper provides valuable insights for [MASK]”. Recent studies often construct prompt-
the design of unbiased datasets, better probing based probing datasets, and take PLMs’ perfor-
frameworks and more reliable evaluations of mance on these datasets as their abilities for the
pretrained language models. Furthermore, our
corresponding tasks. Such a probing evaluation
conclusions also echo that we need to rethink
the criteria for identifying better pretrained lan- has been wildly used in many benchmarks such
guage models1 . as SuperGLUE (Wang et al., 2019; Brown et al.,
2020), LAMA (Petroni et al., 2019), oLMpics (Tal-
1 Introduction mor et al., 2020), LM diagnostics (Ettinger, 2020),
CAT (Zhou et al., 2020), X-FACTR (Jiang et al.,
During the past few years, the great success of 2020a), BioLAMA (Sung et al., 2021), etc.
pretrained language models (PLMs) (Devlin et al., Unfortunately, recent studies have found that
2019; Liu et al., 2019; Brown et al., 2020; Raffel evaluating PLMs via prompt-based probing could
et al., 2020) raises extensive attention about eval- be inaccurate, inconsistent, and unreliable. For
uating what knowledge do PLMs actually entail. example, Poerner et al. (2020) finds that the per-
One of the most popular approaches is prompt- formance may be overestimated because many in-
based probing (Petroni et al., 2019; Davison et al., stances can be easily predicted by only relying
2019; Brown et al., 2020; Schick and Schütze, on surface form shortcuts. Elazar et al. (2021)
2020; Ettinger, 2020; Sun et al., 2021), which shows that semantically equivalent prompts may
assesses whether PLMs are knowledgable for a result in quite different predictions. Cao et al.
specific task by querying PLMs with task-specific (2021) demonstrates that PLMs often generate un-
prompts. For example, to evaluate whether BERT reliable predictions which are prompt-related but
knows the birthplace of Michael Jordan, we could not knowledge-related.
query BERT with “Michael Jordan was born in In these cases, the risks of blindly using prompt-
∗ based probing to evaluate PLMs, without under-
Corresponding Authors
1
We openly released the source code and data at https: standing its inherent vulnerabilities, are significant.
//github.com/c-box/causalEval. Such biased evaluations will make us overestimate
or underestimate the real capabilities of PLMs, mis- lations between PLMs and prompts, i.e., the
lead our understanding of models, and result in performance may be biased by the fitness of
wrong conclusions. Therefore, to reach a trustwor- a prompt to PLMs’ linguistic preference. For
thy evaluation of PLMs, it is necessary to dive into instance, semantically equivalent prompts will
the probing criteria and understand the following lead to different biased evaluation results.
two critical questions: 1) What biases exist in cur-
rent evaluation criteria via prompt-based probing? • Instance Verbalization Bias, which mainly
2) Where do these biases come from? stems from the underlying linguistic correla-
To this end, we compared PLM evaluation via tions between PLMs and verbalized probing
prompt-based probing with conventional evalua- datasets, i.e., the evaluation results are sensi-
tion criteria in machine learning. Figure 1 shows tive and inconsistent to the different verbaliza-
their divergences. Conventional evaluations aim tions of the same instance (e.g., representing
to evaluate different hypotheses (e.g., algorithms the U.S.A. with the U.S. or America).
or model structures) for a specific task. The tested
• Sample Disparity Bias, which mainly stems
hypotheses are raised independently of the train-
from the invisible distributional correlation
ing/test data generation. However, this indepen-
between pretraining and probing data, i.e.,
dence no longer sustains in prompt-based prob-
the performance difference between different
ing. There exist more complicated implicit con-
PLMs may due to the sample disparity of their
nections between pretrained models, probing data,
pretraining corpus, rather than their ability
and prompts, mainly due to the bundled pretraining
divergence. Such invisible correlations may
data with specific PLMs. These unaware connec-
mislead evaluation results, and thus lead to
tions serve as invisible hands that can even dom-
implicit, unaware risks of applying PLMs in
inate the evaluation criteria from both linguistic
real-world applications.
and task aspects. From the linguistic aspect, be-
cause pretraining data, probing data and prompts We further propose to conduct causal interven-
are all expressed in the form of natural language, tion via backdoor adjustments, which can reduce
there exist inevitable linguistic correlations which bias and ensure a more accurate, consistent and re-
can mislead evaluations. From the task aspect, the liable probing under given assumptions. Note that
pretraining data and the probing data are often sam- this paper not intends to create a “universal cor-
pled from correlated distributions. Such invisible rect” probing criteria, but to remind the underlying
task distributional correlations may significantly invisible risks, to understand how spurious correla-
bias the evaluation. For example, Wikipedia is a tions lead to biases, and to provide a causal toolkit
widely used pretraining corpus, and many probing for debiasing probing under specific assumptions.
data are also sampled from Wikipedia or its exten- Besides, we believe that our discoveries not only
sions such as Yago, DBPedia or Wikidata (Petroni exist in prompt-based probing, but will also influ-
et al., 2019; Jiang et al., 2020a; Sung et al., 2021). ence all prompt-based applications to pretrained
As a result, such task distributional correlations language models. Consequently, our conclusions
will inevitably confound evaluations via domain echo that we need to rethink the criteria for identi-
overlapping, answer leakage, knowledge coverage, fying better pretrained language models with the
etc. above-mentioned biases.
To theoretically identify how these correlations Generally, the main contributions of this paper
lead to biases, we revisit the prompt-based prob- are:
ing from a causal view. Specifically, we describe
the evaluation procedure using a structural causal • We investigate the critical biases and quan-
model (Pearl et al., 2000) (SCM), which is shown tify their risks of evaluating pretrained lan-
in Figure 2a. Based on the SCM, we find that guage models with widely used prompt-based
the linguistic correlation and the task distributional probing, including prompt preference bias, in-
correlation correspond to three backdoor paths in stance verbalization bias, and sample disparity
Figure 2b-d, which lead to three critical biases: bias.
Da R L Db L L Da Db Outcome
Confounder
C P T C P C X C T
Variable to Block
Backdoor Path
M I X M I M M X
Causal Relation
E E E E
True Causal Effect
(a) Structural Causal Model (b) Prompt Preference Bias (c) Instance Verbalization Bias (d) Sample Disparity Bias Backdoor Path
Figure 2: The structural causal model for factual knowledge probing and the three backdoor paths in SCM corre-
spond to three biases.
understand, and eliminate biases in prompt- Backdoor Path When estimating the causal ef-
based probing evaluations. fect of X on Y , the backdoor paths are the non-
causal paths between X and Y with an arrow into
• We provide valuable insights for the design of X, e.g., X ← Z → Y . Such paths will confound
unbiased datasets, better probing frameworks, the effect that X has on Y but not transmit causal
and more reliable evaluations, and echo that influences from X, and therefore introduce spuri-
we should rethink the evaluation criteria for ous correlations between X and Y .
pretrained language models.
Backdoor Criterion The Backdoor Criterion is
2 Background and Experimental Setup an important tool for causal intervention. Given an
ordered pair of variables (X, Y ) in SCM, and a set
2.1 Causal Inference of variables Z where Z contains no descendant of
Causal inference is a promising technique for iden- X and blocks every backdoor path between X and
tifying undesirable biases and fairness concerns in Y , then the causal effects of X = x on Y can be
benchmarks (Hardt et al., 2016; Kilbertus et al., calculated by:
2017; Kusner et al., 2017; Vig et al., 2020; Feder
et al., 2021). Causal inference usually describes the P(Y = y|do(X = x)) =
causal relations between variables via Structural
X
P(Y = y|X = x, Z = z)P(Z = z), (1)
Causal Model (SCM), then recognizes confounders z
and spurious correlations for bias analysis, finally
identifies true causal effects by eliminating biases where P(Z = z) can be estimated from data or
using causal intervention techniques. priorly given, and is independent of X.
SCM The structural causal model (Pearl et al., 2.2 Experimental Setup
2000) describes the relevant features in a system Task This paper investigates prompt-based prob-
and how they interact with each other. Every ing on one of the most representative and well-
SCM is associated with a graphical causal model studied tasks – factual knowledge probing (Liu
G = {V, f }, which consists of a set of nodes rep- et al., 2021b). For example, to evaluate whether
resenting variables V , as well as a set of edges BERT knows the birthplace of Michael Jordan,
between the nodes representing the functions f to factual knowledge probing queries BERT with
describe the causal relations. “Michael Jordan was born in [MASK]”, where
Michael Jordan is the verbalized subject men-
Causal Intervention To identify the true causal
tion, “was born in” is the verbalized prompt of rela-
effects between an ordered pair of variables (X, Y ),
tion birthplace, and [MASK] is a placeholder
Causal intervention fixes the value of X = x and
for the target object.
removes the correlations between X and its prece-
dent variables, which is denoted as do(X = x). In Data We use LAMA (Petroni et al., 2019) as our
this way, P(Y = y|do(X = x)) represents the true primary dataset, which is a set of knowledge triples
causal effects of treatment X on outcome Y (Pearl sampled from Wikidata. We remove the N-M rela-
et al., 2016). tions (Elazar et al., 2021) which are unsuitable for
the P@1 metric and retain 32 probing relations in • Verbalized Instances Generation. The path
the dataset. Please refer to the appendix for detail. {Db , R} → T → X ← L represents the
generation procedure of verbalized probing
Pretrained Models We conduct probing experi-
instances X, which first samples probing data
ments on 4 well-known PLMs: BERT (Devlin et al.,
T of relation R according to data distribution
2019), RoBERTa (Liu et al., 2019), GPT-2 (Rad-
Db , then verbalizes the sampled data T into
ford et al., 2019) and BART (Lewis et al., 2020),
X according to the linguistic distribution L.
which correspond to 3 representative PLM archi-
tectures, including autoencoder (BERT, RoBERTa), • Performance Estimation. The path
autoregressive (GPT-2) and denoising autoencoder {M, P } → I → E ← X represents the
(BART). performance estimation procedure, where the
predictor I is first derived by combining PLM
3 Structural Causal Model for Factual
M and prompt P , and then the performance
Knowledge Probing
E is estimated by applying predictor I on
In this section, we formulate the SCM for factual verbalized instances X.
knowledge probing procedure and describe the key
variables and causal relations. To evaluate PLMs’ ability on fact extraction,
The SCM is shown in Figure 2a, which con- we need to estimate P(E|do(M = m), R = r).
tains 11 key variables: 1) Pretraining corpus Such true causal effects are represented by the path
distribution Da ; 2) Pretraining corpus C, e.g., M → I → E in SCM. Unfortunately, there exist
Webtext for GPT2, Wikipedia for BERT; 3) Pre- three backdoor paths between pretrained language
trained language model M ; 4) Linguistic distri- model M and performance E, as shown in Fig-
bution L, which guides how a concept is verbal- ure 2b-d. These spurious correlations make the
ized into natural language expression, e.g., rela- observation correlation between M and E cannot
tion to prompt, entity to mention; 5) Relation R, represent the true causal effects of M on E, and
e.g., birthplace, capital, each relation cor- will inevitably lead to biased evaluations. In the
responds to a probing task; 6) Verbalized prompt following, we identify three critical biases in the
P for each relation , e.g, x was born in y; 7) prompt-based probing evaluation and describe the
Task-specific predictor I, which is a PLM com- manifestations, causes, and casual interventions for
bined with a prompt, e.g., <BERT, was born in> each bias.
as a birthplace predictor; 8) Probing data
4 Prompt Preference Bias
distribution Db , e.g., fact distribution in Wiki-
data; 9) Sampled probing data T such as LAMA, In prompt-based probing, the predictor of a spe-
which are sampled entity pairs (e.g., <Q41421, cific task (e.g., the knowledge extractor of rela-
Q18419> in Wikidata) of relation R; 10) Verbal- tion birthplace) is a PLM M combined with a
ized instances X, (e.g., <Michael Jordan, Brook- prompt P (e.g., BERT + was born in). However,
lyn> from <Q41421, Q18419>); 11) Performance PLMs are pretrained on specific text corpus, there-
E of the predictor I on X. fore will inevitably prefer prompts sharing the same
The causal paths of the prompt-based probing linguistic regularity with their pretraining corpus.
evaluation contains: Such implicit prompt preference will confound the
true causal effects of PLMs on evaluation perfor-
• PLM Pretraining. The path {Da , L} →
mance, i.e., the performance will be affected by
C → M represents the pretraining procedure
both the task ability of PLMs and the preference
for language model M , which first samples
fitness of a prompt. In the following, we investigate
pretraining corpus C according to pretraining
prompt preference bias via causal analysis.
corpus distribution Da and linguistic distribu-
tion L, then pretrains M on C. 4.1 Prompt Preference Leads to Inconsistent
• Prompt Selection. The path {R, L} → P Performance
represents the prompt selection procedure, In factual knowledge probing, we commonly as-
where each prompt P must exactly express sign one prompt for each relation (e.g., X was
the semantics of relation R, and will be influ- born in Y for birthplace). However, dif-
enced by the linguistic distribution L. ferent PLMs may prefer different prompts, and it is
Models LAMA P@1 Worst P@1 Best P@1 Std
BERT-large RoBERTa-large GPT2-xl BART-large
BERT-large 39.08 23.45 46.73 8.75
80 RoBERTa-large 32.27 15.64 41.35 9.07
70 GPT2-xl 24.19 11.19 33.52 8.56
60 BART-large 27.68 16.21 38.93 8.35
50
40 Table 1: The P@1 performance divergence of prompt
30 selection averaged over all relations, we can see prompt
20 preference results in inconsistent performance.
10 BERT-large RoBERTa-large GPT2-xl BART-large
0 45
Language Continent Religion Owned by
30
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhi- Matt J. Kusner, Joshua R. Loftus, Chris Russell, and
lasha Ravichander, Eduard Hovy, Hinrich Schütze, Ricardo Silva. 2017. Counterfactual fairness. In
and Yoav Goldberg. 2021. Measuring and im- Advances in Neural Information Processing Systems
proving consistency in pretrained language models. 30: Annual Conference on Neural Information Pro-
Transactions of the Association for Computational cessing Systems 2017, December 4-9, 2017, Long
Linguistics, 9:1012–1031. Beach, CA, USA, pages 4066–4076.
Allyson Ettinger. 2020. What BERT is not: Lessons Alice Lai and Julia Hockenmaier. 2014. Illinois-LH: A
from a new suite of psycholinguistic diagnostics for denotational and distributional approach to seman-
language models. Transactions of the Association tics. In Proceedings of the 8th International Work-
for Computational Linguistics, 8:34–48. shop on Semantic Evaluation (SemEval 2014), pages
329–334, Dublin, Ireland. Association for Computa-
Amir Feder, Katherine A Keith, Emaad Manzoor, Reid tional Linguistics.
Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Ja-
cob Eisenstein, Justin Grimmer, Roi Reichart, Mar- Omer Levy and Ido Dagan. 2016. Annotating relation
garet E Roberts, et al. 2021. Causal inference inference in context via question answering. In Pro-
in natural language processing: Estimation, predic- ceedings of the 54th Annual Meeting of the Associa-
tion, interpretation and beyond. ArXiv preprint, tion for Computational Linguistics (Volume 2: Short
abs/2109.00725. Papers), pages 249–255, Berlin, Germany. Associa-
tion for Computational Linguistics.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
Making pre-trained language models better few-shot Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
learners. In Proceedings of the 59th Annual Meet- jan Ghazvininejad, Abdelrahman Mohamed, Omer
ing of the Association for Computational Linguistics Levy, Veselin Stoyanov, and Luke Zettlemoyer.
and the 11th International Joint Conference on Nat- 2020. BART: Denoising sequence-to-sequence pre-
ural Language Processing (Volume 1: Long Papers), training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An- Judea Pearl, Madelyn Glymour, and Nicholas P Jewell.
nual Meeting of the Association for Computational 2016. Causal inference in statistics: A primer. John
Linguistics, pages 7871–7880, Online. Association Wiley & Sons.
for Computational Linguistics.
Judea Pearl et al. 2000. Models, reasoning and infer-
Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, ence. Cambridge, UK: CambridgeUniversityPress,
Alan Ritter, and Dan Jurafsky. 2017. Adversarial 19.
learning for neural dialogue generation. In Proceed-
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
ings of the 2017 Conference on Empirical Methods
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
in Natural Language Processing, pages 2157–2169,
Alexander Miller. 2019. Language models as knowl-
Copenhagen, Denmark. Association for Computa-
edge bases? In Proceedings of the 2019 Confer-
tional Linguistics.
ence on Empirical Methods in Natural Language
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Processing and the 9th International Joint Confer-
Optimizing continuous prompts for generation. In ence on Natural Language Processing (EMNLP-
Proceedings of the 59th Annual Meeting of the IJCNLP), pages 2463–2473, Hong Kong, China. As-
Association for Computational Linguistics and the sociation for Computational Linguistics.
11th International Joint Conference on Natural Lan- Nina Poerner, Ulli Waltinger, and Hinrich Schütze.
guage Processing (Volume 1: Long Papers), pages 2020. E-BERT: Efficient-yet-effective entity em-
4582–4597, Online. Association for Computational beddings for BERT. In Findings of the Associa-
Linguistics. tion for Computational Linguistics: EMNLP 2020,
pages 803–818, Online. Association for Computa-
Fangyu Liu, Emanuele Bugliarello, Edoardo Maria tional Linguistics.
Ponti, Siva Reddy, Nigel Collier, and Desmond El-
liott. 2021a. Visually grounded reasoning across lan- Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Ed-
guages and cultures. In Proceedings of the 2021 ward Hu, Ellie Pavlick, Aaron Steven White, and
Conference on Empirical Methods in Natural Lan- Benjamin Van Durme. 2018. Collecting diverse nat-
guage Processing, pages 10467–10485, Online and ural language inference problems for sentence rep-
Punta Cana, Dominican Republic. Association for resentation evaluation. In Proceedings of the 2018
Computational Linguistics. Conference on Empirical Methods in Natural Lan-
guage Processing, pages 67–81, Brussels, Belgium.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Association for Computational Linguistics.
Hiroaki Hayashi, and Graham Neubig. 2021b. Pre-
train, prompt, and predict: A systematic survey of Guanghui Qin and Jason Eisner. 2021. Learning how
prompting methods in natural language processing. to ask: Querying LMs with mixtures of soft prompts.
ArXiv preprint, abs/2107.13586. In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, tional Linguistics: Human Language Technologies,
Yujie Qian, Zhilin Yang, and Jie Tang. 2021c. GPT pages 5203–5212, Online. Association for Compu-
understands, too. ArXiv preprint, abs/2103.10385. tational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Dario Amodei, and Ilya Sutskever. 2019. Language
Luke Zettlemoyer, and Veselin Stoyanov. 2019. models are unsupervised multitask learners. OpenAI
RoBERTa: A Robustly Optimized BERT Pretrain- blog, 1(8):9.
ing Approach. ArXiv preprint, abs/1907.11692. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf-
Wei Li, and Peter J. Liu. 2020. Exploring the limits
faella Bernardi, Stefano Menini, and Roberto Zam-
of transfer learning with a unified text-to-text trans-
parelli. 2014. SemEval-2014 task 1: Evaluation of
former. J. Mach. Learn. Res., 21:140:1–140:67.
compositional distributional semantic models on full
sentences through semantic relatedness and textual Hannah Rashkin, Maarten Sap, Emily Allaway,
entailment. In Proceedings of the 8th International Noah A. Smith, and Yejin Choi. 2018. Event2Mind:
Workshop on Semantic Evaluation (SemEval 2014), Commonsense inference on events, intents, and reac-
pages 1–8, Dublin, Ireland. Association for Compu- tions. In Proceedings of the 56th Annual Meeting of
tational Linguistics. the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 463–473, Melbourne,
Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Australia. Association for Computational Linguis-
Right for the wrong reasons: Diagnosing syntactic tics.
heuristics in natural language inference. In Proceed-
ings of the 57th Annual Meeting of the Association Anna Rogers, Olga Kovaleva, and Anna Rumshisky.
for Computational Linguistics, pages 3428–3448, 2020. A primer in BERTology: What we know
Florence, Italy. Association for Computational Lin- about how BERT works. Transactions of the Associ-
guistics. ation for Computational Linguistics, 8:842–866.
Rachel Rudinger, Jason Naradowsky, Brian Leonard, Alon Talmor, Yanai Elazar, Yoav Goldberg, and
and Benjamin Van Durme. 2018. Gender bias in Jonathan Berant. 2020. oLMpics-on what language
coreference resolution. In Proceedings of the 2018 model pre-training captures. Transactions of the As-
Conference of the North American Chapter of the sociation for Computational Linguistics, 8:743–758.
Association for Computational Linguistics: Human
Language Technologies, Volume 2 (Short Papers), Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov,
pages 8–14, New Orleans, Louisiana. Association Sharon Qian, Daniel Nevo, Yaron Singer, and Stu-
for Computational Linguistics. art M. Shieber. 2020. Investigating gender bias in
language models using causal mediation analysis. In
Ananya B. Sai, Mithun Das Gupta, Mitesh M. Khapra, Advances in Neural Information Processing Systems
and Mukundhan Srinivasan. 2019. Re-evaluating 33: Annual Conference on Neural Information Pro-
ADEM: A deeper look at scoring dialogue responses. cessing Systems 2020, NeurIPS 2020, December 6-
In The Thirty-Third AAAI Conference on Artificial 12, 2020, virtual.
Intelligence, AAAI 2019, The Thirty-First Innova-
tive Applications of Artificial Intelligence Confer- Alex Wang, Yada Pruksachatkun, Nikita Nangia,
ence, IAAI 2019, The Ninth AAAI Symposium on Ed- Amanpreet Singh, Julian Michael, Felix Hill, Omer
ucational Advances in Artificial Intelligence, EAAI Levy, and Samuel R. Bowman. 2019. Superglue: A
2019, Honolulu, Hawaii, USA, January 27 - Febru- stickier benchmark for general-purpose language un-
ary 1, 2019, pages 6220–6227. AAAI Press. derstanding systems. In Advances in Neural Infor-
mation Processing Systems 32: Annual Conference
Ananya B. Sai, Akash Kumar Mohankumar, and on Neural Information Processing Systems 2019,
Mitesh M. Khapra. 2020. A survey of evaluation NeurIPS 2019, December 8-14, 2019, Vancouver,
metrics used for NLG systems. ArXiv preprint, BC, Canada, pages 3261–3275.
abs/2008.12009.
Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021.
Factual probing is [MASK]: Learning vs. learning
Timo Schick and Hinrich Schütze. 2020. Few-Shot
to recall. In Proceedings of the 2021 Conference of
Text Generation with Pattern-Exploiting Training.
the North American Chapter of the Association for
ArXiv preprint, abs/2012.11926.
Computational Linguistics: Human Language Tech-
nologies, pages 5017–5033, Online. Association for
Roy Schwartz, Maarten Sap, Ioannis Konstas, Leila
Computational Linguistics.
Zilles, Yejin Choi, and Noah A. Smith. 2017. The
effect of different writing tasks on linguistic style: A Xuhui Zhou, Yue Zhang, Leyang Cui, and Dandan
case study of the ROC story cloze task. In Proceed- Huang. 2020. Evaluating commonsense in pre-
ings of the 21st Conference on Computational Natu- trained language models. In The Thirty-Fourth
ral Language Learning (CoNLL 2017), pages 15–25, AAAI Conference on Artificial Intelligence, AAAI
Vancouver, Canada. Association for Computational 2020, The Thirty-Second Innovative Applications of
Linguistics. Artificial Intelligence Conference, IAAI 2020, The
Tenth AAAI Symposium on Educational Advances
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, in Artificial Intelligence, EAAI 2020, New York, NY,
Eric Wallace, and Sameer Singh. 2020. AutoPrompt: USA, February 7-12, 2020, pages 9733–9740. AAAI
Eliciting Knowledge from Language Models with Press.
Automatically Generated Prompts. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
4222–4235, Online. Association for Computational
Linguistics.