Answer Equivalence For Open-Domain Question Answering

What’s in a Name?

Answer Equivalence For Open-Domain Question Answering

Chenglei Si Chen Zhao Jordan Boyd-Graber

Computer Science Computer Science CS , LSC , UMIACS,and iSchool
University of Maryland University of Maryland University of Maryland

Abstract Question: Who is the Chief Executive Officer of Apple?

Answer: Timothy Donald Cook
A flaw in QA evaluation is that annotations of- Equivalent Answers: Tim Cook
ten only provide one gold answer. Thus, model
Passage: Tim Cook
predictions semantically equivalent to the an- Timothy Donald Cook ( born November 1, 1960 ) is an
swer but superficially different are considered American business executive and industrial engineer.
incorrect. This work explores mining alias en- Tim Cook is the chief executive officer of Apple inc.
tities from knowledge bases and using them
as additional gold answers (i.e., equivalent an- Figure 1: An example from NQ dataset. The correct
swers). We incorporate answers for two set- model (BERT) prediction does not match the gold an-
tings: evaluation with additional answers and swer but matches the equivalent answer we mine from a
model training with equivalent answers. We knowledge base.
analyse three QA benchmarks: Natural Ques-
tions, TriviaQA and SQuAD. Answer expansion
increases the exact match score on all datasets Donald Cook, rendering Tim Cook as wrong as
for evaluation, while incorporating it helps Tim Apple. In the 2020 NeurIPS Efficient QA com-
model training over real-world datasets. We petition (Min et al., 2021), human annotators rate
ensure the additional answers are valid through
nearly a third of the predictions that do not match
a human post hoc evaluation. 1
the gold annotation as “definitely correct” or “pos-
1 Introduction: A Name that is the sibly correct”.
Enemy of Accuracy Despite the near-universal acknowledgement of
this problem, there is neither a clear measurement
In question answering (QA), computers—given of its magnitude nor a consistent best practice solu-
a question—provide the correct answer to the tion. While some datasets provide comprehensive
question. However, the modern formulation of answer sets (e.g., Joshi et al., 2017), subsequent
QA usually assumes that each question has only datasets such as NQ have not. . . and we do not know
one answer, e.g., SQuAD (Rajpurkar et al., 2016), whether this is a problem. We fill that lacuna.
H otpot QA (Yang et al., 2018), DROP (Dua et al., Section 2 mines knowledge bases for alternative
2019). This is often a byproduct of the prevail- answers to named entities. Even this straightfor-
ing framework for modern QA (Chen et al., 2017; ward approach finds high-precision answers not in-
Karpukhin et al., 2020): a retriever finds passages cluded in official answer sets. We then incorporate
that may contain the answer, and then a machine this in both training and evaluation of QA models
reader identifies the (as in only) answer span. to accept alternate answers. We focus on three pop-
In a recent position paper, Boyd-Graber and ular open-domain QA datasets: NQ, TriviaQA and
Börschinger (2020) argue that this is at odds with SQu AD. Evaluating models with a more permissive
the best practices for human QA. This is also a prob- evaluation improves exact match (EM) by 4.8 points
lem for computer QA. A BERT model (Devlin et al., on TriviaQA, 1.5 points on NQ, and 0.7 points on
2019) trained on Natural Questions (Kwiatkowski SQ u AD (Section 3). By augmenting training data
et al., 2019, NQ) answers Tim Cook to the ques- with answer sets, state-of-the-art models improve
tion “Who is the Chief Executive Officer of Apple?” on NQ and TriviaQA, but not on SQuAD (Section 4),
(Figure 1), while the gold answer is only Timothy which was created with a single evidence passage
Our code and data are available at: https://github. in mind. In constrast, augmenting the answer al-
com/NoviScl/AnswerEquiv. lows diverse evidence sources to provide an answer.
After reviewing other approaches for incorporating spans in the positive passage for span extraction
ambiguity in answers (Section 5), we discuss how (Equations 2–3).
to further make QA more robust. To study the effect of equivalent answers in
reader training, we focus on the distant supervi-
2 Method: An Entity by any Other Name sion setting where we know what the answer is
but not where it is (in contrast to full supervision
This section reviews the open-domain QA (ODQA)
where we know both). To use the answer to dis-
pipeline and introduces how we expand gold an-
cover positive passages, we use string matching:
swer sets for both training and evaulation.
any of the top-k retrieved passages that contains an
2.1 ODQA with Single Gold Answer answer is considered correct. We discard questions
without any positive passages. This framework
We follow the state-of-the-art retriever–reader
is consistent with modern state-of-the-art ODQA
pipeline for ODQA, where a retriever finds a
pipelines (Alberti et al., 2019; Karpukhin et al.,
handful of passages from a large corpus (usually
2020; Zhao et al., 2020; Asai et al., 2020; Zhao
Wikipedia), then a reader, often multi-tasked with
et al., 2021, inter alia).
passage reranking, selects a span as the prediction.
We adopt a dense passage retriver (Karpukhin 2.2 Extracting Alias Entities
et al., 2020, DPR) to find passages. DPR encodes
We expand the original gold answer set by extract-
questions and passages into dense vectors. DPR
ing aliases from Freebase (Bollacker et al., 2008),
searches for passages in this dense space: given
a large-scale knowledge base (KB). Specifically,
an encoded query, it finds the nearest passage vec-
for each answer in the original dataset (e.g., Sun
tors in the dense space. We do not train a new
Life Stadium), if we can find this entry in the KB,
retriever but instead use the released DPR check-
we then use the “common.topic.alias” relation to
point to query the top-k (in this paper, k = 100)
extract all aliases of the entity (e.g., [Joe Robbie
most relevant passages.
Stadium, Pro Player Park, Pro Player Stadium, Dol-
Given a question and retrieved passages, a neu-
phins Stadium, Land Shark Stadium]). We expand
ral reader reranks the top-k passages and ex-
the answer set by adding all aliases. We next de-
tracts an answer span. Specifically, BERT encodes
scribe how this changes evaluation and training.
each passage pi concatenated with the question q
as PiL×h = BERT([pi ; q]), where L is the maxi- 2.3 Augmented Evaluation
mum sequence length and h is the hidden size of
BERT . Three probabilities use this representation For evaluation, we report the exact match (EM)
to reveal where we can find the answer. The first score, where a predicted span is correct only if
probability Pr (pi ) encodes whether passage i con- the (normalized) span text matches with a gold
tains the answer. Because the answer is a subset of answer exactly. This is the adopted metric for span-
the longer span, we must provide the index where extraction datasets in most QA papers (Karpukhin
the answer starts j and where it ends k. Given the et al., 2020; Lee et al., 2019; Min et al., 2019,
encoding of passage i, there are three parameter inter alia). When we incorporate the alias enti-
matrices w that produce these probabilities: ties in evaluation, we get an expanded answer set
A ≡ {a1 , ..., an }. For a given span s predicted by
Pr (pi ) =softmax(wr (Pi [0, :])); (1) the model, we compute EM score of s if the span
matches any correct answer a in the set A:
Ps (tj ) =softmax(ws (Pi [j, :])); and (2)
Pe (tk ) =softmax(we (Pi [k, :])). (3) EM (s, A) = max{EM(s, a)}. (4)
where Pi [0, :] represents the [CLS] token, and
wr , ws and we are learnable weights for passage 2.4 Augmented Training
selection, start span and end span. Training up- When we incorporate the alias entities in training,
dates weights with one positive and m − 1 negative we treat each retrieved passage as positive if it
passages among the top-100 retrieved passages for contains either the original answer or the extracted
each question (we use m = 24) with log-likelihood alias entities. As a result, some originally negative
of the positive passage for passage selection (Equa- passages become positive since they may contain
tion 1) and maximum marginal likelihood over all the aliases, and we augment the original training
NQ SQu AD TriviaQA Baseline +Wiki Train +FB Train
Avg. Original Answers 1.74 1.00 1.00 Single Ans. 49.31 49.42 49.53
Matched Answers (%) 71.63 32.16 88.04 +Wiki Eval 54.13 55.27 54.57
Avg. Augmented Answers 13.04 5.60 14.03 +FB Eval 51.75 52.23 52.52
#Original Positives 69205 48135 62707
#Augmented Positives 69989 48615 67526 Table 3: Results on TriviaQA. Numbers in brackets in-
dicate the improvement compared to the first column.
Table 1: Avg. Original Answers denotes the average Each column indicates a different training setup and
number of answers per question in the official test sets. each row indicates a different evaluation setup. Aug-
Matched Ans. denotes the percentage of original an- mented training with Wikipedia aliases (2nd column)
swers that have aliases in the KB. Avg. Augmented and Freebase aliases (3rd column) improve EM over
Answers denotes the average number of answers in our baseline (1st column).
augmented answer sets. Last two rows: number of
positive questions (questions with matched positive pas-
sages) in the original / augmented training set for each to exhaustively enumerate all of his names (e.g.,
dataset. NQ and TriviaQA have more augmented answers Vladimir Ilyich Ulyanov or Vladimir Lenin). Al-
than SQuAD. though the default test set of TriviaQA uses one
single gold answer, the authors released answer
Data Model Single Ans Ans Set aliases minded from Wikipedia. Thus, we directly
NQ Baseline 34.9 36.4 use those aliases for our experiments in Table 2.
+ Augment Train 35.8 37.2 Overall, a systematic approach to expand gold an-
T RIVIAQA Baseline 49.9 54.7 swers significantly increases gold answer numbers.
+ Augment Train 50.0 55.9
SQ UAD Baseline 18.9 19.6 Trivia QA has the most answers that have equiv-
+ Augment Train 18.3 18.9 alent answers, while SQuAD has the least. Aug-
menting the gold answer set increases the positive
Table 2: Evaluation results on QA datsets compared to passages and thus increases the training examples,
the original “Single Ans” evaluation under the original since questions with no positive passages are dis-
answer set, using the augmented answer sets (“Ans Set”)
carded (Table 1), particularly for TriviaQA’s entity-
improves evaluation. Retraining the reader with aug-
mented answer sets (“Augment Train”) is even better centric questions.
for most datasets, even when evaluated on the datasets’ Implementation Details. For all experiments,
original answer sets. Results are the average of three
we use the multiset.bert-base-encoder
random seeds.
checkpoint of DPR as the retriever and use
bert-base-uncased for our reader model.
set. Then, we train on this augmented training set During training, we sample one positive passage
in the same way as in Equations 1–3. and 23 negative passages for each question. Dur-
ing evaluation, we consider the top-10 retrieved
3 Experiment: Just as Sweet passages for answer span extraction. We use batch
size of 16 and learning rate of 3e-5 for training on
We present results on three QA datasets—NQ,
all datasets.
Trivia QA and SQu AD—on how including aliases as
alternative answers impacts evaluation and train- Augmented Evaluation. We train models with
ing. Since the official test sets are not released, we the original gold answer set and evaluate under two
use the original dev sets as the test sets, and ran- settings: 1) on the original gold answer test set; 2)
domly split 10% training data as the held-out dev on the answer test set augmented with alias entities.
sets. All of these datasets are extractive QA datasets On all three datasets, EM score improves (Table 2).
where answers are spans in Wikipedia articles. Trivia QA shows the largest improvement, as most
answers in TriviaQA are entities (93%).
Statistics of Augmentation. Both SQuAD and
Trivia QAhave one single answer in the test set (Ta- Augmented Training. We incorporate the alias
ble 1). While NQ also has answer sets, these repre- answers in training and compare the results with
sent annotator ambiguity given a passage, not the single-answer training (Table 2). One check that
full range of possible answers. For example, dif- this is encouraging the models to be more robust
ferent annotators might variously highlight Lenin and not a more permissive evaluation is that aug-
or Chairman Lenin, but there is no expectation mented training improves EM by about a point even
NQ SQ u AD TriviaQA NQ SQ u AD TriviaQA
Correct 48 31 41 Correct 48 47 50
Debatable 0 2 3 Wrong 1 1 0
Wrong 1 16 6 Debatable 1 1 0
Invalid 1 1 0 Invalid 0 1 0
Non-equivalent 1 5 2
Wrong context 0 1 1 Table 5: Annotation of fifty test questions that went
Wrong alias 0 10 3 from incorrect to correct under augmented evaluation.
Most changes of correctness are deemed valid by human
Table 4: Annotation of fifty sampled augmented training annotators across all three datasets.
examples from each dataset. Most training examples are
still correct except for SQuAD, where additional answers Q: What city in France did the torch relay start at? P:
are incorrect a third of the time. How the new answers Title: 1948 summer olympics. The torch relay then run
are wrong is broken down in the bottom half of the table. through Switzerland and France . . .
A: Paris
Alias: France
Error Type: Non-equivalent Entity
on the original single answer test set evaluation. Q: How many previously-separate phyla did the 2007
However, TriviaQA improves less, and EM decreases study reclassify?
on SQuAD with augmented training. The next P: Title: celastrales. In the APG III system, the celas-
traceae family was expanded to consist of these five
section inspects examples to understand why aug- groups . . .
mented training accuracy differs on these datasets. A: 3
Alias: III
Freebase vs Wikipedia Aliases. We present the Error Type: Wrong Context
Q: What is Everton football club’s semi-official club
comparison of using Wikipedia entities and Free- nickname?
base entities for augmented evaluation and training P: Title: history of Everton F. C. Everton football club
on TriviaQA. We show the augmented evaluation have a long and detailed history . . .
A: the people’s club
and training results in Table 3. Using Wikipedia en- Alias: Everton F. C.
tities increases in EM score under augmented eval- Error Type: Wrong Alias
uation (e.g., the baseline model scores 54.13 under
Table 6: How adding equivalent answers can go
Wiki-expanded augmented evaluation, as compared wrong. While errors are rare (Table 4 and 5), these
to 51.75 under Freebase-expanded augmented eval- errors are representatives of mistakes. The exam-
uation). This is mainly because TriviaQA answers ples are taken from SQuAD.
have more matches in Wikipedia titles than in Free-
base entities. On the other hand, the difference be-
tween the two alias sources is rather small for aug-
mented training. For example, using Wikipedia for but do have an answer from the augmented answer
answer expansion improves the baseline from 49.31 set (Table 4). We classify them into four catrgories:
to 49.42 under single-answer evaluation, while us- correct, debatable, and wrong answers, as well as
ing Freebase improves it to 49.53. invalid questions that are ill-formed or unanswer-
able due to annotation error. The augmented ex-
4 Analysis: Does QA Retain that Dear amples are mostly correct for NQ, consistent with
Perfection with another Name? its EM jump with augmented training. However,
A sceptical reader would rightly suspect that accu- augmentation often surfaces wrong augmented an-
racy is only going up because we have added more swers for SQuAD, which explains why the EM score
correct answers. Clearly this can go too far. . . if drops with augmented training.
we enumerate all finite length strings we could get We further categorize why the augmentation is
perfect accuracy. This section addresses this criti- wrong into three categories (Table 6): (1) Non-
cism by examining whether the new answers found equivalent entities, where the underlying knowl-
with augmented training and evaluation would still edge base has a mistake, which is rare in high
satisfy user information-seeking needs (Voorhees, quality KBs; (2) Wrong context, where the corre-
2019) for both the training and test sets. sponding context is not answering the question; (3)
Wrong alias, where the question asks about specific
Accuracy of Augmented Training Set. We an- alternate forms of an entity but the prediction is an-
notate fifty passages that originally lack an answer other alias of the entity. This is relatively common
in SQuAD. We speculate this is a side-effect of its Evaluation of QA Models. There are other at-
creation: users write questions given a Wikipedia tempts to improve QA evaluation. Chen et al.
paragraph, and the first paragraph often contains (2019) find that current automatic metrics do not
an entity’s aliases (e.g., “Vladimir Ilyich Ulyanov, correlate well with human judgements, which moti-
better known by his alias Lenin, was a Russian vated Chen et al. (2020) to construct a dataset with
revolutionary, politician, and political theorist”), human annotated scores of candidate answers and
which are easy questions to write. use it to train a BERT-based regression model as
the scorer. Feng and Boyd-Graber (2019) argue
Accuracy of Expanded Answer Set. Next, we for instead of evaluating QA systems directly, we
sample fifty test examples that models get wrong should instead evaluate downstream human accu-
under the original evaluation but that are correct racy when using QA output. Alternatively, Risch
under augmented evaluation. We classify them into et al. (2021) use a cross-encoder to measure the
four catrgories: correct, debatable, wrong answers, semantic similarity between predictions and gold
and the rare cases of invalid questions. Almost answers. For the visual QA task, Luo et al. (2021)
all of the examples are indeed correct (Table 5), incorporate alias answers in visual QA evaluation.
demonstrating the high precision of our answer ex- In this work, instead of proposing new evaluation
pansion for augmented evaluation. In rare cases, for metrics, we improve the evaluation of ODQA mod-
example, for the question “Who sang the song Tell els by augmenting gold answers with alias from
Me Something Good?”, the model prediction Rufus knowledge bases.
is an alias entity, but the reference answer is Rufus
and Chaka Khan. The authors disagree whether
6 Conclusion: Wherefore art thou Single
that would meet a user’s information-seeking need
because Chaka Khan, the vocalist, was part of the
band Rufus. Hence, it was labeled as debatable. Our approach for matching entities in a KB is a sim-
ple approach to improve QA accuracy. We expect
5 Related Work: Refuse thy Name future improvements—e.g.,, entity linking source
Answer Annotation in QA Datasets. Some QA passages would likely improve precision at the
datasets such as NQ and TyDi (Clark et al., 2020) cost of recall. Future work should also investi-
n-way annotate dev and test sets where they ask gate the role of context in deciding the correctness
different annotators to annotate the dev and test set. of predicted answers. Beyond entities, future work
However, such annotation is costly and the cover- should also consider other types of answers such
age is still largely lacking (e.g., our alias expan- as non-entity phrases and free-form expressions.
sion obtains many more answers than NQ’s original As the QA community moves to ODQA and mul-
multi-way annotation). AmbigQA (Min et al., 2020) tilingual QA, robust approaches will need to holis-
aims to address the problem of ambiguous ques- tically account for unexpected but valid answers.
tions, where there are multiple interpretations of This will better help users, use training data more
the same question and therefore multiple correct an- efficiently, and fairly compare models.
swer classes (which could in turn have many valid
aliases for each class). We provide an orthogonal Acknowledgements
view as we are trying to expand equivalent answers
to any given gold answer while AmbigQA aims to We thank members of the UMD CLIP lab, the
cover semantically different but valid answers. anonymous reviewers and meta-reviewer for their
suggestions and comments. Zhao is supported by
Query Expansion Techniques. Automatic query the Office of the Director of National Intelligence
expansion has been used to improve information (ODNI), Intelligence Advanced Research Projects
retrieval (Carpineto and Romano, 2012). Recently, Activity (IARPA), via the BETTER Program contract
query expansion has been used in NLP applications 2019-19051600005. Boyd-Graber is supported by
such as document re-ranking (Zheng et al., 2020) NSF Grant IIS -1822494. Any opinions, findings,
and passage retrieval in ODQA (Qi et al., 2019; Mao conclusions, or recommendations expressed here
et al., 2021), with the goal of increasing accuracy are those of the authors and do not necessarily re-
or recall. Unlike this work, our answer expansion flect the view of the sponsors.
aims to improve evaluation of QA models.
