The document summarizes a study that explores using equivalent answers from knowledge bases to improve open-domain question answering systems. It finds that considering additional correct answers beyond the single gold answer provided in datasets can increase evaluation metrics and model performance. Specifically:
- It mines knowledge bases like Freebase for alias entities equivalent to answers in Natural Questions, TriviaQA, and SQuAD datasets.
- Incorporating these additional correct answers into evaluation increases exact match scores by up to 4.8 points, as models are not penalized for predicting semantically equivalent answers.
- Augmenting training data with equivalent answers also improves performance of state-of-the-art models on Natural Questions and TriviaQA,
The document summarizes a study that explores using equivalent answers from knowledge bases to improve open-domain question answering systems. It finds that considering additional correct answers beyond the single gold answer provided in datasets can increase evaluation metrics and model performance. Specifically:
- It mines knowledge bases like Freebase for alias entities equivalent to answers in Natural Questions, TriviaQA, and SQuAD datasets.
- Incorporating these additional correct answers into evaluation increases exact match scores by up to 4.8 points, as models are not penalized for predicting semantically equivalent answers.
- Augmenting training data with equivalent answers also improves performance of state-of-the-art models on Natural Questions and TriviaQA,
Original Description:
Latest paper on QA in OD.
Original Title
Answer Equivalence for Open-Domain Question Answering
The document summarizes a study that explores using equivalent answers from knowledge bases to improve open-domain question answering systems. It finds that considering additional correct answers beyond the single gold answer provided in datasets can increase evaluation metrics and model performance. Specifically:
- It mines knowledge bases like Freebase for alias entities equivalent to answers in Natural Questions, TriviaQA, and SQuAD datasets.
- Incorporating these additional correct answers into evaluation increases exact match scores by up to 4.8 points, as models are not penalized for predicting semantically equivalent answers.
- Augmenting training data with equivalent answers also improves performance of state-of-the-art models on Natural Questions and TriviaQA,
The document summarizes a study that explores using equivalent answers from knowledge bases to improve open-domain question answering systems. It finds that considering additional correct answers beyond the single gold answer provided in datasets can increase evaluation metrics and model performance. Specifically:
- It mines knowledge bases like Freebase for alias entities equivalent to answers in Natural Questions, TriviaQA, and SQuAD datasets.
- Incorporating these additional correct answers into evaluation increases exact match scores by up to 4.8 points, as models are not penalized for predicting semantically equivalent answers.
- Augmenting training data with equivalent answers also improves performance of state-of-the-art models on Natural Questions and TriviaQA,
Answer Equivalence For Open-Domain Question Answering
Chenglei Si Chen Zhao Jordan Boyd-Graber
Computer Science Computer Science CS , LSC , UMIACS,and iSchool University of Maryland University of Maryland University of Maryland clsi@terpmail.umd.edu chenz@cs.umd.edu jbg@umiacs.umd.edu
Abstract Question: Who is the Chief Executive Officer of Apple?
Answer: Timothy Donald Cook A flaw in QA evaluation is that annotations of- Equivalent Answers: Tim Cook ten only provide one gold answer. Thus, model Passage: Tim Cook predictions semantically equivalent to the an- Timothy Donald Cook ( born November 1, 1960 ) is an swer but superficially different are considered American business executive and industrial engineer. incorrect. This work explores mining alias en- Tim Cook is the chief executive officer of Apple inc. tities from knowledge bases and using them as additional gold answers (i.e., equivalent an- Figure 1: An example from NQ dataset. The correct swers). We incorporate answers for two set- model (BERT) prediction does not match the gold an- tings: evaluation with additional answers and swer but matches the equivalent answer we mine from a model training with equivalent answers. We knowledge base. analyse three QA benchmarks: Natural Ques- tions, TriviaQA and SQuAD. Answer expansion increases the exact match score on all datasets Donald Cook, rendering Tim Cook as wrong as for evaluation, while incorporating it helps Tim Apple. In the 2020 NeurIPS Efficient QA com- model training over real-world datasets. We petition (Min et al., 2021), human annotators rate ensure the additional answers are valid through nearly a third of the predictions that do not match a human post hoc evaluation. 1 the gold annotation as “definitely correct” or “pos- 1 Introduction: A Name that is the sibly correct”. Enemy of Accuracy Despite the near-universal acknowledgement of this problem, there is neither a clear measurement In question answering (QA), computers—given of its magnitude nor a consistent best practice solu- a question—provide the correct answer to the tion. While some datasets provide comprehensive question. However, the modern formulation of answer sets (e.g., Joshi et al., 2017), subsequent QA usually assumes that each question has only datasets such as NQ have not. . . and we do not know one answer, e.g., SQuAD (Rajpurkar et al., 2016), whether this is a problem. We fill that lacuna. H otpot QA (Yang et al., 2018), DROP (Dua et al., Section 2 mines knowledge bases for alternative 2019). This is often a byproduct of the prevail- answers to named entities. Even this straightfor- ing framework for modern QA (Chen et al., 2017; ward approach finds high-precision answers not in- Karpukhin et al., 2020): a retriever finds passages cluded in official answer sets. We then incorporate that may contain the answer, and then a machine this in both training and evaluation of QA models reader identifies the (as in only) answer span. to accept alternate answers. We focus on three pop- In a recent position paper, Boyd-Graber and ular open-domain QA datasets: NQ, TriviaQA and Börschinger (2020) argue that this is at odds with SQu AD. Evaluating models with a more permissive the best practices for human QA. This is also a prob- evaluation improves exact match (EM) by 4.8 points lem for computer QA. A BERT model (Devlin et al., on TriviaQA, 1.5 points on NQ, and 0.7 points on 2019) trained on Natural Questions (Kwiatkowski SQ u AD (Section 3). By augmenting training data et al., 2019, NQ) answers Tim Cook to the ques- with answer sets, state-of-the-art models improve tion “Who is the Chief Executive Officer of Apple?” on NQ and TriviaQA, but not on SQuAD (Section 4), (Figure 1), while the gold answer is only Timothy which was created with a single evidence passage 1 Our code and data are available at: https://github. in mind. In constrast, augmenting the answer al- com/NoviScl/AnswerEquiv. lows diverse evidence sources to provide an answer. 9623 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9623–9629 November 7–11, 2021.
2021 c Association for Computational Linguistics After reviewing other approaches for incorporating spans in the positive passage for span extraction ambiguity in answers (Section 5), we discuss how (Equations 2–3). to further make QA more robust. To study the effect of equivalent answers in reader training, we focus on the distant supervi- 2 Method: An Entity by any Other Name sion setting where we know what the answer is but not where it is (in contrast to full supervision This section reviews the open-domain QA (ODQA) where we know both). To use the answer to dis- pipeline and introduces how we expand gold an- cover positive passages, we use string matching: swer sets for both training and evaulation. any of the top-k retrieved passages that contains an 2.1 ODQA with Single Gold Answer answer is considered correct. We discard questions without any positive passages. This framework We follow the state-of-the-art retriever–reader is consistent with modern state-of-the-art ODQA pipeline for ODQA, where a retriever finds a pipelines (Alberti et al., 2019; Karpukhin et al., handful of passages from a large corpus (usually 2020; Zhao et al., 2020; Asai et al., 2020; Zhao Wikipedia), then a reader, often multi-tasked with et al., 2021, inter alia). passage reranking, selects a span as the prediction. We adopt a dense passage retriver (Karpukhin 2.2 Extracting Alias Entities et al., 2020, DPR) to find passages. DPR encodes We expand the original gold answer set by extract- questions and passages into dense vectors. DPR ing aliases from Freebase (Bollacker et al., 2008), searches for passages in this dense space: given a large-scale knowledge base (KB). Specifically, an encoded query, it finds the nearest passage vec- for each answer in the original dataset (e.g., Sun tors in the dense space. We do not train a new Life Stadium), if we can find this entry in the KB, retriever but instead use the released DPR check- we then use the “common.topic.alias” relation to point to query the top-k (in this paper, k = 100) extract all aliases of the entity (e.g., [Joe Robbie most relevant passages. Stadium, Pro Player Park, Pro Player Stadium, Dol- Given a question and retrieved passages, a neu- phins Stadium, Land Shark Stadium]). We expand ral reader reranks the top-k passages and ex- the answer set by adding all aliases. We next de- tracts an answer span. Specifically, BERT encodes scribe how this changes evaluation and training. each passage pi concatenated with the question q as PiL×h = BERT([pi ; q]), where L is the maxi- 2.3 Augmented Evaluation mum sequence length and h is the hidden size of BERT . Three probabilities use this representation For evaluation, we report the exact match (EM) to reveal where we can find the answer. The first score, where a predicted span is correct only if probability Pr (pi ) encodes whether passage i con- the (normalized) span text matches with a gold tains the answer. Because the answer is a subset of answer exactly. This is the adopted metric for span- the longer span, we must provide the index where extraction datasets in most QA papers (Karpukhin the answer starts j and where it ends k. Given the et al., 2020; Lee et al., 2019; Min et al., 2019, encoding of passage i, there are three parameter inter alia). When we incorporate the alias enti- matrices w that produce these probabilities: ties in evaluation, we get an expanded answer set A ≡ {a1 , ..., an }. For a given span s predicted by Pr (pi ) =softmax(wr (Pi [0, :])); (1) the model, we compute EM score of s if the span matches any correct answer a in the set A: Ps (tj ) =softmax(ws (Pi [j, :])); and (2) Pe (tk ) =softmax(we (Pi [k, :])). (3) EM (s, A) = max{EM(s, a)}. (4) a∈A where Pi [0, :] represents the [CLS] token, and wr , ws and we are learnable weights for passage 2.4 Augmented Training selection, start span and end span. Training up- When we incorporate the alias entities in training, dates weights with one positive and m − 1 negative we treat each retrieved passage as positive if it passages among the top-100 retrieved passages for contains either the original answer or the extracted each question (we use m = 24) with log-likelihood alias entities. As a result, some originally negative of the positive passage for passage selection (Equa- passages become positive since they may contain tion 1) and maximum marginal likelihood over all the aliases, and we augment the original training 9624 NQ SQu AD TriviaQA Baseline +Wiki Train +FB Train Avg. Original Answers 1.74 1.00 1.00 Single Ans. 49.31 49.42 49.53 Matched Answers (%) 71.63 32.16 88.04 +Wiki Eval 54.13 55.27 54.57 Avg. Augmented Answers 13.04 5.60 14.03 +FB Eval 51.75 52.23 52.52 #Original Positives 69205 48135 62707 #Augmented Positives 69989 48615 67526 Table 3: Results on TriviaQA. Numbers in brackets in- dicate the improvement compared to the first column. Table 1: Avg. Original Answers denotes the average Each column indicates a different training setup and number of answers per question in the official test sets. each row indicates a different evaluation setup. Aug- Matched Ans. denotes the percentage of original an- mented training with Wikipedia aliases (2nd column) swers that have aliases in the KB. Avg. Augmented and Freebase aliases (3rd column) improve EM over Answers denotes the average number of answers in our baseline (1st column). augmented answer sets. Last two rows: number of positive questions (questions with matched positive pas- sages) in the original / augmented training set for each to exhaustively enumerate all of his names (e.g., dataset. NQ and TriviaQA have more augmented answers Vladimir Ilyich Ulyanov or Vladimir Lenin). Al- than SQuAD. though the default test set of TriviaQA uses one single gold answer, the authors released answer Data Model Single Ans Ans Set aliases minded from Wikipedia. Thus, we directly NQ Baseline 34.9 36.4 use those aliases for our experiments in Table 2. + Augment Train 35.8 37.2 Overall, a systematic approach to expand gold an- T RIVIAQA Baseline 49.9 54.7 swers significantly increases gold answer numbers. + Augment Train 50.0 55.9 SQ UAD Baseline 18.9 19.6 Trivia QA has the most answers that have equiv- + Augment Train 18.3 18.9 alent answers, while SQuAD has the least. Aug- menting the gold answer set increases the positive Table 2: Evaluation results on QA datsets compared to passages and thus increases the training examples, the original “Single Ans” evaluation under the original since questions with no positive passages are dis- answer set, using the augmented answer sets (“Ans Set”) carded (Table 1), particularly for TriviaQA’s entity- improves evaluation. Retraining the reader with aug- mented answer sets (“Augment Train”) is even better centric questions. for most datasets, even when evaluated on the datasets’ Implementation Details. For all experiments, original answer sets. Results are the average of three we use the multiset.bert-base-encoder random seeds. checkpoint of DPR as the retriever and use bert-base-uncased for our reader model. set. Then, we train on this augmented training set During training, we sample one positive passage in the same way as in Equations 1–3. and 23 negative passages for each question. Dur- ing evaluation, we consider the top-10 retrieved 3 Experiment: Just as Sweet passages for answer span extraction. We use batch size of 16 and learning rate of 3e-5 for training on We present results on three QA datasets—NQ, all datasets. Trivia QA and SQu AD—on how including aliases as alternative answers impacts evaluation and train- Augmented Evaluation. We train models with ing. Since the official test sets are not released, we the original gold answer set and evaluate under two use the original dev sets as the test sets, and ran- settings: 1) on the original gold answer test set; 2) domly split 10% training data as the held-out dev on the answer test set augmented with alias entities. sets. All of these datasets are extractive QA datasets On all three datasets, EM score improves (Table 2). where answers are spans in Wikipedia articles. Trivia QA shows the largest improvement, as most answers in TriviaQA are entities (93%). Statistics of Augmentation. Both SQuAD and Trivia QAhave one single answer in the test set (Ta- Augmented Training. We incorporate the alias ble 1). While NQ also has answer sets, these repre- answers in training and compare the results with sent annotator ambiguity given a passage, not the single-answer training (Table 2). One check that full range of possible answers. For example, dif- this is encouraging the models to be more robust ferent annotators might variously highlight Lenin and not a more permissive evaluation is that aug- or Chairman Lenin, but there is no expectation mented training improves EM by about a point even 9625 NQ SQ u AD TriviaQA NQ SQ u AD TriviaQA Correct 48 31 41 Correct 48 47 50 Debatable 0 2 3 Wrong 1 1 0 Wrong 1 16 6 Debatable 1 1 0 Invalid 1 1 0 Invalid 0 1 0 Non-equivalent 1 5 2 Wrong context 0 1 1 Table 5: Annotation of fifty test questions that went Wrong alias 0 10 3 from incorrect to correct under augmented evaluation. Most changes of correctness are deemed valid by human Table 4: Annotation of fifty sampled augmented training annotators across all three datasets. examples from each dataset. Most training examples are still correct except for SQuAD, where additional answers Q: What city in France did the torch relay start at? P: are incorrect a third of the time. How the new answers Title: 1948 summer olympics. The torch relay then run are wrong is broken down in the bottom half of the table. through Switzerland and France . . . A: Paris Alias: France Error Type: Non-equivalent Entity on the original single answer test set evaluation. Q: How many previously-separate phyla did the 2007 However, TriviaQA improves less, and EM decreases study reclassify? on SQuAD with augmented training. The next P: Title: celastrales. In the APG III system, the celas- traceae family was expanded to consist of these five section inspects examples to understand why aug- groups . . . mented training accuracy differs on these datasets. A: 3 Alias: III Freebase vs Wikipedia Aliases. We present the Error Type: Wrong Context Q: What is Everton football club’s semi-official club comparison of using Wikipedia entities and Free- nickname? base entities for augmented evaluation and training P: Title: history of Everton F. C. Everton football club on TriviaQA. We show the augmented evaluation have a long and detailed history . . . A: the people’s club and training results in Table 3. Using Wikipedia en- Alias: Everton F. C. tities increases in EM score under augmented eval- Error Type: Wrong Alias uation (e.g., the baseline model scores 54.13 under Table 6: How adding equivalent answers can go Wiki-expanded augmented evaluation, as compared wrong. While errors are rare (Table 4 and 5), these to 51.75 under Freebase-expanded augmented eval- errors are representatives of mistakes. The exam- uation). This is mainly because TriviaQA answers ples are taken from SQuAD. have more matches in Wikipedia titles than in Free- base entities. On the other hand, the difference be- tween the two alias sources is rather small for aug- mented training. For example, using Wikipedia for but do have an answer from the augmented answer answer expansion improves the baseline from 49.31 set (Table 4). We classify them into four catrgories: to 49.42 under single-answer evaluation, while us- correct, debatable, and wrong answers, as well as ing Freebase improves it to 49.53. invalid questions that are ill-formed or unanswer- able due to annotation error. The augmented ex- 4 Analysis: Does QA Retain that Dear amples are mostly correct for NQ, consistent with Perfection with another Name? its EM jump with augmented training. However, A sceptical reader would rightly suspect that accu- augmentation often surfaces wrong augmented an- racy is only going up because we have added more swers for SQuAD, which explains why the EM score correct answers. Clearly this can go too far. . . if drops with augmented training. we enumerate all finite length strings we could get We further categorize why the augmentation is perfect accuracy. This section addresses this criti- wrong into three categories (Table 6): (1) Non- cism by examining whether the new answers found equivalent entities, where the underlying knowl- with augmented training and evaluation would still edge base has a mistake, which is rare in high satisfy user information-seeking needs (Voorhees, quality KBs; (2) Wrong context, where the corre- 2019) for both the training and test sets. sponding context is not answering the question; (3) Wrong alias, where the question asks about specific Accuracy of Augmented Training Set. We an- alternate forms of an entity but the prediction is an- notate fifty passages that originally lack an answer other alias of the entity. This is relatively common 9626 in SQuAD. We speculate this is a side-effect of its Evaluation of QA Models. There are other at- creation: users write questions given a Wikipedia tempts to improve QA evaluation. Chen et al. paragraph, and the first paragraph often contains (2019) find that current automatic metrics do not an entity’s aliases (e.g., “Vladimir Ilyich Ulyanov, correlate well with human judgements, which moti- better known by his alias Lenin, was a Russian vated Chen et al. (2020) to construct a dataset with revolutionary, politician, and political theorist”), human annotated scores of candidate answers and which are easy questions to write. use it to train a BERT-based regression model as the scorer. Feng and Boyd-Graber (2019) argue Accuracy of Expanded Answer Set. Next, we for instead of evaluating QA systems directly, we sample fifty test examples that models get wrong should instead evaluate downstream human accu- under the original evaluation but that are correct racy when using QA output. Alternatively, Risch under augmented evaluation. We classify them into et al. (2021) use a cross-encoder to measure the four catrgories: correct, debatable, wrong answers, semantic similarity between predictions and gold and the rare cases of invalid questions. Almost answers. For the visual QA task, Luo et al. (2021) all of the examples are indeed correct (Table 5), incorporate alias answers in visual QA evaluation. demonstrating the high precision of our answer ex- In this work, instead of proposing new evaluation pansion for augmented evaluation. In rare cases, for metrics, we improve the evaluation of ODQA mod- example, for the question “Who sang the song Tell els by augmenting gold answers with alias from Me Something Good?”, the model prediction Rufus knowledge bases. is an alias entity, but the reference answer is Rufus and Chaka Khan. The authors disagree whether 6 Conclusion: Wherefore art thou Single that would meet a user’s information-seeking need Answer? because Chaka Khan, the vocalist, was part of the band Rufus. Hence, it was labeled as debatable. Our approach for matching entities in a KB is a sim- ple approach to improve QA accuracy. We expect 5 Related Work: Refuse thy Name future improvements—e.g.,, entity linking source Answer Annotation in QA Datasets. Some QA passages would likely improve precision at the datasets such as NQ and TyDi (Clark et al., 2020) cost of recall. Future work should also investi- n-way annotate dev and test sets where they ask gate the role of context in deciding the correctness different annotators to annotate the dev and test set. of predicted answers. Beyond entities, future work However, such annotation is costly and the cover- should also consider other types of answers such age is still largely lacking (e.g., our alias expan- as non-entity phrases and free-form expressions. sion obtains many more answers than NQ’s original As the QA community moves to ODQA and mul- multi-way annotation). AmbigQA (Min et al., 2020) tilingual QA, robust approaches will need to holis- aims to address the problem of ambiguous ques- tically account for unexpected but valid answers. tions, where there are multiple interpretations of This will better help users, use training data more the same question and therefore multiple correct an- efficiently, and fairly compare models. swer classes (which could in turn have many valid aliases for each class). We provide an orthogonal Acknowledgements view as we are trying to expand equivalent answers to any given gold answer while AmbigQA aims to We thank members of the UMD CLIP lab, the cover semantically different but valid answers. anonymous reviewers and meta-reviewer for their suggestions and comments. Zhao is supported by Query Expansion Techniques. Automatic query the Office of the Director of National Intelligence expansion has been used to improve information (ODNI), Intelligence Advanced Research Projects retrieval (Carpineto and Romano, 2012). Recently, Activity (IARPA), via the BETTER Program contract query expansion has been used in NLP applications 2019-19051600005. Boyd-Graber is supported by such as document re-ranking (Zheng et al., 2020) NSF Grant IIS -1822494. Any opinions, findings, and passage retrieval in ODQA (Qi et al., 2019; Mao conclusions, or recommendations expressed here et al., 2021), with the goal of increasing accuracy are those of the authors and do not necessarily re- or recall. Unlike this work, our answer expansion flect the view of the sponsors. aims to improve evaluation of QA models. 9627 References Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly Chris Alberti, Kenton Lee, and Michael Collins. 2019. supervised challenge dataset for reading comprehen- A BERT Baseline for the Natural Questions. arXiv, sion. In Proceedings of the Association for Computa- abs/1901.08634. tional Linguistics. Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, and Caiming Xiong. 2020. Learning Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick to retrieve reasoning paths over wikipedia graph for Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and question answering. In Proceedings of the Interna- Wen-tau Yih. 2020. Dense passage retrieval for open- tional Conference on Learning Representations. domain question answering. In Proceedings of Em- pirical Methods in Natural Language Processing. Kurt Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collabo- Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- ratively created graph database for structuring human field, Michael Collins, Ankur Parikh, Chris Alberti, knowledge. In Proceedings of the ACM SIGMOD Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- international conference on Management of data. ton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Jordan Boyd-Graber and Benjamin Börschinger. 2020. Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- What question answering can learn from trivia nerds. ral questions: A benchmark for question answering In Proceedings of the Association for Computational research. Transactions of the Association for Compu- Linguistics. tational Linguistics. Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. retrieval. ACM Computing Surveys. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Association for Computational Linguistics. Matt Gardner. 2019. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Man Luo, Shailaja Keyur Sampat, Riley Tallman, Machine Reading for Question Answering. Yankai Zeng, Manuha Vancha, Akarshan Sajja, and Chitta Baral. 2021. ‘Just because you are right, Anthony Chen, Gabriel Stanovsky, Sameer Singh, and doesn’t mean I am wrong’: Overcoming a Bottleneck Matt Gardner. 2020. MOCHA: A dataset for train- in the Development and Evaluation of Open-Ended ing and evaluating generative reading comprehension Visual Question Answering (VQA) Tasks. In Pro- metrics. In Proceedings of Empirical Methods in ceedings of the European Chapter of the Association Natural Language Processing. for Computational Linguistics. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open- Yuning Mao, Pengcheng He, Xiaodong Liu, Ye- domain questions. In Proceedings of the Association long Shen, Jianfeng Gao, Jiawei Han, and Weizhu for Computational Linguistics. Chen. 2021. Generation-Augmented Retrieval for Open-domain Question Answering. arXiv, Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan abs/2009.08553. Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. TyDi QA: A bench- Sewon Min, Jordan Boyd-Graber, Chris Alberti, mark for information-seeking question answering in Danqi Chen, Eunsol Choi, Michael Collins, Kelvin typologically diverse languages. Transactions of the Guu, Hannaneh Hajishirzi, Kenton Lee, Jenni- Association for Computational Linguistics. maria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Hein- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and rich Küttler, Linqing Liu, Pasquale Minervini, Pon- Kristina Toutanova. 2019. BERT: Pre-training of tus Stenetorp, Sebastian Riedel, Sohee Yang, Min- deep bidirectional transformers for language under- joon Seo, Gautier Izacard, Fabio Petroni, Lucas Hos- standing. In Conference of the North American Chap- seini, Nicola De Cao, Edouard Grave, Ikuya Ya- ter of the Association for Computational Linguistics. mada, Sonse Shimaoka, Masatoshi Suzuki, Shumpei Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Miyawaki, Shun Sato, Ryo Takahashi, Jun Suzuki, Stanovsky, Sameer Singh, and Matt Gardner. 2019. Martin Fajcik, Martin Docekal, Karel Ondrej, Pavel DROP: A reading comprehension benchmark requir- Smrz, Hao Cheng, Yelong Shen, Xiaodong Liu, ing discrete reasoning over paragraphs. In Confer- Pengcheng He, Weizhu Chen, Jianfeng Gao, Bar- ence of the North American Chapter of the Associa- las Oguz, Xilun Chen, Vladimir Karpukhin, Stan tion for Computational Linguistics. Peshterliev, Dmytro Okhonko, Michael Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Wen-tau Yih. 2021. Shi Feng and Jordan Boyd-Graber. 2019. What AI can NeurIPS 2020 EfficientQA Competition: Systems, do for me: Evaluating machine learning interpreta- Analyses and Lessons Learned. In Proceedings of tions in cooperative play. In International Confer- the NeurIPS 2020 Competition and Demonstration ence on Intelligent User Interfaces. Track. 9628 Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. A discrete hard EM ap- proach for weakly supervised question answering. In Proceedings of Empirical Methods in Natural Lan- guage Processing. Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. AmbigQA: Answering am- biguous open-domain questions. In Proceedings of Empirical Methods in Natural Language Processing. Peng Qi, Xiaowen Lin, L. Mehr, Zijian Wang, and Christopher D. Manning. 2019. Answering com- plex open-domain questions through iterative query generation. In Proceedings of Empirical Methods in Natural Language Processing. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Julian Risch, Timo Moller, Julian Gutsch, and Malte Pietsch. 2021. Semantic answer similarity for evaluating question answering models. arXiv, abs/2108.06130. Ellen M Voorhees. 2019. The evolution of cranfield. In Information retrieval evaluation in a changing world. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D. Manning. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of Empirical Methods in Natural Lan- guage Processing. Chen Zhao, Chenyan Xiong, Jordan Boyd-Graber, and Hal Daumé III. 2021. Multi-step reasoning over un- structured text with beam dense retrieval. In Confer- ence of the North American Chapter of the Associa- tion for Computational Linguistics. Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul N. Bennett, and Saurabh Tiwary. 2020. Transformer-xh: Multi-evidence reasoning with extra hop attention. In Proceedings of the International Conference on Learning Representations.
Zhi Zheng, Kai Hui, Ben He, Xianpei Han, Le Sun,
and Andrew Yates. 2020. BERT-QE: Contextualized Query Expansion for Document Re-ranking. In Find- ings of the Association for Computational Linguistics: EMNLP 2020.