Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Covidex: Neural Ranking Models and Keyword Search Infrastructure

for the COVID-19 Open Research Dataset

Edwin Zhang,1 Nikhil Gupta,1 Raphael Tang,1 Xiao Han,1 Ronak Pradeep,1 Kuang Lu,2
Yue Zhang,2 Rodrigo Nogueira,1 Kyunghyun Cho,3,4 Hui Fang,2 and Jimmy Lin1
1 2
University of Waterloo University of Delaware
3 4
New York University CIFAR Associate Fellow

Abstract in natural language processing to generate new in-


sights in support of the fight against this infectious
We present Covidex, a search engine that disease.” We responded to this call to arms.
exploits the latest neural ranking models to
As motivation, we believe that information ac-
provide information access to the COVID-19
arXiv:2007.07846v1 [cs.IR] 14 Jul 2020

Open Research Dataset curated by the Allen cess capabilities (search, question answering, etc.)
Institute for AI. Our system has been online can be applied to provide users with high-quality
and serving users since late March 2020. The information from the scientific literature, to in-
Covidex is the user application component of form evidence-based decision making and to sup-
our three-pronged strategy to develop tech- port insight generation. Examples include public
nologies for helping domain experts tackle the health officials assessing the efficacy of wearing
ongoing global pandemic. In addition, we pro- face masks, clinicians conducting meta-analyses
vide robust and easy-to-use keyword search in-
to update care guidelines based on emerging stud-
frastructure that exploits mature fusion-based
methods as well as standalone neural ranking ies, and virologist probing the genetic structure of
models that can be incorporated into other ap- COVID-19 in search of vaccines. We hope to con-
plications. These techniques have been evalu- tribute to these efforts via a three-pronged strategy:
ated in the ongoing TREC-COVID challenge:
Our infrastructure and baselines have been 1. Despite significant advances in the application
adopted by many participants, including some of neural architectures to text ranking, keyword
of the highest-scoring runs in rounds 1, 2, search (e.g., with “bag of words” queries) re-
and 3. In round 3, we report the highest- mains an important core technology. Building
scoring run that takes advantage of previous on top of our Anserini IR toolkit (Yang et al.,
training data and the second-highest fully au-
2018), we have released robust and easy-to-use
tomatic run.
open-source keyword search infrastructure that
1 Introduction the broader community can build on.
2. Leveraging our own infrastructure, we explored
As a response to the worldwide COVID-19 pan-
the use of sequence-to-sequence transformer
demic, on March 13, 2020, the Allen Institute
models for text ranking, combined with a sim-
for AI (AI2) released the COVID-19 Open Re-
ple classification-based feedback approach to
search Dataset (CORD-19).1 With regular updates
exploit existing relevance judgments. We have
since the initial release (first weekly, then daily),
also open sourced all these models, which can
the corpus contains around 188,000 scientific ar-
be integrated into other systems.
ticles (as of July 12, 2020), including most with
full text, about COVID-19 and coronavirus-related 3. Finally, we package the previous two compo-
research more broadly (for example, SARS and nents into Covidex, an end-to-end search engine
MERS). These articles are gathered from a variety and browsing interface deployed at covidex.ai,
of sources, including PubMed, a curated list of arti- initially described in Zhang et al. (2020a).
cles from the WHO, as well as preprints from arXiv,
All three efforts have been successful. In the on-
bioRxiv, and medRxiv. The goal of the effort is
going TREC-COVID challenge, our infrastructure
“to mobilize researchers to apply recent advances
and baselines have been adopted by many teams,
1
www.semanticscholar.org/cord19 which in some cases have submitted runs that
scored higher than our own submissions. This il- Anserini provides an abstraction for document
lustrates the success of our infrastructure-building collections, and comes with a variety of adaptors
efforts (1). In the latest round 3 results, we report for different corpora and formats: web pages in
the highest-scoring run that exploits relevance judg- WARC containers, XML documents in tarballs,
ments in a user feedback setting and the second- JSON objects in text files, etc. Providing key-
highest fully automatic run, affirming the quality word search capabilities over CORD-19 required
of our own ranking models (2). Finally, usage only writing an adaptor for the corpus that allows
statistics offer some evidence for the success of our Anserini to ingest the documents.
deployed Covidex search engine (3). An issue that immediately arose with CORD-
19 concerns the granularity of indexing, i.e., what
2 Ranking Components should we consider to be a “document” as the
“atomic unit” of indexing and retrieval? One com-
Multi-stage search architectures represent the plication is that the corpus contains a mix of arti-
most common design for modern search en- cles that vary widely in length, not only in terms
gines, with work in academia dating back over a of natural variations (scientific articles of varying
decade (Matveeva et al., 2006; Wang et al., 2011; lengths, book chapters, etc.), but also because the
Asadi and Lin, 2013). Known production deploy- full text is not available for some articles. It is
ments of this design include the Bing web search well known in the IR literature, dating back sev-
engine (Pedersen, 2010) as well as Alibaba’s e- eral decades (e.g., Singhal et al. 1996), that length
commerce search engine (Liu et al., 2017). normalization plays an important role in retrieval
The idea behind multi-stage ranking is straight- effectiveness.
forward: instead of a monolithic ranker, ranking Guided by previous work on searching full-text
is decomposed into a series of stages. Typically, articles (Lin, 2009), we explored three separate
the pipeline begins with an initial retrieval stage, indexing schemes:
most often using bag-of-words queries against an
inverted index. One or more subsequent stages • An index comprised of only titles and abstracts.
reranks and refines the candidate set successively • An index comprised of each full-text article as a
until the final results are presented to the user. The single, individual document; articles without full
multi-stage design provides a clean interface be- text contained only titles and abstracts.
tween keyword search, neural reranking models, • A paragraph-level index structured as follows:
and the user application. each full-text article is segmented into para-
This section details individual components in graphs and for each paragraph, we created a
our architecture. We describe later how these build- “document” comprising the title, abstract, and
ing blocks are assembled in the deployed system that paragraph. The title and abstract alone com-
(Section 3) and for TREC-COVID (Section 4.2). prised an additional “document”. Thus, a full-
text article with n paragraphs yields n + 1 sepa-
2.1 Keyword Search rate retrieval units in the index.
In our design, initial retrieval is performed by the
To be consistent with standard IR parlance, we
Anserini IR toolkit (Yang et al., 2017, 2018),2
call each of these retrieval units a document, in
which we have been developing for several years
a generic sense, despite their composite structure.
and powers a number of our previous systems
Following best practice, documents are ranked us-
that incorporate various neural architectures (Yang
ing BM25 (Robertson et al., 1994). The relative
et al., 2019; Yilmaz et al., 2019). Anserini rep-
effectiveness of each indexing scheme, however, is
resents an effort to better align real-world search
an empirical question.
applications with academic information retrieval
With the paragraph index, a query is likely to
research: under the covers, it builds on the popular
retrieve multiple paragraphs from the same under-
and widely-deployed open-source Lucene search
lying article; since the final task is to rank articles,
library, on top of which we provide a number of
we take the highest-scoring paragraph across all
missing features for conducting research on mod-
retrieved results to produce a final ranking. Fur-
ern IR test collections.
thermore, we can combine these multiple represen-
2
anserini.io tations to capture different ranking signals using
fusion techniques, which further improves effec- vant or not to the query. That is, “true” and “false”
tiveness; see Section 4.2 for details. are the ground truth predictions in the sequence-to-
Since Anserini is built on top of Lucene, which sequence task, what we call the “target words”.
is implemented in Java, it is designed to run on At inference time, to compute probabilities for
the Java Virtual Machine (JVM). However, Tensor- each query–document pair, we apply softmax only
Flow (Abadi et al., 2016) and PyTorch (Paszke to the logits of the “true” and “false” tokens.
et al., 2019), the two most popular neural network We rerank the candidate documents according
toolkits today, use Python as their main language. to the probabilities assigned to the “true” token.
More broadly, with its diverse and mature ecosys- See Nogueira et al. (2020) for additional details
tem, Python has emerged as the language of choice about this logit normalization trick and the effects
for most data scientists today. Anticipating this gap, of different target words.
we have been working on Pyserini,3 Python bind- Since in the beginning we did not have
ings for Anserini, since late 2019 (Yilmaz et al., training data specific to COVID-19, we fine-
2020). Pyserini is released as a well-documented, tuned our model on the MS MARCO passage
easy-to-use Python module distributed via PyPI dataset (Nguyen et al., 2016), which comprises
and easily installable via pip.4 8.8M passages obtained from the top 10 results re-
Putting everything together, we provide the com- trieved by the Bing search engine (based on around
munity keyword search infrastructure by sharing 1M queries). The training set contains approxi-
code, indexes, as well as baseline runs. First, all our mately 500k pairs of query and relevant documents,
code is available open source. Second, we share where each query has one relevant passage on aver-
regularly updated pre-built versions of CORD-19 age; non-relevant documents for training are also
indexes, so that users can replicate our results with provided as part of the training data. Nogueira
minimal effort. Finally, we provide baseline runs et al. (2020) and Yilmaz et al. (2019) have both pre-
for TREC-COVID that can be directly incorporated viously demonstrated that models trained on MS
into other participants’ submissions. MARCO can be directly applied to other document
ranking tasks.
2.2 Rerankers
We fine-tuned our monoT5 model with a con-
In our infrastructure, the output of Pyserini is fed stant learning rate of 10−3 for 10k iterations with
to rerankers that aim to improve ranking quality. class-balanced batches of size 128. We used a maxi-
We describe three different approaches: two are mum of 512 input tokens and one output token (i.e.,
based on neural architectures, and the third exploits either “true” or “false”, as described above). In the
relevance judgments in a feedback setting using a MS MARCO passage dataset, none of the inputs
classification approach. required truncation when using this length limit.
monoT5. Despite the success of BERT for docu- Training variants based on T5-base and T5-3B took
ment ranking (Dai and Callan, 2019; MacAvaney approximately 4 and 40 hours, respectively, on a
et al., 2019; Yilmaz et al., 2019), there is evidence single Google TPU v3-8.
that ranking with sequence-to-sequence models can At inference time, since output from Pyserini is
achieve even better effectiveness, particularly in usually longer than the length restrictions of the
zero-shot and other settings with limited training model, it is not possible to feed the entire text into
data (Nogueira et al., 2020), such as for TREC- our model at once. To address this issue, we first
COVID. Our “base” reranker, called monoT5, is segment each document into spans by applying a
based on T5 (Raffel et al., 2019). sliding window of 10 sentences with a stride of 5.
Given a query q and a set of candidate documents We obtain a probability of relevance for each span
D from Pyserini, for each d ∈ D we construct the by performing inference on it independently, and
following input sequence to feed into our model: then select the highest probability among the spans
as the relevance score of the document.
Query: q Document: d Relevant: (1)
duoT5. A pairwise reranker estimates the proba-
The model is fine-tuned to produce either “true” or bility si,j that candidate di is more relevant than dj
“false” depending on whether the document is rele- for query q, where i 6= j. Nogueira et al. (2019)
3
pyserini.io demonstrated that a pairwise BERT reranker run-
4
pypi.org/project/pyserini/ ning on the output of a pointwise BERT reranker
yields statistically significant improvements in
ranking metrics. We applied the same intuition
to T5 in a pairwise reranker called duoT5, which
takes as input the sequence:
Query: q Document0: di Document1: dj Relevant:
where di and dj are unique pairs of candidates from
the set D. The model is fine-tuned to predict “true”
if candidate di is more relevant than dj to query q
and “false” otherwise. We fine-tuned duoT5 using
the same hyperparameters as monoT5.
At inference time, we use the top 50 highest
scoring documents according to monoT5 as our
candidates {di }. We then obtain probabilities pi,j
Figure 1: Screenshot of the Covidex.
of di being more relevant than dj for all unique can-
didate pairs {di , dj }, ∀i 6= j. Finally, we compute
a single score si for candidate di as follows: PyGaggle,5 which is our recently developed neu-
X ral ranking library designed to work with Pyserini.
si = (pi,j + (1 − pj,i )) (2) Our classification-based approach to feedback is
j∈Ji
implemented in Pyserini directly. These compo-
where Ji = {0 ≤ j < 50, j 6= i}. Based on nents are available for integration into any system.
exploratory studies on the MS MARCO passage
dataset, this setting leads to the most stable and 3 The Covidex
effective rankings. Beyond sharing our keyword search infrastructure
Relevance Feedback. The setup of TREC-COVID and reranking models, we’ve built the Covidex as
(see Section 4.1) provides a feedback setting where an operational search engine to demonstrate our
systems can exploit a limited number of relevance capabilities to domain experts who are not inter-
judgments on a per-query basis. How do we ested in individual components. As deployed, we
take advantage of such training data? Despite use the paragraph index and monoT5-base as the
work on fine-tuning transformers in a few-shot set- reranker. An additional highlighting module based
ting (Zhang et al., 2020b; Lee et al., 2020), we on BioBERT is described in Zhang et al. (2020a).
were wary of the dangers of overfitting on limited To decrease end-to-end latency, we rerank only the
data, particularly since there is little guidance on top 96 documents per query and truncate reranker
relevance feedback using transformers in the litera- input to a maximum of 256 tokens.
ture. Instead, we implemented a robust approach The Covidex is built using the FastAPI Python
that treats relevance feedback as a document clas- web framework, where all incoming API requests
sification problem using simple linear classifiers, are handled by a service that performs searching,
described in Yu et al. (2019) and Lin (2019). reranking, and text highlighting. Search is per-
The approach is conceptually simple: for each formed with Pyserini (Section 2.1), and the results
query, we train a linear classifier (logistic regres- are then reranked with PyGaggle (Section 2.2). The
sion) that attempts to distinguish relevant from non- frontend (which is also open source) is built with
relevant documents for that query. The classifier React to support the use of modular, declarative
operates on sparse bag-of-words representations us- JavaScript components,6 taking advantage of its
ing tf–idf term weighting. At inference time, each vast ecosystem.
candidate document is fed to the classifier, and the A screenshot of our system is shown in Figure 1.
classifier score is then linearly interpolated with the Covidex provides standard search capabilities, ei-
original candidate document score to produce a fi- ther based on keyword queries or natural-language
nal score. We describe the input source documents input. Users can click “Show more” to reveal the
in Section 4.2. abstract as well as excerpts from the full text, where
All components above have also been open sourced. 5
pygaggle.ai
6
The two neural reranking modules are available in reactjs.org
potentially relevant passages are highlighted. Click- Both out of logistic necessity in evaluation de-
ing on the title brings the user to the article’s source sign and because the body of scientific literature is
on the publisher’s site. In addition, we have imple- rapidly expanding, TREC-COVID is organized into
mented a faceted browsing feature. From CORD- a series of “rounds”, each of which use the CORD-
19, we were able to easily expose facets correspond- 19 collection at a snapshot in time. For a particular
ing to dates, authors, journals, and sources. Nav- round, participating teams develop systems that
igating by year, for example, allows a user to fo- return results to a number of information needs,
cus on older coronavirus research (e.g., on SARS) called “topics”—one example is “serological tests
or the latest research on COVID-19, and a com- that detect antibodies of COVID-19”. These results
bination of the journal and source facets allows comprise a run or a submission. NIST then gathers,
a user to differentiate between preprints and the organizes, and evaluates these runs using a standard
peer-reviewed literature, and between venues with pooling methodology (Voorhees, 2002).
different reputations. The product of each round is a collection of rele-
The system is currently deployed across a small vance judgments, which are annotations by domain
cluster of servers, each with two NVIDIA V100 experts about the relevance of documents with re-
GPUs, as our pipeline requires neural network infer- spect to topics. On average, there are around 300
ence at query time. Each server runs the complete judgments (both positive and negative) per topic
software stack in a simple replicated setup (no par- from each round. These relevance judgments are
titioning). On top of this, we leverage Cloudflare used to evaluate the effectiveness of systems (pop-
as a simple load balancer, which uses a round robin ulating a leaderboard) and can also be used to train
scheme to dispatch requests across the different machine-learning models in future rounds. Runs
servers. The end-to-end latency for a typical query that take advantage of these relevance judgments
is around two seconds. are known as “feedback runs”, in contrast to “auto-
The first implementation of our system was de- matic” runs that do not. A third category, “manual”
ployed in late March, and we have been incremen- runs, can involve human input, but we did not sub-
tally adding features since. Based on Cloudflare mit any such runs.
statistics, our site receives around two hundred Currently, TREC-COVID has completed round
unique visitors per day and the site serves more 3 and is in the middle of round 4. We present eval-
than one thousand requests each day. Of course, uation results from rounds 1, 2, and 3, since results
usage statistics were (up to several times) higher from round 4 are not yet available. Each round
when we first launched due to publicity on social contains a number of topics that are persistent (i.e.,
media. However, the figures cited above represent carryover from previous rounds) as well as new
a “steady state” that has held up over the past few topics. To avoid retrieving duplicate documents,
months, in the absence of any deliberate promotion. the evaluation adopts a residual collection method-
ology, where judged documents (either relevant
4 TREC-COVID or not) from previous rounds are automatically re-
moved from consideration. Thus, for each topic,
Reliable, large-scale evaluations of text retrieval
future rounds only evaluate documents that have
methods are a costly endeavour, typically beyond
not been examined before (either newly published
the resources of individual research groups. Fortu-
articles or have never been retrieved). Note that
nately, the community-wide TREC-COVID chal-
due to the evaluation methodology, scores across
lenge sponsored by the U.S. National Institute for
rounds are not comparable.
Standards and Technology (NIST) provides a fo-
rum for evaluating our techniques. 4.2 Results
4.1 Evaluation Overview A selection of results from TREC-COVID are
The TREC-COVID challenge, which began in mid- shown in Table 1, where we report standard metrics
April and is still ongoing, provides an opportunity computed by NIST. We submitted runs under team
for researchers to study methods for quickly stand- “covidex” (for neural models) and team “anserini”
ing up information access systems, both in response (for our bag-of-words baselines).
to the current pandemic and to prepare for similar In Round 1, there were 143 runs from 56 teams.
future events. Our best run T5R1 used BM25 for first-stage re-
Team Run Type nDCG@10 P@5 mAP
Round 1: 30 topics
sabir sabir.meta.docs automatic 0.6080 0.7800 0.3128
GUIR S2 run2† automatic 0.6032 0.6867 0.2601
covidex T5R1 (= monoT5) automatic 0.5223 0.6467 0.2838
Round 2: 35 topics
mpiid5 mpiid5 run3† manual 0.6893 0.8514 0.3380
CMT SparseDenseSciBert† feedback 0.6772 0.7600 0.3115
GUIR S2 GUIR S2 run1† automatic 0.6251 0.7486 0.2842
covidex covidex.t5 (= monoT5) automatic 0.6250 0.7314 0.2880
anserini r2.fusion2 automatic 0.5553 0.6800 0.2725
anserini r2.fusion1 automatic 0.4827 0.6114 0.2418
Round 3: 40 topics
covidex r3.t5 lr feedback 0.7740 0.8600 0.3333
BioinformaticsUA BioInfo-run1 feedback 0.7715 0.8650 0.3188
SFDC SFDC-fus12-enc23-tf3† automatic 0.6867 0.7800 0.3160
covidex r3.duot5 (= monoT5 + duoT5) automatic 0.6626 0.7700 0.2676
covidex r3.monot5 (= monoT5) automatic 0.6596 0.7800 0.2635
anserini r3.fusion2 automatic 0.6100 0.7150 0.2641
anserini r3.fusion1 automatic 0.5359 0.6100 0.2293

Table 1: Selected TREC-COVID results. Our submissions are under teams “covidex” and “anserini”. All runs
notated with † incorporate our infrastructure components in some way.

trieval using the paragraph index followed by our mation retrieval, but is once again affirmed by
monoT5-3B reranker, trained on MS MARCO (as TREC-COVID.
described in Section 2.2). The best automatic neu-
2. The importance of building the “right” query
ral run was run2 from team GUIR S2 (MacAvaney
representations for keyword search. Each
et al., 2020), which was built on Anserini. This
TREC-COVID topic contains three fields: query,
run placed second behind the best automatic run,
question, and narrative. The query field de-
sabir.meta.docs, which interestingly was based
scribes the information need using a few key-
on the vector-space model.
words, similar to what a user would type into a
While we did make meaningful infrastructure
web search engine. The question field phrases
contributions (e.g., Anserini provided the keyword
the information need as a well-formed natural
search results that fed the neural ranking models
language question, and the narrative field con-
of team GUIR S2), our own run T5R1 was substan-
tains additional details in a short paragraph. The
tially behind the top-scoring runs. A post-hoc ex-
query field may be missing important keywords,
periment with round 1 relevance judgments showed
but the other two fields often contain too many
that using the paragraph index did not turn out to be
“noisy” terms unrelated to the information need.
the best choice: simply replacing with the abstract
index (but retaining the monoT5-3B reranker) im- Thus, it makes sense to leverage information
proved nDCG@10 from 0.5223 to 0.5702.7 from multiple fields in constructing keyword
We learned two important lessons from the re- queries, but to do so selectively. Based on re-
sults of round 1: sults from round 1, the following query genera-
tion technique proved to be effective: when con-
1. The effectiveness of simple rank fusion tech- structing the keyword query for a given topic,
niques that can exploit diverse ranking signals we take the non-stopwords from the query field
by combining multiple ranked lists. Many teams and further expand them with terms belonging
adopted such techniques (including the top- to named entities extracted from the question
scoring run), which proved both robust and ef- field using ScispaCy (Neumann et al., 2019).
fective. This is not a new observation in infor-
7
Despite this finding, we suspect that there may be evalua- We saw these two lessons as an opportunity to
tion artifacts at play here, because our impressions from the further contribute community infrastructure, and
deployed system suggest that results from the paragraph index
are better. Thus, the deployed Covidex still uses paragraph starting in round 2 we made two fusion runs from
indexes. Anserini freely available: fusion1 and fusion2.
In both runs, we combined rankings from the ab- reranked by monoT5 and then duoT5. From Ta-
stract, full-text, and paragraph indexes via recip- ble 1, we see that duoT5 does indeed improve
rocal rank fusion (RRF) (Cormack et al., 2009). over just using monoT5 (run r3.monot5), albeit
The runs differed in their treatment of the query the gains are small (but we found that the duoT5
representation. The run fusion1 simply took the run has more unjudged documents). The r3.duot5
query field from the topics as the basis for key- run ranks second among all teams under the “au-
word search, while run fusion2 incorporated the tomatic” condition, and we are about two points
query generator described above to augment the behind team SFDC. However, according to Esteva
query representation with key phrases. These runs et al. (2020), their general approach incorporates
were made available before the deadline so that Anserini fusion runs, which bolsters our case that
other teams could use them, and indeed many took we are providing valuable infrastructure for the
advantage of them. community.
In Round 2, there were 136 runs from 51 teams. Our own feedback run r3.t5 lr implements the
Our two Anserini baseline fusion runs are shown classification-based feedback technique (see Sec-
as r2.fusion1 and r2.fusion2 in Table 1. Com- tion 2.2) with monoT5 results as the input source
paring these two fusion baselines, we see that our document (with a mixing weight of 0.5 to combine
query generation approach yields a large gain in monoT5 scores with classifier scores). This was
effectiveness. Ablation studies further confirmed the highest-scoring run across all submissions (all
that ranking signals from the different indexes do categories), just a bit ahead of BioInfo-run1.
contribute to the overall higher effectiveness of the 5 Conclusions
rank fusion runs. That is, the effectiveness of the
fusion results is higher than results from any of the Our project has three goals: build community in-
individual indexes. frastructure, advance the state of the art in neural
Our covidex.t5 run takes r2.fusion1 and ranking, and provide a useful application. We be-
r2.fusion2, reranks both with monoT5-3B, and lieve that our efforts can contribute to the fight
then combines (with RRF) the outputs of both. The against this global pandemic. Beyond COVID-19,
monoT5-3B model was fine-tuned on MS MARCO the capabilities we’ve developed can be applied to
then fine-tuned (again) on a medical subset of MS analyzing the scientific literature more broadly.
MARCO (MacAvaney et al., 2020). This run essen-
6 Acknowledgments
tially tied for the best automatic run GUIR S2 run1,
which scored just 0.0001 higher. This research was supported in part by the Canada
As additional context, Table 1 shows the best First Research Excellence Fund, the Natural Sci-
“manual” and “automatic” runs from round 2 ences and Engineering Research Council (NSERC)
(mpiid5 run3 and SparseDenseSciBert, respec- of Canada, CIFAR AI & COVID-19 Catalyst
tively), which were also the top two runs overall. Funding 2019–2020, and Microsoft AI for Good
These results show that manual and feedback tech- COVID-19 Grant. We’d like to thank Kyle Lo from
niques can achieve quite a bit of gain over fully AI2 for helpful discussions and Colin Raffel from
automatic techniques. Both of these runs and four Google for his assistance with T5.
out of the five top teams in round 2 took advan-
tage of the fusion baselines we provided, which
demonstrates our impact not only in developing References
effective ranking models, but also our service to Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng
the community in providing infrastructure. Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
In Round 3, there were 79 runs from 31 teams. et al. 2016. TensorFlow: A system for large-scale
Our Anserini fusion baselines, r3.fusion1 and machine learning. In 12th USENIX Symposium
on Operating Systems Design and Implementation
r3.fusion2, remained the same from the previous
(OSDI ’16), pages 265–283.
round and continued to provide strong baselines.
Our run r3.duot5 represents the first deploy- Nima Asadi and Jimmy Lin. 2013. Effective-
ness/efficiency tradeoffs for candidate generation in
ment of our monoT5 and duoT5 multi-stage rerank- multi-stage retrieval architectures. In Proceedings
ing pipeline (see Section 2.2), which is a fusion of the 36th Annual International ACM SIGIR Confer-
of the fusion runs as the first-stage candidates, ence on Research and Development in Information
Retrieval (SIGIR 2013), pages 997–1000, Dublin, Mark Neumann, Daniel King, Iz Beltagy, and Waleed
Ireland. Ammar. 2019. ScispaCy: Fast and robust mod-
els for biomedical natural language processing.
Gordon V. Cormack, Charles L. A. Clarke, and Stefan arXiv:1902.07669.
Büttcher. 2009. Reciprocal rank fusion outperforms
Condorcet and individual rank learning methods. In Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao,
Proceedings of the 32nd Annual International ACM Saurabh Tiwary, Rangan Majumder, and Li Deng.
SIGIR Conference on Research and Development in 2016. MS MARCO: a human-generated machine
Information Retrieval (SIGIR 2009), pages 758–759, reading comprehension dataset. arXiv:1611.09268.
Boston, Massachusetts.
Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin.
Zhuyun Dai and Jamie Callan. 2019. Deeper text un- 2020. Document ranking with a pretrained
derstanding for IR with contextual neural language sequence-to-sequence model. arXiv:2003.06713.
modeling. In Proceedings of the 42nd Annual Inter-
national ACM SIGIR Conference on Research and Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and
Development in Information Retrieval (SIGIR 2019), Jimmy Lin. 2019. Multi-stage document ranking
pages 985–988, Paris, France. with BERT. arXiv:1910.14424.
Andre Esteva, Anuprit Kale, Romain Paulus, Kazuma
Hashimoto, Wenpeng Yin, Dragomir Radev, and Adam Paszke, Sam Gross, Francisco Massa, Adam
Richard Socher. 2020. CO-Search: COVID-19 Lerer, James Bradbury, Gregory Chanan, Trevor
information retrieval with semantic search, ques- Killeen, Zeming Lin, Natalia Gimelshein, Luca
tion answering, and abstractive summarization. Antiga, et al. 2019. PyTorch: an imperative style,
arXiv:2006.09595. high-performance deep learning library. In Ad-
vances in Neural Information Processing Systems,
Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. pages 8024–8035.
2020. Mixout: Effective regularization to finetune
large-scale pretrained language models. In Proceed- Jan Pedersen. 2010. Query understanding at Bing. In
ings of the 8th International Conference on Learning Industry Track Keynote at the 33rd Annual Interna-
Representations (ICLR 2020). tional ACM SIGIR Conference on Research and De-
velopment in Information Retrieval (SIGIR 2010),
Jimmy Lin. 2009. Is searching full text more effec- Geneva, Switzerland.
tive than searching abstracts? BMC Bioinformatics,
10:46. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Jimmy Lin. 2019. The simplest thing that can possibly Wei Li, and Peter J. Liu. 2019. Exploring the limits
work: pseudo-relevance feedback using text classifi- of transfer learning with a unified text-to-text trans-
cation. In arXiv:1904.08861. former. In arXiv:1910.10683.
Shichen Liu, Fei Xiao, Wenwu Ou, and Luo Si. 2017. Stephen E. Robertson, Steve Walker, Susan Jones,
Cascade ranking for operational e-commerce search. Micheline Hancock-Beaulieu, and Mike Gatford.
In Proceedings of the 23rd ACM SIGKDD Inter- 1994. Okapi at TREC-3. In Proceedings of the
national Conference on Knowledge Discovery and 3rd Text REtrieval Conference (TREC-3), pages 109–
Data Mining (SIGKDD 2017), pages 1557–1565, 126, Gaithersburg, Maryland.
Halifax, Nova Scotia, Canada.
Sean MacAvaney, Arman Cohan, and Nazli Gohar- Amit Singhal, Chris Buckley, and Mandar Mitra. 1996.
ian. 2020. SLEDGE: A simple yet effective base- Pivoted document length normalization. In Pro-
line for coronavirus scientific knowledge search. ceedings of the 19th Annual International ACM SI-
arXiv:2005.02365. GIR Conference on Research and Development in
Information Retrieval (SIGIR 1996), pages 21–29,
Sean MacAvaney, Andrew Yates, Arman Cohan, and Zürich, Switzerland.
Nazli Goharian. 2019. CEDR: Contextualized em-
beddings for document ranking. In Proceedings of Ellen M. Voorhees. 2002. The philosophy of informa-
the 42nd Annual International ACM SIGIR Confer- tion retrieval evaluation. In Evaluation of Cross-
ence on Research and Development in Information Language Information Retrieval Systems: Second
Retrieval (SIGIR 2019), pages 1101–1104, Paris, Workshop of the Cross-Language Evaluation Forum,
France. Lecture Notes in Computer Science Volume 2406,
pages 355–370.
Irina Matveeva, Chris Burges, Timo Burkard, Andy
Laucius, and Leon Wong. 2006. High accuracy re- Lidan Wang, Jimmy Lin, and Donald Metzler. 2011.
trieval with multiple nested ranker. In Proceedings A cascade ranking model for efficient ranked re-
of the 29th Annual International ACM SIGIR Con- trieval. In Proceedings of the 34th Annual Inter-
ference on Research and Development in Informa- national ACM SIGIR Conference on Research and
tion Retrieval (SIGIR 2006), pages 437–444, Seattle, Development in Information Retrieval (SIGIR 2011),
Washington. pages 105–114, Beijing, China.
Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini:
enabling the use of Lucene for information retrieval
research. In Proceedings of the 40th Annual Inter-
national ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR 2017),
pages 1253–1256, Tokyo, Japan.
Peilin Yang, Hui Fang, and Jimmy Lin. 2018. Anserini:
reproducible ranking baselines using Lucene. Jour-
nal of Data and Information Quality, 10(4):Article
16.
Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen
Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.
End-to-end open-domain question answering with
BERTserini. In Proceedings of the 2019 Confer-
ence of the North American Chapter of the Asso-
ciation for Computational Linguistics (Demonstra-
tions), pages 72–77, Minneapolis, Minnesota.
Zeynep Akkalyoncu Yilmaz, Charles L. A. Clarke, and
Jimmy Lin. 2020. A lightweight environment for
learning experimental IR research practices. In Pro-
ceedings of the 43rd Annual International ACM SI-
GIR Conference on Research and Development in In-
formation Retrieval (SIGIR 2020).

Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian


Zhang, and Jimmy Lin. 2019. Cross-domain mod-
eling of sentence-level evidence for document re-
trieval. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3481–3487, Hong Kong, China.
Ruifan Yu, Yuhao Xie, and Jimmy Lin. 2019. Simple
techniques for cross-collection relevance feedback.
In Proceedings of the 41th European Conference on
Information Retrieval, Part I (ECIR 2019), pages
397–409, Cologne, Germany.
Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira,
Kyunghyun Cho, and Jimmy Lin. 2020a. Rapidly
deploying a neural search engine for the COVID-19
Open Research Dataset: Preliminary thoughts and
lessons learned. arXiv:2004.05125.
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q.
Weinberger, and Yoav Artzi. 2020b. Revisiting few-
sample BERT fine-tuning. arXiv:2006.05987.

You might also like