Professional Documents
Culture Documents
Rapidly Deploying A Neural Search Engine For The COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned
Rapidly Deploying A Neural Search Engine For The COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned
Rapidly Deploying A Neural Search Engine For The COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned
Edwin Zhang,1 Nikhil Gupta,1 Rodrigo Nogueira,1 Kyunghyun Cho,2,3,4,5 and Jimmy Lin1
1
David R. Cheriton School of Computer Science, University of Waterloo
2
Courant Institute of Mathematical Sciences, New York University
3
Center for Data Science, New York University
4
Facebook AI Research 5 CIFAR Associate Fellow
chitectures to provide information access to a search engine that exploits the latest advances in
the COVID-19 Open Research Dataset curated deep learning and neural architectures for ranking.
by the Allen Institute for AI. This web appli- This paper describes our initial efforts.
cation exists as part of a suite of tools that We have several goals for this paper: First, we
we have developed over the past few weeks to
discuss our motivation and approach, articulating
help domain experts tackle the ongoing global
pandemic. We hope that improved information
how, hopefully, better information access capabil-
access capabilities to the scientific literature ities can contribute to the fight against this global
can inform evidence-based decision making pandemic. Second, we provide a technical descrip-
and insight generation. This paper describes tion of what we have built. Previously, this infor-
our initial efforts and offers a few thoughts mation was scattered on different web pages, in
about lessons we have learned along the way. tweets, and ephemeral discussions with colleagues
1 Introduction over video conferences and email. Gathering all
this information in one place is important for other
As a response to the worldwide COVID-19 pan- researchers who wish to evaluate and build on our
demic, on March 13, 2020, the Allen Institute for work. Finally, we reflect on our journey so far—
AI released the COVID-19 Open Research Dataset discussing the evaluation of our system and offer-
(CORD-19) in partnership with a coalition of re- ing some lessons learned that might inform future
search groups.1 With weekly updates since the efforts in building technologies to aid in rapidly
initial release, the corpus currently contains over developing crises.
47,000 scholarly articles, including over 36,000
with full text, about COVID-19 and coronavirus- 2 Motivation and Approach
related research more broadly (for example, SARS
and MERS), drawn from a variety of sources in- Our team was assembled on March 21, 2020 over
cluding PubMed, a curated list of articles from Slack, comprising members of two research groups
the WHO, as well as preprints from bioRxiv and from the University of Waterloo and New York Uni-
medRxiv. The stated goal of the effort is “to mobi- versity. This was a natural outgrowth of existing
lize researchers to apply recent advances in natural collaborations, and thus we had rapport from the
language processing to generate new insights in very beginning. Prior to these discussions, we had
support of the fight against this infectious disease”. known about the CORD-19 dataset, but had not yet
We responded to this call to arms. undertaken any serious attempt to build a research
In approximately two weeks, our team was able project around it.
to build, deploy, and share with the research com- Motivating our efforts, we believed that informa-
munity a number of components that support infor- tion access capabilities (search, question answering,
mation access to this corpus. We have also as- etc.)—broadly, the types of technologies that our
sembled these components into two end-to-end team works on—could be applied to provide users
1
https://pages.semanticscholar.org/ with high-quality information from the scientific lit-
coronavirus-research erature, to inform evidence-based decision making
and to support insight generation. Examples might 3.1 Modular and Reusable Keyword Search
include public health officials assessing the effi- In our design, initial retrieval is performed by the
cacy of population-level interventions, clinicians Anserini IR toolkit (Yang et al., 2017, 2018),2
conducting meta-analyses to update care guidelines which we have been developing for several years
based on emerging clinical studies, virologist prob- and powers a number of our previous systems that
ing the genetic structure of COVID-19 in search of incorporates various neural architectures (Yang
vaccines. We hope to contribute to these efforts by et al., 2019; Yilmaz et al., 2019). Anserini rep-
building better information access capabilities and resents an effort to better align real-world search
packaging them into useful applications. applications with academic information retrieval
At the outset, we adopted a two-pronged strat- research: under the covers, it builds on the popular
egy to build both end-to-end applications as well and widely-deployed open-source Lucene search
as modular, reusable components. The intended library, on top of which we provide a number of
users of our systems are domain experts (e.g., clini- missing features for conducting research on mod-
cians and virologists) who would naturally demand ern IR test collections.
responsive web applications with intuitive, easy-to- Anserini provides an abstraction for document
use interfaces. However, we also wished to build collections, and comes with a variety of adaptors
component technologies that could be shared with for different corpora and formats: web pages in
the research community, so that others can build WARC containers, XML documents in tarballs,
on our efforts without “reinventing the wheel”. To JSON objects in text files, etc. Providing simple
this end, we have released software artifacts (e.g., keyword search over CORD-19 required only writ-
Java package in Maven Central, Python module ing an adaptor for the corpus that allows Anserini to
on PyPI) that encapsulate some of our capabilities, ingest the documents. We were able to implement
complete with sample notebooks demonstrating such an adaptor in a short amount of time.
their use. These notebooks support one-click repli- However, one important issue that immediately
cability and provide a springboard for extensions. arose with CORD-19 concerned the granularity of
indexing, i.e., what should we consider a “docu-
3 Technical Description ment”, as the “atomic unit” of indexing and re-
trieval? One complication stems from the fact
Multi-stage search architectures represent the that the corpus contains a mix of articles that vary
most common design for modern search en- widely in length, not only in terms of natural varia-
gines, with work in academia dating back over a tions, but also because the full text is not available
decade (Matveeva et al., 2006; Wang et al., 2011; for some documents. It is well known in the IR
Asadi and Lin, 2013). Known production deploy- literature, dating back several decades (e.g., Sing-
ments of this architecture include the Bing web hal et al. 1996), that length normalization plays an
search engine (Pedersen, 2010) as well as Alibaba’s important role in retrieval effectiveness.
e-commerce search engine (Liu et al., 2017). Here, however, the literature does provide some
The idea behind multi-stage ranking is straight- guidance: previous work (Lin, 2009) showed that
forward: instead of a monolithic ranker, ranking paragraph-level indexing can be more effective
is decomposed into a series of stages. Typically, than the two other obvious alternatives of (a) in-
the pipeline begins with an initial retrieval stage, dexing only the title and abstract of articles and
most often using “bag of words” queries against (b) indexing each full-text article as a single, in-
an inverted index. One or more subsequent stages dividual document. Based on this previous work,
reranks and refines the candidate set successively in addition to the two above conditions (for com-
until the final results are presented to the user. parison purposes), we built (c) a paragraph-level
index as follows: each full text article is segmented
This multi-stage ranking design provides a nice
into paragraphs (based on existing annotations),
organizing structure for our efforts—in particular,
and for each paragraph, we create a “document”
it provides a clean interface between basic key-
for indexing comprising the title, abstract, and that
word search and subsequent neural reranking com-
paragraph. Thus, a full-text article comprising n
ponents. This allowed us to make progress inde-
paragraphs yields n + 1 separate “retrievable units”
pendently in a decoupled manner, but also presents
2
natural integration points. http://anserini.io/
in the index. To be consistent with standard IR updated version of Anserini (the underlying IR
parlance, we call each of these retrieval units a toolkit, as a Maven artifact in the Maven Central
document, in a generic sense, despite their com- Repository) as well as Pyserini (the Python inter-
posite structure. An article for which we do not face, as a Python module on PyPI) that provided
have the full text is represented by an individual basic keyword search. Furthermore, these capa-
document in this scheme. Note that while fielded bilities were demonstrated in online notebooks, so
search (dividing the text into separate fields and that other researchers can replicate our results and
performing scoring separately for each field) can continue to build on them.
yield better results, for expediency we did not im- Finally, we demonstrated, also via a notebook,
plement this. Following best practice, documents how basic keyword search can be seamlessly in-
are ranked using the BM25 scoring function. tegrated with modern neural modeling techniques.
Based on “eyeballing the results” using sample On top of initial candidate documents retrieved
information needs (manually formulated into key- from Pyserini, we implemented a simple unsuper-
word queries) from the Kaggle challenge associ- vised sentence highlighting technique to draw a
ated with CORD-19,3 results from the paragraph reader’s attention to the most pertinent passages
index did appear to be better (see Section 4 for in a document, using the pretrained BioBERT
more discussion). In particular, the full-text index, model (Lee et al., 2020) from the HuggingFace
i.e., condition (b) above, overly favored long ar- Transformer library (Wolf et al., 2019). We used
ticles, which were often book chapters and other BioBERT to convert sentences from the retrieved
material of a pedagogical nature, less likely to be candidates and the query (which we treat as a se-
relevant in our context. The paragraph index often quence of keywords) into sets of hidden vectors.7
retrieves multiple paragraphs from the same article, We compute the cosine similarity between every
but we consider this to be a useful feature, since combination of hidden states from the two sets,
duplicates of the same underlying article can pro- corresponding to a sentence and the query. We
vide additional signals for evidence combination choose the top-K words in the context, and then
by downstream components. highlight the top sentences that contain those words.
Since Anserini is built on top of Lucene, which Despite its unsupervised nature, this approach ap-
is implemented in Java, our tools are designed peared to accurately identify pertinent sentences
to run on the Java Virtual Machine (JVM). How- based on context. Originally meant as a simple
ever, TensorFlow (Abadi et al., 2016) and Py- demonstration of how keyword search can be seam-
Torch (Paszke et al., 2019), the two most popu- lessly integrated with neural network components,
lar neural network toolkits, use Python as their this notebook provided the basic approach for sen-
main language. More broadly, Python—with its tence highlighting that we would eventually deploy
diverse and mature ecosystem—has emerged as the in the Neural Covidex (details below).
language of choice for most data scientists today.
Anticipating this gap, our team had been working 3.2 Keyword Search with Faceted Browsing
on Pyserini,4 Python bindings for Anserini, since Python modules and notebooks are useful for fel-
late 2019. Pyserini is released as a Python module low researchers, but it would be unreasonable to
on PyPI and easily installable via pip.5 expect end users (for example, clinicians) to use
Putting all the pieces together, by March 23, a them directly. Thus, we considered it a priority
scant two days after the formation of our team, to deploy an end-to-end search application over
we were able release modular and reusable base- CORD-19 with an easy-to-use interface.
line keyword search components for accessing the Fortunately, our team had also been working
CORD-19 collection.6 Specifically, we shared pre- on this, dating back to early 2019. In Clancy et al.
built Anserini indexes for CORD-19 and released (2019), we described integrating Anserini with Solr,
so that we can use Anserini as a frontend to index
3
https://www.kaggle.com/ directly into the Solr search platform. As Solr is
allen-institute-for-ai/
CORD-19-research-challenge also built on Lucene, such integration was not very
4
http://pyserini.io/ onerous. On top of Solr, we were able to deploy
5
https://pypi.org/project/pyserini/
6 7
https://twitter.com/lintool/status/ We used the hidden activations from the penultimate layer
1241881933031841800 immediately before the final softmax layer.
Figure 1: Screenshot of our “basic” Covidex keyword search application, which builds on Anserini, Solr, and
Blacklight, providing basic BM25 ranking and faceting browsing.
the Blacklight search interface,8 which is an ap- 3.3 The Neural Covidex
plication written in Ruby on Rails. In addition to
providing basic support for query entry and results The Neural Covidex is a search engine that takes
rendering, Blacklight also supports faceted brows- advantage of the latest advances in neural ranking
ing out of the box. With this combination—which architectures, representing a culmination of our cur-
had already been implemented for other corpora— rent efforts. Even before embarking on this project,
our team was able to rapidly create a fully-featured our team had been active in exploring neural ar-
search application on CORD-19, which we shared chitectures for information access problems, par-
with the public on March 23 over social media.9 ticularly deep transformer models that have been
A screenshot of this interface is shown in Fig- pretrained on language modeling objectives: We
ure 1. Beyond standard “type in a query and get were the first to apply BERT (Devlin et al., 2019)
back a list of results” capabilities, it is worthwhile to the passage ranking problem. BERTserini (Yang
to highlight the faceted browsing feature. From et al., 2019) was among the first to apply deep
CORD-19, we were able to easily expose facets transformer models to the retrieval-based question
corresponding to year, authors, journal, and source. answering directly on large corpora. Birch (Yilmaz
Navigating by year, for example, would allow a et al., 2019) represents the state of the art in doc-
user to focus on older coronavirus research (e.g., on ument ranking (as of EMNLP 2019). All of these
SARS) or the latest research on COVID-19, and a systems were built on Anserini.
combination of the journal and source facets would In this project, however, we decided to incor-
allow a user to differentiate between pre-prints and porate our latest work based on ranking with
the peer-reviewed literature, and between venues sequence-to-sequence models (Nogueira et al.,
with different reputations. 2020). Our reranker, which consumes the candi-
8 date documents retrieved from CORD-19 by Py-
https://projectblacklight.org/
9
https://twitter.com/lintool/status/ serini using BM25 ranking, is based on the T5-base
1242085391123066880 model (Raffel et al., 2019) that has been modified
to perform a ranking task. Given a query q and a for each span by performing inference on it inde-
set of candidate documents d ∈ D, we construct pendently. We select the highest probability among
the following input sequence to feed into T5-base: these spans as the relevance probability of the docu-
ment. Note that with the paragraph index, keyword
Query: q Document: d Relevant: (1) search might retrieve multiple paragraphs from the
same underlying article; our technique essentially
The model is fine-tuned to produce either “true” or takes the highest-scoring span across all these re-
“false” depending on whether the document is rele- trieved results as the score for that article to pro-
vant or not to the query. That is, “true” and “false” duce a final ranking of articles. That is, in the final
are the ground truth predictions in the sequence-to- interface, we deduplicate paragraphs so that each
sequence task, what we call the “target words”. article only appears once in the results.
At inference time, to compute probabilities for A screenshot of the Neural Covidex is shown in
each query–document pair (in a reranking setting), Figure 2. By default, the abstract of each article
we apply a softmax only on the logits of the “true” is displayed, but the user can click to reveal the
and “false” tokens. We rerank the candidate doc- relevant paragraph from that article (for those with
uments according to the probabilities assigned to full text). The most salient sentence is highlighted,
the “true” token. See Nogueira et al. (2020) for ad- using exactly the technique described in Section 3.1
ditional details about this logit normalization trick that we initially prototyped in a notebook.
and the effects of different target words. Architecturally, the Neural Covidex is currently
Since we do not have training data specific to built as a monolith (with future plans to refac-
CORD-19, we fine-tuned our model on the MS tor into more modular microservices), where all
MARCO passage dataset (Nguyen et al., 2016), incoming API requests are handled by a service
which comprises 8.8M passages obtained from the that performs searching, reranking, and text high-
top 10 results retrieved by the Bing search engine lighting. Search is performed with Pyserini (as
(based on around 1M queries). The training set discussed in Section 3.1), reranking with T5 (dis-
contains approximately 500k pairs of query and cussed above), and text highlighting with BioBERT
relevant documents, where each query has one rel- (also discussed in Section 3.1). The system is built
evant passage on average; non-relevant documents using the FastAPI Python web framework, which
for training are also provided as part of the train- was chosen for speed and ease of use.10 The fron-
ing data. Nogueira et al. (2020) and Yilmaz et al. tend UI is built with React to support the use of
(2019) had both previously demonstrated that mod- modular, declarative JavaScript components,11 tak-
els trained on MS MACRO can be directly applied ing advantage of its vast ecosystem.
to other document ranking tasks. We hoped that The system is currently deployed across a small
this is also the case for CORD-19. cluster of servers, each with two NVIDIA V100
We fine-tuned our T5-base model with a con- GPUs, as our pipeline requires neural network in-
stant learning rate of 10−3 for 10k iterations with ference at query time (T5 for reranking, BioBERT
class-balanced batches of size 256. We used a max- for highlighting). Each server runs the complete
imum of 512 input tokens and one output token software stack in a simple replicated setup (no par-
(i.e., either “true” or ”false”, as described above). titioning). On top of this, we leverage Cloudflare
In the MS MARCO passage dataset, none of the as a simple load balancer, which uses a round robin
inputs required truncation when using this length scheme to dispatch requests across the different
limit. Training the model takes approximately 4 servers.12 The end-to-end latency for a typical
hours on a single Google TPU v3-8. query is around two seconds.
For the Neural Covidex, we used the paragraph On April 2, 2020, a little more than a week after
index built by Anserini over CORD-19 (see Sec- publicly releasing the basic keyword search inter-
tion 3.1). Since some of the documents are longer face and associated components, we launched the
than the length restrictions of the model, it is not Neural Covidex on social media.13
feasible to directly apply our method to the en-
10
tire text at once. To address this issue, we first https://fastapi.tiangolo.com/
11
https://reactjs.org/
segment each document into spans by applying 12
https://www.cloudflare.com/
a sliding window of 10 sentences with a stride 13
https://twitter.com/lintool/status/
of 5. We then obtain a probability of relevance 1245749445930688514
Figure 2: Screenshot of our Neural Covidex application, which builds on BM25 rankings from Pyserini, neural
reranking using T5, and unsupervised sentence highlighting using BioBERT.