Professional Documents
Culture Documents
2407.03618v1
2407.03618v1
Xing Han Lù
McGill University and Mila Quebec AI Institute
xing.han.lu@mail.mcgill.ca
ing BM25 scores during indexing and storing ing the implementation and introducing a strategy
them into sparse matrices. It also achieves con- to generalize to other variants of the original BM25.
siderable speedups compared to highly opti-
Unlike BM25-pt, BM25S does not rely on PyTorch,
mized Java-based implementations, which are
used by popular commercial products. Fi-
and instead uses Scipy’s sparse matrix implementa-
nally, BM25S reproduces the exact implemen- tion. Whereas BM25-PT multiplies bag-of-words
tation of five BM25 variants based on Kam- with the document matrix, BM25S instead slices rel-
phuis et al. (2020) by extending eager scor- evant indices and sums across the token-dimension,
ing to non-sparse variants using a novel score removing the need of matrix multiplications.
shifting method. The code can be found at At the implementation level, BM25S also intro-
https://github.com/xhluca/bm25s
duces a simple but fast Python-based tokenizer that
1 Background combines Scikit-Learn’s text splitting (Pedregosa
et al., 2011), Elastic’s stopword list7 , and (option-
Sparse lexical search algorithms, such as the BM25 ally) integrates a C-based implementation of the
family (Robertson et al., 1995) remain widely used Snowball stemmer (Bouchet-Valat, 2014). This
as they do not need to be trained, can be applied to achieves a better performance compared to sub-
multiple languages, and are generally faster, espe- word tokenizers (Kudo and Richardson, 2018) used
cially when using highly efficient Java-based imple- by BM25-PT. Finally, it implements top-k retrieval
mentations. Those Java implementations, usually using an average O(n) time complexity when se-
based on Lucene3 , are accessible inside Python lecting the K most relevant documents from a set
via the Pyserini reproducibility toolkit (Lin et al., of n scores associated with each document.
2021), and through HTTP by using the Elastic-
search web client4 . The Lucene-based libraries 2 Implementation
are known to be faster than existing Python-based
implementations, such as Rank-BM255 . The implementation described below follows the
This work shows that it is possible to achieve a study by Kamphuis et al. (2020).
significant speedup compared to existing Python-
based implementations by introducing two im- Calculation of BM25 Many variants of BM25
provements: eagerly calculating all possible scores exist, which could lead to significant confusion
that can be assigned to any future query token about the exact scoring method used in a given
when indexing a corpus, and storing those calcu- implementation (Kamphuis et al., 2020). By de-
lations inside sparse matrices to enable faster slic- fault, we use the scoring method proposed by
ing and summations. The idea of sparse matrices Lucene. Thus, for a given query Q (tokenized into
1
https://numpy.org/ q1 , . . . , q|Q| ) and document D from collection C,
2
https://scipy.org/
3 6
https://lucene.apache.org/ https://github.com/jxmorris12/bm25_pt
4 7
https://www.elastic.co/elasticsearch https://www.elastic.co/guide/en/
5
https://github.com/dorianbrown/rank_bm25 elasticsearch/guide/current/stopwords.html
1
we compute the following score8 : format (scipy.sparse.csc_matrix)9 , which pro-
vides an efficient conversion between the coordi-
|Q|
X nate and CSC format. Since we slice and sum
B(Q, D) = S(qi , D) alongside the column dimension, this implementa-
i=1 tion is the optimal choice among sparse matrix im-
|Q|
X TF(qi , D) plementations. In practice, we replicate the sparse
= IDF(qi , C) operations directly using Numpy array.
D
i=1
Tokenization To split the text, we use the same
where D = TF(t, D) + k1 1 − b + b L|D| avg
, Lavg Regular Expression pattern used by Scikit-Learn
is the average length of documents in corpus C (Pedregosa et al., 2011) for their own tokenizers,
(calculated in number of tokens), TF(qi , D) is the which is r"(?u)\b\w\w+\b". This pattern conve-
term frequency of token qi within the set of tokens niently parses words in UTF-8 (allowing cover-
in D. The IDF is the inverse document frequency, age of various languages), with \b handling word
which is calculated as: boundaries. Then, if stemming is desired, we can
stem all words in the vocabulary, which can be
|C| − DF(qi , C) + 0.5 used to look up the stemmed version of each word
IDF(qi , C) = ln +1
DF(qi , C) + 0.5 in the collection. Finally, we build a dictionary
mapping each unique (stemmed) word to an integer
Where document frequency DF(qi , C) is the num- index, which we use to convert the tokens into their
ber of documents in C containing qi . Although corresponding index, thus significantly reducing
B(Q, D) depends on the query, which is only given memory usage and allowing them to be used to
during retrieval, we show below how to reformu- slice Scipy matrices and Numpy arrays.
late the equation to eagerly calculate the TF and
IDF during indexing. Top-k selection Upon computing scores for all
documents in a collection, we can complete the
Eager index-time scoring Let’s now consider all search process by selecting the top-k most relevant
tokens in a vocabulary V , denoted by t ∈ V . We elements. A naive approach to this would be to
can reformulate S(t, D) as: sort the score vector and select the last k elements;
1 instead, we take the partition of the array, select-
S(t, D) = TF(t, D) · IDF(t, C) ing only the last k documents (unordered). Using
D
an algorithm such as Quickselect (Hoare, 1961),
When t is a token that is not present in document we can accomplish this in an average time com-
D, then TF(t, D) = 0, leading to S(t, D) = 0 plexity of O(n) for n documents in the collection,
as well. This means that, for most tokens in vo- whereas sorting requires O(n log n). If the user
cabulary V , we can simply set the relevance score wishes to receive the top-k results in order, sorting
to 0, and only compute values for t that are actu- the partitioned documents would take an additional
ally in the document D. This calculation can be O(k log k), which is a negligible increase in time
done during the indexing process, thus avoiding complexity assuming k ≪ n. In practice, BM25S
the need to compute S(qi , D) at query time, apart allows the use of two implementations: one based
from straightforward summations. in numpy, which leverages np.argpartition, and
another in jax, which relies on XLA’s top-k im-
Assigning Query Scores Given our sparse ma- plementation. Numpy’s argpartition uses10 the
trix of shape |V |×|C|, we can use the query tokens introspective selection algorithm (Musser, 1997),
to select relevant rows, leaving us a matrix of shape which modifies the quickselect algorithm to ensure
|Q| × |C|, which we can then sum across the col- that the worst-case performance remains in O(n).
umn dimension, resulting in a single |C|-dimension Although this guarantees optimal time complexity,
vector (representing the score of the score of each we observe that JAX’s implementation achieves
document for the query). better performance in practice.
Efficient Matrix Sparsity We implement a 9
https://docs.scipy.org/doc/scipy/reference/
sparse matrix in Compressed Sparse Column (CSC) generated/scipy.sparse.csc_matrix.html
10
https://numpy.org/doc/stable/reference/
8
We follow notations by Kamphuis et al. (2020) generated/numpy.argpartition.html
2
Dataset BM25S ES PT Rank present in any given document D, their value in the
ArguAna 573.91 13.67 110.51 2.00 score matrix will be 0). Then, during retrieval, we
Climate-FEVER 13.09 4.02 OOM 0.03 can simply compute S θ (qi ) for each query qi ∈ Q,
CQADupstack 170.91 13.38 OOM 0.77
DBPedia 13.44 10.68 OOM 0.11
and sum it up to get a single scalar that we can add
FEVER 20.19 7.45 OOM 0.06 to the final score (which would not affect the rank).
FiQA 507.03 16.96 20.52 4.46 More formally, for an empty document ∅, we
HotpotQA 20.88 7.11 OOM 0.04
MSMARCO 12.20 11.88 OOM 0.07 define S θ (t) = S(t, ∅) as the nonoccurrence score
NFCorpus 1196.16 45.84 256.67 224.66 for token t. Then, the differential score S ∆ (t, D)
NQ 41.85 12.16 OOM 0.10 is defined as:
Quora 183.53 21.80 6.49 1.18
SCIDOCS 767.05 17.93 41.34 9.01 S ∆ (t, D) = S(t, D) − S θ (t)
SciFact 952.92 20.81 184.30 47.60
TREC-COVID 85.64 7.34 3.73 1.48 Then, we reformulate the BM25 (B) score as:
Touche-2020 60.59 13.53 OOM 1.10
|Q|
X
Table 1: To calculate the throughput, we calculate the number B(Q, D) = S(qi , D)
of queries per second (QPS) that each model can process i=1
for each task in the public section of the BEIR leaderboard; |Q|
instances achieve over 50 QPS are shown in bold. We compare X
BM25S, BM25-PT (PT), Elasticsearch (ES) and Rank-BM25 = S(qi , D) − S θ (qi ) + S θ (qi )
(Rank). OOM indicates failure due to out-of-memory issues. i=1
|Q|
X
= S ∆ (qi , D) + S θ (qi )
Multi-threading We implement optional multi-
i=1
threading capabilities through pooled executors11 |Q| |Q|
to achieve further speed-up during retrieval. X
∆
X
= S (qi , D) + S θ (qi )
Alternative BM25 implementations Above, we i=1 i=1
3
Stop Stem Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
Eng. None 38.4 48.3 29.4 13.1 27.0 23.3 48.2 56.3 21.2 30.6 27.3 74.8 15.4 66.2 59.5 35.8
Eng. Snow. 39.7 49.3 29.9 13.6 29.9 25.1 48.1 56.9 21.9 32.1 28.5 80.4 15.8 68.7 62.3 33.1
None None 38.3 46.8 29.6 13.6 26.6 23.2 48.8 56.9 21.1 30.6 27.8 74.2 15.2 66.1 58.3 35.9
None Snow. 39.6 47.7 30.2 13.9 29.5 25.1 48.7 57.5 21.7 32.0 29.1 79.7 15.6 68.5 61.6 33.4
Table 2: NDCG@10 results of different tokenization schemes (including and excluding stopwords and the Snowball
stemmer) on all BEIR dataset (Appendix A provides a list of datasets). We notice that including both stopwords and
stemming modestly improves the performance of the BM25 algorithm.
k1 b Variant Avg. AG CD CF DB FQ FV HP MS NF NQ QR SD SF TC WT
1.5 0.75 BM25PT – 44.9 – – – 22.5 – – – 31.9 – 75.1 14.7 67.8 58.0 –
1.5 0.75 PSRN 40.0* 48.4 – 14.2 30.0 25.3 50.0 57.6 22.1 32.6 28.6 80.6 15.6 68.8 63.4 33.5
1.5 0.75 R-BM25 39.6 49.5 29.6 13.6 29.9 25.3 49.3 58.1 21.1 32.1 28.5 80.3 15.8 68.5 60.1 32.9
1.5 0.75 Elastic 42.0 47.7 29.8 17.8 31.1 25.3 62.0 58.6 22.1 34.4 31.6 80.6 16.3 69.0 68.0 35.4
1.5 0.75 Lucene 39.7 49.3 29.9 13.6 29.9 25.1 48.1 56.9 21.9 32.1 28.5 80.4 15.8 68.7 62.3 33.1
0.9 0.4 Lucene 41.1 40.8 28.2 16.2 31.9 23.8 63.8 62.9 22.8 31.8 30.5 78.7 15.0 67.6 58.9 44.2
1.2 0.75 Lucene 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.0 61.0 33.2
1.2 0.75 ATIRE 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25+ 39.9 48.7 30.1 13.7 30.3 25.3 50.3 58.5 22.6 31.8 29.1 80.5 15.6 68.1 61.0 33.2
1.2 0.75 BM25L 39.5 49.6 29.8 13.5 29.4 25.0 46.6 55.9 21.4 32.2 28.1 80.3 15.8 68.7 62.9 33.0
1.2 0.75 Robertson 39.9 49.2 29.9 13.7 30.3 25.4 50.3 58.5 22.6 31.9 29.2 80.4 15.5 68.3 59.0 33.8
Table 3: Comparison of different variants and parameters on all BEIR dataset (Appendix A provides a list of
datasets). Following the recommended range of k1 ∈ [1.2, 2] by Schütze et al. (2008), we try both k1 = 1.5 and
k1 = 1.2 with b = 0.75. Additionally, we use k1 = 0.9 and b = 0.4 following the parameters recommend in BEIR.
We additionally benchmark five of the BM25 variants described in Kamphuis et al. (2020). *note that Pyserini’s
average results are estimated, as the experiments for CQADupStack (CD) did not terminate due to OOM errors.
4
make lexical search more accessible to a broader C. A. R. Hoare. 1961. Algorithm 65: find. Commun.
audience. ACM, 4(7):321–322.
5
Stephen E Robertson, Steve Walker, Susan Jones, BEIR Datasets BEIR (Thakur et al., 2021)
Micheline M Hancock-Beaulieu, Mike Gatford, et al. contains the following datasets: Arguana (AG;
1995. Okapi at trec-3. Nist Special Publication Sp,
Wachsmuth et al., 2014), Climate-FEVER (CF;
109:109.
Diggelmann et al., 2020), DBpedia-Entity (DB;
François Rousseau and Michalis Vazirgiannis. 2013. Hasibi et al., 2017), FEVER (FV; Thorne et al.,
Composition of tf normalizations: new insights on 2018), FiQA (FQ; Maia et al., 2018), HotpotQA
scoring functions for ad hoc ir. Proceedings of the
36th international ACM SIGIR conference on Re-
(HP; Yang et al., 2018), MS MARCO (MS; Cam-
search and development in information retrieval. pos et al., 2016), NQ (NQ; Kwiatkowski et al.,
2019), Quora (QR)19 , SciDocs (SD; Cohan et al.,
Hinrich Schütze, Christopher D Manning, and Prab- 2020), SciFact (SF; Wadden et al., 2020), TREC-
hakar Raghavan. 2008. Introduction to information
retrieval, volume 39. Cambridge University Press
COVID (TC; Roberts et al., 2020), Touche-2020
Cambridge. (WT; Bondarenko et al., 2020).
A Appendix
Hardware To calculate the queries per second,
we run our experiments using a single-threaded
approach. In the interest of reproducibility, our
experiments can be reproduced on Kaggle’s free
CPU instances18 , which are equipped with a Intel
Xeon CPU @ 2.20GHz and 30GB of RAM. This
setup reflects consumer devices, which tend have
fewer CPU cores and rarely exceed 32GB of RAM.
19
https://quoradata.quora.com/
18
https://www.kaggle.com/ First-Quora-Dataset-Release-Question-Pairs