Performance Evaluation of Single vs. Batch of Queries On GPUs

Received: 30 March 2019 Revised: 7 June 2019 Accepted: 3 July 2019
DOI: 10.1002/cpe.5474
SPECIAL ISSUE PAPER
Performance evaluation of single vs. batch of queries on GPUs
Roussian Gaioso1 Veronica Gil-Costa2 Helio Guardia1 Hermes Senger1
1 Department of Computer Science, Federal
University of São Carlos, São Carlos, Brazil Summary

2 CONICET, UNSL, San Luis, Argentina
The WAND processing strategy is a dynamic pruning algorithm designed for large scale Web
Correspondence search engines where fast response to queries is a critical service. The WAND is used to reduce
Hermes Senger, Department of Computer the amount of computation by scoring only documents that may become part of the top-k
Science, Federal University of São Carlos,
document results. In this paper, we present two parallel strategies for the WAND algorithm
13565-905 São Carlos-SP, Brazil.
Email: senger.hermes@gmail.com and compare their performance on GPUs. In our first strategy (named size-based), the posting
lists are evenly partitioned among thread blocks. Our second strategy (named range-based)
Funding information partitions the posting lists according to document identifier intervals; thus, partitions may have
CAPES (Coordenação de Aperfeiçoamento de
Pessoal de Nível Superior - Código de
different sizes. We also propose three threshold sharing policies, named Local, Safe-R, and
Financiamento 001); FAPESP, Grant/Award Safe-WR, which emulate the WAND algorithm global pruning technique. We evaluated our
Number: 2015/24461-2; CNPQ, Grant/Award proposals with different amounts of work, from short to extra-large queries, using single query
Number: 305032/2015-1; FAPESP,
processing and batch of queries. Results show that the size-based strategy reports the highest
Grant/Award Number: 2018/00452-2
speedups but at the cost of low quality of results. The range-based algorithm retrievals the
exact top-k documents and maintains a good speedup. Moreover, both strategies are capable
of scaling as the amount of work is increased. In addition, there is no significant difference in
the performance of the three threshold sharing policies.
KEYWORDS
batch of queries, GPUs, top-k query processing, WAND algorithm
1 INTRODUCTION
The web is continuously growing in two important dimensions, which impact the scalability of web search engines. First, the amount of webpages
and documents available on the web is increasing. The more documents there are to be searched, the larger are the document indexes and the
longer it takes for each individual query to be processed. Second, the interest of people in searching these contents is increasing, which increases
the demand for web search engines with higher throughput. In such scenario, a scalable query processing system ideally should be capable of
answering queries with high quality search results and maintaining reasonable throughput and latency.1
To process user queries with high throughput and low response time, current commercial search engines use large clusters consisting of
thousands of nodes, multi-core processors, usually placed on large datacenters. Web documents are commonly mapped by search engines via
inverted indexes on per-document or per-term bases. These indexes are used for finding the most relevant documents in a query using mainly
two strategies. The Term-at-a-time (TAAT) strategy processes query terms one by one to determine their impact from all relevant documents.
The Document-at-a-time (DAAT) strategy evaluates the contributions of each document to all query terms at once.2
The WAND ranking algorithm3 implements a DAAT strategy that first runs a fast approximate evaluation on candidate documents, and then
makes a full slower evaluation limited to promising candidates. Because it allows many documents to be skipped without fully computing their
scores, this optimization was further used in other algorithms such as the block-max WAND (BMW)4 and MaxScore.5 On the other hand, its
step-by-step sequential nature makes this class of algorithms not a clear candidate for parallelization.
Motivated by the high performance delivered by GPUs, we propose and evaluate two index partition strategies to parallelize the query
processing based on the DAAT approach. We have implemented and tested our proposals with the WAND ranking algorithm. Our strategy for
parallelization is based in partitioning the posting lists to be searched in parallel by many GPU threads, while dealing with the information sharing
among threads to minimize processing latency without loss of relevant documents. Moreover, we propose two synchronization strategies to
process batch of queries. Our strategies show promising speedups.
Concurrency Computat Pract Exper. 2019;e5474. wileyonlinelibrary.com/journal/cpe © 2019 John Wiley & Sons, Ltd. 1 of 21
https://doi.org/10.1002/cpe.5474
2 of 21 GAIOSO ET AL.
The remainder of this paper is organized as follows. Section 2 presents the background and related work on query processing algorithms and
parallel processing using GPUs. Section 3 presents our proposed parallel strategies for GPUs. Section 4 presents the results for single queries.
Section 5 presents the synchronization strategies for batch of queries and Section 6 presents their performance. Finally, Section 7 brings the
conclusions.
2 BACKGROUND AND RELATED WORK
Search engines determine the top-k most relevant documents for term queries by executing a ranking algorithm on an inverted index. This index
enables the fast determination of the documents that contain the query terms and contains data to calculate document scores for ranking. The
index is composed of a table of terms and, for each term, there is a posting list with information including the document identifiers (docIDs) where
the term appears, the frequency (number of times the term appears in the document) along with additional data used for ranking purposes. To
solve a query, it is necessary to get from the posting lists the set of documents associated with the query terms and then to perform a ranking of
these documents in order to select the top-k documents as the query answer.
The WAND algorithm was originally conceived having a single execution flow to iterate over an inverted index. It processes each query by
looking for query terms in the inverted index and retrieving each posting list. Documents referenced from the intersection of the posting lists
allow to answer conjunctive queries (AND bag of word query) and documents retrieved from at least one posting list allow to answer disjunctive
queries (OR bag of word query). The algorithm uses static and dynamic values. The static value, called the upper-bound, is calculated for each
term of the inverted index at construction time whereas the dynamic value, called the threshold, starts in zero for each new query and is updated
during the computations across the inverted index.
It uses a standard docID sorted index and it is based on two levels. In the first level, some potential documents are selected as results using an
approximate evaluation. Then, in the second level those potential documents are fully evaluated (eg, using the BM25 or vector model) to obtain
their scores. A heap keeps track of the current top-k documents being the lowest ranked one at the root. The root score provides a threshold
value, which is used to decide on running a full score evaluation for each of the remaining documents in the posting lists associated with the
query terms. The reference to a new document is inserted in the heap when the WAND operator is true, ie, when the score of the new document
is greater than the score of the document with the minimum score stored in the heap. If the heap is full, the document with the minimum score is
replaced, updating the value of the threshold. Documents with a score smaller than the threshold in the heap are skipped.
This scheme allows skipping many documents that would have been evaluated by an exhaustive algorithm. To this end, the algorithm iterates
through posting lists to evaluate them quickly using a pointer movement strategy based on pivoting. In other words, pivot terms and pivot
documents are selected to move forward in the posting lists which allow skipping many documents that would have been evaluated by an
exhaustive algorithm. A larger threshold usually allows skipping more documents, reducing computational costs involved in score calculation.
Figure 1 shows an example of the WAND algorithm for a query with three terms (‘‘tree, cat and house’’). The posting lists of the query terms
are sorted by docIDs upper bounds (UBs). Then we add the upper bounds of the terms until we get a value greater or equal to the threshold.
In this example, the sum of the UBs of the first two terms is 2 + 4 ≥ 6. Thus, the term ‘‘cat’’ is selected as the pivot term. We assume that
the current document in this posting list is ‘‘503,’’ so this document becomes the pivot document. If the first two posting lists do not contain
the document 503, we proceed to select the next pivot. Otherwise, we compute the score of the document. If the score is greater or equal to
the threshold, we update the heap by removing the root document and adding the new document. This process is repeated until there are no
documents to process or until it is no longer possible for the sum of the upper bounds to exceed the current threshold.
2.1 GPUs
The GPU basic architecture consists of a set of streaming multiprocessors (SMs), each one containing several streaming processors (SPs). All
SPs (thin cores) inside a SM (fat core) execute the same instructions, but they do it over different data instances, according to the SIMD (Single
Instruction, Multiple Data) model. The amount of SPs and the number of SMs in a GPU differ from one graphic card to another. The GPU supports
thousands of lightweight concurrent threads and, unlike the CPU threads, the overhead of creating and switching threads is negligible. The
threads on each SM are organized into thread groups that share computation resources such as registers. A thread group is divided into multiple
schedule units, called warps, which are dynamically scheduled on the SM. Because of the SIMD nature of the SP's execution units, if threads in
a schedule unit must perform different operations, such as going through branches, these operations will be executed serially as opposed to in
FIGURE 1 WAND algorithm execution example

GAIOSO ET AL. 3 of 21
parallel. Additionally, if a thread stalls on a memory operation, the entire warp will be stalled until the memory access is completed. In this case,
the SM selects another ready warp and switches to it. The GPU global memory is typically measured in gigabytes of capacity. It is an off-chip
memory and has both a high bandwidth and a high access latency. To compensate the high latency of this memory, it is important to have more
threads than the number of SPs and to have threads in a warp accessing consecutive memory addresses that can be easily coalesced. The GPU
also provides a fast on-chip shared memory, which is accessible by all SPs of a SM. The size of this memory is small, but it has a low latency and it
can be used as a software-controlled cache. Moving data from the CPU to the GPU and vice versa is done through a PCIExpress connection.
A GPU program exposes parallelism through a data-parallel SPMD (Single Program Multiple Data) kernel function. The programmer can
configure the number of threads to be used. Threads execute data parallel computations of the kernel and are organized in groups (thread blocks)
that are further organized into a grid structure. When a kernel is launched, the blocks within a grid are distributed on idle SMs while the threads
are mapped to the SPs. Threads within a thread block are executed by the SPs of a single SM and can communicate through the SM shared
memory. Furthermore, each thread inside a block has its own registers (private local memory) and uses a global thread block index, and a local
thread index within a thread block, to uniquely identify its data. Threads that belong to different blocks cannot communicate explicitly and have
to rely on the global memory to share their results.
2.2 Related work

Ranking algorithms are executed over the inverted index to return the top-k documents for user queries. They scan the posting lists associated
with the query terms to determine the documents that are the most similar ones to the query. One important bottleneck in query ranking is the
length of the inverted index, which is usually kept in compressed format. To avoid processing the entire posting lists and reduce the amount
of computations, the ranking algorithms use dynamic pruning techniques. Some ranking algorithms for inverted indexes have been proposed
in other works.3,4,6 In this work, we focus on the WAND3 algorithm, which uses a pointer movement strategy based on pivoting to skip many
documents that would be evaluated by an exhaustive algorithm.
Rojas et al7 proposed an algorithm to reduce the communication and computation cost of calculating the top-k documents for user queries. The
algorithm was designed for a distributed cluster of multi-core processors. Each processor holds a portion of the inverted index and executes the
BM-WAND algorithm,4 which is a variant of WAND, using a local heap approach or a shared heap approach. Later, the same authors,8 proposed
two multi-threaded approaches for the WAND algorithm. In the first one, all threads collaborate to calculate the top-k documents using a shared
heap. In the second approach, threads work independently using a local heap. Local heaps are merged when the last thread finishes.
The work presented by Bonacic et al9 is devised for multi-cores, and considers read and write conflicts when user queries are processed at the
same time as the inverted index is updated. The proposed algorithms are devised to process batch of queries by assigning an appropriate number
of threads to resolve user queries and avoid idle threads at the end of each batch.
Ding et al10 proposed to build an information retrieval system using GPUs. The authors calculated the BM25 score for ranking purposes and
to compute the top-k documents they create groups of fixed size of candidate documents, selecting the maximum score in each group until they
found the top-k. However, they do not use a dynamic pruning technique like WAND. Similarly, Ao et al11 studied the problem of uncompressed
sorted lists intersection on the GPU and how to efficiently intersect compressed lists.
Tadros12 used GPUs to reduce the documents ranking latency per query. He evaluated a two-phase ranking algorithm. The first-phase runs
on the CPU and selects documents that could potentially be a good match to the query. The second-phase runs on the GPU and it is more
expensive as it extracts documents features and uses them to rank the documents. A speedup of 19× was obtained when using the GPU against
a CPU-based ranking algorithm.
Wu et al13 proposed a collaborative CPU-GPU scheme to perform lists intersection, which is much simpler than executing the ranking WAND
algorithm, and used a scheduler algorithm to determine whether CPU or GPU processes the query faster. Queries are grouped into batches which
are processed by GPU threads in parallel. Later, the work presented by Zhang et al14 proposed new techniques for improving the performance
of the GPU batched algorithm previously presented by Wu et al.13
Huang et al15 proposed a CPU-GPU collaborative algorithm to process batch of queries. Only short queries are processed on the GPU
meanwhile long queries are processed in the CPU. Inside the GPU, the document partition is fixed. Each thread is responsible of processing a
fixed number of documents (#docs/#threads). It limits the threads to work with small posting lists and small values of k. Moreover, each query is
processed by a single thread block. Our work is different from this paper as our goal is to use the GPU to execute queries of any length including
long queries, which are dominant for most real-world search engines.
More recently, Liu et al16 described an information retrieval system that dynamically combines CPU and GPU to process individual queries
according to their characteristics, system overhead and GPU memory. A scheduler algorithm is used to divide the work between the CPU and the
GPU. The GPU-algorithm combines a fast decompression algorithm17 and a load-balancing merge-based parallel list intersection.18 The authors
improved the query processing by 10× compared to a CPU-based query processing.
In a previous work,19 we proposed two strategies for the parallel execution of WAND algorithm to process single queries. In this paper, we
elaborated on our previous work, by proposing improvements on the previous strategies and proposing two novel strategies for batches of queries.
The performance of all strategies were evaluated in a more comprehensive set of experiments with the full TREC ClueWeb09 (categoryB) dataset.
FIGURE 2 The size-based partitioning of the posting lists
3 A PARALLEL IMPLEMENTATION OF WAND FOR SINGLE QUERY EXECUTION
In this section, we present strategies for the parallel processing of single queries using the WAND algorithm on GPUs. WAND performs a DAAT
analysis by iterating over the same docID on the posting lists of all query terms. Our approach to a parallel WAND preserves its DAAT strategy
by having processor groups to work on partitions of all posting lists. In the first strategy, all the posting lists are divided into fixed size partitions,
each containing the same amount of document identifiers (docIDs), at least for the longest term list. This scheme is depicted in Figure 2. The
documents are partitioned according to the desired number of multiprocessors (SMs) and partition size. This size-based partitioning strategy
aims at simplicity and at maximizing thread occupancy through better load balancing. Given each query term may appear in different documents;
however, this partition strategy may lead to a misalignment of docIDs among the posting lists in a partition.
A second partitioning strategy was then proposed, which fragments the posting lists according to the range of docIDs. This strategy aims at
maximizing the effect of the DAAT approach, by allowing each multiprocessor (SM) to determine the exact impact of each relevant document for
each query term. As a result of this range-based partitioning, each partition will contain the same range of docIDs in the respective posting lists
for the query terms. In our implementation, this partitioning is initially performed as with the size-based strategy and, in a second step, document
ranges of the partitions are adjusted according to the docIDs located at the beginning and end of each partition. The initial docID of a partition is
set to the largest docID located in the first position outside the starting edge (leftmost position) of the partition of each inverted list by adding
one unit to its value. The final docID is the highest value of docID located at the rightmost position of the partition of each inverted list. This
process is illustrated in Figure 3. At the end of this process, the processors are not aware of the amount of documents in each partition, but
only know the initial and final documents of the document ranges. As depicted in Figure 4, the range-based strategy produces partitions with
different sizes.
Our strategies can be implemented on GPU architectures and other manycore accelerators composed of a hierarchy of processors described
as follows. The top level of the hierarchy is composed of one or more multiprocessors (fat cores), each one executing its own instruction stream.
Multiprocessors are processing elements intended for coarse-grained parallelism. In the second level of the hierarchy, each multiprocessor is
composed of a set of simple processing units called Stream Processors (thin cores). Stream processors are noted as fine-grained sharing the same
resources within the multiprocessor and work synchronously in the same instruction stream (SIMD). From this consideration, we can formally
define this GPU abstraction as follows:
1. A set of coarse-grained processors P = {p1 , p2 , … , ps }, where s is the number of processors;

2. Each coarse-grained processor pi , where 1 ≤ i ≤ s, has a set of fine-grained processing units Core[1..b] that share the same resources, where
b is the number of SP in each Multiprocessor.
The strategies for the parallel execution of each single query are detailed in Algorithms 1 and 2. Algorithm 1 formalizes the coarse-grained
distributed query processing. The parallelization occurs with function calls on lines 2 and 7. The ParallelMatchProcessing function (kernel)
FIGURE 3 The pre-processing of the range-based partitioning of the posting lists
FIGURE 4 The range-based partitioning of the posting lists
encapsulates the parallel WAND algorithm, whereas the ParallelMerge function combines in parallel all the partial top-k documents to obtain only
the most relevant k documents. The synchronization barrier on line 4 ensures that all partial results are obtained before the results are merged.
These strategies described in Algorithm 1 do not include the processing for partitioning the inverted query lists, which is performed dynamically
on each coarse-grained processor.
The number of processors involved in the algorithm depends on the size of the largest inverted list for the query terms, and each processor
will initially work on one partition. Thus, the time complexity to perform the classification of the documents of the inverted lists of a query is
N
𝜃( |P|m ) in the worst case, where Nm is the number of docIDs in the largest inverted lists for the query terms. The top-k documents resulting from
the partitioning are merged by the log(|P|) calls of the ParallelMerge function. We propose that the merge algorithm be processed in parallel on
N
the GPU, where each call runs in time that depends on the constant k. Therefore, Algorithm 1 runs in 𝜃( |P|m + log(|P|)) time.
Algorithm 2 details the fine-grained parallelization strategy encapsulated by the ParallelMatchProcessing function. This algorithm is similar to
the original WAND presented in the work of Broder et al.3 The input of the algorithm includes the list of query terms and their corresponding
posting lists (terms), the number of top documents to be retrieved (k), and a set of coarse-grained processors (s), each containing Cores[1..b]
(fine-grained processors). For each partition, the algorithm maintains a pointer to each posting list of query terms (terms) and to the next
document (nextDoc) to be evaluated in each list. The algorithm begins by initializing in parallel these list pointers and the docID range in
each posting list in each partition, ie, the smallest and largest docIDs (dmin , dmax ) of each list in each partition. This can be seen in lines 3
and 4, where functions RangeDocs and NextDoc are called. The NextDoc function determines the next docID to be evaluated from these
docIDs.
Our approach for fine-grained parallelization is applied to several steps within the ParallelMatchProcessing function, ie, the analysis of similarity
models for the selected document (FullScore), the management of the list of relevant documents (ManageTopkDocs), the obtainment of the
next document (NextDoc), and the sorting of the query terms (Sort). Nevertheless, the iterations over the partition documents are sequentially
conserved by the loop in line 8. A document is selected at each iteration to be analyzed and classified. The analysis consists in verifying
(line 9) if the document can have sufficient contribution to be among the most relevant documents. The classification is the application of
similarity models in each posting list where a docID is present. This analyzed and classified document is called pivot document. If the pivot
document score is high enough to be in the list of top-k documents at a certain point in the processing, the document will be inserted
in the local data structure that stores the most relevant documents. This detailed parallelization strategy on Algorithm 2 only considers
local list within the coarse-grained processor. This removes the need to implement policies that guarantee the integrity of access to this
structure.
The partitioning of the inverted lists is performed by the RangeDocs function and depends on the selected strategy. For the range-based
strategy, the initial and final docIDs in the partial posting lists of all query terms will be the same, as returned by the RangeDocs function. When
using the size-based strategy, the docID boundaries in each list will be different.
For each partition, a local threshold is set as the lowest score of all docIDs in the current local top-k list. This value is dynamically updated as the
list changes and is used to define which document to use as the pivot (lines 7 and 27) in each iteration of the loop of line 8. As the local threshold
increases, the faster the algorithm instance proceeds, as more documents are skipped. By sharing the local thresholds among the fine-grained
processors, shown in line 28 (ThresholdSharingStrategy), we are able to improve the overall processing performance of our algorithm. This function
encapsulates one of the sharing policies proposed here to ensure that the quality of the final results is not degraded. The advantage of this
sharing is that it can be performed on each selected pivot document, as demonstrated in Algorithm 2, or to each certain amount of documents
analyzed.
3.1 Threshold sharing policies

Because pruning in WAND is based on a threshold value, our parallel implementation of this algorithm includes a mechanism for the processing
elements to share their local values. Three different threshold policies have been considered, ie, Local, Safe-R shared, and Safe-Write-Read Shared.
(1) Using the Local threshold, each thread block relies solely on its local value, at the cost of skipping less documents. (2) In our second policy, all
thread blocks share the current threshold value by means of a global variable. Document skipping is high, but so are the synchronization costs.
In order to avoid relevant documents do be skipped, threshold sharing is deferred until a thread block fulfills its heap. The global threshold is
initially set to zero, so every thread block can safely read its value for evaluating its documents. For this reason, we called this policy Safe-R
shared. A possible drawback of this strategy is the delay to share the threshold, thus causing a slow increment of the global threshold at the
beginning of the algorithm. (3) To circumvent the threshold sharing delay, a third policy is proposed, which causes each thread block to defer
sharing its threshold until a minimum amount of documents have been selected. Although this strategy can anticipate threshold sharing, there is
still a risk of losing relevant documents, so this minimum amount of documents should be high enough to reduce the possibility of propagating a
high threshold value too early. We have defined that each processor puts off the read and write operations on the global variable until it fills at
least half the local top-k document list. This strategy is called Safe-Write-Read Shared (Sh-WR).
3.2 Implementations
In order to evaluate our proposed strategies for the parallelization of the WAND algorithm, we have implemented them on a CUDA GPU. This
architecture consists of a number of SPMD multiprocessors, on which virtualization techniques provide a massive availability of threads. Thread
blocks are executed on multiprocessors and individual threads are executed by scalar processors. On top of this architecture, we can define the
number of partitions for the inverted lists as a function of the number of threads per block. This provides the needed flexibility for the strategies
detailed in Algorithms 1 and 2.
The thread blocks will process the partitions of the inverted lists, and if the number of partitions exceeds the number of thread blocks, each
block will be responsible for more than one partition. In order to improve data reuse, the local top-k list of a thread block can be preserved for all
partitions this block will handle.
Figure 5 illustrates this reuse of data. The threshold in a thread block may preserve its value from a previously processed partition, increasing
the skipping ratio if there are more partitions than the available number of thread blocks. The appropriate number of partitions, however, needs
to be determined according to the characteristics of the GPU used, and in combination with the characteristics of the queries.
FIGURE 5 Reuse of the results of processing of the partitions

TABLE 1 Execution times in milliseconds for the sequential algorithm Number of Operation Query Max. number Execution
terms length of docIDs time (ms)
2 OR Short 89 615 14.01
2 OR Medium 476 771 42.86
2 OR Large 1 032 795 45.52
2 OR Extra 5 494 285 202.79
2 AND Short 89 615 13.83
2 AND Medium 476 771 42.43
2 AND Large 1 032 795 45.44
2 AND Extra 5 494 285 200.77
4 EXPERIMENTS WITH SINGLE QUERIES
In our tests, we experiment with a 50.2 million document corpus TREC ClueWeb09 (category B), which is part of the corpus ClueWeb09,*
indexed with the Terrier IR platform.† The index has 29 GB being that the most frequent terms produced posting lists with 2.7 millions of docIDs.
All experiments were repeated 20 times, removing the highest and the lowest values. The results presented here are the average of the 18
remaining executions.
The baseline (sequential) version of the WAND algorithm was executed on an Intel Core i7-5820K processor of 3.30 GHz and 64 GB of RAM,
using only one core. The queries were composed by two terms with posting lists of different sizes (given by the number of docIDs). Terms were
randomly selected so that the total number of docIDs of each query is illustrated in Table 1. For instance, the first row shows queries composed
of two terms whose posting lists sum up 89 615 docIDs. We set k=128 (ie, select the top-128 most relevant documents). The parallel approaches
presented in this work were executed in a NVIDIA GPU Titan Xp with 3840 CUDA cores, 30 multiprocessors (SM), and 12 GB memory.
4.1 Performance of the baseline (sequential) algorithm

The average execution times for twenty executions of the baseline are presented in Table 1. Notice that the algorithm is very efficient. For
instance, consider OR queries with two terms and k=128. Although the number of docIDs increase from 89 615 to 5 494 285 (61 times larger),
the execution time increases from 14.01 to 202.79 (only 14.47 times longer). Once the algorithm computes the threshold value for a given k, the
threshold is used to prune the search and therefore more documents are skipped. In other words, the longer the posting lists to be searched, the
more documents the algorithm skips. When executing AND queries, the algorithm reports lower running times than when executing OR queries.
To process AND queries, the algorithm has to select the docIDs assigned to all the terms, thus skipping even more docIDs.
The execution times of the sequential implementation of the algorithm WAND for OR queries are shown in Table 1 (in milliseconds). These
times were used as reference for the calculation of speedup. The algorithm spent 14.01 ms to evaluate 897 615 documents, whereas to evaluate
queries with 5.49 million documents it spent 202.79 ms. In all experiments, the queries search for the 128 most relevant documents (k = 128).
4.2 Performance evaluation for OR queries

In the experiments, we evaluate the performance of the size-based and docID range-based partitioning strategies executing short, medium,
large, and extra-large OR queries. Queries with two terms were chosen because they are usual in web search engines. We used 32 threads per
SM to search for the top-128 documents. For merging the heaps, we used four mergers per SM. For each strategy, three approaches of the
implementations of the threshold were realized and evaluated, ie, local, safe-R shared, and safe-WR shared threshold. Both strategies use a global
implementation of the upper bound. Experiments were carried out also varying the partition size (with 32, and 64 docIDs per posting list), and
the number of partitions per thread block (1, 10). As shown in Table 2, such parameters impact the average execution times and speedups of the
parallel WAND for OR queries. The best speedups of each case were highlighted.
As expected, in most scenarios, the size-based partitioning strategy achieved better speedups due to higher occupancy of the SMs as all
partition have the same size. However, for the extra-long queries, the range-based strategy outperforms the size-based strategy. It happens
because the docID algorithm guarantees that the posting lists of different query terms have the same range of docIDs. In this case, more relevant
documents are captured earlier, thus making the threshold value to increase faster, which ultimately leads the algorithm to skip more documents.
This effect revealed to be more advantageous than the better occupancy of resources implemented by the size-based strategy for very long
posting lists.
As in previous experiment the average performance of queries with partitions of 32 docIDs was better, this partition length is set for the next
experiments. Results in Table 2 also suggest that using more (ie, 10) partitions per thread block leads to better results for long queries, while
less (only one) partitions per thread block improves performance of short queries. In these cases, there is an increase in the amount of work
* http://boston.lti.cs.cmu.edu/Data/clueweb09/.
† http://terrier.org/
TABLE 2 The speedup and execution times for OR queries with several query lengths
Partit. Threshold Part. # of Short Medium Large Extra
strat. sharing size Part. Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup
Size Local 32 1 0.54 25.84 3.28 13.05 11.05 4.12 509.39 0.40
Size Local 32 10 1.73 8.08 2.22 19.34 4.62 9.85 57.14 3.55
Size Local 64 1 0.51 27.58 2.53 16.91 7.34 6.21 271.54 0.75
Size Local 64 10 3.02 4.63 3.41 12.55 4.30 10.58 35.01 5.79
Size Sh-R 32 1 0.57 24.55 3.42 12.54 11.31 4.02 508.60 0.40
Size Sh-R 32 10 1.75 8.03 2.05 20.86 3.50 12.99 49.32 4.11
Size Sh-R 64 1 0.54 25.83 2.62 16.34 7.46 6.10 270.41 0.75
Size Sh-R 64 10 3.19 4.39 3.14 13.63 3.11 14.63 27.12 7.48
Size Sh-WR 32 1 0.55 25.27 3.33 12.85 11.15 4.08 509.39 0.40
Size Sh-WR 32 10 1.74 8.04 2.30 18.60 3.89 11.70 50.08 4.05
Size Sh-WR 64 1 0.56 25.05 2.58 16.58 7.41 6.15 269.57 0.75
Size Sh-WR 64 10 3.22 4.35 3.20 13.38 3.62 12.56 27.70 7.32
Range Local 32 1 1.76 7.94 3.93 10.90 10.07 4.52 225.70 0.90
Range Local 32 10 4.11 3.41 4.23 10.14 7.72 5.90 36.38 5.57
Range Local 64 1 2.50 5.61 3.68 11.64 7.60 5.99 124.15 1.63
Range Local 64 10 5.46 2.57 5.86 7.32 11.75 3.87 31.98 6.34
Range Sh-R 32 1 2.51 5.58 3.79 11.31 7.20 6.32 218.14 0.93
Range Sh-R 32 10 4.19 3.34 3.87 11.08 4.56 9.99 24.78 8.19
Range Sh-R 64 1 2.74 5.11 3.78 11.34 5.10 8.93 118.70 1.71
Range Sh-R 64 10 6.02 2.33 4.96 8.64 4.36 10.44 19.45 10.43
Range Sh-WR 32 1 2.53 5.53 4.10 10.44 6.78 6.71 222.14 0.91
Range Sh-WR 32 10 3.35 4.18 4.26 10.05 4.22 10.78 25.41 7.98
Range Sh-WR 64 1 2.01 6.98 3.56 12.04 6.16 7.39 121.90 1.66
Range Sh-WR 64 10 5.07 2.76 4.08 10.51 4.37 10.41 20.20 10.04
to be executed per thread block, and the reuse of data among partitions executed by the same thread block can lead to better performance.
To investigate this, the next experiment varies the number of partitions per thread block as 1, 5, 10, 100, 150, and 200 (results are shown in
Figures 6 and 7).
As depicted in Figure 6, the longer the length of posting lists for the range-based strategy, the more partitions per thread block are needed
to achieve better speedups. For short queries, one partition per thread block leads to maximum speedup, while 10 partitions per block is more
suitable for medium and long queries, and 100 partitions maximized the performance of extra-large queries. On the other hand, this experiment
shows that all three alternative implementations of the threshold present similar performance. It can be explained as follows.
Consider, for instance, the experiment with medium queries, whose posting lists have near 238 thousands docIDs each (the half of
476 thousands). With 32 docIDs per partition, the query is partitioned into 7450 partitions approximately. With one partition per thread block, the
strategy creates the same amount of thread blocks, which initializes a heap with 128 empty slots to process only one partition. At the end of the
process, each of the 7450 blocks has one resulting heap containing at most 32 selected documents to merge. Furthermore, in this case, each block
computes a local threshold for every partition instead of cumulatively building a global threshold. With local thresholds, the algorithm does not skip
as many documents as expected and the parallel search takes longer to execute. On the other hand, consider the case that each block executes
200 partitions, which share the same heap and threshold. In this case, with only 37 blocks there will be high data reuse among partitions, only
37 heaps to merge, and threshold sharing will skip more documents. However, with only 37 blocks there will insufficient amount of work exposed
to the scheduler, as many of these blocks may not be ready for execution when the current block stalls due to a memory access. Hence, there
is a tradeoff between maximizing the work offer to the processors, and minimizing the overhead of work partitioning. For medium size queries,
best speedups can be achieved around five partitions per thread block (ie, near 1490 thread blocks per query). The increase in the number of
partitions per thread block only leads to performance gains when exist enough amount of work (ie, thread blocks) to maintain the processors
occupied.
Next, we execute similar experiment to evaluate the effect of varying the number of partitions per thread block for the range-based partitioning
strategy. Results presented in Figure 7 exhibit a similar behavior regarding the best number of partitions can be observed. However, the maximum
speedup for short queries decreased because disjunctive queries operate on a smaller amount of documents, so less documents are skipped
compared to conjunctive ones.
In summary, the size-based partitioning strategy lead to higher speedups for short and medium length queries, while the range-based
partitioning lead extra-large queries to better speedups when threshold sharing policies are used. However, the size-based strategy has a
drawback. It does not guarantee that each homogeneous partition takes the same range of docIDs in its posting lists. As a consequence, the
size-based strategy may lose relevant documents during the search process, thus producing less accurate results as shown in Table 3. Hence,
FIGURE 6 Performance for OR queries with the size-based partitioning strategy, three threshold sharing policies, and varying the number of
partitions per thread block
the size-based partitioning strategy provide good approximate results, and better speedups for short and medium length queries. As the parallel
range-based strategy returns exactly the same list of top-k documents returned by the sequential algorithm, its recall is 1.0. For this reason, the
recall for the range-based strategy is not shown in Table 3. The best speedups are summarized for both algorithms in Figure 8.
4.3 Performance evaluation for AND queries

In this section, we present results for AND queries for both range-based and size-based strategies. Table 4 shows the parameter configuration,
the execution time (in milliseconds), and the speedup for AND queries with two terms. We run experiments with one partition and 10 partitions
of sizes 32 and 64. We highlight in bolt the best speedups reported by each strategy and each threshold policy. The speedup obtained by the
size-based strategy was 40× for short queries, while the range-based strategy reached an speedup of 18× for the same kind of queries. The
improvement on the speedup was higher for AND queries than for OR queries. For example, for short queries, the parallel algorithm reports
an improvement of ∼ 35%. This is mainly because the number of documents to be evaluated is further drastically reduced when computing the
intersection of the posting lists. However, the range-based strategy reports lower speedups than the size-based strategy for short, medium,
and large queries due to the partitions assigned to each SM are not equally sized which tends to introduce imbalance among the SMs in order
to keep exact results. However, similarly to OR queries, only for extra-large query length, the range-based strategy reports higher speedup
than the size-based strategy, because the threshold is updated more efficiently reducing the effect of the unbalance workload among the SMs.
Additionally, the speedups tends to be similar with all three threshold policies.
In Figure 9, we evaluated the impact of the number of partition per thread block on the size-based strategy with different threshold policies.
We set the partition size to 32. The y−axis shows the speedup. For short queries, the best speedup is obtained with 1 partition per thread block.
FIGURE 7 Performance for OR queries with the range-based partitioning strategy, three threshold sharing policies, and varying the number of
Partitioning Threshold Short Medium Large Extra TABLE 3 Recall for the
size-based strategy with short,
strategy sharing Recall Std. dev. Recall Std. dev. Recall Std. dev. Recall Std. dev. medium, and long queries
Size Local 0.891 0.030 0.897 0.011 0.833 0.021 0.905 0.019
Size Sh-R 0.890 0.030 0.897 0.011 0.833 0.021 0.905 0.019
Size Sh-WR 0.891 0.030 0.897 0.011 0.833 0.021 0.905 0.019
For medium queries the best results are achieved with five partitions per block. For large queries, 10 partitions per thread block improves the
performance and in the case of extra-large queries the best performance is reported with 100 partitions. In Figure 10, we show the results for
the same experiment with the range-based strategy. In this case, short and medium queries report the best speedup with one partition per thread
block. For large queries, the best results are achieved with 10 partitions per block and, for extra-large queries, the best performance is reported
with 100 partitions.
Similar to OR-queries, these results show that it is not possible to determine a number of partitions that allows obtaining the best performance
for all cases (GPU-based partition strategies and threshold policies). This gives an example of the complexity of the WAND algorithm, which
depends on many parameters like the size of the queries to be processed and how the index is distributed/accessed in the different SMs.
Figure 11 shows a summary of the results achieves by each parallel query processing strategy and each threshold policy. The y-axis shows
the speedup and the x−axis shows the parallel strategy. In each case, we selected the number of partitions per thread block reporting the
best speedup as shown in Figures 10 and 9. In general, the size-based strategy reports higher speedups than the range-based strategy. The
FIGURE 8 Summary of the best

speedups obtained for OR
queries with the range-based and
size-based partitioning strategies
with three threshold sharing
policies (local, safe-R, and
safe-WR)
TABLE 4 Execution time and speedup obtained by the size-based and range-based strategies and different threshold policies for
short, medium, large, and extra-large AND queries with two terms
Partit. Threshold Part. # of Short Medium Large Extra
strat. sharing size Part. Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup
Size Local 32 1 0.34 40.73 2.39 17.74 9.11 4.99 503.78 0.40
Size Local 32 10 1.59 8.70 2.05 20.69 3.79 11.99 53.67 3.74
Size Local 64 1 0.46 30.23 1.80 23.61 5.54 8.20 264.82 0.76
Size Local 64 10 2.79 4.95 3.27 12.99 4.08 11.14 31.18 6.44
Size Sh-R 32 1 0.36 38.64 2.44 17.37 9.20 4.94 502.65 0.40
Size Sh-R 32 10 1.74 7.97 1.96 21.61 3.12 14.57 49.34 4.07
Size Sh-R 64 1 0.49 28.31 1.88 22.61 5.65 8.04 259.13 0.77
Size Sh-R 64 10 3.17 4.36 3.18 13.35 3.22 14.12 27.32 7.35
Size Sh-WR 32 1 0.35 39.59 2.41 17.63 9.16 4.96 499.57 0.40
Size Sh-WR 32 10 1.76 7.87 2.06 20.63 3.28 13.86 50.53 3.97
Size Sh-WR 64 1 0.49 28.05 1.84 23.10 5.58 8.14 264.19 0.76
Size Sh-WR 64 10 3.23 4.29 3.25 13.04 3.43 13.25 27.81 7.22
Range Local 32 1 0.77 18.04 2.90 14.62 8.23 5.52 217.83 0.92
Range Local 32 10 2.46 5.62 4.35 9.75 6.88 6.61 33.91 5.92
Range Local 64 1 1.17 11.86 2.71 15.68 6.23 7.30 119.35 1.68
Range Local 64 10 4.09 3.38 6.12 6.94 8.90 5.11 30.30 6.63
Range Sh-R 32 1 0.85 16.23 2.90 14.64 8.43 5.39 217.66 0.92
Range Sh-R 32 10 2.83 4.89 4.07 10.43 6.70 6.78 25.00 8.03
Range Sh-R 64 1 1.29 10.69 2.68 15.86 7.11 6.39 116.10 1.73
Range Sh-R 64 10 4.77 2.90 5.57 7.62 8.56 5.31 19.89 10.09
Range Sh-WR 32 1 0.85 16.18 2.97 14.27 8.72 5.21 216.24 0.93
Range Sh-WR 32 10 2.82 4.91 4.13 10.27 5.68 7.99 25.95 7.74
Range Sh-WR 64 1 1.25 11.02 2.81 15.09 7.55 6.02 117.87 1.70
Range Sh-WR 64 10 4.69 2.95 5.70 7.45 8.67 5.24 20.19 9.95
results obtained with different threshold policies tend to be very similar; thus, they have a low impact on the performance. For short queries,
the size-based strategy outperforms the range-based strategy for more than twice of the speedup. This difference is reduced as we increase
the query length to extra-large. This behavior can be explained as follows. The size-based strategy balance the workload among the SMs and,
with short queries, the threshold value does not have a significant pruning effect on the top-k document results. On the other hand, with the
range-based strategy, the workload is unbalanced among the SMs. However, for extra-large queries, the threshold value drastically prune the
non-relevant documents, which helps to compensate the loss of performance caused by the workload imbalance.
FIGURE 9 Performance for AND queries with the size-based partitioning strategy, three threshold sharing policies, and varying the number of
Table 5 shows the recall for AND-queries executed with the size-based strategy. With a larger query size, the recall tends to decrease. However,
in all cases, the strategy provide good approximate results with a recall higher than 80%. The standard deviation is lower for AND queries than
for OR-queries (0.012 versus 0.030 for short queries). In other words, all the values are more similar to each other and closer to the mean.
5 PARALLEL EXECUTION OF BATCHES OF QUERIES
State-of-the-art implementations of single queries can be executed in few milliseconds.15,19 However, because there are many factors that impact
the total latency in a query, longer response times may be acceptable. In an attempt to further explore the use of GPUs in web search queries,
we investigate the possibility of grouping queries prior to processing, even if at the cost of a small increase in latency. In our tests, a batch of
query is produced by buffering requests for a period of time or until a predetermined number of request have arrived. In this section, we propose
and investigate two strategies for the parallel execution of batches of queries on GPUs.
5.1 The asynchronous query batch processing strategy

Our first approach for processing a batch of queries in parallel is shown in Algorithm 3. The idea is to iterate over the batch and split each query
at a time to be handled in parallel by the GPU coarse grain processors. As with our strategy for single queries, the coarse-grained processors
work in parallel on different partitions of the posting lists, and fine-grained processors simultaneously handle different posting lists within a
partition.
Algorithm 3 is inspired by the BSP model, having two super-steps to handle each query, the first one processes the inverted lists, and the
second combines the partial results produced by the coarse-grained processors.
After the coarse-grained processors have each selected its most relevant documents, a merge procedure takes place in 𝜃(log(|P|)) steps, as
seen in the calls to ParallelMerge. All involved coarse-grained processors then await at a synchronization barrier (line 10), before another merge
round. Eventually, one of the processors will have the merged list of the global top k most relevant documents for the query. The iteration process
then moves on to the next query in the batch.
FIGURE 10 Performance for AND queries with the range-based partitioning strategy, three threshold sharing policies, and varying the number of
FIGURE 11 Summary of the best

speedups obtained for AND queries
with the range-based and size-based
partitioning strategies with three
threshold sharing policies (local, safe-R,
and safe-WR)
TABLE 5 Recall for AND-queries executed with the size-based strategy

Partitioning Threshold Short Medium Large Extra
strategy sharing Recall Std. dev. Recall Std. dev. Recall Std. dev. Recall Std. dev.
Size Local 0.911 0.012 0.907 0.011 0.821 0.016 0.812 0.015
Size Sh-R 0.891 0.013 0.887 0.021 0.821 0.016 0.812 0.015
Size Sh-WR 0.911 0.012 0.857 0.019 0.821 0.016 0.812 0.015
5.2 The synchronous query batch processing strategy

Our second batch processing strategy handles multiple queries in parallel, as can be seen in Algorithm 4. Each query from the batch is assigned
to a coarse-grained processor via a call to ParallelMatchProcessing, shown in line 2. All coarse-grained processors work in parallel, handling the
complete posting lists for the corresponding query terms, without partitioning them. The WAND algorithm is applied to the whole document
lists, but a limited number of local fine-grained processors will work in parallel for the execution of selected activities within our implementation.
Because there is no partitioning, there is also no need for synchronization and merging.
A barrier at line 4 ensures all queries in the batch have been processed before returning the results. Thus, the response time in this case will
depend on the query with the longest post list, and larger number of documents to evaluate.
5.3 Implementation details for batch processing

In the asynchronous batch processing strategy, queries are launched one at a time. Each query is divided in many partitions, which can be
assigned for parallel execution in multiple coarse-grained processors. In our CUDA implementation, a mechanism called CUDA Streams allows
multiple queries to proceed in parallel, without having to wait until the previous query is concluded. While the instructions in a stream will run in
the order they are issued, different streams (queries) can be interleaved and run in parallel as illustrated in Figure 12. By using streams in CUDA,
our implementation was able to run several instances of the ParallelMatchProcessing function in parallel and there was no need for an explicit
synchronization between calls when iterating sequentially through the batch of queries. On the other hand, the GPU architecture poses a limit
to the maximum number of concurrent streams. For the Pascal architecture, we used in our tests such limit is 32 streams in parallel, which was
reached by our implementation.
FIGURE 12 Query batch mapping in CUDA Streams
5.4 Query processing using thread blocks

In the implementation of our second approach for batch processing, all queries were launched for simultaneous execution on the GPU. In order
to do so, each query was mapped to a thread block, which logically corresponds to a coarse-grained processor. Multiple thread blocks were
also organized in grids, which allowed all queries to run, asynchronously, in parallel. In this sense, the observed performance for running a batch
depended on the slowest query, probably one whose terms had the longest posting lists.
6 EXPERIMENTS WITH BATCHES OF QUERIES
In this section, we evaluate the strategies for the parallel execution of batches of queries proposed in Section 5. For the experiments, we used
the same hardware and the same 50.2 million document corpus TREC ClueWeb09 (category B) indexed by Terrier as described in Section 4.
However, instead of submitting individual queries, we composed a batch of 500 queries, which was submitted for execution in parallel. The query
log represents real requests in the Web environment. Experiments were replicated ten times and the execution times averaged.
The baseline (sequential) version of the WAND algorithm was executed on an Intel Core i7-5820K processor of 3.30 GHz and 64 GB of RAM,
using only one core. The sequential WAND algorithm was implemented in Java and executed one batch of 500 queries in 17.106 seconds, on
average. The batch algorithms invoke the parallel implementation of the WAND algorithm to execute individual queries, as described in Section 5.
In this experiment, the parallel WAND uses the three threshold propagation approaches, and a global implementation of the upper bound.
6.1 Discussion on the batch size

In real web search engines, the size of the query batch size is a matter of investigation. The batch size can be dynamically adjusted according to
external parameters like the query traffic rate, and internal parameters like availability of resources. When query traffic is low, it is convenient
to process one query at a time. On the other hand, when query traffic is high, it is convenient to process several queries in batch.20 We used
a Yahoo query log21 that corresponds to a 3-month period in 2009. This log contains 2 109 198 distinct queries that correspond to 3 991 719
queries. We used batches of 500 queries for our experiments because its execution time (around 1 second) can be tolerated without degrading
the quality of experience of users.
6.2 The performance of the asynchronous strategy

In a first exploratory experiment, we varied number of partitions per thread block as 1 and 10. The size of inverted lists in each partition varies as
32, 64, and 128 docIDs. The three threshold sharing policies are also investigated. Results are presented in Table 6. The best performance results
are shown in bold type.
As shown in Table 6, the range-based strategy outperforms the size-based strategy in this scenario, because the full scoring of the documents
performed by the range-based strategy lead to skipping more documents than the size-based strategy. Moreover, the increase in the number of
partitions per thread block from 1 to 10 increased the data reuse of the WAND algorithm and the amount of work per thread block, leading to an
increase from 4 to 10 times in speedup. In addition, note that the increase in the number of docIDs per partition lead to a noticeable increase in
performance. Regarding the threshold propagation policies, there is no significant variation in performance for size-based strategies. However,
for the range-based strategy, we can observe an improvement around 30-40% in speedup.
TABLE 6 The speedup and execution times of asynchronous strategy for query batch
Partitioning Threshold Number of 32 docIDs per partition 64 docIDs per partition 128 docIDs per partition
strategy sharing partitions Time (s) Speedup Time (s) Speedup Time (s) Speedup
Size Local 1 52.17 0.33 26.10 0.66 13.51 1.27
Size Local 10 5.58 3.07 3.05 5.60 1.77 9.65
Size Sh-R 1 51.51 0.33 25.97 0.66 13.16 1.30
Size Sh-R 10 5.07 3.37 2.63 6.51 1.33 12.86
Size Sh-WR 1 51.51 0.33 25.82 0.66 12.74 1.34
Size Sh-WR 10 5.06 3.38 2.68 6.39 1.41 12.10
Range Local 1 8.59 1.99 4.90 3.49 2.83 6.05
Range Local 10 1.81 9.43 1.42 12.01 1.16 14.70
Range Sh-R 1 7.54 2.27 4.07 4.20 2.49 6.86
Range Sh-R 10 1.37 12.49 1.02 16.74 0.82 20.76
Range Sh-WR 1 7.51 2.28 4.06 4.21 2.34 7.32
Range Sh-WR 10 1.37 12.45 1.02 16.73 0.85 20.24
As suggested by the results shown in Table 6, the increase from 1 to 10 partitions per thread block increases the performance of the algorithms.
To investigate this, the next experiment varies the number of partitions per thread block from 1 to 250. The result is presented in Figure 13.
As depicted in Figure 13, the best performance can be achieved from 100 to 250 partitions per thread block. However, in some cases, the
increase of the number of partitions above this number can lead to a decrease in the occupation of GPU resources, thus slightly decreasing the
performance of the algorithm.
FIGURE 13 Logarithmic execution times (s) of the asynchronous size-based strategy with the threshold propagation policies for the query batch
Next, in Figure 14, we present the effect of threshold sharing policies for the best performance scenario, which uses 128 docIDs per posting
list in the partition. As shown, the two policies that share threshold lead to better performance than local threshold as global threshold allows the
algorithms to skip more documents, thus achieving better performance.
FIGURE 14 Performance of the asynchronous size-based strategy

with the threshold propagation policies and partitions with 128
docIDs size for the query batch
FIGURE 15 Execution times (s) of the asynchronous range-based strategy with the three threshold sharing policies for the query batch
FIGURE 16 Performance of the asynchronous range-based strategy

with the three threshold sharing policies and partitions with 128
docIDs size for the query batch
32 threads per block 64 threads per block 128 threads per block TABLE 7 The speedup and execution times of synchronous strategy
for query batch
Speedup Time (s) Speedup Time (s) Speedup Time (s)
0.809 21.131 0.799 21.403 0.813 21.021
The previous experiments show the performance of batches of queries with the size-based strategy. Next, we present experiments which vary
the number of partitions per thread block for the range-based strategy. As depicted in Figure 15, the algorithm achieved the best performance
with 128 docIDs per partition in the range between 1 and 100 partitions per block, while above 100 partitions per block of threads, the best
performance was obtained by 32 and 63 docIDs per partition.
Finally, Figure 16 presents the speedup for the three threshold sharing policies for the batch of queries running with 128 docIDs per thread
block. Again, the two global threshold sharing policies have demonstrated to improve performance as global thresholds allow skipping more
documents, thus improving query performance.
6.3 The performance of the synchronous strategy for batches of queries

The second strategy proposed for the execution of batches of queries states that each streaming processor (SM) of the GPU can execute one
individual query of the batch, so that several queries can be executed in parallel by different SMs. It reduces the parameters to investigate to
the amount of threads per processor, as the number of partitions per thread block and threshold propagation policies have no effect when each
processor executes one single query at a time.
Table 7 shows the execution time and speedup of the synchronous strategy for batches of queries based on the parallel implementation of
WAND. As shown, the number of threads per block has low impact on performance. Furthermore, the synchronous strategy for batches of
queries has shown better performance than the asynchronous one. It can be explained as follows. As the asynchronous strategy assigns one
query per processor, there is no overhead for creating, sharing, and merging shared data structures (such as the threshold, and heap) among
processors. As the batch has enough number of queries (larger than the number of processors in the GPU), all processors are likely to have good
occupation during the execution of the batch. On the other hand, although the asynchronous strategy divides the same query into partitions
to be executed asynchronously in parallel, the algorithm executes only one query at a time. In this case, small queries (with less partitions to
execute) lead to low thread occupation, thus limiting speedups.
7 CONCLUSIONS
In this paper, we have proposed parallel strategies for single queries and batch of queries executed with the WAND ranking algorithm on GPUs.
In particular, we presented two strategies to partition the documents among the SMs. The first document partition strategy, named size-based,
evenly partitions the posting lists among thread blocks. This approach tends to balance the workload among the SMs at the cost of low quality of
results. Our second document partition strategy, named range-based, partitions the posting lists according to the document identifier. Partitions
have different sizes and therefore workload tends to be unbalance among the SMs. This second approach retrievals the exact top-k documents
for user queries. We also proposed three threshold sharing policies, named (1) local, (2) Safe-R, and (3) Safe-WR.
To process batch of queries, we proposed two synchronization strategies named (1) Synchronous and (2) Asynchronous. In the Synchronous,
each processor executes a different query. The Asynchronous strategy executes two steps. In the first step, it selects the relevant documents for
the batch of queries by accessing the posting lists. In the second step, it merges the partial results. Each step ends with a barrier synchronization.
We evaluated our proposals with different query lengths, with AND/OR queries and different parameters configurations (ie, partition sizes and
number of partitions). Results for single queries show that the size-based strategy allows better speedups (up to 35x for OR queries, and 40x for
OR queries) through higher occupancy of the SMs. Although, it can lose relevant documents during the search process, it reports a recall higher
than 80%. Thus, the size-based strategy can provide good approximate results. The range-based strategy returns the exact top-k documents
and show promising speedups (of up to 25x for OR queries, and 18x for AND queries). All three policies for threshold sharing reported similar
performance.
The execution of batches of queries has demonstrated to be an interesting strategy in terms of performance. The synchronous strategy
noticeably leads to better performance, achieving speedups of 21 times. The synchronous strategy executes one query per processor at a time,
eliminating the overhead of sharing data among processors, and increasing the occupation of GPU resources.
As future work, we plan to develop new strategies for processing batches of queries in multi-GPU, and heterogeneous (CPUGPU) computing
aiming at better load balancing and maximizing communication-computation overlapping.
ACKNOWLEDGMENTS
The authors thank CAPES (Coordenaaçã de Aperfeiaçoamento de Pessoal de Nível Superior - Código de Financiamento 001), and FAPESP
Contract 2015/24461-2. Hermes Senger also thanks CNPQ (Contract 305032/2015-1) and FAPESP (Contract 2018/00452-2) for their support.
The Titan Xp used for this research was donated by the NVIDIA Corporation.
ORCID
Hermes Senger https://orcid.org/0000-0003-1273-9809
REFERENCES
1. Cambazoglu BB, Baeza-Yates R. Scalability Challenges in Web Search Engines. San Rafael, CA: Morgan and Claypool Publishers; 2016.
2. Turtle H, Flood J. Query evaluation: strategies and optimizations. Inf Process Manag. 1995;31(6):831-850.
3. Broder A, Carmel D, Herscovici M, Soffer A, Zien J. Efficient query evaluation using a two-level retrieval process. In: Proceedings of the 20th
International Conference on Information and Knowledge Management (CIKM); 2003; New Orleans, LA.
4. Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th International ACM SIGIR Conference on Research
and Development in Information Retrieval; 2011; Beijing, China.
5. Chakrabarti K, Chaudhuri S, Ganti V. Interval-based pruning for top-k processing over compressed lists. Paper presented at: 2011 IEEE 27th
International Conference on Data Engineering (ICDE); 2011; Hannover, Germany.
6. Altingovde I, Demir E, Can F, Ulusoy Ö. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans Inf Syst.
2008;26(3):1-36.
7. Rojas O, Gil-Costa V, Marin M. Distributing efficiently the Block-Max WAND algorithm. Procedia Comput Sci. 2013;18:120-129.
8. Rojas O, Gil-Costa V, Marin M. Efficient parallel Block-Max WAND algorithm. In: Proceedings of the 19th International Conference on Parallel
Processing (Euro-Par); 2013; Aachen, Germany.
9. Bonacic C, Bustos D, Gil-Costa V, Marin M, Sepulveda V. Multithreaded processing in dynamic inverted indexes for web search engines. In: Proceedings
of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval (LSDS-IR); 2015; Melbourne, Australia.
10. Ding S, He J, Yan H, Suel T. Using graphics processors for high performance IR query processing. In: Proceedings of the 18th International Conference
on World Wide Web (WWW); 2009; Madrid, Spain.
11. Ao N, Zhang F, Wu D, et al. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proc VLDB Endow.
2011;4(8):470-481.
12. Tadros R. Accelerating Web Search Using GPUs [PhD thesis]. Vancouver, Canada: University of British Columbia; 2015.
13. Wu D, Zhang F, Ao N, Wang G, Liu X, Liu J. Efficient lists intersection by CPU-GPU cooperative computing. Paper presented at: 2010 IEEE International
Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW); 2010; Atlanta, GA.
14. Zhang F, Wu D, Ao N, Wang G, Liu X, Liu J. Fast lists intersection with Bloom filter using graphics processing units. In: Proceedings of the 2011 ACM
Symposium on Applied Computing (SAC); 2011; TaiChung, Taiwan.
15. Huang H, Ren M, Zhao Y, et al. GPU-accelerated Block-Max query processing. In: Algorithms and Architectures for Parallel Processing: 17th International
Conference, ICA3PP 2017, Helsinki, Finland, August 21-23, 2017, Proceedings. Cham, Switzerland: Springer International Publishing AG; 2017:225-238.
16. Liu Y, Wang J, Swanson S. Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism. ACM SIGPLAN Not.
2018;53(1):327-337.
17. Vigna S. Quasi-succinct indices. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM); 2013; Rome, Italy.
18. Green O, McColl R, Bader DA. GPU merge path: a GPU merging algorithm. In: Proceedings of the 26th ACM International Conference on
Supercomputing (ICS); 2012; Venice, Italy.
19. Gaioso R, Gil-Costa V, Guardia H, Senger H. A parallel implementation of WAND on GPUs. Paper presented at: 2018 26th Euromicro International
Conference on Parallel, Distributed and Network-based Processing (PDP); 2018; Cambridge, UK.
20. Marin M, Gil-Costa V, Bonacic C, Baeza-Yates RA, Scherson ID. Sync/Async parallel search for the efficient design and construction of web search
engines. Parallel Computing. 2010;36(4):153-168.
21. Mendoza M, Marín M, Gil-Costa V, Ferrarotti F. Reducing hardware hit by queries in web search engines. Inf Process Manag. 2016;52(6):1031-1052.
How to cite this article: Gaioso R, Gil-Costa V, Guardia H, Senger H. Performance evaluation of single vs. batch of queries on GPUs.
Concurrency Computat Pract Exper. 2019;e5474. https://doi.org/10.1002/cpe.5474

Performance Evaluation of Single vs. Batch of Queries On GPUs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Performance Evaluation of Single vs. Batch of Queries On GPUs

Uploaded by

Copyright:

Available Formats

Received: 30 March 2019 Revised: 7 June 2019 Accepted: 3 July 2019

SPECIAL ISSUE PAPER

Performance evaluation of single vs. batch of queries on GPUs

Roussian Gaioso1 Veronica Gil-Costa2 Helio Guardia1 Hermes Senger1

1 Department of Computer Science, Federal

University of São Carlos, São Carlos, Brazil Summary

batch of queries, GPUs, top-k query processing, WAND algorithm

2 BACKGROUND AND RELATED WORK

FIGURE 1 WAND algorithm execution example

2.2 Related work

FIGURE 2 The size-based partitioning of the posting lists

3 A PARALLEL IMPLEMENTATION OF WAND FOR SINGLE QUERY EXECUTION

1. A set of coarse-grained processors P = {p1 , p2 , … , ps }, where s is the number of processors;

FIGURE 3 The pre-processing of the range-based partitioning of the posting lists

FIGURE 4 The range-based partitioning of the posting lists

3.1 Threshold sharing policies

FIGURE 5 Reuse of the results of processing of the partitions

4 EXPERIMENTS WITH SINGLE QUERIES

4.1 Performance of the baseline (sequential) algorithm

4.2 Performance evaluation for OR queries

4.3 Performance evaluation for AND queries

FIGURE 8 Summary of the best

5 PARALLEL EXECUTION OF BATCHES OF QUERIES

5.1 The asynchronous query batch processing strategy

FIGURE 11 Summary of the best

TABLE 5 Recall for AND-queries executed with the size-based strategy

5.2 The synchronous query batch processing strategy

5.3 Implementation details for batch processing

FIGURE 12 Query batch mapping in CUDA Streams

5.4 Query processing using thread blocks

6 EXPERIMENTS WITH BATCHES OF QUERIES

6.1 Discussion on the batch size

6.2 The performance of the asynchronous strategy

FIGURE 14 Performance of the asynchronous size-based strategy

FIGURE 16 Performance of the asynchronous range-based strategy

6.3 The performance of the synchronous strategy for batches of queries

Hermes Senger https://orcid.org/0000-0003-1273-9809

You might also like