Professional Documents
Culture Documents
Performance Evaluation of Single vs. Batch of Queries On GPUs
Performance Evaluation of Single vs. Batch of Queries On GPUs
DOI: 10.1002/cpe.5474
KEYWORDS
1 INTRODUCTION
The web is continuously growing in two important dimensions, which impact the scalability of web search engines. First, the amount of webpages
and documents available on the web is increasing. The more documents there are to be searched, the larger are the document indexes and the
longer it takes for each individual query to be processed. Second, the interest of people in searching these contents is increasing, which increases
the demand for web search engines with higher throughput. In such scenario, a scalable query processing system ideally should be capable of
answering queries with high quality search results and maintaining reasonable throughput and latency.1
To process user queries with high throughput and low response time, current commercial search engines use large clusters consisting of
thousands of nodes, multi-core processors, usually placed on large datacenters. Web documents are commonly mapped by search engines via
inverted indexes on per-document or per-term bases. These indexes are used for finding the most relevant documents in a query using mainly
two strategies. The Term-at-a-time (TAAT) strategy processes query terms one by one to determine their impact from all relevant documents.
The Document-at-a-time (DAAT) strategy evaluates the contributions of each document to all query terms at once.2
The WAND ranking algorithm3 implements a DAAT strategy that first runs a fast approximate evaluation on candidate documents, and then
makes a full slower evaluation limited to promising candidates. Because it allows many documents to be skipped without fully computing their
scores, this optimization was further used in other algorithms such as the block-max WAND (BMW)4 and MaxScore.5 On the other hand, its
step-by-step sequential nature makes this class of algorithms not a clear candidate for parallelization.
Motivated by the high performance delivered by GPUs, we propose and evaluate two index partition strategies to parallelize the query
processing based on the DAAT approach. We have implemented and tested our proposals with the WAND ranking algorithm. Our strategy for
parallelization is based in partitioning the posting lists to be searched in parallel by many GPU threads, while dealing with the information sharing
among threads to minimize processing latency without loss of relevant documents. Moreover, we propose two synchronization strategies to
process batch of queries. Our strategies show promising speedups.
Concurrency Computat Pract Exper. 2019;e5474. wileyonlinelibrary.com/journal/cpe © 2019 John Wiley & Sons, Ltd. 1 of 21
https://doi.org/10.1002/cpe.5474
2 of 21 GAIOSO ET AL.
The remainder of this paper is organized as follows. Section 2 presents the background and related work on query processing algorithms and
parallel processing using GPUs. Section 3 presents our proposed parallel strategies for GPUs. Section 4 presents the results for single queries.
Section 5 presents the synchronization strategies for batch of queries and Section 6 presents their performance. Finally, Section 7 brings the
conclusions.
Search engines determine the top-k most relevant documents for term queries by executing a ranking algorithm on an inverted index. This index
enables the fast determination of the documents that contain the query terms and contains data to calculate document scores for ranking. The
index is composed of a table of terms and, for each term, there is a posting list with information including the document identifiers (docIDs) where
the term appears, the frequency (number of times the term appears in the document) along with additional data used for ranking purposes. To
solve a query, it is necessary to get from the posting lists the set of documents associated with the query terms and then to perform a ranking of
these documents in order to select the top-k documents as the query answer.
The WAND algorithm was originally conceived having a single execution flow to iterate over an inverted index. It processes each query by
looking for query terms in the inverted index and retrieving each posting list. Documents referenced from the intersection of the posting lists
allow to answer conjunctive queries (AND bag of word query) and documents retrieved from at least one posting list allow to answer disjunctive
queries (OR bag of word query). The algorithm uses static and dynamic values. The static value, called the upper-bound, is calculated for each
term of the inverted index at construction time whereas the dynamic value, called the threshold, starts in zero for each new query and is updated
during the computations across the inverted index.
It uses a standard docID sorted index and it is based on two levels. In the first level, some potential documents are selected as results using an
approximate evaluation. Then, in the second level those potential documents are fully evaluated (eg, using the BM25 or vector model) to obtain
their scores. A heap keeps track of the current top-k documents being the lowest ranked one at the root. The root score provides a threshold
value, which is used to decide on running a full score evaluation for each of the remaining documents in the posting lists associated with the
query terms. The reference to a new document is inserted in the heap when the WAND operator is true, ie, when the score of the new document
is greater than the score of the document with the minimum score stored in the heap. If the heap is full, the document with the minimum score is
replaced, updating the value of the threshold. Documents with a score smaller than the threshold in the heap are skipped.
This scheme allows skipping many documents that would have been evaluated by an exhaustive algorithm. To this end, the algorithm iterates
through posting lists to evaluate them quickly using a pointer movement strategy based on pivoting. In other words, pivot terms and pivot
documents are selected to move forward in the posting lists which allow skipping many documents that would have been evaluated by an
exhaustive algorithm. A larger threshold usually allows skipping more documents, reducing computational costs involved in score calculation.
Figure 1 shows an example of the WAND algorithm for a query with three terms (‘‘tree, cat and house’’). The posting lists of the query terms
are sorted by docIDs upper bounds (UBs). Then we add the upper bounds of the terms until we get a value greater or equal to the threshold.
In this example, the sum of the UBs of the first two terms is 2 + 4 ≥ 6. Thus, the term ‘‘cat’’ is selected as the pivot term. We assume that
the current document in this posting list is ‘‘503,’’ so this document becomes the pivot document. If the first two posting lists do not contain
the document 503, we proceed to select the next pivot. Otherwise, we compute the score of the document. If the score is greater or equal to
the threshold, we update the heap by removing the root document and adding the new document. This process is repeated until there are no
documents to process or until it is no longer possible for the sum of the upper bounds to exceed the current threshold.
2.1 GPUs
The GPU basic architecture consists of a set of streaming multiprocessors (SMs), each one containing several streaming processors (SPs). All
SPs (thin cores) inside a SM (fat core) execute the same instructions, but they do it over different data instances, according to the SIMD (Single
Instruction, Multiple Data) model. The amount of SPs and the number of SMs in a GPU differ from one graphic card to another. The GPU supports
thousands of lightweight concurrent threads and, unlike the CPU threads, the overhead of creating and switching threads is negligible. The
threads on each SM are organized into thread groups that share computation resources such as registers. A thread group is divided into multiple
schedule units, called warps, which are dynamically scheduled on the SM. Because of the SIMD nature of the SP's execution units, if threads in
a schedule unit must perform different operations, such as going through branches, these operations will be executed serially as opposed to in
parallel. Additionally, if a thread stalls on a memory operation, the entire warp will be stalled until the memory access is completed. In this case,
the SM selects another ready warp and switches to it. The GPU global memory is typically measured in gigabytes of capacity. It is an off-chip
memory and has both a high bandwidth and a high access latency. To compensate the high latency of this memory, it is important to have more
threads than the number of SPs and to have threads in a warp accessing consecutive memory addresses that can be easily coalesced. The GPU
also provides a fast on-chip shared memory, which is accessible by all SPs of a SM. The size of this memory is small, but it has a low latency and it
can be used as a software-controlled cache. Moving data from the CPU to the GPU and vice versa is done through a PCIExpress connection.
A GPU program exposes parallelism through a data-parallel SPMD (Single Program Multiple Data) kernel function. The programmer can
configure the number of threads to be used. Threads execute data parallel computations of the kernel and are organized in groups (thread blocks)
that are further organized into a grid structure. When a kernel is launched, the blocks within a grid are distributed on idle SMs while the threads
are mapped to the SPs. Threads within a thread block are executed by the SPs of a single SM and can communicate through the SM shared
memory. Furthermore, each thread inside a block has its own registers (private local memory) and uses a global thread block index, and a local
thread index within a thread block, to uniquely identify its data. Threads that belong to different blocks cannot communicate explicitly and have
to rely on the global memory to share their results.
In this section, we present strategies for the parallel processing of single queries using the WAND algorithm on GPUs. WAND performs a DAAT
analysis by iterating over the same docID on the posting lists of all query terms. Our approach to a parallel WAND preserves its DAAT strategy
by having processor groups to work on partitions of all posting lists. In the first strategy, all the posting lists are divided into fixed size partitions,
each containing the same amount of document identifiers (docIDs), at least for the longest term list. This scheme is depicted in Figure 2. The
documents are partitioned according to the desired number of multiprocessors (SMs) and partition size. This size-based partitioning strategy
aims at simplicity and at maximizing thread occupancy through better load balancing. Given each query term may appear in different documents;
however, this partition strategy may lead to a misalignment of docIDs among the posting lists in a partition.
A second partitioning strategy was then proposed, which fragments the posting lists according to the range of docIDs. This strategy aims at
maximizing the effect of the DAAT approach, by allowing each multiprocessor (SM) to determine the exact impact of each relevant document for
each query term. As a result of this range-based partitioning, each partition will contain the same range of docIDs in the respective posting lists
for the query terms. In our implementation, this partitioning is initially performed as with the size-based strategy and, in a second step, document
ranges of the partitions are adjusted according to the docIDs located at the beginning and end of each partition. The initial docID of a partition is
set to the largest docID located in the first position outside the starting edge (leftmost position) of the partition of each inverted list by adding
one unit to its value. The final docID is the highest value of docID located at the rightmost position of the partition of each inverted list. This
process is illustrated in Figure 3. At the end of this process, the processors are not aware of the amount of documents in each partition, but
only know the initial and final documents of the document ranges. As depicted in Figure 4, the range-based strategy produces partitions with
different sizes.
Our strategies can be implemented on GPU architectures and other manycore accelerators composed of a hierarchy of processors described
as follows. The top level of the hierarchy is composed of one or more multiprocessors (fat cores), each one executing its own instruction stream.
Multiprocessors are processing elements intended for coarse-grained parallelism. In the second level of the hierarchy, each multiprocessor is
composed of a set of simple processing units called Stream Processors (thin cores). Stream processors are noted as fine-grained sharing the same
resources within the multiprocessor and work synchronously in the same instruction stream (SIMD). From this consideration, we can formally
define this GPU abstraction as follows:
The strategies for the parallel execution of each single query are detailed in Algorithms 1 and 2. Algorithm 1 formalizes the coarse-grained
distributed query processing. The parallelization occurs with function calls on lines 2 and 7. The ParallelMatchProcessing function (kernel)
GAIOSO ET AL. 5 of 21
encapsulates the parallel WAND algorithm, whereas the ParallelMerge function combines in parallel all the partial top-k documents to obtain only
the most relevant k documents. The synchronization barrier on line 4 ensures that all partial results are obtained before the results are merged.
These strategies described in Algorithm 1 do not include the processing for partitioning the inverted query lists, which is performed dynamically
on each coarse-grained processor.
The number of processors involved in the algorithm depends on the size of the largest inverted list for the query terms, and each processor
will initially work on one partition. Thus, the time complexity to perform the classification of the documents of the inverted lists of a query is
N
𝜃( |P|m ) in the worst case, where Nm is the number of docIDs in the largest inverted lists for the query terms. The top-k documents resulting from
the partitioning are merged by the log(|P|) calls of the ParallelMerge function. We propose that the merge algorithm be processed in parallel on
N
the GPU, where each call runs in time that depends on the constant k. Therefore, Algorithm 1 runs in 𝜃( |P|m + log(|P|)) time.
6 of 21 GAIOSO ET AL.
Algorithm 2 details the fine-grained parallelization strategy encapsulated by the ParallelMatchProcessing function. This algorithm is similar to
the original WAND presented in the work of Broder et al.3 The input of the algorithm includes the list of query terms and their corresponding
posting lists (terms), the number of top documents to be retrieved (k), and a set of coarse-grained processors (s), each containing Cores[1..b]
(fine-grained processors). For each partition, the algorithm maintains a pointer to each posting list of query terms (terms) and to the next
document (nextDoc) to be evaluated in each list. The algorithm begins by initializing in parallel these list pointers and the docID range in
each posting list in each partition, ie, the smallest and largest docIDs (dmin , dmax ) of each list in each partition. This can be seen in lines 3
and 4, where functions RangeDocs and NextDoc are called. The NextDoc function determines the next docID to be evaluated from these
docIDs.
Our approach for fine-grained parallelization is applied to several steps within the ParallelMatchProcessing function, ie, the analysis of similarity
models for the selected document (FullScore), the management of the list of relevant documents (ManageTopkDocs), the obtainment of the
next document (NextDoc), and the sorting of the query terms (Sort). Nevertheless, the iterations over the partition documents are sequentially
conserved by the loop in line 8. A document is selected at each iteration to be analyzed and classified. The analysis consists in verifying
(line 9) if the document can have sufficient contribution to be among the most relevant documents. The classification is the application of
similarity models in each posting list where a docID is present. This analyzed and classified document is called pivot document. If the pivot
document score is high enough to be in the list of top-k documents at a certain point in the processing, the document will be inserted
in the local data structure that stores the most relevant documents. This detailed parallelization strategy on Algorithm 2 only considers
local list within the coarse-grained processor. This removes the need to implement policies that guarantee the integrity of access to this
structure.
The partitioning of the inverted lists is performed by the RangeDocs function and depends on the selected strategy. For the range-based
strategy, the initial and final docIDs in the partial posting lists of all query terms will be the same, as returned by the RangeDocs function. When
using the size-based strategy, the docID boundaries in each list will be different.
For each partition, a local threshold is set as the lowest score of all docIDs in the current local top-k list. This value is dynamically updated as the
list changes and is used to define which document to use as the pivot (lines 7 and 27) in each iteration of the loop of line 8. As the local threshold
increases, the faster the algorithm instance proceeds, as more documents are skipped. By sharing the local thresholds among the fine-grained
processors, shown in line 28 (ThresholdSharingStrategy), we are able to improve the overall processing performance of our algorithm. This function
encapsulates one of the sharing policies proposed here to ensure that the quality of the final results is not degraded. The advantage of this
GAIOSO ET AL. 7 of 21
sharing is that it can be performed on each selected pivot document, as demonstrated in Algorithm 2, or to each certain amount of documents
analyzed.
3.2 Implementations
In order to evaluate our proposed strategies for the parallelization of the WAND algorithm, we have implemented them on a CUDA GPU. This
architecture consists of a number of SPMD multiprocessors, on which virtualization techniques provide a massive availability of threads. Thread
blocks are executed on multiprocessors and individual threads are executed by scalar processors. On top of this architecture, we can define the
number of partitions for the inverted lists as a function of the number of threads per block. This provides the needed flexibility for the strategies
detailed in Algorithms 1 and 2.
The thread blocks will process the partitions of the inverted lists, and if the number of partitions exceeds the number of thread blocks, each
block will be responsible for more than one partition. In order to improve data reuse, the local top-k list of a thread block can be preserved for all
partitions this block will handle.
Figure 5 illustrates this reuse of data. The threshold in a thread block may preserve its value from a previously processed partition, increasing
the skipping ratio if there are more partitions than the available number of thread blocks. The appropriate number of partitions, however, needs
to be determined according to the characteristics of the GPU used, and in combination with the characteristics of the queries.
TABLE 1 Execution times in milliseconds for the sequential algorithm Number of Operation Query Max. number Execution
terms length of docIDs time (ms)
2 OR Short 89 615 14.01
2 OR Medium 476 771 42.86
2 OR Large 1 032 795 45.52
2 OR Extra 5 494 285 202.79
2 AND Short 89 615 13.83
2 AND Medium 476 771 42.43
2 AND Large 1 032 795 45.44
2 AND Extra 5 494 285 200.77
In our tests, we experiment with a 50.2 million document corpus TREC ClueWeb09 (category B), which is part of the corpus ClueWeb09,*
indexed with the Terrier IR platform.† The index has 29 GB being that the most frequent terms produced posting lists with 2.7 millions of docIDs.
All experiments were repeated 20 times, removing the highest and the lowest values. The results presented here are the average of the 18
remaining executions.
The baseline (sequential) version of the WAND algorithm was executed on an Intel Core i7-5820K processor of 3.30 GHz and 64 GB of RAM,
using only one core. The queries were composed by two terms with posting lists of different sizes (given by the number of docIDs). Terms were
randomly selected so that the total number of docIDs of each query is illustrated in Table 1. For instance, the first row shows queries composed
of two terms whose posting lists sum up 89 615 docIDs. We set k=128 (ie, select the top-128 most relevant documents). The parallel approaches
presented in this work were executed in a NVIDIA GPU Titan Xp with 3840 CUDA cores, 30 multiprocessors (SM), and 12 GB memory.
* http://boston.lti.cs.cmu.edu/Data/clueweb09/.
† http://terrier.org/
GAIOSO ET AL. 9 of 21
TABLE 2 The speedup and execution times for OR queries with several query lengths
Partit. Threshold Part. # of Short Medium Large Extra
strat. sharing size Part. Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup
Size Local 32 1 0.54 25.84 3.28 13.05 11.05 4.12 509.39 0.40
Size Local 32 10 1.73 8.08 2.22 19.34 4.62 9.85 57.14 3.55
Size Local 64 1 0.51 27.58 2.53 16.91 7.34 6.21 271.54 0.75
Size Local 64 10 3.02 4.63 3.41 12.55 4.30 10.58 35.01 5.79
Size Sh-R 32 1 0.57 24.55 3.42 12.54 11.31 4.02 508.60 0.40
Size Sh-R 32 10 1.75 8.03 2.05 20.86 3.50 12.99 49.32 4.11
Size Sh-R 64 1 0.54 25.83 2.62 16.34 7.46 6.10 270.41 0.75
Size Sh-R 64 10 3.19 4.39 3.14 13.63 3.11 14.63 27.12 7.48
Size Sh-WR 32 1 0.55 25.27 3.33 12.85 11.15 4.08 509.39 0.40
Size Sh-WR 32 10 1.74 8.04 2.30 18.60 3.89 11.70 50.08 4.05
Size Sh-WR 64 1 0.56 25.05 2.58 16.58 7.41 6.15 269.57 0.75
Size Sh-WR 64 10 3.22 4.35 3.20 13.38 3.62 12.56 27.70 7.32
Range Local 32 1 1.76 7.94 3.93 10.90 10.07 4.52 225.70 0.90
Range Local 32 10 4.11 3.41 4.23 10.14 7.72 5.90 36.38 5.57
Range Local 64 1 2.50 5.61 3.68 11.64 7.60 5.99 124.15 1.63
Range Local 64 10 5.46 2.57 5.86 7.32 11.75 3.87 31.98 6.34
Range Sh-R 32 1 2.51 5.58 3.79 11.31 7.20 6.32 218.14 0.93
Range Sh-R 32 10 4.19 3.34 3.87 11.08 4.56 9.99 24.78 8.19
Range Sh-R 64 1 2.74 5.11 3.78 11.34 5.10 8.93 118.70 1.71
Range Sh-R 64 10 6.02 2.33 4.96 8.64 4.36 10.44 19.45 10.43
Range Sh-WR 32 1 2.53 5.53 4.10 10.44 6.78 6.71 222.14 0.91
Range Sh-WR 32 10 3.35 4.18 4.26 10.05 4.22 10.78 25.41 7.98
Range Sh-WR 64 1 2.01 6.98 3.56 12.04 6.16 7.39 121.90 1.66
Range Sh-WR 64 10 5.07 2.76 4.08 10.51 4.37 10.41 20.20 10.04
to be executed per thread block, and the reuse of data among partitions executed by the same thread block can lead to better performance.
To investigate this, the next experiment varies the number of partitions per thread block as 1, 5, 10, 100, 150, and 200 (results are shown in
Figures 6 and 7).
As depicted in Figure 6, the longer the length of posting lists for the range-based strategy, the more partitions per thread block are needed
to achieve better speedups. For short queries, one partition per thread block leads to maximum speedup, while 10 partitions per block is more
suitable for medium and long queries, and 100 partitions maximized the performance of extra-large queries. On the other hand, this experiment
shows that all three alternative implementations of the threshold present similar performance. It can be explained as follows.
Consider, for instance, the experiment with medium queries, whose posting lists have near 238 thousands docIDs each (the half of
476 thousands). With 32 docIDs per partition, the query is partitioned into 7450 partitions approximately. With one partition per thread block, the
strategy creates the same amount of thread blocks, which initializes a heap with 128 empty slots to process only one partition. At the end of the
process, each of the 7450 blocks has one resulting heap containing at most 32 selected documents to merge. Furthermore, in this case, each block
computes a local threshold for every partition instead of cumulatively building a global threshold. With local thresholds, the algorithm does not skip
as many documents as expected and the parallel search takes longer to execute. On the other hand, consider the case that each block executes
200 partitions, which share the same heap and threshold. In this case, with only 37 blocks there will be high data reuse among partitions, only
37 heaps to merge, and threshold sharing will skip more documents. However, with only 37 blocks there will insufficient amount of work exposed
to the scheduler, as many of these blocks may not be ready for execution when the current block stalls due to a memory access. Hence, there
is a tradeoff between maximizing the work offer to the processors, and minimizing the overhead of work partitioning. For medium size queries,
best speedups can be achieved around five partitions per thread block (ie, near 1490 thread blocks per query). The increase in the number of
partitions per thread block only leads to performance gains when exist enough amount of work (ie, thread blocks) to maintain the processors
occupied.
Next, we execute similar experiment to evaluate the effect of varying the number of partitions per thread block for the range-based partitioning
strategy. Results presented in Figure 7 exhibit a similar behavior regarding the best number of partitions can be observed. However, the maximum
speedup for short queries decreased because disjunctive queries operate on a smaller amount of documents, so less documents are skipped
compared to conjunctive ones.
In summary, the size-based partitioning strategy lead to higher speedups for short and medium length queries, while the range-based
partitioning lead extra-large queries to better speedups when threshold sharing policies are used. However, the size-based strategy has a
drawback. It does not guarantee that each homogeneous partition takes the same range of docIDs in its posting lists. As a consequence, the
size-based strategy may lose relevant documents during the search process, thus producing less accurate results as shown in Table 3. Hence,
10 of 21 GAIOSO ET AL.
FIGURE 6 Performance for OR queries with the size-based partitioning strategy, three threshold sharing policies, and varying the number of
partitions per thread block
the size-based partitioning strategy provide good approximate results, and better speedups for short and medium length queries. As the parallel
range-based strategy returns exactly the same list of top-k documents returned by the sequential algorithm, its recall is 1.0. For this reason, the
recall for the range-based strategy is not shown in Table 3. The best speedups are summarized for both algorithms in Figure 8.
FIGURE 7 Performance for OR queries with the range-based partitioning strategy, three threshold sharing policies, and varying the number of
partitions per thread block
Partitioning Threshold Short Medium Large Extra TABLE 3 Recall for the
size-based strategy with short,
strategy sharing Recall Std. dev. Recall Std. dev. Recall Std. dev. Recall Std. dev. medium, and long queries
Size Local 0.891 0.030 0.897 0.011 0.833 0.021 0.905 0.019
Size Sh-R 0.890 0.030 0.897 0.011 0.833 0.021 0.905 0.019
Size Sh-WR 0.891 0.030 0.897 0.011 0.833 0.021 0.905 0.019
For medium queries the best results are achieved with five partitions per block. For large queries, 10 partitions per thread block improves the
performance and in the case of extra-large queries the best performance is reported with 100 partitions. In Figure 10, we show the results for
the same experiment with the range-based strategy. In this case, short and medium queries report the best speedup with one partition per thread
block. For large queries, the best results are achieved with 10 partitions per block and, for extra-large queries, the best performance is reported
with 100 partitions.
Similar to OR-queries, these results show that it is not possible to determine a number of partitions that allows obtaining the best performance
for all cases (GPU-based partition strategies and threshold policies). This gives an example of the complexity of the WAND algorithm, which
depends on many parameters like the size of the queries to be processed and how the index is distributed/accessed in the different SMs.
Figure 11 shows a summary of the results achieves by each parallel query processing strategy and each threshold policy. The y-axis shows
the speedup and the x−axis shows the parallel strategy. In each case, we selected the number of partitions per thread block reporting the
best speedup as shown in Figures 10 and 9. In general, the size-based strategy reports higher speedups than the range-based strategy. The
12 of 21 GAIOSO ET AL.
TABLE 4 Execution time and speedup obtained by the size-based and range-based strategies and different threshold policies for
short, medium, large, and extra-large AND queries with two terms
Partit. Threshold Part. # of Short Medium Large Extra
strat. sharing size Part. Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup Time (ms) Speedup
Size Local 32 1 0.34 40.73 2.39 17.74 9.11 4.99 503.78 0.40
Size Local 32 10 1.59 8.70 2.05 20.69 3.79 11.99 53.67 3.74
Size Local 64 1 0.46 30.23 1.80 23.61 5.54 8.20 264.82 0.76
Size Local 64 10 2.79 4.95 3.27 12.99 4.08 11.14 31.18 6.44
Size Sh-R 32 1 0.36 38.64 2.44 17.37 9.20 4.94 502.65 0.40
Size Sh-R 32 10 1.74 7.97 1.96 21.61 3.12 14.57 49.34 4.07
Size Sh-R 64 1 0.49 28.31 1.88 22.61 5.65 8.04 259.13 0.77
Size Sh-R 64 10 3.17 4.36 3.18 13.35 3.22 14.12 27.32 7.35
Size Sh-WR 32 1 0.35 39.59 2.41 17.63 9.16 4.96 499.57 0.40
Size Sh-WR 32 10 1.76 7.87 2.06 20.63 3.28 13.86 50.53 3.97
Size Sh-WR 64 1 0.49 28.05 1.84 23.10 5.58 8.14 264.19 0.76
Size Sh-WR 64 10 3.23 4.29 3.25 13.04 3.43 13.25 27.81 7.22
Range Local 32 1 0.77 18.04 2.90 14.62 8.23 5.52 217.83 0.92
Range Local 32 10 2.46 5.62 4.35 9.75 6.88 6.61 33.91 5.92
Range Local 64 1 1.17 11.86 2.71 15.68 6.23 7.30 119.35 1.68
Range Local 64 10 4.09 3.38 6.12 6.94 8.90 5.11 30.30 6.63
Range Sh-R 32 1 0.85 16.23 2.90 14.64 8.43 5.39 217.66 0.92
Range Sh-R 32 10 2.83 4.89 4.07 10.43 6.70 6.78 25.00 8.03
Range Sh-R 64 1 1.29 10.69 2.68 15.86 7.11 6.39 116.10 1.73
Range Sh-R 64 10 4.77 2.90 5.57 7.62 8.56 5.31 19.89 10.09
Range Sh-WR 32 1 0.85 16.18 2.97 14.27 8.72 5.21 216.24 0.93
Range Sh-WR 32 10 2.82 4.91 4.13 10.27 5.68 7.99 25.95 7.74
Range Sh-WR 64 1 1.25 11.02 2.81 15.09 7.55 6.02 117.87 1.70
Range Sh-WR 64 10 4.69 2.95 5.70 7.45 8.67 5.24 20.19 9.95
results obtained with different threshold policies tend to be very similar; thus, they have a low impact on the performance. For short queries,
the size-based strategy outperforms the range-based strategy for more than twice of the speedup. This difference is reduced as we increase
the query length to extra-large. This behavior can be explained as follows. The size-based strategy balance the workload among the SMs and,
with short queries, the threshold value does not have a significant pruning effect on the top-k document results. On the other hand, with the
range-based strategy, the workload is unbalanced among the SMs. However, for extra-large queries, the threshold value drastically prune the
non-relevant documents, which helps to compensate the loss of performance caused by the workload imbalance.
GAIOSO ET AL. 13 of 21
FIGURE 9 Performance for AND queries with the size-based partitioning strategy, three threshold sharing policies, and varying the number of
partitions per thread block
Table 5 shows the recall for AND-queries executed with the size-based strategy. With a larger query size, the recall tends to decrease. However,
in all cases, the strategy provide good approximate results with a recall higher than 80%. The standard deviation is lower for AND queries than
for OR-queries (0.012 versus 0.030 for short queries). In other words, all the values are more similar to each other and closer to the mean.
State-of-the-art implementations of single queries can be executed in few milliseconds.15,19 However, because there are many factors that impact
the total latency in a query, longer response times may be acceptable. In an attempt to further explore the use of GPUs in web search queries,
we investigate the possibility of grouping queries prior to processing, even if at the cost of a small increase in latency. In our tests, a batch of
query is produced by buffering requests for a period of time or until a predetermined number of request have arrived. In this section, we propose
and investigate two strategies for the parallel execution of batches of queries on GPUs.
Algorithm 3 is inspired by the BSP model, having two super-steps to handle each query, the first one processes the inverted lists, and the
second combines the partial results produced by the coarse-grained processors.
After the coarse-grained processors have each selected its most relevant documents, a merge procedure takes place in 𝜃(log(|P|)) steps, as
seen in the calls to ParallelMerge. All involved coarse-grained processors then await at a synchronization barrier (line 10), before another merge
round. Eventually, one of the processors will have the merged list of the global top k most relevant documents for the query. The iteration process
then moves on to the next query in the batch.
FIGURE 10 Performance for AND queries with the range-based partitioning strategy, three threshold sharing policies, and varying the number of
partitions per thread block
GAIOSO ET AL. 15 of 21
In this section, we evaluate the strategies for the parallel execution of batches of queries proposed in Section 5. For the experiments, we used
the same hardware and the same 50.2 million document corpus TREC ClueWeb09 (category B) indexed by Terrier as described in Section 4.
However, instead of submitting individual queries, we composed a batch of 500 queries, which was submitted for execution in parallel. The query
log represents real requests in the Web environment. Experiments were replicated ten times and the execution times averaged.
The baseline (sequential) version of the WAND algorithm was executed on an Intel Core i7-5820K processor of 3.30 GHz and 64 GB of RAM,
using only one core. The sequential WAND algorithm was implemented in Java and executed one batch of 500 queries in 17.106 seconds, on
average. The batch algorithms invoke the parallel implementation of the WAND algorithm to execute individual queries, as described in Section 5.
In this experiment, the parallel WAND uses the three threshold propagation approaches, and a global implementation of the upper bound.
TABLE 6 The speedup and execution times of asynchronous strategy for query batch
Partitioning Threshold Number of 32 docIDs per partition 64 docIDs per partition 128 docIDs per partition
strategy sharing partitions Time (s) Speedup Time (s) Speedup Time (s) Speedup
Size Local 1 52.17 0.33 26.10 0.66 13.51 1.27
Size Local 10 5.58 3.07 3.05 5.60 1.77 9.65
Size Sh-R 1 51.51 0.33 25.97 0.66 13.16 1.30
Size Sh-R 10 5.07 3.37 2.63 6.51 1.33 12.86
Size Sh-WR 1 51.51 0.33 25.82 0.66 12.74 1.34
Size Sh-WR 10 5.06 3.38 2.68 6.39 1.41 12.10
Range Local 1 8.59 1.99 4.90 3.49 2.83 6.05
Range Local 10 1.81 9.43 1.42 12.01 1.16 14.70
Range Sh-R 1 7.54 2.27 4.07 4.20 2.49 6.86
Range Sh-R 10 1.37 12.49 1.02 16.74 0.82 20.76
Range Sh-WR 1 7.51 2.28 4.06 4.21 2.34 7.32
Range Sh-WR 10 1.37 12.45 1.02 16.73 0.85 20.24
As suggested by the results shown in Table 6, the increase from 1 to 10 partitions per thread block increases the performance of the algorithms.
To investigate this, the next experiment varies the number of partitions per thread block from 1 to 250. The result is presented in Figure 13.
As depicted in Figure 13, the best performance can be achieved from 100 to 250 partitions per thread block. However, in some cases, the
increase of the number of partitions above this number can lead to a decrease in the occupation of GPU resources, thus slightly decreasing the
performance of the algorithm.
FIGURE 13 Logarithmic execution times (s) of the asynchronous size-based strategy with the threshold propagation policies for the query batch
18 of 21 GAIOSO ET AL.
Next, in Figure 14, we present the effect of threshold sharing policies for the best performance scenario, which uses 128 docIDs per posting
list in the partition. As shown, the two policies that share threshold lead to better performance than local threshold as global threshold allows the
algorithms to skip more documents, thus achieving better performance.
FIGURE 15 Execution times (s) of the asynchronous range-based strategy with the three threshold sharing policies for the query batch
GAIOSO ET AL. 19 of 21
32 threads per block 64 threads per block 128 threads per block TABLE 7 The speedup and execution times of synchronous strategy
for query batch
Speedup Time (s) Speedup Time (s) Speedup Time (s)
0.809 21.131 0.799 21.403 0.813 21.021
The previous experiments show the performance of batches of queries with the size-based strategy. Next, we present experiments which vary
the number of partitions per thread block for the range-based strategy. As depicted in Figure 15, the algorithm achieved the best performance
with 128 docIDs per partition in the range between 1 and 100 partitions per block, while above 100 partitions per block of threads, the best
performance was obtained by 32 and 63 docIDs per partition.
Finally, Figure 16 presents the speedup for the three threshold sharing policies for the batch of queries running with 128 docIDs per thread
block. Again, the two global threshold sharing policies have demonstrated to improve performance as global thresholds allow skipping more
documents, thus improving query performance.
7 CONCLUSIONS
In this paper, we have proposed parallel strategies for single queries and batch of queries executed with the WAND ranking algorithm on GPUs.
In particular, we presented two strategies to partition the documents among the SMs. The first document partition strategy, named size-based,
evenly partitions the posting lists among thread blocks. This approach tends to balance the workload among the SMs at the cost of low quality of
results. Our second document partition strategy, named range-based, partitions the posting lists according to the document identifier. Partitions
have different sizes and therefore workload tends to be unbalance among the SMs. This second approach retrievals the exact top-k documents
for user queries. We also proposed three threshold sharing policies, named (1) local, (2) Safe-R, and (3) Safe-WR.
To process batch of queries, we proposed two synchronization strategies named (1) Synchronous and (2) Asynchronous. In the Synchronous,
each processor executes a different query. The Asynchronous strategy executes two steps. In the first step, it selects the relevant documents for
the batch of queries by accessing the posting lists. In the second step, it merges the partial results. Each step ends with a barrier synchronization.
20 of 21 GAIOSO ET AL.
We evaluated our proposals with different query lengths, with AND/OR queries and different parameters configurations (ie, partition sizes and
number of partitions). Results for single queries show that the size-based strategy allows better speedups (up to 35x for OR queries, and 40x for
OR queries) through higher occupancy of the SMs. Although, it can lose relevant documents during the search process, it reports a recall higher
than 80%. Thus, the size-based strategy can provide good approximate results. The range-based strategy returns the exact top-k documents
and show promising speedups (of up to 25x for OR queries, and 18x for AND queries). All three policies for threshold sharing reported similar
performance.
The execution of batches of queries has demonstrated to be an interesting strategy in terms of performance. The synchronous strategy
noticeably leads to better performance, achieving speedups of 21 times. The synchronous strategy executes one query per processor at a time,
eliminating the overhead of sharing data among processors, and increasing the occupation of GPU resources.
As future work, we plan to develop new strategies for processing batches of queries in multi-GPU, and heterogeneous (CPUGPU) computing
aiming at better load balancing and maximizing communication-computation overlapping.
ACKNOWLEDGMENTS
The authors thank CAPES (Coordenaaçã de Aperfeiaçoamento de Pessoal de Nível Superior - Código de Financiamento 001), and FAPESP
Contract 2015/24461-2. Hermes Senger also thanks CNPQ (Contract 305032/2015-1) and FAPESP (Contract 2018/00452-2) for their support.
The Titan Xp used for this research was donated by the NVIDIA Corporation.
ORCID
REFERENCES
1. Cambazoglu BB, Baeza-Yates R. Scalability Challenges in Web Search Engines. San Rafael, CA: Morgan and Claypool Publishers; 2016.
2. Turtle H, Flood J. Query evaluation: strategies and optimizations. Inf Process Manag. 1995;31(6):831-850.
3. Broder A, Carmel D, Herscovici M, Soffer A, Zien J. Efficient query evaluation using a two-level retrieval process. In: Proceedings of the 20th
International Conference on Information and Knowledge Management (CIKM); 2003; New Orleans, LA.
4. Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th International ACM SIGIR Conference on Research
and Development in Information Retrieval; 2011; Beijing, China.
5. Chakrabarti K, Chaudhuri S, Ganti V. Interval-based pruning for top-k processing over compressed lists. Paper presented at: 2011 IEEE 27th
International Conference on Data Engineering (ICDE); 2011; Hannover, Germany.
6. Altingovde I, Demir E, Can F, Ulusoy Ö. Incremental cluster-based retrieval using compressed cluster-skipping inverted files. ACM Trans Inf Syst.
2008;26(3):1-36.
7. Rojas O, Gil-Costa V, Marin M. Distributing efficiently the Block-Max WAND algorithm. Procedia Comput Sci. 2013;18:120-129.
8. Rojas O, Gil-Costa V, Marin M. Efficient parallel Block-Max WAND algorithm. In: Proceedings of the 19th International Conference on Parallel
Processing (Euro-Par); 2013; Aachen, Germany.
9. Bonacic C, Bustos D, Gil-Costa V, Marin M, Sepulveda V. Multithreaded processing in dynamic inverted indexes for web search engines. In: Proceedings
of the 2015 Workshop on Large-Scale and Distributed System for Information Retrieval (LSDS-IR); 2015; Melbourne, Australia.
10. Ding S, He J, Yan H, Suel T. Using graphics processors for high performance IR query processing. In: Proceedings of the 18th International Conference
on World Wide Web (WWW); 2009; Madrid, Spain.
11. Ao N, Zhang F, Wu D, et al. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proc VLDB Endow.
2011;4(8):470-481.
12. Tadros R. Accelerating Web Search Using GPUs [PhD thesis]. Vancouver, Canada: University of British Columbia; 2015.
13. Wu D, Zhang F, Ao N, Wang G, Liu X, Liu J. Efficient lists intersection by CPU-GPU cooperative computing. Paper presented at: 2010 IEEE International
Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW); 2010; Atlanta, GA.
14. Zhang F, Wu D, Ao N, Wang G, Liu X, Liu J. Fast lists intersection with Bloom filter using graphics processing units. In: Proceedings of the 2011 ACM
Symposium on Applied Computing (SAC); 2011; TaiChung, Taiwan.
15. Huang H, Ren M, Zhao Y, et al. GPU-accelerated Block-Max query processing. In: Algorithms and Architectures for Parallel Processing: 17th International
Conference, ICA3PP 2017, Helsinki, Finland, August 21-23, 2017, Proceedings. Cham, Switzerland: Springer International Publishing AG; 2017:225-238.
16. Liu Y, Wang J, Swanson S. Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism. ACM SIGPLAN Not.
2018;53(1):327-337.
17. Vigna S. Quasi-succinct indices. In: Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM); 2013; Rome, Italy.
18. Green O, McColl R, Bader DA. GPU merge path: a GPU merging algorithm. In: Proceedings of the 26th ACM International Conference on
Supercomputing (ICS); 2012; Venice, Italy.
19. Gaioso R, Gil-Costa V, Guardia H, Senger H. A parallel implementation of WAND on GPUs. Paper presented at: 2018 26th Euromicro International
Conference on Parallel, Distributed and Network-based Processing (PDP); 2018; Cambridge, UK.
GAIOSO ET AL. 21 of 21
20. Marin M, Gil-Costa V, Bonacic C, Baeza-Yates RA, Scherson ID. Sync/Async parallel search for the efficient design and construction of web search
engines. Parallel Computing. 2010;36(4):153-168.
21. Mendoza M, Marín M, Gil-Costa V, Ferrarotti F. Reducing hardware hit by queries in web search engines. Inf Process Manag. 2016;52(6):1031-1052.
How to cite this article: Gaioso R, Gil-Costa V, Guardia H, Senger H. Performance evaluation of single vs. batch of queries on GPUs.
Concurrency Computat Pract Exper. 2019;e5474. https://doi.org/10.1002/cpe.5474