Professional Documents
Culture Documents
Efficient Simd Vectorization For Hashing in Opencl
Efficient Simd Vectorization For Hashing in Opencl
Xeon Phi coprocessors). OpenCL programs are implemented in Selective Load. In the first iteration, we load all SIMD lanes
special functions called kernels. We provide vectorized kernels for as indicated by the green bitmask.
the data movement primitives selective load, selective store, gather, 2 For each probe key ki in the SIMD register, we compute the
and scatter [8]. These primitives are essential building blocks of hash hi and store it in a SIMD register h. We keep separate
hash-based operators. We use vectorized hashing operations for SIMD registers for probe keys (k) and their hashes (h).
3 We use the hash values as position pointers in a Gather
© 2018 Copyright held by the owner/author(s). Published in Proceedings of the 21st operation to load buckets from the hash table. We store
International Conference on Extending Database Technology (EDBT), March 26-29,
2018, ISBN 978-3-89318-078-3 on OpenProceedings.org. the found keys in a new SIMD register k ′ .
Distribution of this paper is permitted under the terms of the Creative Commons
1 Source Code: https://github.com/TU-Berlin-DIMA/OpenCL-SIMD-hashing
license CC-by-nc-nd 4.0.
4 We compare the original probe keys (k) with the keys we Zhou and Ross introduced vectorizations for major database
retrieved from the hash table (k ′ ). This comparison results operators (selections, joins, aggregations, etc.) [13]. They pointed
in a collision mask (c). Light green boxes in the mask out opportunities of SIMD in databases but did not apply SIMD to
indicate matches and dark red boxes indicate collisions. hash tables. Complementary to our work, Ye et al. evaluated differ-
In our example, we find three expected keys: k 1 , k 3 , and k 4 . How- ent strategies for efficient aggregations on multi core CPUs [12].
ever, the hash table contains the key k 5 at position h 2 instead
of k 2 , which indicates a collision (i.e., k 2 and k 5 have the same 3 PORTABLE VECTORIZED HASHING
hash). In general, there are three possible cases per probe key: In this section, we review vectorization support in OpenCL and
(1) The bucket contains the key. (2) The hash bucket is empty. present the internals of our OpenCL-based primitives Selective
(3) The key in the hash bucket and the probe key are different. Load, Selective Store, Gather, and Scatter.
In cases one and two, we replace the matched probe key in
k with a new probe key. In case three, we keep the probe key
ki , but increment its hash value hi in h to probe the next bucket 3.1 Vectorization Support in OpenCL
(Figure 2b, step 5 ). We now continue with the next iteration. To implement data movement primitives, we use several built-in
6 We use the collision mask to load new keys into the SIMD functions. OpenCL natively supports vector data types which
lanes for which there was no collision in the previous represent SIMD registers. OpenCL also provides arithmetic and
iteration, leaving key k 2 unchanged as described above. logical operations on vector types. For example, we compare two
7 We compute hashes for the new keys in k, leaving the vectors containing four values in Listing 1. We can also access
value h 2 +1 unchanged. Steps 3 to 7 repeat until all individual vector components by their indices which increase
probe keys are processed. from left to right. For example, probeKeys.s0 selects the left-
Note that we need to write the payload (matched keys) into an most component containing the value k 1 .
output buffer with Selective Store between steps 4 and 5 , e.g., to
1 u i n t 4 p r o b e K e y s = { k1 , k2 , k3 , k4 } ;
perform a hash join. For simplicity, we ignore empty hash buckets 2 u i n t 4 f o u n d K e y s = { k1 , k5 , k3 , k4 } ;
in the illustration in Figure 2. To handle empty hash buckets, we 3 u i n t 4 mask = p r o b e K e y s == f o u n d K e y s ; / / { - 1 , 0 , - 1 , - 1 }
need to compute three bitmasks. The first bitmap (c ′ ) indicates
found keys, the second bitmap (c ′′ ) indicates empty buckets, and Listing 1: Vectorized data types and operations in OpenCL.
the third bitmap (c) indicates collisions (c = ¬c ′ ∧ ¬c ′′ ).
If we are building a hash table, instead of using a Gather oper- The function shuffle(input, mask) returns a vector in
ation to load payloads, we use a Scatter operation to store keys. which each component si contains the value of input.s j that
Since multiple SIMD lanes can point to the same hash bucket, we is specified by the corresponding component si in mask, i.e.,
need to verify afterwards if the hash table contains the expected j = mask.si . The function select(a, b, mask) returns a vec-
keys using steps 3 and 4 . For successfully stored keys, we tor in which each component si contains the value of a.si if
store the payload in the hash table using a scatter operation. For mask.si ≥ 0 and b.si otherwise.
conflicting keys we need to probe the next bucket using step 5 .
Note that the hash table is unaware of being accessed with 3.2 Implementation of Primitives
vectorized operations. We can use the described scheme to access Selective Load. Listing 2 shows the internals of the Selective
hash tables built with scalar operations and vice versa. Load primitive which we introduce with the other primitives
in Section 2.1 (Figure 2a). The algorithm has four parameters:
2.3 Related Work
(1) input: source memory buffer, (2) offset: read offset on input.
Heimel et al. showed that OpenCL is a viable way to run database (3) vector: target vector, and (4) mask: indicates the components
systems on heterogeneous processors [4]. Our work complements in vector which will be overwritten. The algorithm uses the
this research by porting vectorization optimizations to OpenCL. parameters as follows: (1) It moves components which will be
Pirk et al. introduced Voodoo – a vector algebra that abstracts overwritten to the left of the target vector and adjusts the mask
from the underlying processor and generates OpenCL code [7]. accordingly (Lines 3–6). (2) The algorithm loads the input data
Breß et al. introduced Hawk – a hardware-tailored code generator into a temporary vector (Line 7). (3) It copies the left-most values
which produces custom code for heterogeneous processors [3]. from the temporary vector into the target vector according to
Our work complements Voodoo and Hawk with templates for the mask (Line 8). (4) The algorithm moves the components of
efficient vectorized hash tables in OpenCL. the target vector back to their original positions (Lines 11–12).
Richter et al. showed a seven-dimensional analysis of hash The shuffle functions used in steps 1 and 4 of the algorithm
tables [9]. Balkesen et al. [1] and Blanas et al. [2] studied efficient require permutation masks (left and back) to reorder the target
hash joins focusing on radix joins. Jha et al. optimized hash joins vector. To speed up execution, we precompute these masks and
for Xeon Phis, but with limited use of SIMD instructions [6]. store them in two lookup tables (one per step).
490
1 / / I n p u t s : i n p u t , o f f s e t , v e c t o r , mask 1 / / Inputs : input , vector , mask
2 / / O u t p u t s : o f f sPe t , v e c t o r 2 o u t p u t [ mask . s 0 ] = v e c t o r . s0 ;
3 ushort index = n i =0 −2
n−i × mask .s ;
i 3 o u t p u t [ mask . s 1 ] = v e c t o r . s1 ;
4 uchar8 l e f t = convert_uint8 ( move_left_masks [ index ] ) ; 4 / / . . . o u t p u t [ mask . s 7 ] = vector . s7
5 vector = s h u f f l e ( vector , l e f t ) ;
6 i n t 8 mask2 = s h u f f l e ( mask , l e f t ) ; Listing 5: OpenCL implementation of Scatter.
7 u i n t 8 tmp = v l o a d 8 ( 0 , &i n p u t [ o f f s e t ] ) ;
8 v e c t o r = s e l e c t ( v e c t o r , tmp , mask == - 1 ) ;
1 / / Inputs : uint64_t ∗ table , __128i index
9 o f f s e t += p o p c o u n t ( i n d e x ) ;
2 __m128i i n d e x _ R = _mm_ s h u f f l e _ e p i 3 2 ( i n d e x , _MM_SHUFFLE
10 / / l i n e s below can be o m i t t e d f o r o p t i m i z a t i o n
,→ ( 1 , 0 , 3 , 2 ) ) ;
11 u c h a r 8 b a c k = c o n v e r t _ u i n t 8 ( move_back_masks [ i n d e x ] ) ;
3 __m128i i 1 2 = _mm_ c v t e p i 3 2 _ e p i 6 4 ( i n d e x ) ;
12 v e c t o r = s h u f f l e ( vector , back ) ;
4 __m128i i 3 4 = _mm_ c v t e p i 3 2 _ e p i 6 4 ( i n d e x _ R ) ;
5 s i z e _ t i 1 = _mm_ c v t s i 1 2 8 _ s i 6 4 ( i 1 2 ) ;
Listing 2: OpenCL implementation of Selective Load. 6 s i z e _ t i 3 = _mm_ c v t s i 1 2 8 _ s i 6 4 ( i 3 4 ) ;
7 __m128i d1 = _mm_ l o a d l _ e p i 6 4 ( ( __m128i ∗ ) & t a b l e [ i 1 ] ) ;
1 / / I n p u t s : o u t p u t , o f f s e t , v e c t o r , mask 8 __m128i d3 = _mm_ l o a d l _ e p i 6 4 ( ( __m128i ∗ ) & t a b l e [ i 3 ] ) ;
2 / / O u t p u t s : o fPfset 9 i 1 2 = _mm_ s r l i _ s i 1 2 8 ( i 1 2 , 8 ) ;
3 uchar index = n n−i × mask .s ;
i =0 −2 i 10 i 3 4 = _mm_ s r l i _ s i 1 2 8 ( i 3 4 , 8 ) ;
4 uchar8 l e f t = convert_uint8 ( move_left_masks [ index ] ) ; 11 s i z e _ t i 2 = _mm_ c v t s i 1 2 8 _ s i 6 4 ( i 1 2 ) ;
5 u i n t 8 tmp = s h u f f l e ( v e c t o r , l e f t ) ; 12 s i z e _ t i 4 = _mm_ c v t s i 1 2 8 _ s i 6 4 ( i 3 4 ) ;
6 v s t o r e 8 ( tmp , 0 , &o u t p u t [ o f f s e t ] ) ; 13 __m128i d2 = _mm_ l o a d l _ e p i 6 4 ( ( __m128i ∗ ) & t a b l e [ i 2 ] ) ;
7 o f f s e t += p o p c o u n t ( i n d e x ) ; 14 __m128i d4 = _mm_ l o a d l _ e p i 6 4 ( ( __m128i ∗ ) & t a b l e [ i 4 ] ) ;
15 __m256i d12 = _ m m 2 5 6 _ c a s t s i 1 2 8 _ s i 2 5 6 ( _mm_ u n p a c k l o _ e p i 6 4
Listing 3: OpenCL implementation of Selective Store. ,→ ( d1 , d2 ) ) ;
16 __m256i d34 = _ m m 2 5 6 _ c a s t s i 1 2 8 _ s i 2 5 6 ( _mm_ u n p a c k l o _ e p i 6 4
1 / / I n p u t s : i n p u t , v e c t o r , mask ,→ ( d3 , d4 ) ) ;
2 / / Output : v e c t o r 17 __m256i r e s = _ m m 2 5 6 _ p e r m u t e 2 x 1 2 8 _ s i 2 5 6 ( d12 , d34 ,
3 v e c t o r . s 0 = i n p u t [ mask . s 0 ] ; ,→_MM_SHUFFLE ( 0 , 2 , 0 , 0 ) ) ;
4 v e c t o r . s 1 = i n p u t [ mask . s 1 ] ;
5 / / . . . up t o v e c t o r . s 7 = i n p u t [ mask . s 7 ] ; Listing 6: SIMD implementation of Gather [8].
Listing 4: OpenCL implementation of Gather.
we utilize eight threads to emphasize attainable throughput. We
use native implementations based on intrinsics [8] as a baseline.
We select the required permutation mask depending on the First, we perform microbenchmarks to measure the performance
mask parameter (Line 3). The lookup tables together consume 4 of our vectorized data movement primitives in isolation. We then
KB and fit comfortably in the L1 cache. evaluate the complete hash table implementation by executing a
Step 4 of the algorithm is only required if the calling code ex- hash join. We separately measure building the hash table on the
pects the components of target vector to remain in the original inner join table, and probing it with keys from the outer table.
order. In many cases the calling code does not have this expec- Every experiment is executed 20 times. We prevent autovec-
tation. For example, the hashing scheme in Section 2.2 does not torization of the scalar and intrinsics implementation. For each
require the original order as long as reordering is mirrored be- experiment we generate new test data to obtain unbiased results.
tween probe keys and the collision mask. Therefore, we optimize Load and Store. To evaluate both primitives, we stream 100
the general implementation shown above by omitting step 4 of million keys (108 , 32-bit int) from memory into a SIMD register
the algorithm (Lines 11 and 12). This optimizations discards one (or vice versa) and measure the throughput. For each invocation,
lookup table and saves the respective space in the L1 cache. we change the mask indicating which SIMD lanes are accessed,
Selective store. The implementation of Selective Store is pro- accessing four out of eight lanes on average. The large number
vided in Listing 3. It is similar to Selective Load and has the same of keys simulates a large outer table of a hash join.
parameters. Again, we move the components of vector that Gather and Scatter. To evaluate Gather and Scatter, we read
will be stored according to mask to the left (line 5). The values 100 million keys from discontiguous memory locations into a
are then written to output (Line 6). Since the original vector is SIMD register. We use memory regions of different sizes, from
unchanged, we do not have to shuffle it back. 4 kB to 64 MB, to simulate hash tables built on inner tables used
Gather and Scatter. We provide the implementations for the in a hash join. The stride size between the memory locations
Gather and Scatter primitives in Listings 4 and 5. We access the depends on the size of the memory region. Especially for large
components of vector and mask by their indices (see Section 3.1). memory regions, we chose the indices so that the accessed data
Overall, we replace a complex and non-portable implementation does not fit into the processor cache.
based on intrinsics (Listing 6) with a fairly simple and portable Hash Join Build and Probe. In general, we adopt the ex-
implementation in OpenCL. Our implementation offers the same perimental setup of Polychroniou et al. [8] in order to obtain
functionality, while improving maintainability and portability. comparable results. We build a hash table on the inner table with
a load factor of 50% and evaluate hash tables of different sizes,
4 EVALUATION from 4 kB to 64 MB. On the Xeon Phi, we cannot build 60 hash
4.1 Experimental Setup tables of 64 MB in OpenCL due to OpenCL memory allocation
Execution Environment. We evaluate our implementation restrictions. As the reference [8] does not provide an intrinsics
on two processors, an Intel Core i7-6700K CPU2 and a Xeon Phi implementation for the build on CPUs, we omit the curve.
7120P coprocessor3 based on Intel’s MIC architecture. As of writ- To simplify partitioning the workload to different threads, the
ing, the newer Xeon Phi Knights Landing (KNL) product line does number of keys in the outer table depends on the processor. On
not yet support OpenCL. On the Xeon Phi, we utilize 60 threads the CPU, the outer table contains 100 million keys. On the Xeon
for our measurements to emphasize call overheads. On the CPU, Phi, it contains 245.76 million keys. We chose the keys in the
outer table so that on average every tenth key is found in the
24 GHz, 4 physical cores, 2 threads/core, 8 MB L3 shared, 32 GB RAM hash table. Note, that our hash join implementation includes
3 1.24 GHz, 61 physical cores, 4 threads/core, 512 kB L2 per-core, 16 GB RAM
writing the matched probe keys to an output buffer.
491
Selective Load Selective Store L1 L2 L3 16 8
2.0 2.0
GB / Second
GB / Second
3 3
Without call overhead 12 6
10.0 10.0 1.5 ● 1.5
● ● ●
Bil. Ops / Second
GB / Second
1.0 ● 8 1.0 ● ● ● ● 4
2 7.5 2 7.5 ● ● ● ●
● ● ● ● ● ●
5.0 5.0 0.5 4 0.5 ● 2
●
● ● ● ● ● ● ● ● ●
1 1 ● ● ● ● ● ● ● ● ● L1 L2 L3
2.5 2.5 0.0 0 0.0 0
kB
kB
kB
kB
kB
kB
kB
kB
B
M
M
4
16
64
16
64
6
1
16
64
16
64
25
25
0 0.0 0 0.0
Intrinsics OpenCL Intrinsics OpenCL Hash Table Size Hash Table Size
a) Selective Load & Store on CPU. b) Build in OpenCL on CPU. c) Probe on CPU.
Bil. Buckets / Second
25 1.6
0.20 L1 L2 10.0
3
GB / Second
GB / Second
1.2 7.5
0.15 ● 2
2 ● 15 ● ●
● ● ● ● ● ● ● ● ●
● 10 0.10 0.8 5.0
● ●
1 ● ● 1 ● ● ● ● ●
● ● ● ● ● ● ●
5 ● ●
0.05 0.4 2.5
● ●
0 0 ● L1 L2 ● ● ● ●
● ● ● ● ●
kB
kB
kB
kB
M
4
16
64
16
64
kB
kB
kB
kB
kB
kB
kB
kB
B
25
M
4
16
64
16
64
6
Hash Table Size
16
64
16
64
25
25
Gather (Intrinsics) Scatter (Intrinsics) Hash Table Size Hash Table Size
● Gather (OpenCL) Scatter (OpenCL) Scalar (C) ● Scalar (OpenCL) Vectorized (Intrinsics) Vectorized (OpenCL)
d) Gather & Scatter on CPU. e) Build on Xeon Phi. f) Probe on Xeon Phi.
Figure 3: Evaluation results on the Intel Core i7-6700K CPU and Xeon Phi 7120 coprocessor.
4.2 Results is slower. On this processor, the data movement primitives are
Selective Load and Store. Figure 3a shows the results of implemented directly as SIMD instructions which perform much
the Selective Load and Store microbenchmarks. The intrinsics- faster than implementations that rely on an emulation.
based version of Selective Load outperforms the portable OpenCL 5 CONCLUSION
implementation by a factor of 1.75. The native Selective Store
implementation is 4.6 times faster than the OpenCL version. Vectorized database operators improve performance but require
Gather and Scatter. Figure 3d shows the results of the Gather processor-specific APIs. In this paper, we vectorize the essential
and Scatter microbenchmark. If the hash table fits into the L1 primitives Gather, Scatter, Selective Load and Selective Store in
cache, the intrinsics implementation of Scatter is 1.5 times faster OpenCL to reduce code complexity and to ensure portability.
than the OpenCL version. However, if the hash table size exceeds We conduct an evaluation on CPUs and Xeon Phi coprocessors.
the L1 cache, the performance of both implementations drops. In general, vectorized hashing based on intrinsics outperforms
The native implementations of Gather is between 1.75 and 1.9 OpenCL-based hashing. Hash tables usually exceed processor
times faster than the OpenCL-based implementation. caches. In this case, both variants are memory-bound and perform
Hash Join Build. We present the results of the hash build similarly. However, on CPUs, OpenCL-based vectorized hashing
experiment for the CPU in Figure 3b and for the Xeon Phi in outperforms scalar hashing for moderately sized hash tables that
Figure 3e. On the CPU, the OpenCL-based vectorized implemen- fit into the L2 cache. In this case, our OpenCL-based hashing
tation is marginally slower than the OpenCL-based scalar imple- scheme is competitive to intrinsics-based hashing.
Acknowledgements: This work was funded by the EU projects SAGE (671500)
mentation. Both are significantly slower than the C-based build and E2Data (780245), DFG Stratosphere (606902), and the German Ministry for
and exhibit a curious rising trend up to a hash table size of 1 MB. Education and Research as BBDC (01IS14013A) and Software Campus (01IS12056).
This trend is due to overheads of OpenCL kernel invocations.
To illustrate this, we also show scalar implementations which REFERENCES
build 10000 hash tables inside a single function to minimize call [1] Cagri Balkesen, Jens Teubner, et al. 2013. Main-memory hash joins on multi-
core CPUs: Tuning to the underlying hardware. In IEEE ICDE. 362–373.
overhead (dashed lines). These curves follow the expected shape. [2] Spyros Blanas, Yinan Li, et al. 2011. Design and evaluation of main memory
On the Xeon Phi, all implementations show a rising trend due to hash join algorithms for multi-core CPUs. In ACM SIGMOD. 37–48.
[3] Sebastian Breß et al. 2017. Generating Custom Code for Efficient Query
function call overhead. These overheads cause the bowl-shaped Execution on Heterogeneous Processors. CoRR abs/1709.00700 (2017).
form of the curves which is most visible for the C-based scalar im- [4] Max Heimel, Michael Saecker, Holger Pirk, et al. 2013. Hardware-Oblivious
plementation. However, we cannot fully explain why the curves Parallelism for In-Memory Column-Stores. PVLDB 6, 9 (2013), 709–720.
[5] Intel. [n. d.]. Intel C++ Intrinsic Reference. Retrieved September 30, 2017 from
of the C-based and OpenCL-based scalar implementations cross https://software.intel.com/sites/default/files/a6/22/18072-347603.pdf
between the hash table sizes of 64 kB and 128 kB. [6] Saurabh Jha, Bingsheng He, et al. 2015. Improving main memory hash joins
Hash Join Probe. Figures 3c and 3f show the result of the on intel xeon phi processors: An experimental approach. PVLDB, 642–653.
[7] Holger Pirk, Oscar Moll, Matei Zaharia, et al. 2016. Voodoo-a vector algebra
hash probe experiment on the CPU and the Xeon Phi. We compare for portable database performance on modern hardware. PVLDB, 1707–1718.
the vectorized OpenCL-based implementation with an intrinsics- [8] Orestis Polychroniou, Arun Raghavan, and Kenneth A Ross. 2015. Rethinking
SIMD vectorization for in-memory databases. In ACM SIGMOD. 1493–1508.
based implementation [8] and a scalar implementation. On the [9] Stefan Richter, Victor Alvarez, et al. 2015. A Seven-dimensional Analysis of
CPU, both vectorized implementations outperform the scalar Hashing Methods and Its Implications on Query Processing. PVLDB, 96–107.
version as long as the hash table fits into L2 cache. For 4 kB hash [10] Thomas Willhalm et al. 2009. SIMD-scan: Ultra Fast In-memory Table Scan
Using On-chip Vector Processing Units. PVLDB 2, 1, 385–394.
tables, the intrinsics-based implementation is twice as fast as the [11] Thomas Willhalm, Ismail Oukid, Ingo Müller, and Franz Faerber. 2013. Vector-
scalar implementation, whereas the OpenCL-based implementa- izing Database Column Scans with Complex Predicates. In ADMS. 1–12.
tion is 1.3 times faster. For small hash tables on the Xeon Phi, the [12] Yang Ye, Kenneth A. Ross, and Norases Vesdapunt. 2011. Scalable Aggregation
on Multicore Processors. In ACM DaMoN. 1–9.
intrinsics-based implementation greatly outperforms the scalar [13] Jingren Zhou and Kenneth Ross. 2002. Implementing database operations
version whereas the OpenCL-based vectorized implementation using SIMD instructions. In ACM SIGMOD. 145–156.
492