Professional Documents
Culture Documents
10.1515_comp-2016-0006
10.1515_comp-2016-0006
2016; 6:79–90
specifies the execution and branching behavior of a single To solve the algebraic eigenvalue problem and to ob-
thread. The CUDA requires that thread blocks are indepen- tain the eigenvalues and/or eigenvectors, one needs to use
dent. This means that a kernel executes blocks correctly no numerical linear algebra algorithms. One of the libraries
matter the order in which they are run. This independence that contain the necessary algorithms is Linear Algebra
of the blocks of a kernel provides scalability [2–6]. Package (LAPACK). It offers a number of routines working
In this paper we focus on employing CUDA’s fast com- in single and double precision. In this work we employ:
puting capabilities for solving important numerical prob- xPOTRF that computes Cholesky decomposition (symbol
lems. Our case study is provided by solving the eigenvalue × refers to the precision, there can be single SPOTRF or
problem, described in detail in Section 2. In Section 3 we double DPOTRF), xTRTRI that computes the inverse of a
present our work of transforming typical numerical proce- real upper or lower triangular matrix, blocked xSYTRD that
dures from the CPU to GPU in an easy way, together with reduces a real symmetric matrix to symmetric tridiagonal
an evaluation of our transformations. We also describe the form. To obtain the eigenvalues of a symmetric tridiago-
difficulties associated with such operations. In Section 4 nal matrix we used xSTERF routine. This routine uses a
we summarize and conclude. square-root free version of the QR algorithm [11]. It is worth
emphasizing that our main case study is in efficiently com-
puting eigenvalues and eigenvectors for the quantum me-
chanical system. For this purpose one needs to use the
2 The generalized algebraic xSTEQR function on a symmetric tridiagonal matrix. This
eigenvalue problem routine uses the implicitly shifted QR algorithm [11]. De-
tails may be found in [11–14].
In many fields of physics and engineering the analysis of
eigenvalue problems plays an important role. For exam-
ple, determining the eigenstates of quantum systems. A
general eigenvalue problem can be described as follows:
3 Results and discussion
Hc = εSc, (1) 3.1 Simple conversion for functions from the
CPU to GPU
where we assume that matrices H and S are real and sym-
metric, and matrix S is positively defined, ε and c are
Here we present an example of a function, converted by
eigenvalues and eigenvectors respectively. To solve this
us to work on the graphics card, using CUDA Basic Linear
type of equation it is convenient to convert it into an equiv-
Algebra Subroutines (CUBLAS) library. Function SSYTD2
alent form, in which S reduces to the identity matrix I
reduces a real symmetric matrix H ^ to a symmetric tridi-
(S → I). This can be done by using the Cholesky decompo-
agonal form T by an orthogonal similarity transformation:
sition, which is a decomposition procedure of a symmetric
^ = T, for CPU. The function can be found at [12]. Some
QT HQ
(Hermitian) positively defined matrix into the product of a
parts of function SSYTD2_GPU for GPU (in lower triangular
matrix and its transpose:
case) have been written using CUBLAS library provided by
S = LLT , (2) NVIDIA Corporation.
The above piece of code describes simple kernels (see
so that L is a nonsingular lower triangular matrix and LT its Procedure 1a, lines 1-15), whose aim is to combine CUBLAS
transposition. Substituting the decomposition of S to the functions, so that all the routines can be performed on the
equation (1) we obtain: graphics card without data transfer from the CPU to GPU
and vice versa. It has been verified that simple kernels are
Hc = εLLT c. (3)
faster than the cudaMemcpy function.
Multiplying both sides by the inverse of L and taking The numbers in parentheses next to the name of the
the identity matrix I in the form of L−T LT we get function (see Procedure 1b, line 28) are executive informa-
(︁ )︁ (︁ )︁ (︁ )︁ tion for a system about how to call the kernel. The first
L−1 H L−T LT c = ε L−1 LLT c. (4) one specifies the number of parallel blocks, and the sec-
ond one specifies the number of threads. In this example,
Now by substitution of H^ = L−1 HL−T and ^
c = LT c we get an
one copy of the kernel call is sufficient, and thus no paral-
equivalent algebraic eigenvalue problem [7–10]:
lelization is needed in this case.
^ c = ε^
H^ c. (5)
Procedures for GPU. Generalized eigenvalue problem | 81
Procedure 1a: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 1–15).
1 ad2ed(. . .)
2 /* value of the i__ element of the vector ed is replaced by value of the (i__+1+i__*a_dim1)
3 element of the matrix ad */
4 replacing_by_1( . . .)
5 /* value of the (i__+1+i__*a_dim1) element of the matrix ad is replaced by 1 */
6 ed2ad(. . .)
7 /* value of the (i__+1+i__*a_dim1) element of the matrix ad is replaced by value of the i__
8 element of the vector ed */
9 ad2dd_taui2taud( . . .)
10 /* value of the i__ element of the vector dd__ is replaced by value of the ( i__+i__*a_dim1)
11 element of the matrix ad and also value of the i__ element of the vector taud is replaced by
12 value of the variable taui */
13 ad2dd(. . .)
14 /* value of the n-1 element of the vector dd__ is replaced by value of the (n-1 + (n-1)* a_dim 1)
15 element of the matrix ad */
Procedure 1b: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 16–28).
16 int ssytd2_GPU{
17 /* Allocate device memory for the matrices ad, dd__, ed, taud */
18 /* Initialize the device matrices with the host matrices */
19 if (upper) { }
20 else {
21 /* Reduce the lower triangle of A */
22 for (i__ = 0; i__ < i__1; ++i__) {
23 /* Generate elementary reflector H(i) = I - tau * v * v’ to annihilate A(i+2:n,i) */
24 /* Computing MIN */
25 slarfg_GPU (. . .);
26 /* SLARFG function from LAPACK library (which generates a real elementary reflector H of
27 order n) has been changed to work for the GPU */
28 ad2ed<<< 1,1 >>> (. . .);
Procedure 1c: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 29–39).
29 if (taui != 0.f) {
30 /* Apply H(i) from both sides to A(i+1:n,i+1:n) */
31 replacing_by_1<<< 1,1 >>> (. . .);
32 /* Compute x := tau * A * v storing y in TAU(i:n-1) */
33 cublasSsymv (. . .);
34 /* Compute w := x - 1/2 * tau * (x’*v) * v */
35 alpha = taui * -.5f * cublasSdot(. . .);
36 cublasSaxpy(. . .);
37 /* Apply the transformation as a rank-2 update: */
38 /* A := A - v * w’ - w * v’ */
39 cublasSsyr2(. . .);
As it can be seen by analyzing the part of the code gebra Subprograms (BLAS) procedures were swapped to
from lines 29 to 39 (see Procedure 1c), all Basic Linear Al- CUBLAS procedures (lines 33, 35, 36, 39).
82 | Ł. Syrocki and G. Pestka
Procedure 1d: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 40–47).
Lines from 43 to 46 (see Procedure 1d) describe data trans- cision and Test_double/dist/Debug/CUDA-Linux-x86 for
fer from the GPU to the CPU memory (cudaMemcpy), in double precision.
turn line 47 (see Procedure 1d) describes release of the al- All test runs have been performed on computer
located GPU memory (cublasFree). equipped with Intel Core i5-2410M 2.3 GHz processor, 4 GB
It can be seen in this example (Procedure 1) that only of RAM and 64–bit operating system Kubuntu 11.10 with
the main loop is executed on the CPU. This is because the installed BLAS and LAPACK library. The graphics cards
functions in the CUBLAS library are C-wrapper functions were: GeForce GT 540M with 96 CUDA cores and 1024
and it is not possible to call them inside the kernel (be- MB DDR3 of the main memory. The graphics card drivers:
low CUDA 5.0). Three main components of programming 295.41. NVIDIA GPU Computing Toolkit: 4.2.9. We have
on graphics cards are: 1) transfer data from the CPU to compiled the above code using –arch sm_21 for single and
the GPU memory, 2) performing the calculations, 3) down- double precision floating point numbers. To measure the
loading the result from the GPU memory. In an attempt to computation time we used the CUTIL library from the SDK
minimize the transfer of data, we write the code so as to for CUDA.
transmit the data only at the beginning and at the end of
the program as shown in Procedure 1, lines 18 and 43–46.
Due to the limited memory of the graphics cards (for exam- 3.3 The advantages of using the GPU
ple, GeForce GTX 560 has only 1GB), it is necessary to save
as little as possible data in the memory of the GPU. The A comparison of the speedup factor (ratio of the execution
presented conversion may be more sophisticated, whereby times CPU/GPU) of the algorithms involved in solving the
acceleration could be higher. However, in our approach algebraic eigenvalue problem, for the one core CPU and for
we get a satisfying acceleration without spending too the GPU processors is given in Figures 1, 2 and 4. In turn,
much time on the code conversion. Several transformed a comparison of the speedup factor for many CPU cores,
functions in single and double precision can be found depending on the slowest case, involving Parallel Linear
in the attached SLASfGPU library (HOUSEHOLDER_GPU, Algebra for Scalable Multi-core Architectures (PLASMA) li-
xLARFG_GPU, xPOTF2_GPU, xSYTD2_GPU, xSYTD2_GPU2 brary and GPU processors is given in Figure 3.
(for device), xTRTI2_GPU, xSYTRD_GPU, xLATRD_GPU). Figure 1 presents the dependence of the speedup fac-
tor of the Householder algorithm (the transformation that
converts a real symmetric matrix H ^ to symmetric tridiag-
3.2 Test runs onal form T) on the dimension of the square symmetric
matrix [15, 16]. Algorithm for the CPU was taken from [16]
In order to enable the tests of the presented routines and converted to C, and then converted for GPU, in sin-
we have prepared two sets of input files, aggregated in gle and double precisions. Calculations were performed
folder Test_runs. In this folder there are two subfold- for the single core of Intel core i7-2600K 3.4 GHz processor
ers Test_float and Test_double, created in an integrated with approximate computing power equal 217.6 Gflops (for
development environment NetBeans. They contain func- 4 cores) and single graphics card GeForce GTX 590 with
tions written in single or double precision. In subfolder peak single precision floating point performance equal to
dist/Debug/CUDA-Linux-x86 there are files with input ma- 1244.15 Gflops. The complexity of this algorithm is O(n3 )
trices, which in the main program newmain.cu is given lo- (cf. Figure 1).
cation to. The output files are created in the subdirecto- A similar analysis, but for the orthogonal similarity
ries Test_float/dist/Debug/CUDA-Linux-x86 for single pre- transformation, is presented in Figure 2. Optimized LA-
Procedures for GPU. Generalized eigenvalue problem | 83
Figure 4: Speedup factor of data transfers time for different matrix sizes between Intel i7-2 600K – GeForce GTX 590 and Intel i5-2410M –
GeForce GT 540M.
PACK subroutine has been applied in the CPU program. For cial implementation of LAPACK interface for CUDA (CULA)
the GPU the appropriate function of the LAPACK library (R14) library [19] and open source Matrix Algebra on GPU
parallelized and converted by us has been used. Also here and Multicore Architectures (MAGMA) (1.4.1-beta2) [20,
the complexity of the algorithm is O(n3 ). 21] are designed for graphics cards. SSYTRD_GPU and
Comparing the Householder transformation and LA- DSYTRD_GPU functions are implemented by us. Compar-
PACK xSYTD2 algorithm we have found that the execution ing the speedup factors, one can see that even in relation
time in double precision on the CPU is slightly longer than to the commercial CULA library and PLASMA library, our
the execution time of the algorithm in single precision, easy way of implementation gives better acceleration. In
while in the case of GPU the execution time in double pre- comparison with the newest version 1.4.1-beta2 of profes-
cision takes about two times longer than in the single pre- sional MAGMA library our results are almost at the same
cision. It follows that the GPU is relatively worse in han- level. MAGMA xSYTRD was analyzed earlier [22].
dling double precision calculations than the CPU. How- The measurement penalties introduced by the data
ever, our implementation of the Householder transforma- transfers time for different matrix sizes characterized by
tion based on [16] written for the CPU and GPU is much speedup factor are presented in Figure 4. By using a high-
slower than the xSYTD2 algorithm taken from the LAPACK speed bus, which is the PCI Express 2.0 x16 (theoretical
library for the CPU and GPU, but in terms of acceleration bandwidth equals 8GB/s in each direction), the data trans-
on the GPU in comparison to the CPU execution time, our fer between the CPU and GPU (GPU and CPU) takes hun-
Householder algorithm is better in both single and double dreds milliseconds, therefore it does not affect the execu-
precision. Both single and double precision calculations tion time of the algorithm because it is negligible com-
for the GPU give a significant acceleration compared to the pared to the total time of execution. Of course, in the case
CPU, particularly for larger dimensions of the matrix (see of shorter algorithms the data transfer time may play a
Figure 1 and Figure 2). more significant role [6].
In Figure 3 the speedup factors depending on the A comparison of technical data of processors and
slowest case for several algorithms which transform a graphics card used for matrix diagonalization is presented
symmetric matrix to symmetric tridiagonal form derived in Tables 1 and 2 [24–28]. Tables 3, 5 and 7 show a compar-
from four different linear algebra libraries are shown. The ison of the execution time for various processors and vari-
transformations were performed in single and in dou- ous graphics cards used in the diagonalization of different
ble precision on Intel core i7-2600K processor and a sin- size matrices. Functions used for this purpose are listed in
gle graphics card GeForce GTX 590. The LAPACK (3.3.1) the Tables. Some of them were taken from the LAPACK li-
library is designed for use on a single processor core. brary for CPU, and some other have been transformed from
The PLASMA(2.5.0beta1) [17, 18] library allows the use the LAPACK library, so that they can be used for graphics
of all processor cores (in Figure 3 calculations were car- cards. The specified functions perform an initial diagonal-
ried out on 4 cores without hyper-threading). Commer- ization process (on the CPU or GPU), that is the transforma-
Procedures for GPU. Generalized eigenvalue problem | 85
Table 1: Specification for the Intel processors used for the matrix diagonalization.
Model of the Intel processors Intel core Intel core Intel core i7-960 Intel core i7-2600K
Duo i5-2410M
Number of cores 2 2 4 4
Number of cores threads - 4 8 8
Cores freq. [GHz] 1.86 2.3 3.2 3.4
Theoretical peak single precision 29.81 73.61 102.41 217.61
floating point performance in Gflops
Theoretical peak double precision 14.91 36.81,2 51.21 108.81
floating point performance in Gflops
Thermal design power [W] 65 35 130 95
1 The calculations of the theoretical peak performance for each processor were carried out on the basis of page 4 equation from [23], extended
to the multi-core architectures, and then compared with the data given in [24, 25].
2 Example: Intel core i5-2410M: 2.3 GHz (Cores freq.) * 8 (AVX double precision (16 for single precision)) * 2 (cores) = 36.8 Gflops (Theoretical
peak double precision floating point performance).
tion of a symmetric matrix to the tridiagonal forms. The en- Tables 4, 6 and 8 present speedup factor of functions
try “The entire program” informs about the duration of the used for matrix diagonalization for N = 2 000, 6 000 and
whole diagonalization process on the CPU (obtaining all 10 000. The speedup factor has been calculated according
eigenvalues and eigenvectors). As one can see, if we sub- to one core of the best processor tested, Intel core i7-2600K
stitute the CPU functions by the GPU ones the execution and all standard graphics cards tested in single and dou-
time decreases significantly. ble precision. A speedup factor lower than 1 indicates that
All calculations have been performed using 64–bit op- there was no acceleration. Comparing the acceleration val-
erating system Kubuntu 11.10. The graphics card drivers: ues for different matrix sizes (see Table 4, 6 and 8) one
295.41 and NVIDIA GPU Computing Toolkit: 4.2.9. We have can see a general trend: the GPU is faster than the CPU.
compiled the above code using – arch sm_21 for single and Moreover, the accelerations of the matrix with N = 6 000
double precision floating point numbers. To measure the (see Table 6) for three graphics cards (GeForce GTX 560,
computation time we used simple utility (CUTIL) library GeForce GTX 590, Tesla C2075) are slightly larger than for
from the Software Development Kit (SDK) for CUDA. matrix with N = 10 000 (see Table 8), despite the fact that
For some functions (e.g. LAPACK SSYTD2 and DSYTD2) it might seem that with increasing matrix size the acceler-
there is a big difference of single and double precision ex- ation should increase.
ecution time. For some other (e.g. the Householder trans- The presented approach has also been applied and
formation [16]) the times are nearly the same. This is be- tested on the real physical problem, i.e. it was used for the
cause the LAPACK library is written from the point of view determination of the helium atom eigenstates by solving
of the performance. Each function should act with the the secular equation in the case of Dirac-Coulomb Hamil-
speed of the hardware and therefore a single precision tonian. The obtained speedup factor has not changed sig-
should be up to two times faster. Simple codes from hand nificantly from that shown for the test data presented here.
books on numerical methods spend most of the time wait-
ing in the main memory so that the processor speed does
not play a major role. For both GPU cards of the older
(GeForce 9500 GT) and the new generation CUDA architec- 3.4 Accuracy of tested subroutines
ture, code named “Fermi” (GeForce GT 540M, GeForce GTX
560, GeForce GTX 590, Tesla C2075) it is hard to deduce sig- Presented subroutines have been tested for the accuracy
nificant differences in the calculation time in double preci- of the results obtained for a random matrix of dimension
sion. The reason for this is that only one of the tested cards N = 1000 and N = 3000. For this purpose we have used stan-
(GeForce 9500 GT) neither has the “Fermi” architecture nor dard backwards error formula taken from LAPACK working
supports double precision [26–28]. note 41, section 7.6.4 [29], and the results have been sum-
marized in Table 9. For most cases, calculated error per-
formed for calculation in the double precision is smaller
Table 2: Specification for the graphics cards used for the matrix diagonalization.
Model of the graphics card GeForce 9500 GT GeForce GT 540M GeForce GTX 560 Tesla GeForce GTX 590
C2075 [GF110 (x2)]
Number of CUDA cores 32 96 336 448 2 × 512
Cores freq. [MHz] 1400 1344 1800 1147 1215
86 | Ł. Syrocki and G. Pestka
Table 3: CPU and GPU matrix diagonalization execution times (in seconds), N = 2 000.
Precision Intel core GeForce Intel core GeForce Intel core GeForce Tesla Intel core GeForce GTX
Duo 9500 GT i5-2410M GT 540M i7-960 3.2 GTX 560 C2075 i7-2600K 590 [GF110
1.86 GHz 2.3 GHz GHz 3.4 GHz (x1)]
Householder Single 384 36.6 202 21 169 5 4.6 124.2 4.3
transformation Double 401 - 209 58 186 12.3 5.4 129.5 7.6
The entire program Single 422 79.8 219 36 208 40 35 131.6 14.4
Double 457.6 - 233 74 217 42.3 35.2 144 19
Orthogonal similarity Single 5.1 12.1 3.5 2.1 3.7 0.9 1 2.4 0.5
transformation: Double 10.3 - 4.5 3.5 4.2 1.2 1.3 3.7 0.8
QT HQ
^ =T
The entire program Single 34.4 42.4 17 15.2 22 18.2 18 11.7 9.8
Double 72.6 - 28.1 27.5 27.3 23.3 22.3 19.2 16
Table 4: CPU/GPU ratios for the case displayed in Table 3 (N = 2 000).
Table 5: CPU and GPU matrix diagonalization execution times (in seconds), N = 6 000.
Precision Intel core GeForce Intel core GeForce Intel core GeForce Tesla Intel core GeForce GTX
Duo 9500 GT i5-2410M GT 540M i7-960 3.2 GTX 560 C2075 i7-2600K 590 [GF110
1.86 GHz 2.3 GHz GHz 3.4 GHz (x1)]
Householder Single 8951 1100 4322 640 4590 111 101 3282 90.3
transformation Double 9294 - 4497 1431 4882 271.5 146 3499 161
Entire program Single 9784 2152 4682 1038 5029 512 481 3547 346.8
Double 10355 - 5103 1852 5290 657 649 3946 598
Orthogonal similarity Single 135 315 93 72 100 12 13 67.7 9
transformation: Double 253 - 100 99 111 20.2 20.6 105.2 14.4
QT HQ
^ =T
Entire program Single 931 1083 435 440 540 481 468 312.6 213
Double 1856 - 700 693 700 605 581 501.2 411.9
QT HQ
^ =T
87
Table 7: CPU and GPU matrix diagonalization execution times (in seconds), N = 10 000.
Precision Intel core GeForce Intel core GeForce Intel core GeForce Tesla Intel core GeForce GTX
Duo 9500 GT i5-2410M GT 540M i7-960 3.2 GTX 560 C2075 i7-2600K 590
1.86 GHz 2.3 GHz GHz 3.4 GHz [GF110 (x1)]
Householder Single 47498 5056 20049 2577 22383 585 474 15169 427
transformation Double 48024 - 20611 5683 22897 1228 670 15634 762
Entire program Single 51361 8856 21569 4208 24451 2750 2571 16292 1622
Double 52702 - 22507 7586 25067 3387 2849 16902 2032
Orthogonal similarity Single 616 1455 419 242 473 53 55 309 39.6
88 | Ł. Syrocki and G. Pestka
^
^ − VTV* ||/Nε||H||.
Table 9: Backwards error ||H
[21] Matrix Algebra on GPU and Multicore Architectures (MAGMA), [25] ARK | Your Source for Intelr Product Information, http://ark.
http://icl.cs.utk.edu/magma/ intel.com/
[22] Yamazaki I., Dong T., Solcà R., Tomov S., Dongarra J., Schulthess [26] GeForce 500 Series, http://en.wikipedia.org/wiki/GeForce_500
T., Tridiagonalization of a Dense Symmetric Matrix On Multiple _Series
GPUs and Its Application to Symmetric Eigenvalue Problems, [27] GeForce 9 Series, http://en.wikipedia.org/wiki/GeForce_9_Seri
Concurrency and Computation: Practice and Experience, 2014, es
26, 2652–2666 [28] Fermi Architecture White Paper – Nvidia, http://www.nvidia.
[23] Dongarra J.J., Luszczek P., Petitet A., The LINPACK Benchmark: com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute
Past, Present, and Future, Concurrency and Computation: Prac- _Architecture_Whitepaper.pdf
tice and Experience, 2003, 15, 803–820 [29] Blackford S., Dongarra J.J., LAPACK Working Note 41, UT-CS-
[24] Intel Processors, http://www.intel.com/support/processors/sb 92-151, March, 1992. Updated: June 30, 1999 (VERSION 3.0),
/cs-017346.htm http://www.netlib.org/lapack/lawnspdf/lawn41.pdf