Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Open Comput. Sci.

2016; 6:79–90

Research Article Open Access

Łukasz Syrocki* and Grzegorz Pestka

Implementation of algebraic procedures on the


GPU using CUDA architecture on the example of
generalized eigenvalue problem
DOI 10.1515/comp-2016-0006 Number of bytes of the test runs input files: 95,3 MB
Received Oct 13, 2014; accepted Mar 21, 2016 Number of bytes of the test runs reference output: 640 kB
Distribution format: zip
Abstract: The ready to use set of functions to facilitate Nature of physical problem: Algebraic functions facilitate
solving a generalized eigenvalue problem for symmetric solving generalized eigenvalue problem
matrices in order to efficiently calculate eigenvalues and Typical run time: Every test example included in the dis-
eigenvectors, using Compute Unified Device Architecture tribution package takes up to 15 minutes
Method of solution: CUBLAS library, LAPACK library,
(CUDA) technology from NVIDIA, is provided. An integral
CUDA technology
part of the CUDA is the high level programming environ-
ment enabling tracking both code executed on Central Pro-
cessing Unit and on Graphics Processing Unit. The pre-
sented matrix structures allow for the analysis of the ad- 1 Introduction
vantages of using graphics processors in such calcula-
tions. The Compute Unified Device Architecture (CUDA) is a par-
allel computing architecture for a Graphics Processing
Keywords: Eigenvalue problem; CUDA; GPGPU comput-
Unit (GPU) developed by NVIDIA. This architecture allows
ing; CUBLAS library; LAPACK library
the use of the computing power of a GPU to solve general
PACS codes: 02.70.Wz, 07.05.Tp problems in a numerical way much more efficiently than in
the traditional general purpose multi-core processors [1].
Program summary:
Title of program: SLASfGPU An integral part of CUDA architecture is a software envi-
Program obtainable from: http://www.fizyka.umk.pl/~luk ronment using the parallel processing model which allows
es/ developers to use C, C++ or FORTRAN to create applica-
Computer: graphics card with CUDA technology recom- tions. As a consequence, a programmer can focus on the
mended
important things related to the parallelization, such as the
Operating system: no limits (tested on 32-bit and 64-bit
Windows and 64-bit Linux) art of creating efficient parallel algorithms without hav-
Programming language used: C++, C for CUDA ing to learn the structure and syntax of a new language.
Memory required to execute: dependent on user’s parame- This model is designed to allow the user to write highly
ters, typically between several tens of megabytes and sev- scalable parallel code that can run on tens of thousands
eral gigabytes (this also concerns the graphics card’s mem- of concurrent threads and hundreds of processor cores. A
ory).
CUDA program consists of the main program where the
Number of bits in a word: 32 and 64 bits
Number of processors used: one CPU core and all CUDA parallel kernel is called by one or more sequential threads
cores of the selected processor of graphics card running on the Central Processing Unit (CPU), and of the
Number of bytes in distributed program: 72 kB kernel, which is suitable for execution on a parallel pro-
cessing device such as the GPU. Thread blocks are con-
ceptually organized into 1D, 2D or 3D arrays of threads for
convenience. The maximum sizes of each dimension of a
*Corresponding Author: Łukasz Syrocki: Institute of Physics,
block are 1024 x 1024 × 64. Blocks are grouped in a grid,
Faculty of Physics, Astronomy and Informatics, Nicolaus Coper-
nicus University, Grudziadzka 5, 87-100 Torun, Poland; Email: up to 231 -1 × 65535 × 65535 (depending on the version of
lukes@fizyka.umk.pl the computing capability of GPU). To manage such a huge
Grzegorz Pestka: Institute of Physics, Faculty of Physics, Astron- amount of threads the GPUs multiprocessor employs SIMT
omy and Informatics, Nicolaus Copernicus University, Grudziadzka (Single-Instruction, Multiple-Thread) architecture, which
5, 87-100 Torun, Poland

© 2016 Ł. Syrocki and G. Pestka, published by De Gruyter Open.


This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License.
80 | Ł. Syrocki and G. Pestka

specifies the execution and branching behavior of a single To solve the algebraic eigenvalue problem and to ob-
thread. The CUDA requires that thread blocks are indepen- tain the eigenvalues and/or eigenvectors, one needs to use
dent. This means that a kernel executes blocks correctly no numerical linear algebra algorithms. One of the libraries
matter the order in which they are run. This independence that contain the necessary algorithms is Linear Algebra
of the blocks of a kernel provides scalability [2–6]. Package (LAPACK). It offers a number of routines working
In this paper we focus on employing CUDA’s fast com- in single and double precision. In this work we employ:
puting capabilities for solving important numerical prob- xPOTRF that computes Cholesky decomposition (symbol
lems. Our case study is provided by solving the eigenvalue × refers to the precision, there can be single SPOTRF or
problem, described in detail in Section 2. In Section 3 we double DPOTRF), xTRTRI that computes the inverse of a
present our work of transforming typical numerical proce- real upper or lower triangular matrix, blocked xSYTRD that
dures from the CPU to GPU in an easy way, together with reduces a real symmetric matrix to symmetric tridiagonal
an evaluation of our transformations. We also describe the form. To obtain the eigenvalues of a symmetric tridiago-
difficulties associated with such operations. In Section 4 nal matrix we used xSTERF routine. This routine uses a
we summarize and conclude. square-root free version of the QR algorithm [11]. It is worth
emphasizing that our main case study is in efficiently com-
puting eigenvalues and eigenvectors for the quantum me-
chanical system. For this purpose one needs to use the
2 The generalized algebraic xSTEQR function on a symmetric tridiagonal matrix. This
eigenvalue problem routine uses the implicitly shifted QR algorithm [11]. De-
tails may be found in [11–14].
In many fields of physics and engineering the analysis of
eigenvalue problems plays an important role. For exam-
ple, determining the eigenstates of quantum systems. A
general eigenvalue problem can be described as follows:
3 Results and discussion
Hc = εSc, (1) 3.1 Simple conversion for functions from the
CPU to GPU
where we assume that matrices H and S are real and sym-
metric, and matrix S is positively defined, ε and c are
Here we present an example of a function, converted by
eigenvalues and eigenvectors respectively. To solve this
us to work on the graphics card, using CUDA Basic Linear
type of equation it is convenient to convert it into an equiv-
Algebra Subroutines (CUBLAS) library. Function SSYTD2
alent form, in which S reduces to the identity matrix I
reduces a real symmetric matrix H ^ to a symmetric tridi-
(S → I). This can be done by using the Cholesky decompo-
agonal form T by an orthogonal similarity transformation:
sition, which is a decomposition procedure of a symmetric
^ = T, for CPU. The function can be found at [12]. Some
QT HQ
(Hermitian) positively defined matrix into the product of a
parts of function SSYTD2_GPU for GPU (in lower triangular
matrix and its transpose:
case) have been written using CUBLAS library provided by
S = LLT , (2) NVIDIA Corporation.
The above piece of code describes simple kernels (see
so that L is a nonsingular lower triangular matrix and LT its Procedure 1a, lines 1-15), whose aim is to combine CUBLAS
transposition. Substituting the decomposition of S to the functions, so that all the routines can be performed on the
equation (1) we obtain: graphics card without data transfer from the CPU to GPU
and vice versa. It has been verified that simple kernels are
Hc = εLLT c. (3)
faster than the cudaMemcpy function.
Multiplying both sides by the inverse of L and taking The numbers in parentheses next to the name of the
the identity matrix I in the form of L−T LT we get function (see Procedure 1b, line 28) are executive informa-
(︁ )︁ (︁ )︁ (︁ )︁ tion for a system about how to call the kernel. The first
L−1 H L−T LT c = ε L−1 LLT c. (4) one specifies the number of parallel blocks, and the sec-
ond one specifies the number of threads. In this example,
Now by substitution of H^ = L−1 HL−T and ^
c = LT c we get an
one copy of the kernel call is sufficient, and thus no paral-
equivalent algebraic eigenvalue problem [7–10]:
lelization is needed in this case.
^ c = ε^
H^ c. (5)
Procedures for GPU. Generalized eigenvalue problem | 81

Procedure 1a: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 1–15).

1 ad2ed(. . .)
2 /* value of the i__ element of the vector ed is replaced by value of the (i__+1+i__*a_dim1)
3 element of the matrix ad */
4 replacing_by_1( . . .)
5 /* value of the (i__+1+i__*a_dim1) element of the matrix ad is replaced by 1 */
6 ed2ad(. . .)
7 /* value of the (i__+1+i__*a_dim1) element of the matrix ad is replaced by value of the i__
8 element of the vector ed */
9 ad2dd_taui2taud( . . .)
10 /* value of the i__ element of the vector dd__ is replaced by value of the ( i__+i__*a_dim1)
11 element of the matrix ad and also value of the i__ element of the vector taud is replaced by
12 value of the variable taui */
13 ad2dd(. . .)
14 /* value of the n-1 element of the vector dd__ is replaced by value of the (n-1 + (n-1)* a_dim 1)
15 element of the matrix ad */

Procedure 1b: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 16–28).

16 int ssytd2_GPU{
17 /* Allocate device memory for the matrices ad, dd__, ed, taud */
18 /* Initialize the device matrices with the host matrices */
19 if (upper) { }
20 else {
21 /* Reduce the lower triangle of A */
22 for (i__ = 0; i__ < i__1; ++i__) {
23 /* Generate elementary reflector H(i) = I - tau * v * v’ to annihilate A(i+2:n,i) */
24 /* Computing MIN */
25 slarfg_GPU (. . .);
26 /* SLARFG function from LAPACK library (which generates a real elementary reflector H of
27 order n) has been changed to work for the GPU */
28 ad2ed<<< 1,1 >>> (. . .);

Procedure 1c: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 29–39).

29 if (taui != 0.f) {
30 /* Apply H(i) from both sides to A(i+1:n,i+1:n) */
31 replacing_by_1<<< 1,1 >>> (. . .);
32 /* Compute x := tau * A * v storing y in TAU(i:n-1) */
33 cublasSsymv (. . .);
34 /* Compute w := x - 1/2 * tau * (x’*v) * v */
35 alpha = taui * -.5f * cublasSdot(. . .);
36 cublasSaxpy(. . .);
37 /* Apply the transformation as a rank-2 update: */
38 /* A := A - v * w’ - w * v’ */
39 cublasSsyr2(. . .);

As it can be seen by analyzing the part of the code gebra Subprograms (BLAS) procedures were swapped to
from lines 29 to 39 (see Procedure 1c), all Basic Linear Al- CUBLAS procedures (lines 33, 35, 36, 39).
82 | Ł. Syrocki and G. Pestka

Procedure 1d: Part of function SSYTD2_GPU for GPU in lower triangular case (lines 40–47).

40 ed2ad<<< 1,1 >>> (. . .); }


41 ad2dd_taui2taud<<< 1,1 >>>( . . .); }
42 ad2dd<<< 1,1 >>>(. . .); }
43 /* Read the result back */
44 /* d__ - the diagonal elements of the tridiagonal matrix */
45 /* e - the off-diagonal elements of the tridiagonal matrix */
46 /* tau - The scalar factors of the elementary reflectors */
47 /* Memory clean up */ }

Lines from 43 to 46 (see Procedure 1d) describe data trans- cision and Test_double/dist/Debug/CUDA-Linux-x86 for
fer from the GPU to the CPU memory (cudaMemcpy), in double precision.
turn line 47 (see Procedure 1d) describes release of the al- All test runs have been performed on computer
located GPU memory (cublasFree). equipped with Intel Core i5-2410M 2.3 GHz processor, 4 GB
It can be seen in this example (Procedure 1) that only of RAM and 64–bit operating system Kubuntu 11.10 with
the main loop is executed on the CPU. This is because the installed BLAS and LAPACK library. The graphics cards
functions in the CUBLAS library are C-wrapper functions were: GeForce GT 540M with 96 CUDA cores and 1024
and it is not possible to call them inside the kernel (be- MB DDR3 of the main memory. The graphics card drivers:
low CUDA 5.0). Three main components of programming 295.41. NVIDIA GPU Computing Toolkit: 4.2.9. We have
on graphics cards are: 1) transfer data from the CPU to compiled the above code using –arch sm_21 for single and
the GPU memory, 2) performing the calculations, 3) down- double precision floating point numbers. To measure the
loading the result from the GPU memory. In an attempt to computation time we used the CUTIL library from the SDK
minimize the transfer of data, we write the code so as to for CUDA.
transmit the data only at the beginning and at the end of
the program as shown in Procedure 1, lines 18 and 43–46.
Due to the limited memory of the graphics cards (for exam- 3.3 The advantages of using the GPU
ple, GeForce GTX 560 has only 1GB), it is necessary to save
as little as possible data in the memory of the GPU. The A comparison of the speedup factor (ratio of the execution
presented conversion may be more sophisticated, whereby times CPU/GPU) of the algorithms involved in solving the
acceleration could be higher. However, in our approach algebraic eigenvalue problem, for the one core CPU and for
we get a satisfying acceleration without spending too the GPU processors is given in Figures 1, 2 and 4. In turn,
much time on the code conversion. Several transformed a comparison of the speedup factor for many CPU cores,
functions in single and double precision can be found depending on the slowest case, involving Parallel Linear
in the attached SLASfGPU library (HOUSEHOLDER_GPU, Algebra for Scalable Multi-core Architectures (PLASMA) li-
xLARFG_GPU, xPOTF2_GPU, xSYTD2_GPU, xSYTD2_GPU2 brary and GPU processors is given in Figure 3.
(for device), xTRTI2_GPU, xSYTRD_GPU, xLATRD_GPU). Figure 1 presents the dependence of the speedup fac-
tor of the Householder algorithm (the transformation that
converts a real symmetric matrix H ^ to symmetric tridiag-
3.2 Test runs onal form T) on the dimension of the square symmetric
matrix [15, 16]. Algorithm for the CPU was taken from [16]
In order to enable the tests of the presented routines and converted to C, and then converted for GPU, in sin-
we have prepared two sets of input files, aggregated in gle and double precisions. Calculations were performed
folder Test_runs. In this folder there are two subfold- for the single core of Intel core i7-2600K 3.4 GHz processor
ers Test_float and Test_double, created in an integrated with approximate computing power equal 217.6 Gflops (for
development environment NetBeans. They contain func- 4 cores) and single graphics card GeForce GTX 590 with
tions written in single or double precision. In subfolder peak single precision floating point performance equal to
dist/Debug/CUDA-Linux-x86 there are files with input ma- 1244.15 Gflops. The complexity of this algorithm is O(n3 )
trices, which in the main program newmain.cu is given lo- (cf. Figure 1).
cation to. The output files are created in the subdirecto- A similar analysis, but for the orthogonal similarity
ries Test_float/dist/Debug/CUDA-Linux-x86 for single pre- transformation, is presented in Figure 2. Optimized LA-
Procedures for GPU. Generalized eigenvalue problem | 83

^ for single and double precision


Figure 1: Speedup factor of the Householder algorithm vs. the dimension of a square symmetric matrix H,
programs.

^ = T, called xSYTD2 vs. the dimension of a square sym-


Figure 2: Speedup factor of orthogonal similarity transformation algorithm: QT HQ
^ for single and double precision programs.
metric matrix H,

^ for CPUs and GPUs programs that converts a real


Figure 3: Speedup factors of algorithms vs. the dimension of a square symmetric matrix H,
^ to symmetric tridiagonal form T, (blocked version of xSYTRD).
symmetric matrix H
84 | Ł. Syrocki and G. Pestka

Figure 4: Speedup factor of data transfers time for different matrix sizes between Intel i7-2 600K – GeForce GTX 590 and Intel i5-2410M –
GeForce GT 540M.

PACK subroutine has been applied in the CPU program. For cial implementation of LAPACK interface for CUDA (CULA)
the GPU the appropriate function of the LAPACK library (R14) library [19] and open source Matrix Algebra on GPU
parallelized and converted by us has been used. Also here and Multicore Architectures (MAGMA) (1.4.1-beta2) [20,
the complexity of the algorithm is O(n3 ). 21] are designed for graphics cards. SSYTRD_GPU and
Comparing the Householder transformation and LA- DSYTRD_GPU functions are implemented by us. Compar-
PACK xSYTD2 algorithm we have found that the execution ing the speedup factors, one can see that even in relation
time in double precision on the CPU is slightly longer than to the commercial CULA library and PLASMA library, our
the execution time of the algorithm in single precision, easy way of implementation gives better acceleration. In
while in the case of GPU the execution time in double pre- comparison with the newest version 1.4.1-beta2 of profes-
cision takes about two times longer than in the single pre- sional MAGMA library our results are almost at the same
cision. It follows that the GPU is relatively worse in han- level. MAGMA xSYTRD was analyzed earlier [22].
dling double precision calculations than the CPU. How- The measurement penalties introduced by the data
ever, our implementation of the Householder transforma- transfers time for different matrix sizes characterized by
tion based on [16] written for the CPU and GPU is much speedup factor are presented in Figure 4. By using a high-
slower than the xSYTD2 algorithm taken from the LAPACK speed bus, which is the PCI Express 2.0 x16 (theoretical
library for the CPU and GPU, but in terms of acceleration bandwidth equals 8GB/s in each direction), the data trans-
on the GPU in comparison to the CPU execution time, our fer between the CPU and GPU (GPU and CPU) takes hun-
Householder algorithm is better in both single and double dreds milliseconds, therefore it does not affect the execu-
precision. Both single and double precision calculations tion time of the algorithm because it is negligible com-
for the GPU give a significant acceleration compared to the pared to the total time of execution. Of course, in the case
CPU, particularly for larger dimensions of the matrix (see of shorter algorithms the data transfer time may play a
Figure 1 and Figure 2). more significant role [6].
In Figure 3 the speedup factors depending on the A comparison of technical data of processors and
slowest case for several algorithms which transform a graphics card used for matrix diagonalization is presented
symmetric matrix to symmetric tridiagonal form derived in Tables 1 and 2 [24–28]. Tables 3, 5 and 7 show a compar-
from four different linear algebra libraries are shown. The ison of the execution time for various processors and vari-
transformations were performed in single and in dou- ous graphics cards used in the diagonalization of different
ble precision on Intel core i7-2600K processor and a sin- size matrices. Functions used for this purpose are listed in
gle graphics card GeForce GTX 590. The LAPACK (3.3.1) the Tables. Some of them were taken from the LAPACK li-
library is designed for use on a single processor core. brary for CPU, and some other have been transformed from
The PLASMA(2.5.0beta1) [17, 18] library allows the use the LAPACK library, so that they can be used for graphics
of all processor cores (in Figure 3 calculations were car- cards. The specified functions perform an initial diagonal-
ried out on 4 cores without hyper-threading). Commer- ization process (on the CPU or GPU), that is the transforma-
Procedures for GPU. Generalized eigenvalue problem | 85

Table 1: Specification for the Intel processors used for the matrix diagonalization.

Model of the Intel processors Intel core Intel core Intel core i7-960 Intel core i7-2600K
Duo i5-2410M
Number of cores 2 2 4 4
Number of cores threads - 4 8 8
Cores freq. [GHz] 1.86 2.3 3.2 3.4
Theoretical peak single precision 29.81 73.61 102.41 217.61
floating point performance in Gflops
Theoretical peak double precision 14.91 36.81,2 51.21 108.81
floating point performance in Gflops
Thermal design power [W] 65 35 130 95
1 The calculations of the theoretical peak performance for each processor were carried out on the basis of page 4 equation from [23], extended
to the multi-core architectures, and then compared with the data given in [24, 25].
2 Example: Intel core i5-2410M: 2.3 GHz (Cores freq.) * 8 (AVX double precision (16 for single precision)) * 2 (cores) = 36.8 Gflops (Theoretical
peak double precision floating point performance).

tion of a symmetric matrix to the tridiagonal forms. The en- Tables 4, 6 and 8 present speedup factor of functions
try “The entire program” informs about the duration of the used for matrix diagonalization for N = 2 000, 6 000 and
whole diagonalization process on the CPU (obtaining all 10 000. The speedup factor has been calculated according
eigenvalues and eigenvectors). As one can see, if we sub- to one core of the best processor tested, Intel core i7-2600K
stitute the CPU functions by the GPU ones the execution and all standard graphics cards tested in single and dou-
time decreases significantly. ble precision. A speedup factor lower than 1 indicates that
All calculations have been performed using 64–bit op- there was no acceleration. Comparing the acceleration val-
erating system Kubuntu 11.10. The graphics card drivers: ues for different matrix sizes (see Table 4, 6 and 8) one
295.41 and NVIDIA GPU Computing Toolkit: 4.2.9. We have can see a general trend: the GPU is faster than the CPU.
compiled the above code using – arch sm_21 for single and Moreover, the accelerations of the matrix with N = 6 000
double precision floating point numbers. To measure the (see Table 6) for three graphics cards (GeForce GTX 560,
computation time we used simple utility (CUTIL) library GeForce GTX 590, Tesla C2075) are slightly larger than for
from the Software Development Kit (SDK) for CUDA. matrix with N = 10 000 (see Table 8), despite the fact that
For some functions (e.g. LAPACK SSYTD2 and DSYTD2) it might seem that with increasing matrix size the acceler-
there is a big difference of single and double precision ex- ation should increase.
ecution time. For some other (e.g. the Householder trans- The presented approach has also been applied and
formation [16]) the times are nearly the same. This is be- tested on the real physical problem, i.e. it was used for the
cause the LAPACK library is written from the point of view determination of the helium atom eigenstates by solving
of the performance. Each function should act with the the secular equation in the case of Dirac-Coulomb Hamil-
speed of the hardware and therefore a single precision tonian. The obtained speedup factor has not changed sig-
should be up to two times faster. Simple codes from hand nificantly from that shown for the test data presented here.
books on numerical methods spend most of the time wait-
ing in the main memory so that the processor speed does
not play a major role. For both GPU cards of the older
(GeForce 9500 GT) and the new generation CUDA architec- 3.4 Accuracy of tested subroutines
ture, code named “Fermi” (GeForce GT 540M, GeForce GTX
560, GeForce GTX 590, Tesla C2075) it is hard to deduce sig- Presented subroutines have been tested for the accuracy
nificant differences in the calculation time in double preci- of the results obtained for a random matrix of dimension
sion. The reason for this is that only one of the tested cards N = 1000 and N = 3000. For this purpose we have used stan-
(GeForce 9500 GT) neither has the “Fermi” architecture nor dard backwards error formula taken from LAPACK working
supports double precision [26–28]. note 41, section 7.6.4 [29], and the results have been sum-
marized in Table 9. For most cases, calculated error per-
formed for calculation in the double precision is smaller
Table 2: Specification for the graphics cards used for the matrix diagonalization.

Model of the graphics card GeForce 9500 GT GeForce GT 540M GeForce GTX 560 Tesla GeForce GTX 590
C2075 [GF110 (x2)]
Number of CUDA cores 32 96 336 448 2 × 512
Cores freq. [MHz] 1400 1344 1800 1147 1215
86 | Ł. Syrocki and G. Pestka

Memory [MB] 512 1024 1024 6144 2 × 1536


Theoretical peak single precision 134.41 2581,2 1209.61 10301 2 × 12441
floating point performance in Gflops
Theoretical peak double precision Not supported 1291 604.81 5151 2 × 6221
floating point performance in Gflops
Thermal design power [W] 50 35 150 225 365
1 The calculations of the theoretical peak performance for each graphics card were carried out on the basis of a page 4 formula from [23], extended to the multi-core architectures, and then
compared with the data given in [26–28].
2 Example: GeForce GT 540M: 1.344 GHz (Cores freq.) * 2 (MAD instructions (3 for 9000 Series)) * 96 (cores) = 258 Gflops (Theoretical peak float precision floating point performance).

Table 3: CPU and GPU matrix diagonalization execution times (in seconds), N = 2 000.

Precision Intel core GeForce Intel core GeForce Intel core GeForce Tesla Intel core GeForce GTX
Duo 9500 GT i5-2410M GT 540M i7-960 3.2 GTX 560 C2075 i7-2600K 590 [GF110
1.86 GHz 2.3 GHz GHz 3.4 GHz (x1)]
Householder Single 384 36.6 202 21 169 5 4.6 124.2 4.3
transformation Double 401 - 209 58 186 12.3 5.4 129.5 7.6
The entire program Single 422 79.8 219 36 208 40 35 131.6 14.4
Double 457.6 - 233 74 217 42.3 35.2 144 19
Orthogonal similarity Single 5.1 12.1 3.5 2.1 3.7 0.9 1 2.4 0.5
transformation: Double 10.3 - 4.5 3.5 4.2 1.2 1.3 3.7 0.8
QT HQ
^ =T
The entire program Single 34.4 42.4 17 15.2 22 18.2 18 11.7 9.8
Double 72.6 - 28.1 27.5 27.3 23.3 22.3 19.2 16
Table 4: CPU/GPU ratios for the case displayed in Table 3 (N = 2 000).

Function Precision Speedup Speedup Speedup Speedup


(Intel core i7-2600K / (Intel core i7-2600K / (Intel core i7-2600K / Tesla (Intel core i7-2600K /
GeForce GT 540M) GeForce GTX 560) C2075) GeForce GTX 590
[GF110 (x1)])
Householder Single 5.91 24.84 27 28.88
transformation Double 2.23 10.53 23.98 17.04
Orthogonal similarity Single 1.14 2.67 2.4 4.80
transformation: Double 0.86 2.50 2.3 4.63
QT HQ
^ =T

Table 5: CPU and GPU matrix diagonalization execution times (in seconds), N = 6 000.

Precision Intel core GeForce Intel core GeForce Intel core GeForce Tesla Intel core GeForce GTX
Duo 9500 GT i5-2410M GT 540M i7-960 3.2 GTX 560 C2075 i7-2600K 590 [GF110
1.86 GHz 2.3 GHz GHz 3.4 GHz (x1)]
Householder Single 8951 1100 4322 640 4590 111 101 3282 90.3
transformation Double 9294 - 4497 1431 4882 271.5 146 3499 161
Entire program Single 9784 2152 4682 1038 5029 512 481 3547 346.8
Double 10355 - 5103 1852 5290 657 649 3946 598
Orthogonal similarity Single 135 315 93 72 100 12 13 67.7 9
transformation: Double 253 - 100 99 111 20.2 20.6 105.2 14.4
QT HQ
^ =T
Entire program Single 931 1083 435 440 540 481 468 312.6 213
Double 1856 - 700 693 700 605 581 501.2 411.9

Table 6: CPU/GPU ratios for the case displayed in Table 5 (N = 6 000).

Function Precision Speedup Speedup Speedup Speedup


(Intel core i7-2600K / (Intel core i7-2600K / (Intel core i7-2600K / (Intel core i7-2600K /
GeForce GT 540M) GeForce GTX 560) Tesla C2075) GeForce GTX 590
[GF110 (x1)])
Householder Single 5.13 29.57 32.50 36.35
transformation Double 2.45 12.89 23.97 21.73
Orthogonal similarity Single 0.94 5.64 5.21 7.52
transformation: Double 0.81 3.96 3.88 7.31
Procedures for GPU. Generalized eigenvalue problem |

QT HQ
^ =T
87
Table 7: CPU and GPU matrix diagonalization execution times (in seconds), N = 10 000.

Precision Intel core GeForce Intel core GeForce Intel core GeForce Tesla Intel core GeForce GTX
Duo 9500 GT i5-2410M GT 540M i7-960 3.2 GTX 560 C2075 i7-2600K 590
1.86 GHz 2.3 GHz GHz 3.4 GHz [GF110 (x1)]
Householder Single 47498 5056 20049 2577 22383 585 474 15169 427
transformation Double 48024 - 20611 5683 22897 1228 670 15634 762
Entire program Single 51361 8856 21569 4208 24451 2750 2571 16292 1622
Double 52702 - 22507 7586 25067 3387 2849 16902 2032
Orthogonal similarity Single 616 1455 419 242 473 53 55 309 39.6
88 | Ł. Syrocki and G. Pestka

transformation: Double 1177 - 569 405 802 93 90 471 62


QT HQ
^ =T
Entire program Single 2411 3018 1388 1194 1673 1164 1246 942 670
Double 4547 - 2066 1915 3073 1374 1473 1357 958

Table 8: CPU/GPU ratios for the case displayed in Table 7 (N = 10 000).

Function Precision Speedup Speedup Speedup Speedup


(Intel core i7-2600K / (Intel core i7-2600K / (Intel core i7-2600K / Tesla (Intel core i7-2600K /
GeForce GT 540M) GeForce GTX 560) C2075) GeForce GTX 590
[GF110 (x1)])
Householder Single 5.89 25.93 32 35.52
transformation Double 2.75 12.73 23.33 20.52
Orthogonal similarity Single 1.21 5.53 5.33 7.80
transformation: Double 0.87 3.80 3.92 7.60
QT HQ
^ =T

^
^ − VTV* ||/Nε||H||.
Table 9: Backwards error ||H

Function N = 1 000 N = 3 000


Single Double Single Double
xSYTD2 (LAPACK) 0.143813 0.118277 0.370533 0.360776
xSYTD2_GPU 0.087666 0.039664 0.044049 0.028661
xSYTRD (LAPACK) 0.071262 0.274496 0.315258 0.120432
xSYTRD (PLASMA) 0.029757 0.032289 0.033938 0.017719
xSYRDB (CULA) 1.911581 0.737640 1.949956 0.585071
xSYTRD (MAGMA) 0.039892 0.069084 0.038517 0.071060
xSYTRD_GPU 0.0464725 0.070825 0.037039 0.078054
Procedures for GPU. Generalized eigenvalue problem | 89

than calculated error for calculation in single precision,


which is understandable and natural. As it can be ex-
References
pected, one may also notice that the formula strongly de- [1] Sanders J., Kandrot E., CUDA by Example, NVIDIA Corporation,
pends on the size of the matrix N and on the machine ep- 2011
silon ε. Another conclusion which arises during analyses [2] NVIDIA CUDA C Programming Guide Version 4.2, NVIDIA Corpo-
of Table 9 is the fact that the most accurate, of course in ration, 2006–2012, 2701 San Tomas Expressway Santa Clara,
this case, is the PLASMA library, and the least accurate is CA 95050
[3] Garland M., Grand S.L., Nickolls J., Anderson J., Hardwick J., Mor-
the CULA library. We also checked the differences in the re-
ton S., Phillips E., Zhang Y., Volkov V., Parallel computing expe-
sults for some diagonal and off-diagonal elements from the riences with CUDA., IEEE Micro, 2008, 28 (4), 13–27
output files. The differences in the matrix N = 1000 appear [4] Nickolls J., et al., Scalable Parallel Programming with CUDA,
already in the 4th significant figure using single precision, ACM Queue, vol. 6, no. 2, Mar. Apr. 2008, 40–53
while using double precision, the results are correct up to [5] Luebke D., CUDA: Scalable parallel programming for high-
performance scientific computing, NVIDIA Corp., Santa Clara,
12 significant figures. For matrix N = 3000 in single preci-
CA, Biomedical Imaging: From Nano to Macro, 2008. ISBI 2008.
sion errors have appeared in the first significant figure, but 5th IEEE International Symposium on
the results obtained in the double precision are correct up [6] CUDA C Best Practices Guide - CUDA Toolkit Documen-
to 9 significant figures. tation, http://docs.nvidia.com/cuda/cuda-c-best-practices-gui
de/index.html
[7] Wilkinson J.H., The algebraic eigenvalue problem, Oxford,
Clarendon Press, 1965
4 Summary and conclusions [8] Pang T., An Introduction to Computational Physics, Cambridge
University Press, 1997, 124–143.
[9] Stoer J., Bulirsch R., Introduction to Numerical Analysis,
Some of the existing libraries are covered by the license,
Springer-Verlag, New York Inc, 1980, 180–184, 323–408
and provide the executable code only, without an access [10] Piela L., Ideas of quantum chemistry, Elsevier B.V., 2007, 982–
to the source code (i.e. CULA, MKL). Therefore, in this pa- 985
per, our main goal was to show not only the procedures but [11] Greenbaum A., Dongarra J. J., Experiments with QL/QR meth-
also a simple way how to transfer the conventional numer- ods for the symmetric tridiagonal eigenproblem, Computer Sci-
ical algorithms from the CPU to the GPU and describe the ence Dept. Technical Report CS-89-92, University of Tennessee,
Knoxville, TN, 1989.
problems associated with such operations. In addition, we
[12] Anderson E., Bai Z., Bischof C., Blackford S., Demmel J., Don-
have shown that matrix operations performed on the sec- garra J.J., Du Croz J., Greenbaum A., Hammarling S., McKenney
ular equation can be easily parallelized due to the use of A., Sorensen D., {LAPACK} Users Guide, Third edition. Society
multi-core graphics card that will significantly accelerate for Industrial and Applied Mathematics, 1999, Philadelphia, PA,
calculations (see Table 3 and 4). Our procedures provide ISBN 0-89871-447-8 (paperback)
[13] Brent R., Strazdins P., Implementation of BLAS Level 3 and LIN-
results in accordance with the reference functions from
PACK Benchmark on the AP1000, Fujitsu Scientific and Technical
the LAPACK library and are faster. For example execution- Journal, 1993, 5(1), 61–70
time of the SSYTD2 function performed on single graph- [14] Aliaga J.I., Bientinesi P., Davidović D., Di Napoli E., Igual
ics card GeForce GTX 590 is about 16 times faster than F.D., Quintana-Ortí E.S., Solving dense generalized eigenprob-
on the one core of Intel core Duo 1.86 GHz in single pre- lems on multi-threaded architectures, Applied Mathematics
cision (see Table 7). The GPU computing processors with and Computation, 2012, 218, 11279–11289
[15] Martin R.S., Reinsch C., Wilkinson J.H., Householder’s Tridi-
NVIDIA Tesla cards transform standard PCs and worksta-
agonalization of a Symmetric Matrix, Numerische Mathematic,
tions into personal supercomputers providing computing 1968, 11, 181–195
performance of a level typical for CPUs clusters. This com- [16] Householder Transformation Method, math.fullerton.edu/Math
putational power can be easily applied, among others, to ews/n2003/householdermod.html
physical problems. We would like to add that in the next [17] Agullo E., Demmel J., Dongarra J.J., Hadri B., Kurzak B.J., Langou
J., Ltaief H., Luszczek P., Tomov S., Numerical linear algebra on
step, due to our specific demands, the presented approach
emerging architectures: the PLASMA and MAGMA projects, Jour-
will be extended to quadruple precision. nal of Physics: Conference Series 2009, 180
[18] Parallel Linear Algebra for Scalable Multi-core Architectures
Acknowledgement: We would like to thank J. Matulewski (PLASMA), http://icl.cs.utk.edu/plasma/
and M. Zieliński for helpful advice and for making their [19] Humphrey J.R., Price D.K., Spagnoli K.E., Paolini A.L., Kelmelis
graphics cards available for our tests. E.J., CULA: Hybrid GPU Accelerated Linear Algebra Routines,
SPIE Defense and Security Symposium (DSS), April, 2010
[20] Bosma W., Cannon J., Playoust C., The Magma algebra system.
I. The user language, J. Symbolic Comput., 1997, 24, 235–265
90 | Ł. Syrocki and G. Pestka

[21] Matrix Algebra on GPU and Multicore Architectures (MAGMA), [25] ARK | Your Source for Intelr Product Information, http://ark.
http://icl.cs.utk.edu/magma/ intel.com/
[22] Yamazaki I., Dong T., Solcà R., Tomov S., Dongarra J., Schulthess [26] GeForce 500 Series, http://en.wikipedia.org/wiki/GeForce_500
T., Tridiagonalization of a Dense Symmetric Matrix On Multiple _Series
GPUs and Its Application to Symmetric Eigenvalue Problems, [27] GeForce 9 Series, http://en.wikipedia.org/wiki/GeForce_9_Seri
Concurrency and Computation: Practice and Experience, 2014, es
26, 2652–2666 [28] Fermi Architecture White Paper – Nvidia, http://www.nvidia.
[23] Dongarra J.J., Luszczek P., Petitet A., The LINPACK Benchmark: com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute
Past, Present, and Future, Concurrency and Computation: Prac- _Architecture_Whitepaper.pdf
tice and Experience, 2003, 15, 803–820 [29] Blackford S., Dongarra J.J., LAPACK Working Note 41, UT-CS-
[24] Intel Processors, http://www.intel.com/support/processors/sb 92-151, March, 1992. Updated: June 30, 1999 (VERSION 3.0),
/cs-017346.htm http://www.netlib.org/lapack/lawnspdf/lawn41.pdf

You might also like