Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Performance Optimization and Evaluation for

Linear Codes

Pavel Tvrdı́k and Ivan Šimeček

Department of Computer Science and Engineering,


Czech Technical University, Prague
{tvrdik, xsimecek}@fel.cvut.cz

Abstract. In this paper, we develop a probabilistic model for estimation


of the numbers of cache misses during the sparse matrix-vector multi-
plication (for both general and symmetric matrices) and the Conjugate
Gradient algorithm for 3 types of data caches: direct mapped, s-way set
associative with random or with LRU replacement strategies. Using HW
cache monitoring tools, we compare the predicted number of cache misses
with real numbers on Intel x86 architecture with L1 and L2 caches. The
accuracy of our analytical model is around 96%.

1 Introduction
Sparse matrix-vector multiplication (shortly SpM ×V ) is an important building
block in algorithms solving sparse systems of linear equations, e.g., FEM. Due
to matrix sparsity, the memory access patterns are irregular and the utilization
of cache suffers from low spatial and temporal locality. An analytical model for
SpM ×V is developed in [2], where the dependence of the number of cache misses
on data and cache parameters is studied. We have already designed another
analytical model in [1], but here it is further extended for symmetric matrices.
The contribution of this paper is twofold. (1) We have designed source code
transformations based on loop reversal and loop fusion for the Conjugate Gra-
dient algorithm (CGA) that improve the temporal cache locality. (2) We have
derived a probabilistic models for estimations of the numbers of cache misses
for data caches of 3 types: direct mapped and s-way set associative with ran-
dom and with LRU replacement strategies. We have derived these models for 3
algorithms: general and symmetric SpM ×V and CGA. We have concentrated
on Intel architecture with L1 and L2 caches. Using HW cache monitoring tools,
we have verified that the accuracy of our analytical model is around 96%. The
errors in estimations are due to minor simplifying assumptions in our model.

2 Terminology and Notation


A common method for solving sparse systems of linear equations appearing in
FEM is the Conjugate Gradients algorithm (CGA). Such a sparse system of

Z. Li et al. (Eds.): NAA 2004, LNCS 3401, pp. 566–573, 2005.



c Springer-Verlag Berlin Heidelberg 2005
Performance Optimization and Evaluation for Linear Codes 567

linear equations with n variables is usually represented by a sparse (n × n)-


matrix A stored in the format of compressed rows. Let nZ be the total number
of nonzero elements in A. Then A is represented by 3 linear arrays A, adr, and x.
Array A[1, . . . , nZ ] stores the nonzero elements of A, array adr[1, . . . , n] contains
indexes of initial nonzero elements of rows of A, and array c[1, . . . , nZ ] contains
column indexes of nonzero elements of A. Hence, the first nonzero element of
row j is stored at index adr[j] in array A. Let nzpr denote the average number
of nonzero elements in matrix A per row: nzpr = nZ /n. Let wB denote the
bandwidth of matrix A, i.e., the maximal difference between column indexes of
2 nonzero elements in rows of A, i.e. , li = minj {Ai,j = 0}, ri = maxj {Ai,j =
0}, wB = maxi {ri − li + 1}.
For a symmetric sparse matrix A the bandwidth of the matrix (denoted by
wB ) is defined as the largest distance between a nonzero element and the main
diagonal, i.e. , ri = maxj {Ai,j = 0}, wB = maxi {|ri − i| + 1}.
A suitable format for storing symmetric sparse matrices is the SSS (symmet-
ric sparse skyline) format in which only the strictly lower triangular submatrix
is stored in the CSR format and the diagonal elements are stored separately in
array diag[1 . . . n].
The cache model we consider corresponds to the structure of L1 and L2
caches in the Intel x86 architecture. An s-way set-associative cache consists of
h sets and one set consists of s independent blocks (called lines in the Intel
terminology). Let CS denote the size of the data part of a cache in bytes and
BS denote the cache block size in bytes. Then CS = s · BS · h. Let SD denote
the size of type double and SI the size of type integer. Let ndpb denote the
number of items of type double per cache block and nipb the number of items
of type integer per cache block. Obviously, BS = ndpb SD = nipb SI .
We distinguish 2 types of cache misses: Compulsory misses (sometimes called
intrinsic or cold) that occur when empty cache blocks are loaded with new data
and thrashing misses (also called cross-interference, conflict or capacity misses)
that occur when useful data are loaded into a cache block, but these data are
replaced prematurely.

2.1 Sparse Matrix-Vector Multiplication


Consider a sparse matrix A represented by linear arrays A, adr, c, and diag as
defined in Section 2 and a vector x represented by dense array x[1, . . . , n]. The
goal is to compute the vector y = Ax represented by dense array y[1, . . . , n]. The
multiplication requires indirect addressing, which causes performance degrada-
tion due to the low spatial and temporal locality.

3 Linear Code Optimizations of CG Algorithm


Consider a sparse symmetric positive definite matrix A in the compressed row
format as defined in Section 2 and an input dense array y[1, . . . , n], representing
vector y. The goal is to compute the output dense array x[1, . . . , n], representing
solution vector x of linear system Ax = y.
568 P. Tvrdı́k and I. Šimeček

A Standard CGA Implementation

Algorithm CGS(in A, adr, c, y;out x) (* CGA Axelsson’s variant without


preconditioning *)
Auxiliary: double d[1, . . . , n], p[1, . . . , n], r[1, . . . , n], nom, nory, denom, α;
(1) nory = 0;
(2) for i = 1 to n do
(3) { nory+ = y[i] ∗ y[i]; r[i] = −y[i]; x[i] = 0.0; d[i] = y[i]; }
(4) nom = nory;
(5) while (residual vector is ”large”) do {
(6) call MVM(A, adr, c, d; p);
(7) call Dot product(d, p; denom); α = nom/denom;
(8) for i = 1 to n do { x[i]+ = α ∗ d[i]; r[i]+ = α ∗ p[i]; }
(9) denom = nom; call Dot product(r, r; nom); α = nom/denom;
(10) for i = 1 to n do { d[i] = α ∗ d[i] − r[i]; }

This code has a serious drawback. If the data cache size is less than the
total memory requirements for storing all input, output, and auxiliary arrays
(A, adr, c, d, p, r, x, y), then due to thrashing misses, part or all of these arrays
are flushed out of the cache and must be reloaded during the next iteration of the
while loop at codelines (5)-(10). This inefficiency can be reduced by application
of the loop reversal [3] and loop fusion [3].

An Improved CGA Implementation Based on Loop Restructuring

Algorithm CGM(in A, adr, c, y;out x) (* modified implementation of


CGS*)
Auxiliary: double d[1, . . . , n], p[1, . . . , n], r[1, . . . , n], nom, nory, denom, α;
(1) nory = 0;
(2) for i = 1 to n do
(3) {nory+ = y[i] ∗ y[i]; r[i] = −y[i]; x[i] = 0.0; d[i] = y[i];}
(4) nom = nory;
(5) while (residual vector is ”large”) do {
(6 ) Loop fusion (MVM(A, adr, c, d; p), Dot product(d, p; denom));
(8 ) α = nom/denom; nory = 0.0;
(8 ) for i = 1 to n do {x[i]+ = α ∗ d[i]; r[i]+ = α ∗ p[i]; nory+ =
r[i] ∗ r[i];}
(9 ) denom = nom; nom = nory; α = nom/denom;
(10 ) for i = n downto 1 do { d[i] = α ∗ d[i] − r[i]; }

CGM code has been obtained from CGS code by applying 3 transformations:
1. Codelines (6, 7) in CGS are grouped together by loop fusion and this allows
to reuse immediately the new computed values of array p.
2. Similarly, codelines (8, 9) in CGS are grouped together by loop fusion and
this allows to reuse immediately the new computed values of array r (see
codelines (8”), (9’) in CGM).
Performance Optimization and Evaluation for Linear Codes 569

3. The loop on codeline (10) in CGS is reversed by loop reversal so that the
last elements of arrays d and r that remain in the cache from the loop on
codelines (8, 9) can be reused.

The CGM code has better temporal locality of data and in Section 4.2 we
perform its quantitative analysis.

4 Probabilistic Analysis of the Cache Behavior


4.1 Sparse Matrix-Vector Multiplication
Algorithms MVM CSR and MVM SSS (see Section 2.1) produce the same sequence
of memory access instructions and therefore, the analytical model is the same
for both. It is based on the following simplified assumptions (same as in [1]):

1. There is no correlation among mappings of arrays A, c, and x into cache


blocks. Hence, we can view load operations as mutually independent events.
2. We consider thrashing misses only for loads from arrays A, c, and x.
3. We assume that the whole cache size is used for data of SpM ×V .
4. We assume that each execution of SpM ×V starts with the empty cache.

We use the following notation.

– P (Z[i]) denotes the probability of a thrashing miss of the cache block con-
taining element i of array Z.
– NCM denotes the number of cache misses during one execution of SpM ×V .
– d denotes the average number of iterations of the innermost loop of MVM CSR
at codeline (4) between 2 reuses of the cache block containing some element
of array x[1, . . . , n].

We distinguish 3 relevant types of sparse matrices to estimate the value of d. (1)


A symmetric sparse matrix A with bandwidth wB and with uniform distribution
of nonzero elements on rows. Then d can be approximated by d = wB [2]. (2)
A symmetric sparse banded matrix A with similar row structure. Two rows i
and i + 1 are said to be similar if row i contains nonzero elements · · · , A[i][i −
∆] , A[i][i] , A[i][i+∆] , · · · , whereas row i+1 contains nonzero elements · · · , A[i+
1][i + 1 − ∆] , A[i + 1][i + 1] , A[i + 1][i + 1 + ∆] , · · ·, where ∆ is a constant. In
other words, we assume that the indexes of nonzero elements of row i+1 are only
”shifted” by one position with respect to row i. For A with structurally similar
rows, the cache block containing an element x[i] is reused with high probability
during loading x[i + 1] after nzpr iterations1 . Hence, d = min (wB , nzpr ) = nzpr .

1
For simplicity, we assume that all rows corresponding to discretization of internal
mesh nodes are similar. But in real applications, boundary mesh nodes produce
equations with a slightly different structure. This simplification should not have a
significant impact, since the number of boundary nodes is of order of square root of
all mesh nodes.
570 P. Tvrdı́k and I. Šimeček

(3) A sparse banded matrix A where nwzpr B


≈ 1. Then x[i] is reused with high
probability during loading x[i + 1] in the next iteration, and therefore, d = 1.
Due to the assumption that each new execution of SpM ×V starts with the
empty cache, all nZ elements of arrays A and c and all n elements of arrays x,
adr, y must be loaded into the cache once and the number of compulsory misses
C nZ (SD +SI )+n (2·SD +SI )
is NCM = BS .
Since caches have always limited size, thrashing misses occur: Data loaded
into a cache set may cause replacement of other data that will be needed later.
C T
Hence, the total number of cache misses is NCM = NCM + NCM .

Symmetric SpM ×V . The same assumptions are valid even for symmetric case.
We consider the SSS format (see 2),let denote nZ  the number of nonzero ele-
ments in strictly lower triangular submatrix. Then the number of compulsory

C  C  T 
misses is NCM = nZ (SD +SI )+n
BS
(3·SD +SI )
, NCM  = NCM + NCM .

Direct Mapped Cache (s = 1). The innermost loop of the MVM CSR algorithm
has nZ iterations. Each element of arrays A, c, and x in each iteration is either
reused or replaced due to thrashing. Under our assumption of independence of
these replacements for all 3 arrays, the total number of thrashing misses can be
approximated by formula: NCM T
= nZ (P (A[j]) + P (c[j]) + P (x[k])) ; ∀j, k.
The probability that 2 randomly chosen cache blocks from the cache are
−1
distinct is 1 − BCS = 1 − h
S
. Hence, P (c[j]) = P (A[j]) = 1 − (1 − h−1 )2 .
Arrays A and c are accessed linearly, their indexes are incremented in each
iteration of the innermost loop, and in a given moment, only one cache block
is actively used unless thrashing occurs. The access pattern for array x is more
complicated due to indirect addressing. In the worst case, an element of x, after
loading into the cache, is reused only after d iterations of the innermost loop,
since it is used at each row of matrix A only once (said otherwise, array x actively
uses d cache blocks). Every load during this time can cause cache thrashing.
Hence,P (x[k]) = 1 − (1 − h−1 )3d .
Symmetric SpM ×V . We can make similar assumptions. Vector y is accessed us-
ing the same pattern as vector x, so P (c[j]) = P (A[j]) = 1−(1−h−1 )3 P (x[k]) =
T 
P (y[k]) = 1 − (1 − h−1 )4d NCM = nZ  (P (A[j]) + P (c[j]) + P (x[k])) ; ∀j, k.

s-Way Set-Associative Cache, Random Replacement Strategy


Standard SpM ×V . The probability can be derived as in the previous section,
only the thrashing occurs with probability 1s . So, one of s cache blocks containing
elements of A or c can be replaced by both loads, which are assumed independent,
with probability P (A[j]) = P (c[j]) = 1s (1 − (1 − h−1 )2 ) and P (x[k]) = 1s (1 −
(1 − h−1 )3d ).

Symmetric SpM ×V . We can make similar assumptions. Vector y is accessed


with same pattern like vector x, so P (A[j]) = P (c[j]) = 1s (1 − (1 − h−1 )3 ) and
P (x[k]) = P (y[k]) = 1s (1 − (1 − h−1 )4d ).
Performance Optimization and Evaluation for Linear Codes 571

s-Way Set-Associative Cache, LRU Replacement Strategy


In the LRU case, cache block can be replaced only if at least s immediately
preceding loads accessed this block. Hence,

⎨0 if s > 2,
P (A[j]) = P (c[j]) = h−2 if s = 2, (1)

1 − (1 − h−1 )2 if s = 1.
Arrays A and c are accessed in the linear order of indexes as before and during d
iterations, they completely fill  =  d(SsB
D +SI )
S
 cache sets. We distinguish 2 cases
of sparsity of matrix A:
– wB
≈ 1. This corresponds to a sparse banded matrix A of type (3) in Section
nzpr
4.1 where cache sets are almost completely filled with array x. Then memory
access pattern is the same as for arrays A and c, P (x[k]) = P (A[j]) = P (c[j]).
– nwzpr
B
≥ ndpb . This corresponds to a sparse matrix A of type (1) or (2) in
Section 4.1 such that every cache block contains at most one element of
array x during one execution of SpM ×V . If the load operations for arrays
A or c replace a cache block containing array x, one thrashing miss occurs,
P (x[k]) = 1 − (1 − h−1 ) .

Symmetric SpM ×V . Similarly,




⎪ 0 if s > 3,
⎨ −3
h if s = 3,
P (A[j]) = P (c[j]) = −2 (2)

⎪ 3h if s = 2,

1 − (1 − h−1 )3 if s = 1.

Vector y is accessed with the same pattern like vector x, so P (y[k]) = P (x[k]).

4.2 CGA
We consider the same data structures A, adr, c, x, y as in Algorithms CGS and
CGM. Let us further assume the following simplified conditions.
1. There is no correlation among mappings of arrays into cache blocks.
2. Thrashing misses occur only within the subroutines for SpM ×V . Hence, we
consider only compulsory cache misses.
3. We again assume that the whole cache size is used for input, output, and
auxiliary arrays.
4. We assume that each iteration of the while loop of the CGA starts with an
empty cache.
Define
N Sl = the predicted number of cache misses that occur on codeline (l) of the
CGS algorithm.
N Ml = the predicted number of cache misses that occur on codeline (l) of the
CGM algorithm.
572 P. Tvrdı́k and I. Šimeček

The total number of cache misses for SpM ×V on codeline (6) in CGS and on
codeline (6’) in CGM was evaluated in Section 4.1, N S(6) = N M(6 ) = NCM . The
number of compulsory misses for the dot product of 2 vectors on codeline (7) in
CGS is the number of cache blocks needed for 2n doubles. Whereas in CGM, the dot
product is computed directly during the multiplication due to the loop fusion and
all elements of arrays d and p are reused. Hence, N S(7) = n2n
dpb
and N M(7 ) =
0.
The codeline (8) contains 2 linear operations on 4 vectors and the same holds
for CGM. Therefore,N S(8) = N M(8 +8 ) = n4n
dpb
. The codeline (9) contains a dot
product of vector r, whereas in CGM on the codeline (8”) the same dot product
is computed directly during those linear vector operations and all elements of
n
array r are reused. So,N S9 = ndpb and N S9 = 0.
The codeline (10) contains linear operations on 2 vectors. For large n, we can
assume that after finishing the loop on the codeline (9), the whole cache is filled
only with elements of arrays d, p, r, and x. In CGM, the loop on codeline (10’) is
CS
reversed and so, in the best case, the last 4SD
elements of array r (similarly for
array d) can be reused. Therefore, N S10 = n2n dpb
and N M10 = n2n dpb
− 2B
CS
S
.

4.3 Evaluation of the Probabilistic Model


Figure 1 gives performance numbers for Pentium Celeron 1GHz, 256 MB, run-
ning W2000 Professional with the following cache parameters: L1 cache is data
cache with BS = 32, CS = 16K, s = 4, h = 128, and LRU replacement strategy.
L2 cache is unified with BS = 32, CS = 128K, s = 4, h = 1024, and LRU strat-
NCM (L1) NCM (L2)
egy. The parameters R1 = RN CM (L1)
and R2 = RN CM (L2)
denote the ratios
of the estimated and real numbers of misses for L1 and L2 caches, respectively.
They represent the accuracy of the probabilistic model.
Figure 1 illustrates that the real number of cache misses was predicted with
average accuracy 97% in case of algorithm MVM CSR and 95% in case of algorithm
MVM SSS. Figure 1 also shows that the accuracy of the probabilistic cache model

98 99
MVM_CSR(R1) CGS(R1)
MVM_CSR(R2) CGS(R2)
97.5 MVM_SSS(R1) CGM(R1)
MVM_SSS(R2) CGM(R2)
98.5
97

96.5 98

96
97.5
R1,R2[%]

R1,R2[%]

95.5

97
95

94.5 96.5

94
96
93.5

93 95.5
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Order of matrix[in K] Order of matrix[in K]

Fig. 1. The accuracy of the model (a) for algorithms MVM CSR,MVM SSS, (b) for algo-
rithms CGS,CGM
Performance Optimization and Evaluation for Linear Codes 573

is around 96% of algorithms CGS and CGM. The accuracy of an analytical model
is influenced by the following assumptions.

1. Successive loads of items of array x are assumed mutually independent from


the view of mapping into cache sets and that happens if the structure of
matrix A is random, but this is not true in real cases.
2. In the SpM ×V , we consider only arrays A, c, and x, but the algorithms
MVM CSR and MVM SSS (see Section 2.1) also load arrays adr and y to caches.
3. We assume that both L1 and L2 caches are data caches. In the Intel archi-
tecture, this assumption holds only for the L1 cache, whereas the L2 cache
is unified and it is used also for storing instructions. This fact is not taken
into account in our formulas, but the error is small due to small code sizes.
Similarly, a small part of the error is due to system task codes in L2.
4. In the CGAs, we assume that every iteration is independent. This assumption
is valid for CS ≤ n(5 · SD + SI ) + nzpr (SD + SI ) (all memory requirements
for storing the arrays A, adr, c, d, p, r, x, y in the CGAs).

5 Conclusions
Our analytical probabilistic model for predicting cache behavior is similar to
the model in [2], but we differ in 2 aspects. We have explicitly developed and
verified a model for s-way set associative caches with LRU block replacement
strategy and obtained average accuracy of predicted numbers of cache misses
equal to about 97% for both SpM ×V and CGA. We have derived models for
both general and symmetric matrices. In contrast to [2], (1) our results indicate
that cache miss ratios for these 2 applications are sensitive to the replacement
strategy used, (2) we consider both compulsory and thrashing misses for arrays
A and c.

Acknowledgements
This research has been supported by grant GA AV ČR IBS 3086102, by IGA
CTU FEE under CTU0409313, and by MŠMT under research program #J04/98:
212300014.

References
1. P. Tvrdı́k and I. Šimeček: Analytical modeling of sparse linear code. PPAM 12
(2003) 617-629 Czestochova, Poland
2. Olivier Temam and William Jalby: Characterizing the Behavior of Sparse Algo-
rithms on Caches. Supercomputing (1992) 578-587
3. K. R. Wadleigh and I. L. Crawford: Software optimization for high performance
computing. Hewlett-Packard professional books (2000)

You might also like