Professional Documents
Culture Documents
Performance Optimization and Evaluation For Linear Codes
Performance Optimization and Evaluation For Linear Codes
Linear Codes
1 Introduction
Sparse matrix-vector multiplication (shortly SpM ×V ) is an important building
block in algorithms solving sparse systems of linear equations, e.g., FEM. Due
to matrix sparsity, the memory access patterns are irregular and the utilization
of cache suffers from low spatial and temporal locality. An analytical model for
SpM ×V is developed in [2], where the dependence of the number of cache misses
on data and cache parameters is studied. We have already designed another
analytical model in [1], but here it is further extended for symmetric matrices.
The contribution of this paper is twofold. (1) We have designed source code
transformations based on loop reversal and loop fusion for the Conjugate Gra-
dient algorithm (CGA) that improve the temporal cache locality. (2) We have
derived a probabilistic models for estimations of the numbers of cache misses
for data caches of 3 types: direct mapped and s-way set associative with ran-
dom and with LRU replacement strategies. We have derived these models for 3
algorithms: general and symmetric SpM ×V and CGA. We have concentrated
on Intel architecture with L1 and L2 caches. Using HW cache monitoring tools,
we have verified that the accuracy of our analytical model is around 96%. The
errors in estimations are due to minor simplifying assumptions in our model.
This code has a serious drawback. If the data cache size is less than the
total memory requirements for storing all input, output, and auxiliary arrays
(A, adr, c, d, p, r, x, y), then due to thrashing misses, part or all of these arrays
are flushed out of the cache and must be reloaded during the next iteration of the
while loop at codelines (5)-(10). This inefficiency can be reduced by application
of the loop reversal [3] and loop fusion [3].
CGM code has been obtained from CGS code by applying 3 transformations:
1. Codelines (6, 7) in CGS are grouped together by loop fusion and this allows
to reuse immediately the new computed values of array p.
2. Similarly, codelines (8, 9) in CGS are grouped together by loop fusion and
this allows to reuse immediately the new computed values of array r (see
codelines (8”), (9’) in CGM).
Performance Optimization and Evaluation for Linear Codes 569
3. The loop on codeline (10) in CGS is reversed by loop reversal so that the
last elements of arrays d and r that remain in the cache from the loop on
codelines (8, 9) can be reused.
The CGM code has better temporal locality of data and in Section 4.2 we
perform its quantitative analysis.
– P (Z[i]) denotes the probability of a thrashing miss of the cache block con-
taining element i of array Z.
– NCM denotes the number of cache misses during one execution of SpM ×V .
– d denotes the average number of iterations of the innermost loop of MVM CSR
at codeline (4) between 2 reuses of the cache block containing some element
of array x[1, . . . , n].
1
For simplicity, we assume that all rows corresponding to discretization of internal
mesh nodes are similar. But in real applications, boundary mesh nodes produce
equations with a slightly different structure. This simplification should not have a
significant impact, since the number of boundary nodes is of order of square root of
all mesh nodes.
570 P. Tvrdı́k and I. Šimeček
Symmetric SpM ×V . The same assumptions are valid even for symmetric case.
We consider the SSS format (see 2),let denote nZ the number of nonzero ele-
ments in strictly lower triangular submatrix. Then the number of compulsory
C C T
misses is NCM = nZ (SD +SI )+n
BS
(3·SD +SI )
, NCM = NCM + NCM .
Direct Mapped Cache (s = 1). The innermost loop of the MVM CSR algorithm
has nZ iterations. Each element of arrays A, c, and x in each iteration is either
reused or replaced due to thrashing. Under our assumption of independence of
these replacements for all 3 arrays, the total number of thrashing misses can be
approximated by formula: NCM T
= nZ (P (A[j]) + P (c[j]) + P (x[k])) ; ∀j, k.
The probability that 2 randomly chosen cache blocks from the cache are
−1
distinct is 1 − BCS = 1 − h
S
. Hence, P (c[j]) = P (A[j]) = 1 − (1 − h−1 )2 .
Arrays A and c are accessed linearly, their indexes are incremented in each
iteration of the innermost loop, and in a given moment, only one cache block
is actively used unless thrashing occurs. The access pattern for array x is more
complicated due to indirect addressing. In the worst case, an element of x, after
loading into the cache, is reused only after d iterations of the innermost loop,
since it is used at each row of matrix A only once (said otherwise, array x actively
uses d cache blocks). Every load during this time can cause cache thrashing.
Hence,P (x[k]) = 1 − (1 − h−1 )3d .
Symmetric SpM ×V . We can make similar assumptions. Vector y is accessed us-
ing the same pattern as vector x, so P (c[j]) = P (A[j]) = 1−(1−h−1 )3 P (x[k]) =
T
P (y[k]) = 1 − (1 − h−1 )4d NCM = nZ (P (A[j]) + P (c[j]) + P (x[k])) ; ∀j, k.
Vector y is accessed with the same pattern like vector x, so P (y[k]) = P (x[k]).
4.2 CGA
We consider the same data structures A, adr, c, x, y as in Algorithms CGS and
CGM. Let us further assume the following simplified conditions.
1. There is no correlation among mappings of arrays into cache blocks.
2. Thrashing misses occur only within the subroutines for SpM ×V . Hence, we
consider only compulsory cache misses.
3. We again assume that the whole cache size is used for input, output, and
auxiliary arrays.
4. We assume that each iteration of the while loop of the CGA starts with an
empty cache.
Define
N Sl = the predicted number of cache misses that occur on codeline (l) of the
CGS algorithm.
N Ml = the predicted number of cache misses that occur on codeline (l) of the
CGM algorithm.
572 P. Tvrdı́k and I. Šimeček
The total number of cache misses for SpM ×V on codeline (6) in CGS and on
codeline (6’) in CGM was evaluated in Section 4.1, N S(6) = N M(6 ) = NCM . The
number of compulsory misses for the dot product of 2 vectors on codeline (7) in
CGS is the number of cache blocks needed for 2n doubles. Whereas in CGM, the dot
product is computed directly during the multiplication due to the loop fusion and
all elements of arrays d and p are reused. Hence, N S(7) = n2n
dpb
and N M(7 ) =
0.
The codeline (8) contains 2 linear operations on 4 vectors and the same holds
for CGM. Therefore,N S(8) = N M(8 +8 ) = n4n
dpb
. The codeline (9) contains a dot
product of vector r, whereas in CGM on the codeline (8”) the same dot product
is computed directly during those linear vector operations and all elements of
n
array r are reused. So,N S9 = ndpb and N S9 = 0.
The codeline (10) contains linear operations on 2 vectors. For large n, we can
assume that after finishing the loop on the codeline (9), the whole cache is filled
only with elements of arrays d, p, r, and x. In CGM, the loop on codeline (10’) is
CS
reversed and so, in the best case, the last 4SD
elements of array r (similarly for
array d) can be reused. Therefore, N S10 = n2n dpb
and N M10 = n2n dpb
− 2B
CS
S
.
98 99
MVM_CSR(R1) CGS(R1)
MVM_CSR(R2) CGS(R2)
97.5 MVM_SSS(R1) CGM(R1)
MVM_SSS(R2) CGM(R2)
98.5
97
96.5 98
96
97.5
R1,R2[%]
R1,R2[%]
95.5
97
95
94.5 96.5
94
96
93.5
93 95.5
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Order of matrix[in K] Order of matrix[in K]
Fig. 1. The accuracy of the model (a) for algorithms MVM CSR,MVM SSS, (b) for algo-
rithms CGS,CGM
Performance Optimization and Evaluation for Linear Codes 573
is around 96% of algorithms CGS and CGM. The accuracy of an analytical model
is influenced by the following assumptions.
5 Conclusions
Our analytical probabilistic model for predicting cache behavior is similar to
the model in [2], but we differ in 2 aspects. We have explicitly developed and
verified a model for s-way set associative caches with LRU block replacement
strategy and obtained average accuracy of predicted numbers of cache misses
equal to about 97% for both SpM ×V and CGA. We have derived models for
both general and symmetric matrices. In contrast to [2], (1) our results indicate
that cache miss ratios for these 2 applications are sensitive to the replacement
strategy used, (2) we consider both compulsory and thrashing misses for arrays
A and c.
Acknowledgements
This research has been supported by grant GA AV ČR IBS 3086102, by IGA
CTU FEE under CTU0409313, and by MŠMT under research program #J04/98:
212300014.
References
1. P. Tvrdı́k and I. Šimeček: Analytical modeling of sparse linear code. PPAM 12
(2003) 617-629 Czestochova, Poland
2. Olivier Temam and William Jalby: Characterizing the Behavior of Sparse Algo-
rithms on Caches. Supercomputing (1992) 578-587
3. K. R. Wadleigh and I. L. Crawford: Software optimization for high performance
computing. Hewlett-Packard professional books (2000)