Cache Oblivious Algorithms

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Cache-Oblivious Algorithms

E XTENDED A BSTRACT

Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran


MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139
fathena,cel,prokop,sridharg@supertech.lcs.mit.edu

Main
Abstract This paper presents asymptotically optimal algo- organized by Memory
rithms for rectangular matrix transpose, FFT, and sorting on optimal replacement
computers with multiple levels of caching. Unlike previous strategy
optimal algorithms, these algorithms are cache oblivious: no Cache
variables dependent on hardware parameters, such as cache
CPU
size and cache-line length, need to be tuned to achieve opti-
mality. Nevertheless, these algorithms use an optimal amount W
of work and move data optimally among multiple levels of work
cache. For a cache with size Z and cache-line length L where Z =L Cache lines Q
Z = Ω(L2 ) the number of cache misses for an m  n ma- cache
misses
trix transpose is Θ(1 + mn=L). The number of cache misses Lines
of length L
for either an n-point FFT or the sorting of n numbers is
Θ(1 + (n=L)(1 + log Z n)). We also give an Θ(mnp)-work al-
gorithm to multiply an m  n matrix by an p n  p matrix that Figure 1: The ideal-cache model
incurs Θ(1 + (mn + np + mp)=L + mnp=L Z ) cache faults.
We introduce an “ideal-cache” model to analyze our algo- shall assume that word size is constant; the particular
rithms. We prove that an optimal cache-oblivious algorithm constant does not affect our asymptotic analyses. The
designed for two levels of memory is also optimal for multi-
cache is partitioned into cache lines, each consisting of
ple levels and that the assumption of optimal replacement in
the ideal-cache model can be simulated efficiently by LRU re-
L consecutive words which are always moved together
placement. We also provide preliminary empirical results on between cache and main memory. Cache designers typ-
the effectiveness of cache-oblivious algorithms in practice. ically use L > 1, banking on spatial locality to amortize
the overhead of moving the cache line. We shall gener-
ally assume in this paper that the cache is tall:
1. Introduction
Z = Ω(L2 ) ; (1)
Resource-oblivious algorithms that nevertheless use re-
sources efficiently offer advantages of simplicity and which is usually true in practice.
portability over resource-aware algorithms whose re- The processor can only reference words that reside
source usage must be programmed explicitly. In this in the cache. If the referenced word belongs to a line
paper, we study cache resources, specifically, the hier- already in cache, a cache hit occurs, and the word is
archy of memories in modern computers. We exhibit delivered to the processor. Otherwise, a cache miss oc-
several “cache-oblivious” algorithms that use cache as curs, and the line is fetched into the cache. The ideal
effectively as “cache-aware” algorithms. cache is fully associative [20, Ch. 5]: cache lines can be
Before discussing the notion of cache obliviousness, stored anywhere in the cache. If the cache is full, a cache
we first introduce the (Z ; L) ideal-cache model to study line must be evicted. The ideal cache uses the optimal
the cache complexity of algorithms. This model, which off-line strategy of replacing the cache line whose next
is illustrated in Figure 1, consists of a computer with a access is furthest in the future [7], and thus it exploits
two-level memory hierarchy consisting of an ideal (data) temporal locality perfectly.
cache of Z words and an arbitrarily large main mem- Unlike various other hierarchical-memory models
ory. Because the actual size of words in a computer is [1, 2, 5, 8] in which algorithms are analyzed in terms of
typically a small, fixed size (4 bytes, 8 bytes, etc.), we a single measure, the ideal-cache model uses two mea-
sures. An algorithm with an input of size n is measured
This research was supported in part by the Defense Advanced
Research Projects Agency (DARPA) under Grant F30602-97-1-0270.
by its work complexity W (n)—its conventional running
Matteo Frigo was supported in part by a Digital Equipment Corpora- time in a RAM model [4]—and its cache complexity
tion fellowship. Q(n; Z ; L)—the number of cache misses it incurs as a
function of the size Z and line length L of the ideal cache. cache-oblivious algorithm that requires no tuning pa-
When Z and L are clear from context, we denote the rameters such as the s in B LOCK -M ULT. We present
cache complexity simply as Q(n) to ease notation. such an algorithm, which works on general rectangular
We define an algorithm to be cache aware if it con- matrices, in Section 2. The problems of computing a
tains parameters (set at either compile-time or runtime) matrix transpose and of performing an FFT also suc-
that can be tuned to optimize the cache complexity for cumb to remarkably simple algorithms, which are de-
the particular cache size and line length. Otherwise, the scribed in Section 3. Cache-oblivious sorting poses a
algorithm is cache oblivious. Historically, good perfor- more formidable challenge. In Sections 4 and 5, we
mance has been obtained using cache-aware algorithms, present two sorting algorithms, one based on mergesort
but we shall exhibit several optimal1 cache-oblivious al- and the other on distribution sort, both of which are op-
gorithms. timal in both work and cache misses.
To illustrate the notion of cache awareness, consider The ideal-cache model makes the perhaps-
the problem of multiplying two n  n matrices A and questionable assumptions that there are only two
B to produce their n  n product C. We assume that levels in the memory hierarchy, that memory is man-
the three matrices are stored in row-major order, as aged automatically by an optimal cache-replacement
shown in Figure 2(a). We further assume that n is strategy, and that the cache is fully associative. We
“big,” i.e., n > L, in order to simplify the analysis. The address these assumptions in Section 6, showing that
conventional way to multiply matrices on a computer to a certain extent, these assumptions entail no loss
with caches is to use a blocked algorithm [19, p. 45]. of generality. Section 7 discusses related work, and
The idea is to view each matrix M as consisting of Section 8 offers some concluding remarks, including
(n=s)  (n=s) submatrices Mi j (the blocks), each of some preliminary empirical results.
which has size s  s, where s is a tuning parame-
ter. The following algorithm implements this strategy: 2. Matrix multiplication
B LOCK -M ULT(A; B; C; n) This section describes and analyzes a cache-oblivious al-
1 for i 1 to n=s gorithm for multiplying an m  n matrix by an n  p ma-
pincurring
2 do for j 1 to n=s trix cache-obliviously using Θ(mnp) work and
3 do for k 1 to n=s
Θ(m + n + p + (mn + np + mp)=L + mnp=L Z ) cache
4 do O RD -M ULT(Aik ; Bk j ; Ci j ; s)
misses. These results require the tall-cache assumption
The O RD -M ULT(A; B; C; s) subroutine computes (1) for matrices stored in row-major layout format, but
C C + AB on s  s matrices using the ordinary O(s3 ) the assumption can be relaxed for certain other layouts.
algorithm. (This algorithm assumes for simplicity that We also show that Strassen’s algorithm [31] for multi-
s evenly divides n, but in practice s and n need have no plying n  n matrices,pwhich uses Θ(nlg7 ) work2 , incurs
special relationship, yielding more complicated code in Θ(1 + n2=L + nlg7 =L Z ) cache misses.
the same spirit.) In [9] with others, two of the present authors analyzed
Depending on the cache size of the machine on which an optimal divide-and-conquer algorithm for n  n ma-
B LOCK -M ULT is run, the parameter s can be tuned to trix multiplication that contained no tuning parameters,
make the algorithm run fast, and thus B LOCK -M ULT is but we did not study cache-obliviousness per se. That
a cache-aware algorithm. To minimize the cache com- algorithm can be extended to multiply rectangular matri-
plexity, we choose s to be the largest value such that ces. To multiply a m  n matrix A and a n  p matrix B,
the three s  s submatrices simultaneously fit in cache. the R EC -M ULT algorithm halves the largest of the three
An s  s submatrix is stored on Θ(s + s2 =L) cache lines. dimensions and recurs according to one of the following
p tall-cache assumption (1), we can see that
From the three cases:
   
s = Θ( Z ). Thus, each of the calls to O RD -M ULT runs
A1 A1 B
with at most Z =L = Θ(s2 =L) cache misses needed to B = ; (2)
A2 A2 B
bring the three matrices into the cache. Consequently,  
? 
p
the cache complexity of the entire algorithmpis Θ(1 + A1 A2
B1
= A1 B1 + A2 B2 ; (3)
n2 =L + (n= Z )3 (Z =L)) = Θ(1 + n2=L + n3=L Z ), since B2
?  ? 
the algorithm
  has to read n2 elements, which reside on A B1 B2 = AB1 AB2 : (4)
2
In case (2), we have m  max fn; pg. Matrix A is split
n =L cache lines.
The same bound can be achieved using a simple
horizontally, and both halves are multiplied by matrix B.
1 For simplicity in this paper, we use the term “optimal” as a syn- In case (3), we have n  max fm; pg. Both matrices are
onym for “asymptotically optimal,” since all our analyses are asymp-
totic. 2 We use the notation lg to denote log2 .
split, and the two halves are multiplied. In case (4), we (a) 0 1 2 3 4 5 6 7 (b) 0 8 16 24 32 40 48 56
have p  max fm; ng. Matrix B is split vertically, and 8 9 10 11 12 13 14 15 1 9 17 25 33 41 49 57
each half is multiplied by A. For square matrices, these 16 17 18 19 20 21 22 23 2 10 18 26 34 42 50 58
three cases together are equivalent to the recursive mul- 24 25 26 27 28 29 30 31 3 11 19 27 35 43 51 59
tiplication algorithm described in [9]. The base case oc- 32 33 34 35 36 37 38 39 4 12 20 28 36 44 52 60
curs when m = n = p = 1, in which case the two ele- 40 41 42 43 44 45 46 47 5 13 21 29 37 45 53 61
ments are multiplied and added into the result matrix. 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62
Although this straightforward divide-and-conquer al- 56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63
gorithm contains no tuning parameters, it uses cache op-
timally. To analyze the R EC -M ULT algorithm, we as- (c) 0 1 2 3 16 17 18 19 (d) 0 1 4 5 16 17 20 21
sume that the three matrices are stored in row-major or- 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23
der, as shown in Figure 2(a). Intuitively, R EC -M ULT 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29
uses the cache effectively, because once a subproblem 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31
fits into the cache, its smaller subproblems can be solved 32 33 34 35 48 49 50 51 32 33 36 37 48 49 52 53
in cache with no further cache misses. 36 37 38 39 52 53 54 55 34 35 38 39 50 51 54 55
Theorem 1 The R EC -M ULT algorithm uses Θ(mnp) 40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61
42 43 46 47 58 59 62 63
p incurs Θ(m + n + p + (mn + np + mp)=L +
44 45 46 47 60 61 62 63
work and
mnp=L Z ) cache misses when multiplying an m  n ma- Figure 2: Layout of a 16  16 matrix in (a) row ma-
trix by an n  p matrix. jor, (b) column major, (c) 4  4-blocked, and (d) bit-
Proof. It can be shown by induction that the work of interleaved layouts.
R EC -M ULT is Θ(mnp). To analyze the cache misses, let
α > 0 be the largest constant sufficiently small that three in the recursion, both are small enough that the whole
submatrices of sizesp m0  n0 , n0  p0 , and m0  p0 , where problem fits into cache. The number of cache misses
max fm0 ; n0 ; p0 g  α Z, all fit completely in the cache. can be described by the recurrence
We distinguish four cases depending on the initial size Q(m; n; p) 
of the matrices. p
8 p p (6)
< Θ(1 + n + np L + m) = if n; p 2 [α Z =2; α Z ] ;
Case I: m; n; p > α Z. This case is the most intuitive. otherwise if n  p ;
The matrices do not fit in cache, since all dimensions are : 2Q(m n 2 p) + O(1)
; = ;

2Q(m n p 2) + O(1) otherwise ;


p
; ; =
“big enough.” The cache complexity can be described
whose solution is Q(m; n; p) = Θ(np=L + mnp=L Z ).
by the recurrence p p p
Q(m; n; p)  Case III: p(n; p  α Z andpm > α Z) orp(m; p  α Z
p p
(5)
8 and n > α Z) or (m; n  α Z and p > α Z). In each
> if m; n; p 2 [α Z =2; α Z ] ;
< Θ2Q(((mn + np + mp) L)
=

m 2 n p) + O(1)
= ; ; ow. if m  n and m  p ; of these cases, one of the matrices fits into cache, and
> ow. if n > m and n  p ; the otherspdo not. Here, we
p shall present the case where
: 2Q(m n 2 p) + O(1)
; = ;

2Q(m; n; p=2) + O(1) otherwise : n; p  α Z and m > α Z. The other cases can be
proven similarly. The R EC -M ULT algorithm always di-
The base case arises as soon as all three submatrices fit vides m by 2 according to case (2). p
At some point inpthe
in cache. The total number of lines used by the three recursion, m falls into the range α Z =2  m  α Z,
submatrices is Θ((mn + np + mp)=L). The only cache and the whole problem fits in cache. The number cache
misses that occur during the remainder of the recursion misses can be described by the recurrence
are the Θ((mn + np + mp)=L) cache misses required to
Q(m; n) 
p p
(7)
bring the matrices into cache. In the recursive cases, 
when the matrices do not fit in cache, we pay for the Θ(1 + m) if m 2 [α Z =2; α Z ] ;
cache misses of the recursive calls, which depend on the 2Q(m=2; n; p) + O(1) otherwise ;
dimensions of the matrices, plus O(1) cache misses for p
whose solution is Q(m; n; p) = Θ(m + mnp=L Z ).
the overhead of manipulating submatrices. Thep solution p
to this recurrence is Q(m; n; p) = Θ(mnp=L Z ). Case IV: m; n; p  α Z. From the choice of α, all
p p p
Case II:p (m  α Z and p n; p > α Z) orp(n  α Z and three matrices fit into cache. The matrices are stored
m; p > α Z) or (p  α Z and m; np >α Z). Here,pwe on Θ(1 + mn=L + np=L + mp=L) cache lines. Therefore,
shall present the case where m  α Z and n; p > α Z. we have Q(m; n; p) = Θ(1 + (mn + np + mp)=L).
The proofs for the other cases are only small variations We require the tall-cache assumption (1) in these
of this proof. The R EC -M ULT algorithm always divides analyses, because the matrices are stored in row-major
n or p by 2 according to cases (3) and (4). At some point order. Tall caches are also needed if matrices are stored
in column-major order (Figure 2(b)), but the assumption stored in a row-major layout. The straightforward algo-
that Z = Ω(L2 ) can be relaxed for certain other matrix rithm for transposition that employs doubly nested loops
layouts. The s  s-blocked layout (Figure 2(c)), for some incurs Θ(mn) cache misses on one of the matrices when
tuning parameter s, can be used to achieve the same m  Z =L and n  Z =L, which is suboptimal.
bounds with the weaker assumption that the cache holds Optimal work and cache complexities can be ob-
at least some sufficiently large constant number of lines. tained with a divide-and-conquer strategy, however. If
The cache-oblivious bit-interleaved layout (Figure 2(d)) n  m, the R EC -T RANSPOSE algorithm partitions
has the same advantage as the blocked layout, but no  
B1
p parameter
tuning p need be set, since submatrices of size A = (A 1 A 2 ) ; B =
O( L)  O( L) are cache-obliviously stored on O(1)
B2
cache lines. The advantages of bit-interleaved and re- and recursively executes R EC -T RANSPOSE(A1 ; B1 ) and
lated layouts have been studied in [11, 12, 16]. One of R EC -T RANSPOSE (A2 ; B2 ). Otherwise, it divides matrix
the practical disadvantages of bit-interleaved layouts is A horizontally and matrix B vertically and likewise per-
that index calculations on conventional microprocessors forms two transpositions recursively. The next two lem-
can be costly, a deficiency we hope that processor archi- mas provide upper and lower bounds on the performance
tects will remedy. of this algorithm.
p the cache complexity Q(n) =
For square matrices,
Θ(n + n =L + n =L Z ) of the R EC -M ULT algorithm is
2 3 Lemma 2 The R EC -T RANSPOSE algorithm involves
the same as the cache complexity of the cache-aware O(mn) work and incurs O(1 + mn=L) cache misses for
B LOCK -M ULT algorithm and also matches the lower an m  n matrix.
bound by Hong and Kung [21]. This lower bound
holds for all algorithms that execute the Θ(n3 ) opera- Proof. That the algorithm does O(mn) work is straight-
tions given by the definition of matrix multiplication forward. For the cache analysis, let Q(m; n) be the cache
n complexity of transposing an m  n matrix. We as-
ci j = ∑ aik bk j : sume that the matrices are stored in row-major order, the
k=1 column-major layout having a similar analysis.
No tight lower bounds for the general problem of matrix Let α be a constant sufficiently small such that two
multiplication are known. submatrices of size m  n and n  m, where max fm; ng
By using an asymptotically faster algorithm, such as αL, fit completely in the cache even if each row is stored
Strassen’s algorithm [31] or one of its variants [37], both in a different cache line. We distinguish the three cases.
the work and cache complexity can be reduced. When Case I: max fm; ng  αL. Both the matrices fit in
multiplying n  n matrices, Strassen’s algorithm, which O(1) + 2mn=L lines. From the choice of α, the number
is cache oblivious, requires only 7 recursive multiplica-
tions of n=2  n=2 matrices and a constant number of
of lines required is at most Z =L. Therefore Q(m; n) =
O(1 + mn=L).
matrix additions, yielding the recurrence
 Case II: m  αL < n or n  αL < m. Suppose first that
Θ(1 + n + n2=L) if n2  αZ ;
Q(n)  2 (8) m  αL < n. The R EC -T RANSPOSE algorithm divides
7Q(n=2) + O(n =L) otherwise ;
the greater dimension n by 2 and performs divide and
where α is a sufficiently small constant.p The solution to conquer. At some point in the recursion, n falls into the
this recurrence is Θ(n + n2=L + nlg7 =L Z ). range αL=2  n  αL, and the whole problem fits in
cache. Because the layout is row-major, at this point the
3. Matrix transposition and FFT input array has n rows and m columns, and it is laid out
This section describes a recursive cache-oblivious al- in contiguous locations, requiring at most O(1 + nm=L)
gorithm for transposing an m  n matrix which uses cache misses to be read. The output array consists of nm
O(mn) work and incurs O(1 + mn=L) cache misses, elements in m rows, where in the worst case every row
which is optimal. Using matrix transposition as a sub- lies on a different cache line. Consequently, we incur at
routine, we convert a variant [36] of the “six-step” fast most O(m + nm=L) for writing the output array. Since
Fourier transform (FFT) algorithm [6] into an optimal n  αL=2, the total cache complexity for this base case
cache-oblivious algorithm. This FFT algorithm uses is O(1 + m). These observations yield the recurrence

O(n lg n) work and incurs O (1 + (n=L) (1 + logZ n)) if n 2 [αL=2; αL] ;
Q(m; n) 
O(1 + m)
cache misses. 2Q(m; n=2) + O(1) otherwise ;
The problem of matrix transposition is defined as fol-
lows. Given an m  n matrix stored in a row-major lay- whose solution is Q(m; n) = O(1 + mn=L).
out, compute and store AT into an n  m matrix B also The case n  αL < m is analogous.
Case III: m; n > αL. As in Case II, at some point in the We choose n1 to be 2dlgn=2e and n2 to be 2blgn=2c . The
recursion both n and m fall into the range [αL=2; αL]. recursive step then operates as follows:
The whole problem fits into cache and can be solved 1. Pretend that input is a row-major n1  n2 matrix A.
with at most O(m + n + mn=L) cache misses. The cache Transpose A in place, i.e., use the cache-oblivious
complexity thus satisfies the recurrence R EC -T RANSPOSE algorithm to transpose A onto an
Q(m; n)  auxiliary array B, and copy B back onto A. Notice
8
< O(m + n + mn=L) if m; n 2 [αL=2; αL] ; that if n1 = 2n2 , we can consider the matrix to be
2Q(m=2; n) + O(1) if m  n ; made up of records containing two elements.
: 2. At this stage, the inner sum corresponds to a DFT
2Q(m; n=2) + O(1) otherwise;
of the n2 rows of the transposed matrix. Compute
whose solution is Q(m; n) = O(1 + mn=L). these n2 DFT’s of size n1 recursively. Observe that,
because of the previous transposition, we are trans-
Theorem 3 The R EC -T RANSPOSE algorithm has opti- forming a contiguous array of elements.
mal cache complexity. 3. Multiply A by the twiddle factors, which can be
Proof. For an m  n matrix, the algorithm must write to computed on the fly with no extra cache misses.
mn distinct elements, which occupy at least dmn=Le = 4. Transpose A in place, so that the inputs to the next
Ω(1 + mn=L) cache lines. stage are arranged in contiguous locations.
As an example of an application of this cache- 5. Compute n1 DFT’s of the rows of the matrix recur-
oblivious transposition algorithm, in the rest of this sec- sively.
tion we describe and analyze a cache-oblivious algo- 6. Transpose A in place so as to produce the correct
rithm for computing the discrete Fourier transform of a output order.
complex array of n elements, where n is an exact power It can be proven by induction that the work com-
of 2. The basic algorithm is the well-known “six-step” plexity of this FFT algorithm is O(n lg n). We now an-
variant [6, 36] of the Cooley-Tukey FFT algorithm [13]. alyze its cache complexity. The algorithm always op-
Using the cache-oblivious transposition algorithm, how- erates on contiguous data, by construction. Thus, by
ever, the FFT becomes cache-oblivious, and its perfor- the tall-cache assumption (1), the transposition oper-
mance matches the lower bound by Hong and Kung [21]. ations and the twiddle-factor multiplication require at
Recall that the discrete Fourier transform (DFT) of most O(1 + n=L) cache misses. Thus, the cache com-
an array X of n complex numbers is the array Y given by plexity satisfies the recurrence
n?1 8
X [ j]ω?i j ; < O(1 + n=L); if n  αZ ;
Y [i] = ∑ n (9)
Q(n) 
j =0 :
n1 Q(n2 ) + n2Q(n1 ) otherwise ; (11)
p +O(1 + n=L)
where ωn = e2π ?1=n is a primitive nth root of unity,
and 0  i < n. Many algorithms evaluate Equation (9) in where α > 0 is a constant sufficiently small that a sub-
O(n lg n) time for all integers n [15]. In this paper, how- problem of size αZ fits in cache. This recurrence has
ever, we assume that n is an exact power of 2, and we solution
compute Equation (9) according to the Cooley-Tukey al- Q(n) = O (1 + (n=L) (1 + logZ n)) ;

gorithm, which works recursively as follows. In the base which is optimal for a Cooley-Tukey algorithm, match-
case where n = O(1), we compute Equation (9) directly. ing the lower bound by Hong and Kung [21] when n is
Otherwise, for any factorization n = n1 n2 of n, we have an exact power of 2. As with matrix multiplication, no
Y [i1 + i2 n1 ] = (10) tight lower bounds for cache complexity are known for
" ! #
n2 ?1 n1 ?1 the general DFT problem.
∑ ∑ X [ j1 n2 + j2 ]ω?n1i1 j1 ω?
n
i1 j2
ω?
n2
i2 j2
:
j2 =0 j1 =0 4. Funnelsort
Observe that both the inner and outer summations in Cache-oblivious algorithms, like the familiar two-way
Equation (10) are DFT’s. Operationally, the computa- merge sort, are not optimal with respect to cache misses.
tion specified by Equation (10) can be performed by The Z-way mergesort suggested by Aggarwal and Vit-
computing n2 transforms of size n1 (the inner sum), mul- ter [3] has optimal cache complexity, but although it ap-
?i j
tiplying the result by the factors ωn 1 2 (called the twid- parently works well in practice [23], it is cache aware.
dle factors [15]), and finally computing n1 transforms of This section describes a cache-oblivious sorting algo-
size n2 (the outer sum). rithm called “funnelsort.” This algorithm has optimal
p
of k buffers. Each buffer is a FIFO queue that can
hold 2k3=2 elements. p p of the buffers
Finally, the outputs
are connected to the k inputs of the k-merger R pin
the right part of the figure. The output of this final k-
L1 R merger becomes the output of the whole k-merger. The
intermediate buffers are overdimensioned, since each
buffers can hold 2k3=2 elements, whichp is twice the number k3=2
of elements output by a k-merger. This additional
buffer space is necessary for the correct behavior of the
algorithm, as will be explained below. The base case of
Lpk the recursion is a k-merger with k = 2, which produces
k3 = 8 elements whenever invoked.
k-merger A k-merger operates recursively in the following way.
In order to output k3 elements, the k-merger invokes
R k3=2 times. Before each invocation, however, the k-
merger fills all buffers that are less than half full, i.e.,
p of a k-merger.
Figure 3: Illustration p A k-merger is built all buffers that contain less than k3=2 elements. In order
p L1, L2, : : :, Lpk ,
recursively out of k “left” k-mergers to fill buffer i, the algorithm invokes the corresponding
a series of buffers, and one “right” k-merger R. left merger Li once. Since Li outputs k3=2 elements, the
buffer contains at least k3=2 elements after Li finishes.
O(n lg n) work complexity, and optimal O(1 + (n=L)(1 + It can be proven by induction that the work com-
logZ n)) cache complexity. plexity of funnelsort is O(n lgn). We will now analyze
Funnelsort is similar to mergesort. In order to sort the cache complexity. The goal of the analysis is to
a (contiguous) array of n elements, funnelsort performs show that funnelsort on n elements requires at most Q(n)
the following two steps: cache misses, where
1. Split the input into n1=3 contiguous arrays of size Q(n) = O(1 + (n=L)(1 + logZ n)) :
n2=3 , and sort these arrays recursively.
In order to prove this result, we need three auxiliary lem-
2. Merge the n1=3 sorted sequences using a n1=3- mas. The first lemma bounds the space required by a
merger, which is described below. k-merger.
Funnelsort differs from mergesort in the way the Lemma 4 A k-merger can be laid out in O(k2 ) contigu-
merge operation works. Merging is performed by a de- ous memory locations.
vice called a k-merger, which inputs k sorted sequences
and merges them. A k-merger operates by recursively Proof. A k-merger requires O(k2 ) memory locations p
merging sorted sequences which become progressively for the buffers, plus the space required by the k-
mergers. The space S(k) thus satisfies the recurrence
longer as the algorithm proceeds. Unlike mergesort,
p p
S(k)  ( k + 1)S( k) + O(k2 ) ;
however, a k-merger suspends work on a merging sub-
problem when the merged output sequence becomes
“long enough” and resumes work on another merging whose solution is S(k) = O(k2 ).
subproblem. In order to achieve the bound on Q(n), the buffers
This complicated flow of control makes a k-merger in a k-merger must be maintained as circular queues of
a bit tricky to describe. Figure 3 shows a representa- size k. This requirement guarantees that we can man-
tion of a k-merger, which has k sorted sequences as in- age the queue cache-efficiently, in the sense stated by
puts. Throughout its execution, the k-merger maintains the next lemma.
the following invariant.
Lemma 5 Performing r insert and remove operations
Invariant Each invocation of a k-merger outputs the on a circular queue causes in O(1 + r=L) cache misses
next k3 elements of the sorted sequence obtained by as long as two cache lines are available for the buffer.
merging the k input sequences.
p Proof. Associate the two cache lines with the head and
A k-merger is built recursively out of k-mergerspin tail of the circular queue. If a new cache line is read
during a insert (delete) operation, the next L ? 1 insert
p way. The k inputs are partitioned into pk
the following
p
sets of k elements, which form the input to the k (delete) operations do not cause a cache miss.
k-mergers L1 ; L2 ; : : : ; Lpk in the left part of the figure. The next lemma bounds the cache complexity of a
The outputs of these mergers are connected to the inputs k-merger.
Lemma 6 If Z = Ω(L2 ), then a k-merger operates with Before invoking R, the algorithm must check every
at most p it is empty. One such check pre-
buffer to see whether
quires at most k cache misses, since there are k
QM (k) = O(1 + k + k3=L + k3 logZ k=L)
buffers. This check is repeated exactly k3=2 times, lead-
cache misses. ing to at most k2 cache misses for all checks. These
p
p There are two cases: either k < α Z or k
Proof. >
considerations lead to the recurrence
 p p
α Z, where α is a sufficiently small constant. QM (k)  2k3=2 + 2 k QM ( k) + k2 :
p
Case I: k < α Z. By Lemma 4, the data structure Application of the inductive hypothesis and the choice
associated with the k-merger requires at most O(k2 ) = A(k) = k(1 + 2c logZ k=L) yields Inequality (12) as fol-
O(Z ) contiguous memory locations, and therefore it fits lows:
 p p
QM (k)  2k3=2 + 2 k QM ( k) + k2
into cache. The k-merger has k input queues from
which it loads O(k3 ) elements. Let ri be the number " #
p extracted from the ith input queue. Since
of elements  p ck3=2 logZ k p
k<α p Z and the tall-cache assumption (1) implies that 2 k 3=2
+ k
2L
? A( k ) +k
2

L = O( Z ), there are at least Z =L = Ω(k) cache lines


available for the input buffers. Lemma 5 applies, whence  ck3 logZ k L + k2 (1 + c logZ k L)
p p
= =
the total number of cache misses for accessing the input
queues is ? 2k3=2 + 2 k A( k)
k  ck3 logZ k L ? A(k)
= :
∑ O(1 + ri =L) = O(k + k3 =L) :
i=1 Theorem 7 To sort n elements, funnelsort incurs O(1 +
(n=L)(1 + logZ n))
cache misses.
Similarly, Lemma 4 implies that the cache complexity
of writing the output queue is O(1 + k3 =L). Finally, the Proof. If n < αZ for a small enough constant α, then
algorithm incurs O(1 + k2 =L) cache misses for touching the algorithm fits into cache. To see why, observe that
its internal data structures. The total cache complexity is only one k-merger is active at any time. The biggest
therefore QM (k) = O(1 + k + k3=L). k-merger is the top-level n1=3 -merger, which requires
p O(n2=3 ) < O(n) space. The algorithm thus can operate
Case I: k > α pZ. We prove by induction on k that in O(1 + n=L) cache misses.
whenever k > α Z, we have If N > αZ, we have the recurrence
QM (k)  ck3 logZ k=L ? A(k) ; (12) Q(n) = n1=3Q(n2=3 ) + QM (n1=3 ) :
where A(k) = k(1 + 2c logZ k=L) = o(k3 ).This particular By Lemma 6, we have QM (n1=3 ) = O(1 + n1=3 + n=L +
value of A(k) will be justified at the end of the analysis. n logZ n=L).
p
The base case of the induction consists of values of By the tall-cache assumption (1), we have n=L =
k such that αZ 1=4 <pk < α Z. (It is not sufficient only Ω(n1=3 ). Moreover, we also have n1=3 = Ω(1) and lg n =
to consider k = Θ( Z ), since k can become as small as Ω(lg Z ). Consequently, QM (n1=3 ) = O(n logZ n=L)
Θ(Z 1=4 ) in the recursive calls.) The analysis of the first holds, and the recurrence simplifies to
case applies, pyielding QM (k) = O(1 + k + k3 =L). Be-
cause k > α Z = Ω(L) and k = Ω(1), the last term
2 Q(n) = n1=3 Q(n2=3 ) + O(n logZ n=L) :
dominates, which implies QM (k) = O(k3 =L). Conse- The result follows by induction on n.
quently, a big enough value of c can be found that satis- This upper bound matches the lower bound stated
fies Inequality (12). p by the next theorem, proving that funnelsort is cache-
p suppose that k > α Z. The
For the inductive case, optimal.
k-mergerpinvokes the k-mergers recursively. Since Theorem 8 The cache complexity of any sorting algo-
p hypothesis can be used to
αZ 1=4 < k < k, the inductive rithm is Q(n) = Ω(1 + (n=L)(1 + logZ n)).
bound the number QM ( k) of cache misses incurred by
the submergers. The “right” merger R is invoked exactly Proof. Aggarwal and Vitter [3] show that there is an
k3=2 times. The total number l of p invocations of “left” Ω((n=L) logZ =L (n=Z )) bound on the number of cache
mergers is bounded by l < k3=2 + 2 k. To see why, con- misses made by any sorting algorithm on their “out-of-
sider that every invocation of a left merger puts k3=2 el- core” memory model, a bound that extends to the ideal-
ements into some buffer. Since k3 elements are output p cache model. The theorem can be proved by apply-
ing the tall-cache assumption Z = Ω(L2 ) and the trivial
and the buffer space is 2k2 , the bound l < k3=2 + 2 k
follows. lower bounds of Q(n) = Ω(1) and Q(n) = Ω(n=L).
5. Distribution sort The distribution step is accomplished by the recur-
sive procedure D ISTRIBUTE(i; j; m) which distributes
In this section, we describe another cache-oblivious op- elements from the ith through (i + m ? 1)th subarrays
timal sorting algorithm based on distribution sort. Like into buckets starting from B j . Given the precondition
the funnelsort algorithm from Section 4, the distribution- that each subarray i; i + 1; : : : ; i + m ? 1 has its bnum  j,
sorting algorithm uses O(n lg n) work to sort n elements, the execution of D ISTRIBUTE(i; j; m) enforces the post-
and it incurs O (1 + (n=L) (1 + logZ n)) cache misses. condition that subarrays i; i + 1; : : : ; i + m ? 1 have their
Unlike previous cache-efficient distribution-sorting al- bnum  j + m. Step p 2 of the distribution sort invokes
gorithms [1, 3, 25, 34, 36], which use sampling or other D ISTRIBUTE(1; 1; n). The following is a recursive im-
techniques to find the partitioning elements before the plementation of D ISTRIBUTE:
distribution step, our algorithm uses a “bucket splitting”
technique to select pivots incrementally during the dis- D ISTRIBUTE(i; j; m)
tribution step. 1 if m = 1
Given an array A (stored in contiguous locations) of 2 then C OPY E LEMS(i; j)
length n, the cache-oblivious distribution sort operates 3 else D ISTRIBUTE(i; j; m=2)
as follows:
p 4 D ISTRIBUTE(i + m=2; j; m=2)
pn. Recursively
1. Partition A into n contiguous subarrays of size
sort each subarray.
5
6
D ISTRIBUTE(i; j + m=2; m=2)
D ISTRIBUTE(i + m=2; j + m=2; m=2)
2. Distribute the sorted subarrays into q buckets In the base case, the procedure C OPY E LEMS(i; j)
B1 ; : : : ; Bq of size n1 ; : : : ; nq , respectively, such that copies all elements from subarray i pthat belong to
1. max fx j x 2 Bi g  min fx j x 2 Bi+1 g for i = bucket j. If bucket j has more than 2 n elements af-
1; 2; : : : ; q ? 1. p
ter the insertion, it can be split into two buckets of size
p
2. ni  2 n for i = 1; 2; : : : ; q.
at least n. For the splitting operation, we use the deter-
ministic median-finding algorithm [14, p. 189] followed
(See below for details.) by a partition.
3. Recursively sort each bucket. Lemma 9 The median of n elements can be found
4. Copy the sorted buckets to array A. cache-obliviously using O(n) work and incurring O(1 +
n=L) cache misses.
A stack-based memory allocator is used to exploit spatial
locality. Proof. See [14, p. 189] for the linear-time median find-
The goal of Step 2 is to distribute the sorted subarrays ing algorithm and the work analysis. The cache com-
of A into q buckets B1 ; B2 ; : : : ; Bq . The algorithm main- plexity is given by the same recurrence as the work com-
tains two pinvariants. First, at any time each bucket holds plexity with a different base case.
8
at most 2 n elements, and any element in bucket Bi is < O(1 + m=L) if m  αZ ;
smaller than any element in bucket Bi+1 . Second, every Q(m) = Q(dm=5e) + Q(7m=10 + 6) otherwise ;
bucket has an associated pivot. Initially, only one empty :
+ O(1 + m=L)
bucket exists with pivot ∞.
The idea is to copy all elements from the subarrays where α is a sufficiently small constant. The result fol-
into the buckets while maintaining the invariants. We lows.
p
In our case, we have buckets of size 2 n + 1. In ad-
keep state information for each subarray and bucket. The
state of a subarray consists of the index next of the next dition, when a bucket splits, all subarrays whose bnum
element to be read from the subarray and the bucket is greater than the bnum of the split bucket must have
number bnum where this element should be copied. By their bnum’s incremented. The analysis of D ISTRIBUTE
convention, bnum = ∞ if all elements in a subarray have is given by the following lemma.
been copied. The state of a bucket consists of the pivot
Lemma 10 The distribution step involves O(n) work,
and the number of elements currently in the bucket.
incurs O(1 + n=L) cache misses, and uses O(n) stack
We would like to copy the element at position next of
space to distribute n elements.
a subarray to bucket bnum. If this element is greater than
the pivot of bucket bnum, we would increment bnum un- Proof. In order to simplify the analysis of the work
til we find a bucket for which the element is smaller than used by D ISTRIBUTE, assume that C OPY E LEMS uses
the pivot. Unfortunately, this basic strategy has poor O(1) work for procedural overhead. We will account for
caching behavior, which calls for a more complicated the work due to copying elements and splitting of buck-
procedure. ets separately. The work of D ISTRIBUTE is described by
the recurrence Theorem 11 Distribution sort uses O(n lgn) work and
T (c) = 4T (c=2) + O(1) : incurs O(1 + (n=L) (1 + logZ n)) cache misses to sort n
p
It follows that T (c) = O(c2 ), where c = n initially. The
elements.

p
work due to copying elements is also O(n). Proof. The work done by the algorithm is given by
The total number of bucket splits ispat most n. To pnW (pn) + q

see why, observe that there are at most n buckets at the W (n) = ∑ W (ni ) + O(n) ;
i=1
p at
end ofpthe distribution step, since each bucket contains p
where each ni  2 n and ∑ ni = n. The solution to this
least n elements. Each split operation involves O( n)
work and so the net contribution to the work is O(n). recurrence is W (n) = O(n lgn).
Thus,pthe total work used by D ISTRIBUTE is W (n) = p
The space complexity of the algorithm is given by
O(T ( n)) + O(n) + O(n) = O(n). S(n)  S(2 n) + O(n) ;
For the cache analysis, we distinguish two cases. Let where the O(n) term comes from Step 2. The solution to
α be a sufficiently small constant such that the stack this recurrence is S(n) = O(n).
space used fits into cache. The cache complexity of distribution sort is described
Case I, n  αZ: The input and the auxiliary space of by the recurrence
8
if n  αZ ;
size O(n) fit into cache using O(1 + n=L) cache lines.
Consequently, the cache complexity is O(1 + n=L).
Q(n) 
pnQ(pn) +
< O(1 + n=L)
q
∑i=1 Q(ni ) otherwise ;
Case II, n > αZ: Let R(c; m) denote the cache misses :
+O(1 + n=L)
incurred by an invocation of D ISTRIBUTE(a; b; c) that
copies m elements from subarrays to buckets. We first where α is a sufficiently small constant such that the
prove that R(c; m) = O(L + c2 =L + m=L), ignoring the stack space used by a sorting problem of size αZ, in-
cost splitting of buckets, which we shall account for sep- cluding the input array, fits completely in cache. The
arately. We argue that R(c; m) satisfies the recurrence base case n  αZ arises when both the input array A
8
if c  αL ;
and the contiguous stack space of size S(n) = O(n) fit
< O(L + m=L) in O(1 + n=L) cache lines of the cache. In this case,
R(c; m)  4
(13)
: ∑ R(c 2 mi ) otherwise ;
= ;
the algorithm incurs O(1 + n=L) cache misses to touch
i=1 all involved memory locations once. In the case where
where ∑4i=1 mi = m, whose solution is R(c; m) = O(L +
n> p αZ, theq recursive calls in Steps 1 and 3 cause
Q( n) + ∑i=1 Q(ni ) cache misses and O(1 + n=L) is
c2 =L + m=L). The recursive case c > αL follows im-
mediately from the algorithm. The base case c 
the cache complexity of Steps 2 and 4, as shown by
Lemma 10. The theorem follows by solving the recur-
αL can be justified as follows. An invocation of rence.
D ISTRIBUTE(a; b; c) operates with c subarrays and c
buckets. Since there are Ω(L) cache lines, the cache can
hold all the auxiliary storage involved and the currently
6. Theoretical justifications for the ideal-
accessed element in each subarray and bucket. In this cache model
case there are O(L + m=L) cache misses. The initial ac- How reasonable is the ideal-cache model for algorithm
cess to each subarray and bucket causes O(c) = O(L) design? The model incorporates four major assumptions
cache misses. Copying the m elements to and from con- that deserve scrutiny:
tiguous locations causes O(1 + m=L) cache misses.  optimal replacement,
We still need to account for the cache misses caused  exactly two levels of memory,
p
by the splitting of buckets. Each split causes O(1 +  automatic replacement,
n=L) cache missespdue to median finding (Lemma 9)  full associativity.
p of n contiguous elements. An addi-
and partitioning
tional O(1 + n=L) misses are incurred by restoring the
Designing algorithms in the ideal-cache model is easier
than in models lacking these properties, but are these
pn splitAsoperations.
cache. p there are at most
proven in the work analysis,
By adding R( n; n) to the split
assumptions too strong? In this section we show that
cache-oblivious algorithms designed in the ideal-cache
complexity, we conclude that the totalp p
cache complexity model can be efficiently simulated by weaker models.
of the distribution step is O(L + n=L + n(1 + n=L)) = The first assumption that we shall eliminate is that
O(n=L). of optimal replacement. Our strategy for the simula-
The analysis of distribution sort is given in the next tion is to use an LRU (least-recently used) replacement
theorem. The work and cache complexity match lower strategy [20, p. 378] in place of the optimal and om-
bounds specified in Theorem 8. niscient replacement strategy. We start by proving a
lemma that bounds the effectiveness of the LRU simu- never replaced, the multilevel cache maintains the inclu-
lation. We then show that algorithms whose complex- sion property. The next lemma, whose proof is omitted,
ity bounds satisfy a simple regularity condition (includ- asserts that even though a cache in a multilevel model
ing all algorithms heretofore presented) can be ported to does not see accesses that hit at lower levels, it neverthe-
caches incorporating an LRU replacement policy. less behaves like the first-level cache of a simple two-
level model, which sees all the memory accesses.
Lemma 12 Consider an algorithm that causes
Q (n; Z ; L) cache misses on a problem of size n using Lemma 14 A (Zi ; Li )-cache at a given level i of a mul-
a (Z ; L) ideal cache. Then, the same algorithm incurs tilevel LRU model always contains the same cache lines
Q(n; Z ; L)  2Q (n; Z =2; L) cache misses on a (Z ; L) as a simple (Zi ; Li )-cache managed by LRU that serves
cache that uses LRU replacement. the same sequence of memory accesses.
Proof. Sleator and Tarjan [30] have shown that the
Lemma 15 An optimal cache-oblivious algorithm
cache misses on a (Z ; L) cache using LRU replacement
are (Z =L)=((Z ? Z  )=L + 1)-competitive with optimal
whose cache complexity satisifies the regularity condi-
replacement on a (Z  ; L) ideal cache if both caches start tion (14) incurs an optimal number of cache misses on
empty. It follows that the number of misses on a (Z ; L) each level3 of a multilevel cache with LRU replacement.
LRU-cache is at most twice the number of misses on a Proof. Let cache i in the multilevel LRU model be a
(Z =2; L) ideal-cache. (Zi ; Li ) cache. Lemma 14 says that the cache holds ex-
actly the same elements as a (Zi ; Li ) cache in a two-level
Corollary 13 For any algorithm whose cache- LRU model. From Corollary 13, the cache complex-
complexity bound Q(n; Z ; L) in the ideal-cache model ity of a cache-oblivious algorithm working on a (Zi ; Li )
satisfies the regularity condition LRU cache lower-bounds that of any cache-aware algo-
Q(n; Z ; L) = O(Q(n; 2Z ; L)) ; (14) rithm for a (Zi ; Li ) ideal cache. A (Zi ; Li ) level in a mul-
tilevel cache incurs at least as many cache misses as a
the number of cache misses with LRU replacement is
(Zi ; Li ) ideal cache when the same algorithm is executed.
Θ(Q(n; Z ; L)).
Proof. Follows directly from (14) and Lemma 12. Finally, we remove the two assumptions of automatic
The second assumption we shall eliminate is the as- replacement and full associativity. Specifically, we shall
sumption of only two levels of memory. Although mod- show that a fully associative LRU cache can be main-
els incorporating multiple levels of caches may be nec- tained in ordinary memory with no asymptotic loss in
essary to analyze some algorithms, for cache-oblivious expected performance.
algorithms, analysis in the two-level ideal-cache model
Lemma 16 A (Z ; L) LRU-cache can be maintained us-
suffices. Specifically, optimal cache-oblivious algo-
ing O(Z ) memory locations such that every access to a
rithms also perform optimally in computers with mul-
cache line in memory takes O(1) expected time.
tiple levels of LRU caches. We assume that the caches
satisfy the inclusion property [20, p. 723], which says Proof. Given the address of the memory location to
that the values stored in cache i are also stored in cache be accessed, we use a 2-universal hash function [24,
i + 1 (where cache 1 is the cache closest to the proces- p. 216] to maintain a hash table of cache lines present
sor). We also assume that if two elements belong to in the memory. The Z =L entries in the hash table
the same cache line at level i, then they belong to the point to linked lists in a heap of memory that contains
same line at level i + 1. Moreover, we assume that cache Z =L records corresponding to the cache lines. The 2-
i + 1 has strictly more cache lines than cache i. These as- universal hash function guarantees that the expected size
sumptions ensure that cache i + 1 includes the contents of a chain is O(1). All records in the heap are organized
of cache i plus at least one more cache line. as a doubly linked list in the LRU order. Thus, the LRU
The multilevel LRU cache operates as follows. A hit policy can be implemented in O(1) expected time using
on an element in cache i is served by cache i and is not O(Z =L) records of O(L) words each.
seen by higher-level caches. We consider a line in cache
i + 1 to be marked if any element stored on the line be- 3 Alpern, Carter and Feig [5] show that optimality on each level of

longs to cache i. When cache i misses on an access, it memory in the UMH model does not necessarily imply global optimal-
recursively fetches the needed line from cache i + 1, re- ity. The UMH model incorporates a single cost measure that combines
the costs of work and cache faults at each of the levels of memory. By
placing the least-recently accessed unmarked cache line. analyzing the levels independently, our multilevel ideal-cache model
The replaced cache line is then brought to the front of remains agnostic about the various schemes by which work and cache
cache (i + 1)’s LRU list. Because marked cache lines are faults might be combined.
0.25
Theorem 17 An optimal cache-oblivious algorithm

Time (microseconds)
iterative
whose cache-complexity bound satisfies the regularity 0.2 recursive
condition (14) can be implemented optimally in expec- 0.15
tation in multilevel models with explicit memory man- 0.1
agement. 0.05
0
Proof. Combine Lemma 15 and Lemma 16. 0 200 400 600 800 1000 1200
N
Corollary 18 The recursive cache-oblivious algorithms Figure 4: Average time to transpose an N  N matrix,
for matrix multiplication, matrix transpose, FFT, and divided by N 2 .
sorting are optimal in multilevel models with explicit
memory management. exceptions. Hong and Kung [21] use the red-blue peb-
ble game to prove lower bounds on the I/O-complexity
Proof. Their complexity bounds satisfy the regularity of matrix multiplication, FFT, and other problems. The
condition (14). red-blue pebble game models temporal locality using
It can also be shown [26] that cache-oblivous algo- two levels of memory. The model was extended by
rithms satisfying (14) are also optimal (in expectation) Savage [27] for deeper memory hierarchies. Aggarwal
in the previously studied SUMH [5, 34] and HMM [1] and Vitter [3] introduced spatial locality and investigated
models. Thus, all the algorithmic results in this paper a two-level memory in which a block of P contiguous
apply to these models, matching the best bounds previ- items can be transferred in one step. They obtained tight
ously achieved. bounds for matrix multiplication, FFT, sorting, and other
Other simulation results can be shown. For example, problems. The hierarchical memory model (HMM) by
by using the copying technique of [22], cache-oblivious Aggarwal et al. [1] treats memory as a linear array,
algorithms for matrix multiplication and other problems where the cost of an access to element at location x is
can be designed that are provably optimal on direct- given by a cost function f (x). The BT model [2] extends
mapped caches. HMM to support block transfers. The UMH model by
Alpern et al. [5] is a multilevel model that allows I/O at
7. Related work different levels to proceed in parallel. Vitter and Shriver
introduce parallelism, and they give algorithms for ma-
In this section, we discuss the origin of the notion of trix multiplication, FFT, sorting, and other problems in
cache-obliviousness. We also give an overview of other both a two-level model [35] and several parallel hierar-
hierarchical memory models. chical memory models [36]. Vitter [33] provides a com-
Our research group at MIT noticed as far back as prehensive survey of external-memory algorithms.
1994 that divide-and-conquer matrix multiplication was
a cache-optimal algorithm that required no tuning, but
8. Conclusion
we did not adopt the term “cache-oblivious” until 1997.
This matrix-multiplication algorithm, as well as a cache- The theoretical work presented in this paper opens two
oblivious algorithm for LU-decomposition without piv- important avenues for future research. The first is to
oting, eventually appeared in [9]. Shortly after leaving determine the range of practicality of cache-oblivious
our research group, Toledo [32] independently proposed algorithms, or indeed, of any algorithms developed in
a cache-oblivious algorithm for LU-decomposition with the ideal-cache model. The second is to resolve, from a
pivoting. For n  n matrices, Toledo’s algorithm
p uses complexity-theoretic point of view, the relative strengths
Θ(n3 ) work and incurs Θ(1 + n2 =L + n3 =L Z ) cache of cache-oblivious and cache-aware algorithms. To con-
misses. More recently, our group has produced an FFT clude, we discuss each of these avenues in turn.
library called FFTW [18], which in its most recent incar- Figure 4 compares per-element time to transpose a
nation [17], employs a register-allocation and schedul- matrix using the naive iterative algorithm employing a
ing algorithm inspired by our cache-oblivious FFT al- doubly nested loop with the recursive cache-oblivious
gorithm. The general idea that divide-and-conquer en- R EC -T RANSPOSE algorithm from Section 3. The two
hances memory locality has been known for a long algorithms were evaluated on a 450 megahertz AMD
time [29]. K6III processor with a 32-kilobyte 2-way set-associative
Previous theoretical work on understanding hierar- L1 cache, a 64-kilobyte 4-way set-associative L2 cache,
chical memories and the I/O-complexity of algorithms and a 1-megabyte L3 cache of unknown associativ-
has been studied in cache-aware models lacking an auto- ity, all with 32-byte cache lines. The code for R EC -
matic replacement strategy, although [10, 28] are recent T RANSPOSE was the same as presented in Section 3, ex-
0.12
Time (microseconds) uses a recursive strategy to exploit caches in Fourier
0.1 iterative
recursive transform calculations. FFTW’s code generator pro-
0.08
duces straight-line “codelets,” which are coarsened base
0.06
cases for the FFT algorithm. Because these codelets are
0.04
cache oblivious, a C compiler can perform its register
0.02
allocation efficiently, and yet the codelets can be gen-
0
0 100 200 300 400 500 600 erated without knowing the number of registers on the
N target architecture.
Figure 5: Average time taken to multiply two N  N To close, we mention two theoretical avenues
matrices, divided by N 3 . that should be explored to determine the complexity-
theoretic relationship between cache-oblivious algo-
cept that the divide-and-conquer structure was modified rithms and cache-aware algorithms.
to produce exact powers of 2 as submatrix sizes wher-
ever possible. In addition, the base cases were “coars- Separation: Is there a gap in asymptotic complexity
ened” by inlining the recursion near the leaves to in- between cache-aware and cache-oblivious algorithms?
crease their size and overcome the overhead of proce- It appears that cache-aware algorithms should be able to
dure calls. (A good research problem is to determine use caches better than cache-oblivious algorithms, since
an effective compiler strategy for coarsening base cases they have more knowledge about the system on which
automatically.) they are running. Do there exist problems for which this
Although these results must be considered prelimi- advantage is asymptotically significant, for example an
nary, Figure 4 strongly indicates that the recursive al- Ω(lg Z ) advantage? Bilardi and Peserico [8] have re-
gorithm outperforms the iterative algorithm throughout cently taken some steps in proving a separation.
the range of matrix sizes. Moreover, the iterative al- Simulation: Is there a limit as to how much better a
gorithm behaves erratically, apparently due to so-called cache-aware algorithm can be than a cache-oblivious
“conflict” misses [20, p. 390], where limited cache asso- algorithm for the same problem? That is, given a class
ciativity interacts with the regular addressing of the ma- of optimal cache-aware algorithms to solve a single
trix to cause systematic interference. Blocking the itera- problem, can we construct a good cache-oblivious al-
tive algorithm should help with conflict misses [22], but gorithm that solves the same problem with only, for
it would make the algorithm cache aware. For large ma- example, O(lg Z ) loss of efficiency? Perhaps simula-
trices, the recursive algorithm executes in less than 70% tion techniques can be used to convert a class of effi-
of the time used by the iterative algorithm, even though cient cache-aware algorithms into a comparably efficient
the transpose problem exhibits no temporal locality. cache-oblivious algorithm.
Figure 5 makes a similar comparison between the
naive iterative matrix-multiplication algorithm, which
Acknowledgments
uses three nested loops, with the O(n3 )-work recur-
sive R EC -M ULT algorithm described in Section 2. This Thanks to Bobby Blumofe, now of the University of
problem exhibits a high degree of temporal locality, Texas at Austin, who sparked early discussions at MIT
which R EC -M ULT exploits effectively. As the figure about what we now call cache obliviousness. Thanks to
shows, the average time used per integer multiplication Gianfranco Bilardi of University of Padova, Sid Chat-
in the recursive algorithm is almost constant, which for terjee of University of North Carolina, Chris Joerg of
large matrices, is less than 50% of the time used by the Compaq CRL, Martin Rinard of MIT, Bin Song of MIT,
iterative variant. A similar study for Jacobi multipass Sivan Toledo of Tel Aviv University, and David Wise of
filters can be found in [26]. Indiana University for helpful discussions. Thanks also
Several researchers [12, 16] have also observed that to our anonymous reviewers.
recursive algorithms exhibit performance advantages
over iterative algorithms for computers with caches. A References
comprehensive empirical study has yet to be done, how- [1] A. Aggarwal, B. Alpern, A. K. Chandra, and M. Snir.
ever. Do cache-oblivious algorithms perform nearly as A model for hierarchical memory. In Proceedings of the
well as cache-aware algorithms in practice, where con- 19th Annual ACM Symposium on Theory of Computing
stant factors matter? Does the ideal-cache model cap- (STOC), pages 305–314, May 1987.
ture the substantial caching concerns for an algorithms [2] A. Aggarwal, A. K. Chandra, and M. Snir. Hierarchi-
cal memory with block transfer. In 28th Annual Sym-
designer? posium on Foundations of Computer Science (FOCS),
An anecdotal affirmative answer to these questions is pages 204–216, Los Angeles, California, 12–14 Oct.
exhibited by the popular FFTW library [17, 18], which 1987. IEEE.
[3] A. Aggarwal and J. S. Vitter. The input/output complex- [22] M. S. Lam, E. Rothberg, and M. E. Wolf. The cache per-
ity of sorting and related problems. Communications of formance and optimizations of blocked algortihms. In
the ACM, 31(9):1116–1127, Sept. 1988. Fourth International Conference on Architectural Sup-
[4] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design port for Programming Languages and Operating Systems
and Analysis of Computer Algorithms. Addison-Wesley (ASPLOS), pages 63–74, Santa Clara, CA, Apr. 1991.
Publishing Company, 1974. ACM SIGPLAN Notices 26:4.
[5] B. Alpern, L. Carter, and E. Feig. Uniform memory hier- [23] A. LaMarca and R. E. Ladner. The influence of caches
archies. In Proceedings of the 31st Annual IEEE Sym- on the performance of sorting. Proceedings of the Eighth
posium on Foundations of Computer Science (FOCS), Annual ACM-SIAM Symposium on Discrete Algorithms
pages 600–608, Oct. 1990. (SODA), pages 370–377, 1997.
[6] D. H. Bailey. FFTs in external or hierarchical memory. [24] R. Motwani and P. Raghavan. Randomized Algorithms.
Journal of Supercomputing, 4(1):23–35, May 1990. Cambridge University Press, 1995.
[7] L. A. Belady. A study of replacement algorithms for vir- [25] M. H. Nodine and J. S. Vitter. Deterministic distribution
tual storage computers. IBM Systems Journal, 5(2):78– sort in shared and distributed memory multiprocessors.
101, 1966. In Proceedings of the Fifth ACM Symposium on Paral-
[8] G. Bilardi and E. Peserico. Efficient portability across lel Algorithms and Architectures (SPAA), pages 120–129,
memory hierarchies. Unpublished manuscript, 1999. Velen, Germany, 1993.
[9] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, [26] H. Prokop. Cache-oblivious algorithms. Master’s thesis,
and K. H. Randall. An analysis of dag-consistent dis- Massachusetts Institute of Technology, June 1999.
tributed shared-memory algorithms. In Proceedings of [27] J. E. Savage. Extending the Hong-Kung model to mem-
the Eighth Annual ACM Symposium on Parallel Algo- ory hierarchies. In D.-Z. Du and M. Li, editors, Com-
rithms and Architectures (SPAA), pages 297–308, Padua, puting and Combinatorics, volume 959 of Lecture Notes
Italy, June 1996. in Computer Science, pages 270–281. Springer Verlag,
[10] L. Carter and K. S. Gatlin. Towards an optimal bit- 1995.
reversal permutation program. In Proceedings of the 39th [28] S. Sen and S. Chatterjee. Towards a theory of cache-
Annual Symposium on Foundations of Computer Science, efficient algorithms. Unpublished manuscript, 1999.
pages 544–555. IEEE Computer Society Press, 1998. [29] R. C. Singleton. An algorithm for computing the mixed
[11] S. Chatterjee, V. V. Jain, A. R. Lebeck, and S. Mundhra. radix fast Fourier transform. IEEE Transactions on Audio
Nonlinear array layouts for hierarchical memory sys- and Electroacoustics, AU-17(2):93–103, June 1969.
tems. In Proceedings of the ACM International Confer- [30] D. D. Sleator and R. E. Tarjan. Amortized efficiency
ence on Supercomputing, Rhodes, Greece, June 1999. of list update and paging rules. Communications of the
[12] S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thot- ACM, 28(2):202–208, Feb. 1985.
tethodi. Recursive array layouts and fast parallel matrix [31] V. Strassen. Gaussian elimination is not optimal. Nu-
multiplication. In Proceedings of the Eleventh ACM Sym- merische Mathematik, 13:354–356, 1969.
posium on Parallel Algorithms and Architectures (SPAA), [32] S. Toledo. Locality of reference in LU decomposition
June 1999. with partial pivoting. SIAM Journal on Matrix Analysis
[13] J. W. Cooley and J. W. Tukey. An algorithm for the ma- and Applications, 18(4):1065–1081, Oct. 1997.
chine computation of the complex Fourier series. Mathe-
matics of Computation, 19:297–301, Apr. 1965. [33] J. S. Vitter. External memory algorithms and data struc-
tures. In J. Abello and J. S. Vitter, editors, External
[14] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. In- Memory Algorithms and Visualization, DIMACS Series
troduction to Algorithms. MIT Press and McGraw Hill, in Discrete Mathematics and Theoretical Computer Sci-
1990. ence. American Mathematical Society Press, Providence,
[15] P. Duhamel and M. Vetterli. Fast Fourier transforms: a RI, 1999.
tutorial review and a state of the art. Signal Processing, [34] J. S. Vitter and M. H. Nodine. Large-scale sorting in
19:259–299, Apr. 1990. uniform memory hierarchies. Journal of Parallel and
[16] J. D. Frens and D. S. Wise. Auto-blocking matrix- Distributed Computing, 17(1–2):107–114, January and
multiplication or tracking BLAS3 performance from February 1993.
source code. In Proceedings of the Sixth ACM SIG-
PLAN Symposium on Principles and Practice of Parallel [35] J. S. Vitter and E. A. M. Shriver. Algorithms for par-
Programming (PPoPP), pages 206–216, Las Vegas, NV, allel memory I: Two-level memories. Algorithmica,
June 1997. 12(2/3):110–147, August and September 1994.
[17] M. Frigo. A fast Fourier transform compiler. In Proceed- [36] J. S. Vitter and E. A. M. Shriver. Algorithms for parallel
ings of the ACM SIGPLAN’99 Conference on Program- memory II: Hierarchical multilevel memories. Algorith-
ming Language Design and Implementation (PLDI), At- mica, 12(2/3):148–169, August and September 1994.
lanta, Georgia, May 1999. [37] S. Winograd. On the algebraic complexity of functions.
[18] M. Frigo and S. G. Johnson. FFTW: An adaptive soft- Actes du Congrès International des Mathématiciens,
ware architecture for the FFT. In Proceedings of the In- 3:283–288, 1970.
ternational Conference on Acoustics, Speech, and Signal
Processing, Seattle, Washington, May 1998.
[19] G. H. Golub and C. F. van Loan. Matrix Computations.
Johns Hopkins University Press, 1989.
[20] J. L. Hennessy and D. A. Patterson. Computer Architec-
ture: A Quantitative Approach. Morgan Kaufmann, 2nd
edition, 1996.
[21] J.-W. Hong and H. T. Kung. I/O complexity: the red-blue
pebbling game. In Proceedings of the 13th Annual ACM
Symposium on Theory of Computing (STOC), pages 326–
333, Milwaukee, 1981.

You might also like