Abstract.: Array T N

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Arrays in Blitz++

Todd L. Veldhuizen
Indiana University Computer Science Department

Abstract. Numeric arrays in Blitz++ rival the eciency of Fortran,


but without any extensions to the C++ language. Blitz++ has features
unavailable in Fortran 90/95, such as arbitrary transpose operations,
array renaming, tensor notation, partial reductions, multicomponent ar-
rays and stencil operators. The library handles parsing and analysis of
array expressions on its own using the expression templates technique,
and performs optimizations (such as loop transformations) which have
until now been the responsibility of compilers.

1 Introduction
The goal of the Blitz++ library is to provide a solid \base environment" of
arrays, matrices and vectors for scienti c computing in C++. This paper focuses
on arrays in Blitz++, which provide performance competitive with Fortran and
superior functionality. The design of Blitz++ has been in uenced by Fortran
90, High-Performance Fortran, the Math.h++ library [3], A++/P++ [4], and
POOMA [5]. It incorporates various features from these environments, and adds
many of its own. This paper concentrates on the unique features of Blitz++
arrays.

2 Overview
Multidimensionalarrays in Blitz++ are provided by the class template Array<T,
N>. The template parameter T is the numeric type stored in the array, and N is
its rank (dimensionality). This class supports a variety of array models:
{ Arrays of scalar types, such as Array<int,2> and Array<float,3>
{ Complex arrays, such as Array<complex<float>,2>
{ Arrays of user-de ned types. For example, if Polynomial is a class de ned
by the user (or another library), Array<Polynomial,2> is a two dimensional
array of Polynomial objects.
{ Nested homogeneous arrays using the Blitz++ classes TinyVector and Tiny-
Matrix. For example, Array<TinyVector<float,3>,3> is a three-dimensional
vector eld.
{ Nested heterogeneous arrays, such as Array<Array<int,1>,1>, in which
each element is an array of variable length.
2.1 Storage layout and reference counting
Array objects are lightweight views of a separately allocated data block. This
design permits a single block of data to be represented by several array views
[3]. Each array object contains a descriptor (also called a dope vector) which
speci es the memory layout. The descriptor contains a pointer to the array data,
lower bounds for the indices, a shape vector, a stride vector, reversal ags, and
a storage ordering vector. This last is a permutation of the dimension numbers
[1; 2; : ::; N ] which indicates the order in which dimensions are stored in memory.
Fortran-style column-major arrays correspond to [1; 2; : : :; N ], and C-style row-
major arrays correspond to [N; N ? 1; : : :; 1]. Reversal ags indicate whether
each dimension is stored in ascending or descending order.
The storage ordering vector and reversal ags allow arrays to be stored in
any one of N !2N orderings. Only two of these { C and Fortran-style arrays {
are frequently used. There are occasional uses for other orderings: some image
formats store rows from bottom to top, which can be handled transparently by
a reversal ag.
Arrays are reference-counted: the number of arrays referencing a data block is
monitored, and when no arrays refer to a data block it is deallocated. Reference
counting provides the bene ts of garbage collection, and allows functions to
return array objects eciently:
Array<float,2> someUserFunction(Array<float,2>&);

Reference-counting and exible storage formats support useful O(1) array oper-
ations:
{ Arbitrary transpose operations: The dimensions of an array can be permuted
using the transpose(...) member function. This code makes B a shared
view of A, but with the rst and second dimensions swapped:
Array<float,4> A(3,3,3); // A 3x3x3 array
Array<float,4> B = A.transpose(secondDim,firstDim,thirdDim);

The integer constants firstDim, secondDim, ... are intended to improve


readability, and hide confusion over whether the rst dimension is 1 (as in
Fortran) or 0 (as in C).
{ Dimension reversals: Each dimension can be independently reversed. If A
contains a two-dimensional colour image, then
Array<RGB24,2> B = A.reverse(firstDim);

ips the image vertically.


{ Array relabelling: Since array objects are really lightweight handles, arrays
can be swapped and relabelled in constant time. This is very useful in
time-stepping PDEs. If A1, A2 and A3 represent a eld at three consecu-
tive timesteps, cycleArrays(A1,A2,A3) relabels the arrays for the next time
step: [A1,A2,A3] [A2,A3,A1]. This avoids costly copying of the array data.
{ Array interlacing: Blitz++ allows arrays of the same shape to be interlaced
in memory. Such an arrangement improves data locality, which can increase
performance in some situations.
2.2 Subarrays and slicing
Subarrays in Blitz++ are fully functional Array objects which have a shared
view of the array data. Subarrays can either be full rank or lesser rank. Blitz++
supplies Range objects which emulate the Fortran 90 range syntax. Any combi-
nation of Range and integer values can be used to obtain a subarray:
Array<float,3> A(64,64,64); // A 64x64x64 array

// C refers to the 2D slice A(10..63, 15, 0..63)


Array<float,2> C = A(Range(10,toEnd), 15, Range::all());
Array<float,1> D = A(Range(fromStart,30), 15, 20); // A(0..30,15,20)

The use of fromStart and toEnd is after [3]. An optional third parameter to the
Range constructor speci es a stride, so subarrays do not have to be contiguous.

3 Array Expressions
Array expressions in Blitz++ are implemented using the expression templates
technique [6]. Prior to expression templates, use of overloaded operators meant
generating temporary arrays, which caused huge performance losses. In Blitz++,
temporary arrays are never created. Since its original development, the expres-
sion templates technique has grown substantially more complex and powerful [1,
2]. Its present incarnation in Blitz++ supports a wide variety of useful notations
and optimizations. The next sections overview the main features of the Blitz++
expression templates implementation from a user perspective.

3.1 Operators
Any operator which is meaningful for the array elements can be applied to arrays.
For example:
Array<float,2> A, B, C, D; // ...
A = B + (C * D);

Array<int,1> E, F, G, H; // ...
E |= (F & G) >> H;

Operators are always applied in an elementwise manner. Users can create arrays
of their own classes, and use whichever overloaded operators they have provided:
class Polynomial f
// define operators + and *
g;

Array<Polynomial,2> A, B, C, D; // ...
A = B + (C*D); // results in appropriate calls
// to Polynomial operators
Math functions provided by the standard C++, IEEE and System V math li-
braries may be used on arrays, for example sin(A) and lgamma(B).
Arrays with di erent storage formats can appear in the same expression; for
example, a user can add a C-style array to a Fortran array. Blitz++ transparently
corrects for the storage formats. Blitz++ allows arrays of di erent numeric types
to be mixed in an expression. Type promotion follows the standard C rules, with
some modi cations to handle complex numbers and user-de ned types.
Blitz++ supplies a set of index placeholder objects which allow array indices
to be used in expressions. This code creates a Hilbert matrix:
Array<float,2> A(4,4);

// i and j are index placeholders


firstIndex i;
secondIndex j;

A = 1.0 / (1+i+j); // Sets A(i,j) = 1.0 / (1+i+j) for all (i,j)

3.2 Tensor notation


Blitz++ provides a notation modelled on tensors. Here is an example of math-
ematical tensor notation:
C ijk = Aij xk ? Ajk yi
In Blitz++, this equation can be coded as:
using namespace blitz::tensor;

C = A(i,j) * x(k) - A(j,k) * y(i);

The tensor indices i,j,k,... are special objects concealed in the namespace
blitz::tensor. Users are free to declare their own tensor indices with di erent
names if they prefer. Tensor indices specify how arrays are oriented in the domain
of the array receiving the expression (Fig. 1). Any missing tensor indices are
interpreted as spread operations; for example, the A(i,j) term in the above
example is spread over the k index.

C = A(i,j) * x(k) + A(j,k) * y(i)


Fig. 1. Illustration of Blitz++ tensor notation: the indices specify how arrays are
oriented in the domain of the array receiving the result.
Unlike real tensor notation, repeated indices do not imply contraction. For
example, the tensor expression C ij = Aik B kj implies a summation over k. In
Blitz++, contractions must be written explicitly using a partial reduction (de-
scribed later):
Array<float,2> A, B, C; // ...
C(i,j) = sum(A(i,k) * B(k,j), k);

3.3 Stencil objects and operators


Blitz++ provides a stencil object mechanism which removes much of the drudgery
from writing nite di erence equations. One of the Blitz++ example programs is
a three-dimensional computational uid dynamics simulation. In each iteration,
the velocity eld is time-stepped according to the equation
? ?  
V V + t ?1 r2 V ? rP + F ? A
where V, P , F and A are velocity, pressure, force, and advection. Implementing
this equation using 4th-order accurate nite di erencing in Fortran requires a set
of mammoth equations with approximately 70 terms. Using a Blitz++ stencil
object, the equation is written as:
nextV = *V + delta_t * (recip_rho * (eta * Laplacian3DVec4(V, geom)
- grad3D4(P, geom) + *force) - *advect);

The vector elds V, force and advect are implemented as arrays of 3-vectors.
This eliminates the need to represent each vector eld as three separate arrays,
common in Fortran implementations. The stencil operators Laplacian3DVec4
and grad3D4 are provided by Blitz++, and implement 4th-order Laplacian and
gradient operators. The Laplacian3DVec4 operator expands into a 45-point
stencil. Blitz++ supplies stencil operators for forward, central and backward
di erences of various orders and accuracies; built on top of these are divergence,
gradient, curl, mixed partial, and Laplacian operators.
Blitz++ provides special support for vector elds (and in general, multi-
component/multispectral arrays). The [] operator is overloaded for easy access
to individual components of a multicomponent array. For example, this code
initializes the force eld with gravity:
const int x = 0, y = 1, z = 2;
force[x] = 0.0;
force[y] = 0.0;
force[z] = gravity;

3.4 Reductions
Reductions in Blitz++ transform an N-dimensional array (or array expression)
to a scalar value:
Array<int,2> A(4,4); // ...
int result1 = sum(A); // sum all elements
int result2 = count(A == 0); // count zero elements

Available reductions are sum, product, min, max, count, minIndex, maxIndex,
any and all. Partial reductions transform an N-dimensional array (or array
expression) to an N-1 dimensional array expression. The reduction is performed
along a single rank:
Array<int,2> A(2,4);
Array<int,1> B(2);

A = 0, 1, 1, 5,
3, 0, 0, 0;

B = sum(A, j); // Reduce along rows: B = [ 7 3 ]

Reductions can be chained: for example, this code nds the row with the mini-
mum sum of squares:
Array<float,2> A(N,N); // ...
int minRow = minIndex(sum(pow2(A),k));

4 Optimizations
The expression tempaltes technique allows Blitz++ to parse array expressions
and generate customized evaluation kernels at compile time. To achieve good
performance, Blitz++ performs many loop transformations which have tradi-
tionally been the responsibility of optimizing compilers:
{ Loop interchange and reversal: Consider this bit of code, which is a nave
implementation of the array operation A = B + C:
for (int i=0; i < N1; ++i)
for (int j=0; j < N2; ++j)
for (int k=0; k < N3; ++k)
A(i,j,k) = B(i,j,k) + C(i,j,k);

The layout of these arrays in memory is unknown at compile time. If the


arrays are stored in column-major order, this code will be very inecient
because of poor data locality. For large arrays, an entire cache line would
have to be loaded for each element access. To avoid this problem, Blitz++
selects a traversal order at run-time such that the arrays are traversed in
memory-storage order.
{ Hoisting stride calculations: The inner loop of the above code fragment would
expand to contain many stride calculations. Blitz++ generates code which
hoists the invariant portion of the stride arithmetic out of the innermost
loop.
{ Collapsing inner loops: Suppose that in the above code fragment, N3 is quite
small. Loop overhead and pipeline e ects will conspire to cause poor perfor-
mance. The solution is to convert the three nested loops into a single loop.
At runtime, Blitz++ collapses the inner loops whenever possible.
{ Partial unrolling: Many compilers partially unroll inner loops to expose low-
level parallelism. For compilers that won't, Blitz++ does this unrolling itself.
{ Common stride optimizations: Blitz++ tests at run-time to see if all the
arrays in the expression have a unit or common stride. If so, faster evaluation
kernels are used.
{ Tiling: Blitz++ detects the presence of stencils, and does tiling to ensure
good cache use.

4.1 Benchmark results


Fig. 2 shows performance of the Blitz++ classes Array and Vector for a DAXPY
operation on the Cray T3E-900 (single PE) using KAI C++. The Blitz++ classes
achieve the same performance as Fortran 90.1 Without expression templates,
performance is typically 30% that of Fortran. The native BLAS library is able
to outperform both Fortran 90 and Blitz++.2

400

Vector<T>
350 Array<T,1>
Native BLAS
Fortran 90
300

250
Mflops/s

200

150

100

50

0
0 1 2 3 4 5
10 10 10 10 10 10
Array length

Fig. 2. DAXPY Benchmark on the Cray T3E (single PE)

Table 1 shows performance of Blitz++ arrays on 21 loop kernels used by IBM


for benchmarking the RS/6000. Performance is reported as a fraction of Fortran
1
Fortran 77 is no longer supported on the T3E, and is actually slower. The ags used
for the f90 compiler were -O3,aggress,unroll2,pipeline3
2
Although not yet implemented, it is possible to do pattern matching to native BLAS
using expression templates, an idea due to Roldan Pozo.
performance: > 100 is faster, and < 100 is slower. The fastest native Fortran
compiler was used, with typical optimization switches (-O3, -Ofast). The loop
kernels and make les are available as part of the Blitz++ distribution.

Table 1. Performance of Blitz++ on 21 loop kernels, relative to Fortran


Platform/ Out of cache In-cache (peak)
Compiler Median Mean Median Mean
Cray T3E/KCC 95.7% 86.4% 98.1% 88.4%
HPC-160/KCC 100.2% 97.5% 95.1% 93.4%
Origin 2000/KCC 88.1% 87.3% 79.8% 78.6%
Pentium II/egcs 98.4% 98.5% 79.6% 82.6%
RS 6000/KCC 93.5% 90.7% 97.3% 93.2%
UltraSPARC/KCC 91.1% 86.8% 79.0% 78.3%

Acknowledgments This work was supported in part by NSERC (Canada), and


by the Director, Oce of Computational and Technology Research, Division of
Mathematical, Information, and Computational Sciences of the U.S. Department
of Energy under contract number DE-AC03-76SF00098. This research used re-
sources of NERSC and the Advanced Computing Laboratory (LANL) which are
supported by the Oce of Energy Research of the U.S. Department of Energy,
and of ZAM (Research Centre Julich, Germany).

References
1. Geo rey Furnish. Disambiguated glommable expression templates. Computers in
Physics, 11(3):263{269, May/June 1997.
2. Scott W. Haney. Beating the abstraction penalty in C++ using expression tem-
plates. Computers in Physics, 10(6):552{557, Nov/Dec 1996.
3. Thomas Ke er and Allan Vermeulen. Math.h++ Introduction and Reference Man-
ual. Rogue Wave Software, Corvallis, Oregon, 1989.
4. Rebecca Parsons and Daniel Quinlan. A++/P++ array classes for architecture
independent nite di erence computations. In Proceedings of the Second Annual
Object-Oriented Numerics Conference (OON-SKI'94), pages 408{418, April 24{27,
1994.
5. John V. W. Reynders, Paul J. Hinker, Julian C. Cummings, Susan R. Atlas, Sub-
hankar Banerjee, William F. Humphrey, Steve R. Karmesin, Katarzyna Keahey,
M. Srikant, and MaryDell Tholburn. POOMA. In Gregory V. Wilson and Paul Lu,
editors, Parallel Programming Using C++. MIT Press, 1996.
6. Todd L. Veldhuizen. Expression templates. C++ Report, 7(5):26{31, June 1995.
Reprinted in C++ Gems, ed. Stanley Lippman.
7. Todd L. Veldhuizen. The Blitz++ User Guide. 1998. http://seurat.uwaterloo.ca/-
blitz/.

You might also like