Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

This article was downloaded by: [Deakin University Library]

On: 14 July 2013, At: 14:47


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Optimization Methods and Software


Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/goms20

Fast higher-order derivative tensors


with Rapsodia
a b
I. Charpentier & J. Utke
a
Laboratoire de Physique et Mécanique des Matériaux, UMR CNRS
7554, Ile du Saulcy, Metz Cedex 1, France
b
Mathematics and Computer Science Division, Argonne National
Laboratory, Argonne, IL, USA
Published online: 04 Mar 2011.

To cite this article: I. Charpentier & J. Utke (2009) Fast higher-order derivative tensors with
Rapsodia, Optimization Methods and Software, 24:1, 1-14, DOI: 10.1080/10556780802413769

To link to this article: http://dx.doi.org/10.1080/10556780802413769

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Optimization Methods & Software
Vol. 24, No. 1, February 2009, 1–14

Fast higher-order derivative tensors with Rapsodia


I. Charpentiera and J. Utkeb *
a Laboratoire de Physique et Mécanique des Matériaux, UMR CNRS 7554, Ile du Saulcy, Metz Cedex 1,
France; b Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
Downloaded by [Deakin University Library] at 14:47 14 July 2013

(Received 6 June 2007; final version received 17 June 2008 )

A number of practical problems in physics can be solved by using accurate higher-order derivatives.
Such derivatives can be obtained with automatic differentiation. However, one has to be concerned with
the complexity of computing higher-order derivative tensors even for a modest order and number of
independents. Initial experiments using univariate Taylor polynomials with interpolation and operator
overloading with unrolled loops showed better runtimes than using other automatic differentiation tools.
Motivated by these results, we developed the Rapsodia code generator that produces Fortran and C++
libraries for the most common intrinsics. Here we explain the algorithmic approach, implementation, and
present test results on a select set of applications. Further details on the Rapsodia tool, and an example for
user extensions are given in the Appendix.

Keywords: higher-order derivatives; automatic differentiation; code generator

AMS Subject Classification: 46G05; 68N19; 65-04; 65Z05

1. Introduction

A number of practical problems in physics can be solved by using accurate higher-order deriva-
tives. One example is the double ionization of a molecule modelled using Brauner’s method [3].
In brief, Brauner’s method is based on the computation of a term D, related to the complex wave
function ψi (r1 , r2 ) = e−ar1 e−br2 e−cr12 , arising from a third-order derivative of Brauner’s term
e−ar1 e−br2 e−cr12 /(r1 r2 r12 ). The terms ri (i = 1, 2) and r12 denote vectors and the corresponding
terms ri and r12 are their respective norms. The Brauner term was introduced in ψi to enable
λ−μ μ−ν ν −ar1 −br2 −cr12
the writing of terms r1 r2 r12 e e e , appearing when using Brauner’s method, as
derivatives (Equation (1)) of D when λ − μ ≥ 0, μ − ν ≥ 0, and ν ≥ 0:
 μ−ν  ν 
λ−μ μ−ν ν −ar1 −br2 −cr12 ∂ λ−μ ∂ ∂ D
r1 r2 r12 e e e = λ−μ . (1)
∂a ∂bμ−ν ∂cν
We give more details on the applications in Section 4.1. Wave functions involving six parame-
ters were used [3,15], but as noticed in [16] they are not accurate enough. More accurate wave
functions, for instance, the 18-parameter Hylleraas-type of [17], require up to ninth-order mixed
derivatives and are obviously out of reach of hand-coded differentiation.

*Corresponding author. Email: utke@mes.anl.gov

ISSN 1055-6788 print/ISSN 1029-4937 online


© 2009 Taylor & Francis
DOI: 10.1080/10556780802413769
http://www.informaworld.com
2 I. Charpentier and J. Utke

The basic principles of automatic differentiation (AD) have been known for several decades
[22], but only during the past 15 years have the tools implementing AD found significant use in
optimization, data assimilation, and other applications in need of efficient and accurate deriva-
tive information. As a consequence of the wider use of AD, various tools have been developed
that address specific application requirements or programming languages. The AD community’s
website www.autodiff.org provides a comprehensive overview of the available tools. Only a small
subset of these tools is capable of computing higher-order derivatives. Aside from a number of
unfinished or no longer maintained projects, the practically relevant choices are AD02 for Fortran
[19] and Adol-C for C and C++ [10]. The initial impetus for developing Rapsodia [21] came
from the practical need to compute higher-order derivatives of the complex function ψi . Like
AD02 and Adol-C, Rapsodia relies on operator overloading as the vehicle of attaching derivative
computations to the elementary operations φ provided by the programming language such as the
arithmetic operators and intrinsic functions sin, ex , and so forth. The commonly taken approach
Downloaded by [Deakin University Library] at 14:47 14 July 2013

for derivative tensors of order three and above is the forward propagation of Taylor polynomials
up to order o in d directions with coefficients aji , i = 1, . . . , d, j = 1, . . . , o around a common
point a0 ≡ a0i . This is done on the level of the elementary operations. For each r = φ(a, b, . . .)
with result r and arguments a, b, . . .1 we compute the result’s Taylor coefficients rji based on
the Taylor coefficients of the arguments aji , bji , . . . . For instance, for the addition r = a + b
the formula is simply rji = aji + bji , and for the multiplication r = a · b, it is the convolution
j
rji = l=0 ali · bji −l . The recurrence formulae for the common elementary operations are given in
[9]. The invocation of the overloaded φ is tied to the use of program variables of a specific type.
A program implementing a numerical model y = f (x) : Rn  → Rm is viewed as a sequence of
elementary operations. As all AD tools do, we perform the computation of the derivative of f by
applying the chain rule to the sequence of φ comprising f . For a given point x0 in the domain and a
given set of d directions xi1 = si ∈ Rn , xij = O|j =2,...,o we obtain coefficients yij of the univariate
Taylor series f (x0 + hsi ). Rather than repeating the principles of AD in this paper we direct the
interested reader to [9] for a detailed introduction. The output variables y, input variables x, and
all intermediate program variables on any dependency path from the x to the y are called active
variables and need to have their type changed to the specific active type(s), which then triggers
the execution of the overloaded φ. All other variables are called passive. For overloading-based
tools in practice often all floating-point variable type declarations are changed to active type. To
compute the actual derivative tensors, we employ the approach described in [11], which is based
on the propagation of univariate Taylor polynomials in a predefined number of directions and
subsequent interpolation of the tensor elements.
Section 2 introduces the features and design of the Rapsodia generator, Section 3 illustrates the
tool’s use. Section 4 discusses the ionization application in detail, and gives performance results
using numerical models for ocean acoustics and a volcano eruption plume. Section 5 provides
a summary and an outlook. Details on the tool implementation and possible user extensions are
given in the Appendix.

2. The Rapsodia generator

Because this is a joint French-American project where one author is actually German, we thought it
proper to use the Spanish/Italian word Rapsodia as an acronym for rapide surcharge d’opérateur
pour la différentiation automatique. The generator and the library are open sources and can
be downloaded under the LGPL terms at [21]. Section 1 already explained the central role of
elementary operations φ in the computation of Taylor coefficients. The main idea of Rapsodia is
to combine operator overloading with code generation. The generator creates a library consisting
Optimization Methods & Software 3

of active types and operators overloaded for these active types for a given number n of input
variables and a given derivative order o. Because n and o are fixed, the generator can create
a specialized code that yields a performance advantage. Not surprisingly, very few differences
exist between general-purpose languages such as Fortran and C/C++ in the representation of the
elementary operations that have to be overloaded. Therefore, the idea of using a single generator to
produce both, a Fortran and a C++ library, is plausible. A number of differences mostly unrelated
to the core purpose of the library are addressed in the Appendix.

2.1 Design

We selected Python [20] to implement the generator, in part because we were able to use a code
generator in PETSc [18] as a starting point, but mainly because it is readily available in most
Downloaded by [Deakin University Library] at 14:47 14 July 2013

computing environments and the Python programming model appears to be a good fit for the size
of this project. The source code of the Rapsodia generator consists of three major parts.
(i) Class definitions for the elements of an abstract syntax tree (AST) (see [8]).
(ii) Methods to generate the Taylor coefficient propagation functionalities as an AST (see the
code in generate.py and Common/genOp*.py).
(iii) Methods to print this AST as source code (see [8]).
We want to highlight the three main features of the Rapsodia code generation.
1. Hand-written code and general-purpose AD tools will typically use loops and arrays to imple-
ment the propagation logic of the Taylor coefficients given as formulae for instance in [9]. Our
code generator creates an overloading library in which all loops are unrolled for a fixed order
and fixed number of input variables and in which the active type is represented as a flat structure,
i.e. without the use of arrays. Section 4 presents evidence for the improved performance of the
resulting code. These improvements can be attributed to widening the scope of such compiler
optimizations that are normally limited by the loop and other control flow constructs, and in
reducing the conservative aliasing overestimate because of the lack of array accesses. In this
paper, we do not attempt to show where exactly the improvement originates. This would be a
subject of low-level compiler optimization and be hardware-dependent. Instead, we recognize
and exploit the fact that the underlying semantics of the Taylor coefficient formulae can be
rewritten as code-generating logic. This logic has loops and the notion of indices that are sim-
ilar to the hand-written Taylor propagation code. However, instead of computing values, the
code generator creates expressions and concatenates indices to variable names while iterating
through these loops. Thus, it creates the exact same semantics but at a syntactic level that
is considerably lower than typical hand-written code and is more amenable to compiler-level
optimization. We are aware of the plethora of different ways to express the propagation seman-
tics as generated source code. One might consider a range, perhaps beginning from simple
variations of AD02-like source with loops that have compile-time constant loop bounds to
source code that uses additional local variables to aid register allocation. While the former
falls short of the code generator’s potential, the latter would already be somewhat hardware-
and compiler-specific. We believe our approach constitutes a plausible compromise.
2. The generator creates all relevant argument combinations for elementary operations, assign-
ments, and copy constructors based on a list of precisions and types. Just like floating point
variables that have not been redeclared with the active type, integers are considered passive.
Multivariate φ have distinct overloaded versions for occurences of passive arguments of differ-
ent types. By default we distinguish the floating-point type and precision2 combinations listed
in Table 1. Each combination implies a distinct active type and distinct overloaded versions
when paired with passive arguments and arguments of other active types. The result type is
4 I. Charpentier and J. Utke

Table 1. Floating-point type combinations generated by default in


Rapsodia for Fortran and C++.

Combinations Active type

RArealS
    RArealD
real RAsKind
Fortran × RAcomplexS
complex RAdKind
RAcomplexD
 
float RAfloatS
C++
double RAfloatD

determined based on the argument types emulating in a limited fashion the target language’s
Downloaded by [Deakin University Library] at 14:47 14 July 2013

built-in typing rules. In particular, for assignments (and copy constructors) we do not generate
assignments that would inadvertently permit a loss of information or precision such as assign-
ing an active variable to a floating-point variable or an active double variable to an active
float variable. Generating all the combinations prevents implicit conversions in particular
in C++ using copy constructors from passive to active arguments for elementary operations
and therefore avoids extraneous computations for variables that are not active.
3. Rather than making the definitions of the elementary operations type members, we decided to
keep them separate. In particular, for some of the C++ operators one might normally prefer
the member declaration over a separate, non-member declaration, but this approach would not
cover all argument combinations of binary operators3 , and Fortran does not have the syntactic
option. A welcome side-effect of separate source files is a less bulky compile step. For a non-
trivial order and number of directions, the compiler optimization can take a noticeable amount
of time and memory. Splitting up the library source files in this fashion reduces the compiler’s
resource requirements.

The principal building blocks of the generated overloading libraries for C++ and Fortran are as
follows:

• a set of definitions for floating-point precision;


• a set of active types;
• type-specific accessors for the derivative components via integer parameters for direction and
order;
• a set of assignment operators, and for C++ a set of copy constructors;
• for each elementary operation –
◦ declarations for all relevant combinations of active arguments,
◦ definition stubs, and
◦ definition bodies included in the stubs.

2.2 Higher-order tensor drivers

An efficient approach for computing a derivative tensor of order o ≥ 3, in the following denoted
by Do , at a given point x0 , is laid out in [11] and has been previously implemented within
Adol-C. Rather than leaving the reimplementation to the user, we provide a Fortran and a C++
implementation as a convenience together with the Rapsodia sources. Because symmetry increases
in D with o, we want to compute and represent just the distinct elements. In Section 1, we
introduced n as the number of input variables and m as the number of output variables. N0 are the
non-negative integers. Following [11], we use multi-indices t ∈ Nn0 , where each ti , i = 1, . . . , n
represents the derivative order with respect to input xi . For instance, for n = 2, m = 1, the two
Optimization Methods & Software 5

Hessian entries H12 = ∂ 2 /(∂x1 ∂x2 ) and H21 = ∂ 2 /(∂x2 ∂x1 ) are both represented by t =(1, 1).
All distinct elements of Do are represented by the multi-indices t for which o ≡ |t| = ni=1 ti .
 
There are exactly n+o−1 o
such multi-indices. We take each multi-index t j as a direction for
which we propagate
 Taylor polynomials of order o at the given point x0 . In other words, we have
d ≡ n+o−1 o
and s i
≡ t i , i = 1, . . . , d (see also Section 1). The resulting Taylor coefficients of the
output yij can then be interpolated to retrieve the elements of any Dj , j = 1, . . . , o again identified
by their respective multi-indices t with |t| ≡ j . The precomputed interpolation coefficients depend
only on o and n.
The source code for computing the multi-indices/directions sj and the interpolation routines are
in Rapsodia/hotF90 and Rapsodia/hotCpp. Because of syntax differences, the interfaces
differ slightly between C++ and Fortran. As with the set and get methods for the active type we
did not abandon the more concise style of calling a member function in the C++ implementation
Downloaded by [Deakin University Library] at 14:47 14 July 2013

just because Fortran does not provide it. All functionality is tied to a HigherOrderTensor
object, which needs to be initialized with n and o. The direction count d as well as the matrix
S = [sj ] ∈ Nn×d
0 can be obtained by calling getDirectionCount and getSeedMatrix,
respectively. Assuming the Rapsodia library has been generated for the given o and d, one can
then compute and set the output Taylor coefficients for a single output yi (here, i = 1, . . . , m)
as a Ro×d (or a Co×d ) matrix by calling setTaylorCoefficients. In practice, programs
may not have all inputs and outputs packed into vectors x and y but instead the inputs and
outputs may consist of separate program variables. Consequently, the interface does not require
the packing into vectors. However, that also implies that the initialization of the input coefficients
with the columns of the seed matrix and the transfer of the output coefficients to the interpolation
routine cannot be automated.
For the output variable yi one retrieves the entries of any D j , j ∈ [1, . . . , o] by calling
getCompressedTensor. This returns a vector of length n+jj −1 whose lth entry corresponds
to the lth multi-index sl returned by getSeedMatrix computed for order j . In particular for
order j = o, getCompressedTensor returns a vector of length d whose lth entry corresponds
to the lth direction/multi-index we propagated through our model. In the current implementation,
the pair of calls to setTaylorCoefficients and getCompressedTensor have to be
repeated for each of the m outputs. Because d becomes fairly large even for moderate o and n,
memory constraints may make it necessary to propagate the sj in smaller slices of size d ∗ , a
technique called stripmining (see, for instance, [2]) previously used for the computation of large
Jacobians.
The Rapsodia regression tests contain examples for using the routines for both Fortran and C++.

3. Usage

The usage of Rapsodia follows a pattern similar to that of other overloading-based AD tools.
The generator writes all necessary inclusions into a single file RAinclude4 , which needs to
be included in all compile units that are part of the computation to be differentiated; e.g. see
line 2 in Figure 1b. To trigger the use of the overloaded operators, one then has to change the
type of all active variables to the active type generated by Rapsodia as illustrated in the change
on lines 4 and 5 of Figure 1. In some cases a simple global type replacement, for instance via
the C preprocessor, is not possible. For instance, in C++ any union containing a floating-point
type cannot be converted. The conversion is not permitted by C++ because the active data type
needs to have copy constructors from passive values, e.g. for the purpose of method calls where
an active formal parameter is passed by a passive actual parameter. This, in turn, requires explicit
specification of a default constructor (e.g. uninitialized declarations) because the implicit default
6 I. Charpentier and J. Utke

Figure 1. Original code for head in (a) and typechanged code in (b).

constructor is no longer generated and the explicit default constructor is not permitted for types
in unions.
Also problematic is the memory allocation for floating-point variables using malloc. The
Downloaded by [Deakin University Library] at 14:47 14 July 2013

active type constructor is empty and the absence of virtual functions still permits the use of
malloc but the size has to be computed, i.e. it must not be specified with a fixed byte count.
A global type change done in Fortran will require manual adjustments, e.g. for the alignment of
data in block I/O operations.
The original model is assumed to have a driver to initialize the inputs and consume the output.
That driver will have to be extended to initialize the input Taylor coefficients and consume the
output Taylor coefficients as is shown in Figure 2. The type of the model’s actual arguments has
to be adjusted. If one assumes line 6 to be the original initialization, nothing is changed here. In
particular, we do not set the value component explicitly as one would by writing x[0].v=0.3;.
The overloaded assignment operator (with the passive right-hand side) also initializes to zero all
Taylor coefficients in x[0] that had, for efficiency reasons, not been initialized until now. Before
calling model we need to initialize the input Taylor coefficients, for simplicity here done only for
x[0]11 . After model has run, we may extract the output and the Taylor coefficient y11 as shown
on lines 11 and 12, respectively. With the given initialization of the input coefficients, the output
coefficient y11 will yield ∇f [1 0]T (see Section 1).
To compile and run the example, one has to generate the overloading library, e.g. by invoking

generate.py -d 1 -o 1 -c ./CppLib

to create all the source and header files for active types and overloaded operations with
d ≡ o ≡ 1 into a subdirectory named ./CppLib. The name list CPPRA_SRC_NAMES in
Rapsodia/Makefile.inc contains all Rapsodia source file names to be compiled and linked
to the driver.
For Fortran the usage is in principle the same. The minor differences relate to the need to
reference the Rapsodia definitions, which include RAinclude.i90 in every subroutine

Figure 2. Simple driver for Figure b.


Optimization Methods & Software 7

(compile unit) and Taylor coefficient set and get methods signature. The Fortran-specific list
of source file names F90RA_SRC_NAMES stipulates a compilation order that satisfies the mod
file dependencies and is defined in Rapsodia/Makefile.inc. This file also contains the
flags for the Fortran compilers we used for testing, including gfortran v4.2.0, g95 v0.91, Intel
v10.0, NAG v5.0, PGI v7.04, and Absoft v10. One can use the Makefiles in the Rapsodia
regression tests as a template.

4. Applications

The initial impetus for developing Rapsodia came from the practical need to compute higher-order
derivatives for a number of applications where the model was given in Fortran, which excluded
Downloaded by [Deakin University Library] at 14:47 14 July 2013

Adol-C. Some experiments then showed the advantage of generating code for a fixed order and
number of directions. In the following, we briefly highlight these applications.

4.1 Ionization

Double ionization of atoms or molecules by electron impact is of considerable interest in many


fields of physics. The fully differential cross-section (FDCS) on helium depends upon the solid
angles for the scattered electron and the two ejected electrons and on the energies of the ejected
electrons. Assuming a unique interaction between the target and the incoming electron, one
computes it as
k1 k2 ks
FDCS = |M|2 , (2)
ki
where ki , ks , k1 , and k2 denote vectors (here the momenta of incident, scattered, first ejected,
and second ejected electrons) and the corresponding terms ki , ks , k1 , and k2 are their respective
norms. The matrix element M is a nine-dimensional integral
1
M= ψf∗ (r1 , r2 )eiks ·r0 V ψi (r1 , r2 )eiki ·r0 dr0 dr1 dr2 , (3)

where V = −2r−1 0 + |r0 − r1 |


−1
+ |r0 − r2 |−1 is the Coulomb interaction between the projectile
and the helium atom, r0 is the distance between the incident electron and the nucleus, and r1 and
r2 are the distances between one of the helium electrons and its nucleus. The terms r0 , r1 , and
r2 denote the related vectors. The wave functions ψi and ψf are solutions of the Schrödinger
equation for the helium atom. No exact formulae exist for ψi and ψf . The well-known Bethe
−1
transformation eik·r0 k −2 = 4π −1 eik·r |r − r0 | dr, allows for the integration on r0 . Thus, the
computation of Equation (3) needs a six-dimensional integral only.
On the one hand, the bound state wave function ψi may be approximated, under the first Born
approximation, by means of a Hylleraas-type wave function:
0,∞
−αs/2
ψi (r1 , r2 ) = e cλ,μ,ν s λ−μ uμ−ν t ν , (4)
λ,μ,ν;νeven

where α is a constant, s = r1 + r2 , u = r12 = |r1 − r2 |, and t = −r1 + r2 . On the other hand,


the best approximation for the final state ψf is that of [3], which satisfies exact asymptotic
boundary collisions. The two numerical approaches available to tackle an accurate Hylleraas
wave function are either a six-dimensional numerical quadrature [14] (expensive in computer
time), or a two-dimensional quadrature applied to high-order derivative tensors [3].
8 I. Charpentier and J. Utke

The gain in number of integrals has to be paid for. As presented in the Introduction, Brauner’s
method is based on the computation of a term D, related to the simple wave function ψi (r1 , r2 ) =
e−ar1 e−br2 e−cr12 , arising from a third-order derivative of Brauner’s term e−ar1 e−br2 e−cr12 /(r1 r2 r12 ).
λ−μ μ−ν ν −ar1 −br2 −cr12
The latter was introduced in ψi to enable the writing of terms r1 r2 r12 e e e appear-
ing when using Brauner’s method as derivatives (Equation (1)) of D when λ − μ ≥ 0, μ − ν ≥ 0,
and ν ≥ 0.
As an example, let us consider the Kinoshita [17] wave function ψK involving 18 parameters
(it has no negative power)

ψK (s, u, t) = [c0,0,0 + c1,0,0 s + c1,1,0 u + c2,0,0 s 2 + c2,1,0 su + c2,2,0 u2 + c2,2,2 t 2 + c3,0,0 s 3


+ c3,3,0 u3 + c3,2,2 st 2 + c3,3,2 ut 2 + c4,0,0 s 4 + c4,4,0 u4 + c4,2,2 s 2 t 2
+ c4,4,2 u2 t 2 + c5,5,0 u5 + c5,5,2 u3 t 2 + c6,6,2 u4 t 2 ]e−ks/2 .
Downloaded by [Deakin University Library] at 14:47 14 July 2013

(5)

In this formula, the required derivatives are deduced from the monomials. For instance, s 2 t 2 indi-
cates the need of derivative ∂ 4 /(∂s 2 ∂t 2 )(∂ 3 D/(∂a∂b∂c)). According to Hylleraas wave function
conventions, this term is multiplied by the Kinoshita coefficient c4,2,2 corresponding to the choice
λ = 4, μ = 2, and ν = 2. Because the generic third-order term (∂ 3 D/∂a∂b∂c) is already imple-
mented in the original code, AD has to compute sixth-order terms. The special structure of the
problem permits exact computations with only 10 directions [6]. The unrolled loop strategy was
used to compute the physical results presented in [6] that prove the convergence of the Brauner
method.
Rapsodia was initially conceived to differentiate this application because, at the time, no other
AD tool supported higher-order derivatives of a Fortran program with complex numbers. The
ionization model setup does not easily allow to vary the dimension and the order and is there-
fore not well suited for a performance comparison. Instead, we consider applications that model
ocean acoustics and volcanology. Neither application uses complex numbers and both have pre-
viously been the subject of optimization studies involving adjoint computations. However, both
permit scaling. The higher-order derivatives computed for these models might, outside of our
performance comparison, be used for optimization studies as done in [12] in the context of shape
optimization involving a few parameters. There, higher-order derivative information is used to
avoid some expensive solutions of a PDE system. This justifies our use of these models for the
performance tests.

4.2 Ocean acoustics

One important research field in underwater acoustics consists of determining modes un and
wavenumbers kn in a shallow water environment without any a priori information about the
propagation medium. The shallow zone may be described as a waveguide. The pressure field
Pω at a frequency ω is then decomposed under the propagating modes of the guide [1,13]. The
approximated pressure field Pω ({un , kn , αn }, R) between a source and a receiver at depth zs and
zr is written as

N
ie−iπ/4 S(ω) un (zs )up (zr ) ikn R−αn R
Pω ({un , kn , αn }, R)(zr , zs ) = √ √ e , (6)
ρ(zs ) 8π n=1
kn R

where N is the number of propagating modes and R is the source–receiver range. In the triplet
{un , kn , αn }, we have the parameters un (zs ) as the amplitude of mode p at depth zs , kn as the
propagation wavenumber, and αn as the bottom attenuation associated to mode un . S(ω) denotes
Optimization Methods & Software 9

the source spectrum at frequency ω and ρ(zs ) is the density at the source depth. As presented in
[7], modes and wavenumbers may be identified by means of an adjoint modelling technique.
This example was used to run a comparison test withAD02 [19]. For varying n and o we recorded
the runtimes in seconds on a Xeon 3.06-GHz processor with 3 GB memory with g95, Intel’s ifort,
and NAG’s f95 compilers. In Table 2, we show the optimization levels that exhibit significant
differences. In the test cases marked with + we did not wait for completion; the cases marked
with ∗ aborted because of lack of memory. Clearly, a large part of the runtime improvement over
AD02 can be attributed to the higher-order tensors approach from Section 2.2, which is currently
not used by AD02. In fairness we should also point out that AD02, as a general-purpose tool, has
other advanced features such as runtime activity tracking in its active type, which in this particular
setup has no opportunity to elicit an advantage.
The actual benefits of our code generation over a general-purpose implementation of the Taylor
Downloaded by [Deakin University Library] at 14:47 14 July 2013

polynomial propagation with loop iterations is of higher interest for this paper than the effects of
the interpolation algorithm. For comparison with the Rapsodia-generated source we wrote a simple
overloading test library denoted by T in which the active type stores all Taylor coefficients in a
simple two-dimensional array. Because we are using Fortran, we coded the Taylor propagations
using implicit loop constructs provided by the Fortran built-in array operations when appropriate.
In the comparison, we give two advantages to T that are fixed o, d combinations and a simplified
implementation of pow. The first is done because constant array sizes and loop bounds improve
compiler optimizations without using actual code generation. We declared o and d ∗ as Fortran
PARAMETERs and recompiled T for the specific (o, d ∗ ) combinations. Figure 3 (top left) shows
the runtime ratios comparing the Rapsodia-generated code with T for several combinations of
compiler and optimization levels on a 2.66-GHz Pentium 4. Not surprising, the results vary with
the compiler and optimization levels, but despite the advantages given to T , the Rapsodia code
is still always better in the experiment. Both ifort and gfortran show a trend toward diminishing
returns for growing o and d ∗ . For larger applications, however, one will eventually have to contend
with the data growth, which at o = 10 and d ∗ = 13 already shows a factor of 130 over the original
data size. For growing o the increase in computational complexity even on small problems using
the interpolation approach is considerable, and therefore even relatively small improvements
become worthwhile. Generally speaking, o and d ∗ are limited by application concerns, and we
do not expect to see much use for Rapsodia with parameter values much larger than what we
consider here. We also used the PGI and Absoft compilers (see Figure 3, right); but because of
problem in the version of the Absoft compiler available to us, it aborted for the higher-order test
cases. However, the PGI compiler with the ratios more favourable to Rapsodia also produced the
better runtimes in absolute numbers compared with the Absoft-generated executables.

Table 2. Runtimes for the ocean acoustics example for different o and n combinations. The times for o = 8/10 and
n = 3 for AD02 with g95 were obtained with −O2.

AD02 Rapsodia
g95 ifort NAG g95 ifort NAG
o n −O3 −O2 −O2 −O4 d∗ d −O3 −O2 −O2 −O4

2 5 0.599 0.460 0.543 0.658 15 15 0.072 0.106 0.087 0.086


4 3 40.97 11.97 13.67 14.41 15 15 0.161 0.255 0.181 0.176
6 3 185.4 58.88 73.63 71.21 14 28 0.514 0.794 0.538 0.515
8 2 105.8 36.39 45.41 41.56 9 9 0.250 0.366 0.262 0.257
8 3 651.1 ∗ 289.8 285.2 15 45 1.157 1.762 1.172 1.101
10 3 1958 ∗ + + 11 66 2.453 3.523 2.474 2.420
13 3 + ∗ + + 10 105 5.677 8.656 5.673 5.638
10 I. Charpentier and J. Utke
Downloaded by [Deakin University Library] at 14:47 14 July 2013

Figure 3. Runtime ratios for Rapsodia over T per (o, d ∗ ) pair computed on a 2.66-GHz Pentium 4 (top left), an
Opteron (bottom left), and a 2.4-GHz Pentium 4 CPU (right) where the Absoft and PGI compilers were available at
varying optimization levels.

The same experiment was also conducted on a 1.6-GHz Opteron running 64-bit Linux for which
the runtime ratios are shown in the bottom left diagram of Figure 3. Here, the ifort compiler for
the two highest-order cases shows a slight disadvantage for Rapsodia. However, we should keep
in mind the two advantages given to T and also note that on the Opteron the overall shortest
runtimes at the highest optimization levels were produced by g95 at about 15% less than ifort.
Curiously, the runtime ratios most favourable to Rapsodia were also produced by g95, which
one might interpret as an emphasis in optimizing straight line code over Fortran’s implicit loop
constructs. Such speculation is, however, beyond the scope of this paper.

4.3 One-phase, one-dimensional Plinian column models

Plinian volcanological models basically consist of the application of the equations of conservation
of mass, momentum, and energy to the motion of one fluid into another. In these terms, the
problem is of relevance in applications of fluid mechanics involving jets or buoyant plumes.
Both behaviours may occur in an eruptive column. In the lower ‘thrust’ region, the motion of the
column is dominated by the high pressure difference between the mixture in the volcanic conduit
and the atmosphere at the vent. The mixture forced into the atmosphere loses kinetic energy
and ingests ambient air heating it. If, at a certain height, the bulk density of the new mixture
is less than that of the ambient air, this mixture is carried upward, forming a ‘buoyant region’.
The reader is referred to [4] for a review. For the tests we use the equations in [5], the ODE
system being solved through a hand-written fourth-order Runge–Kutta scheme. Some sensitivity
analysis and adjoint experiments are presented in [4,5]. The model is written in C++, and we
use it for a runtime comparison with Adol-C. We have n = 5 and compute for o = 5 and d ∗ = 5.
The comparison is not as straightforward as in the previous application.Adol-C creates a recording,
or ‘tape’of the computations involving active variables and then performs derivative computations
using only the tape. There is a tapeless version in Adol-C, but that is defined only for first-order
propagation. InAdol-C, higher-order propagation is accomplished by a single call to the subroutine
hov_forward. While the Adol-C taping approach is often suspected to be inefficient, it can
be quite fast on a small problem such as this. In Figure 4, we show the runtime comparison.
Because we cannot use the tapeless mode, we show runtimes for hov_forward only and
runtimes including the tape creation. Analysis of the problem shows two major reasons working
Optimization Methods & Software 11
Downloaded by [Deakin University Library] at 14:47 14 July 2013

Figure 4. Runtime comparison of Rapsodia (R1), Rapsodia inlined (R2), Adol-C hov_forward (A1), and
Adol-C taping + hov_forward (A2).

in Adol-C’s advantage. First, the maximal number of active variables instantiated at any one time
during the execution is 52, rather small. Therefore, Adol-C can operate on a small data structure
in hov_forward once the tape is generated. Second, because the problem requires relatively
few computations, the tape size is comparatively small, too. The test problem does not lend itself
to scale up the data size to test for cases with a maximum of instantiated active variables larger
than 52. However, we can increase the number of operations by repeating the model computation
and observe in Figure 4 the effects for problems that take longer than the subsecond runtimes
seen in the original problem. Increased tape lengths lead to diminished performance advantages
until the system file size limits prevent completion altogether (for 500 repetitions). Rapsodia –
as one would expect – does not exhibit such a problem. Considering the advantage of Adol-C
on the original problem with g++, the most plausible explanation is that overloaded methods in
Rapsodia are not always inlined. While Rapsodia can produce C++ code with inline directives,
the runtime effects using g++ appear to be negligible, perhaps because of the size of some of the
member definitions. With the Intel C++ compiler, icpc, we observe noticeable improvements for
the inlined version and less runtime already on the short computations. As evident from Figure 4,
the largest advantage for Rapsodia and the best overall runtimes are achieved with Intel’s icpc
and optimization turned on.

5. Summary and outlook

This paper explains the rationale and implementation for the Rapsodia code generator. We use
practical applications to evaluate the runtime advantage to be obtained in comparison with other
AD tools and also to compare Rapsodia with a reference implementation using hand-written
Fortran implicit loop constructs with fixed bounds. Depending on the compiler and platform, the
12 I. Charpentier and J. Utke

improvements vary but can, for instance in the case of g95 on Opteron, amount to a factor of 10
over the reference implementation. In addition to the main part of Rapsodia, the generated code
for the Taylor polynomial propagation, it also provides Fortran and C++ implementations for an
efficient higher-order tensor interpolation approach.
Future work on Rapsodia includes the implementation of intrinsics that have not yet been
required by any practical application and an integration with the source transformation tool
OpenAD. Of interest is also a refinement of the implementation for complex derivative com-
putations in Fortran and the introduction of a comparable active complex type based on the C99
built-in complex type for C and C++. Ongoing improvements in compiler analysis and optimiza-
tion can be expected to reduce the advantage of Rapsodia-style code generation in the long run.
As part of the tool maintenance, the Rapsodia website [21] will be continuously tracking timing
results on compilers and platforms available to the authors. This should give potential users an
Downloaded by [Deakin University Library] at 14:47 14 July 2013

estimate for the benefits of the Rapsodia-generated code compared with the general-purpose AD
tools.

Acknowledgements
The authors thank John Reid for giving the opportunity to perform the runtime comparison with AD02 at Rutherford
Appleton Laboratories and C. Dal Cappello for suggesting the ionization application. Jean Utke was supported by the
Mathematical, Information, and Computational Sciences Division subprogramme of the Office of Advanced Scientific
Computing Research, Office of Science, US Department of Energy under contract DE-AC02-06CH11357.

Notes

1. In practice, most φ are uni- or bivariate.


2. RAsKind and RAdKind are equivalent to F77 single and double precisions, respectively.
3. For example, the unary ‘-’ operator could easily be declared as a member but there is no way to declare the binary
‘*’ operator a member for the case where only the right operand is a class instance and the left is passive.
4. The generated file names have language-specific extensions left off here.

References

[1] A.B. Baggeroer, W.A. Kuperman, and P.N. Mikhalevsky, An overview of matched field methods in ocean acoustics,
IEEE J. Ocean. Eng. 18 (1993), pp. 401–424.
[2] C.H. Bischof, L. Green, K. Haigler, and T. Knauff, Calculation of sensitivity derivatives for aircraft design using
automatic differentiation, in Proceedings of the 5th AIAA/NASA/USAF/ISSMO Symposium on Multidisciplinary
Analysis and Optimization, AIAA 94-4261, American Institute of Aeronautics and Astronautics, 1994, pp. 73–84.
[3] M. Brauner, J. Briggs, and H. Klar, Triply-differential cross sections for ionization of hydrogen atoms by electrons
and positrons, J. Phys. B: At. Mol. Phys. 22 (1989), pp. 2265–2287.
[4] I. Charpentier, Adjoint modelling experiments on eruptive columns, Geophys. J. Int. 169(3) (2007), pp. 1356–1365.
[5] ———, Variational coupling of Plinian column models and data: application to El Chichón volcano, J. Volcanol.
Geotherm. Res. 25 (2008), pp. 501–508.
[6] I. Charpentier and C. Dal Cappello, High order cross derivative computation for the differential cross section of
double ionization of helium by electron impact, Tech. Rep. 5546, INRIA, 2005.
[7] I. Charpentier and P. Roux, Mode and wavenumber inversion in shallow water using an adjoint method, J. Comput.
Acoust. 12 (2004), pp. 521–542.
[8] I. Charpentier and J. Utke, Rapsodia: user manual, Tech. Rep., Argonne National Laboratory, 2008. Available at
http://www.mcs.anl.gov/rapsodia/.
[9] A. Griewank, Evaluating Derivatives. Principles and Techniques of Algorithmic Differentiation, Number 19 in
Frontiers in Applied Mathematics, SIAM, Philadelphia, 2000.
[10] A. Griewank, D. Juedes, and J. Utke, ADOL–C, a package for the automatic differentiation of algorithms written in
C/C++, ACM Trans. Math. Software 22(2) (1996), pp. 131–167.
[11] A. Griewank, J. Utke, and A. Walther, Evaluating higher derivative tensors by forward propagation of univariate
Taylor series, Math. Comput. 69 (2000), pp. 1117–1130.
[12] Ph. Guillaume and M. Masmoudi, Computation of high order derivatives in optimal shape design, Numer. Math.
67(2) (1994), pp. 231–250.
Optimization Methods & Software 13

[13] F. Jensen, W. Kuperman, M. Porter, and H. Schmidt, Computational Ocean Acoustics, Springer, New York, 1994.
[14] S. Jones and D. Madison, Single and double ionization of atoms by photons, electrons, and ions, in AIP Conference
Proceedings 697, 2003, pp. 70–73.
[15] B. Joulakian, C. Dal Cappello, and M. Brauner, Double ionization of helium by fast electrons: use of correlated two
electron wavefunctions, J. Phys. B: Atomic, Mol. Opt. Phys. 25 (1992), pp. 2863–2871.
[16] A. Kheifets, I. Bray, J. Berakdar, and C. Dal Cappello, Comparative theoretical study of (e, 3e) on helium: Coulomb-
waves versus close-coupling approach, J. Phys. B: Atomic, Mol. Opt. Phys. 35 (2002), pp. L15–L21.
[17] T. Kinoshita, Ground state of the helium, Phys. Rev. 105(5) (1957), pp. 1490–1502.
[18] Portable, Extensible Toolkit for Scientific Computation. Available at http://www. mcs.anl.gov/petsc.
[19] J. Pryce and J. Reid, AD01, a Fortran 90 code for automatic differentiation, Tech. Rep. RAL-TR-1998-057,
Rutherford Appleton Laboratory, Chilton, Oxfordshire, England, 1998.
[20] Python Programming Language. Available at http://www.python.org/.
[21] Rapsodia website – http://www.mcs.anl.gov/rapsodia.
[22] R.E. Wengert, A simple automatic derivative evaluation program, Comm. ACM 7(8) (1964), pp. 463–464.
Downloaded by [Deakin University Library] at 14:47 14 July 2013

Appendix: language-specific concerns in Rapsodia

Numerous differences exist between C++ and Fortran. Here we allude only to those relevant for
the Rapsodia generator. In the previous sections, we already explained the control over precision
and basic floating-point type for the generated active types. We also gave a rationale for separating
the active type declaration from the declaration of the overloaded elementary operations.

Module vs. header and source

A Fortran module file contains both declarations and subroutine definitions, while in typical, non-
inlined C++ the declarations are in some header file and the definitions in a separate source file.
To allow for both concepts, the AST has an ObjectSource class that encapsulates declarations
and definitions. For an ObjectSource instance the Fortran printer produces a single file with
the module source code, while the C++ printer produces a header and a separate source file.
References to such declarations are done in Fortran by a use statement and in C++ by an
#include, both of which are represented in the AST by an ObjectReference instance.

User-defined types

In Fortran, assignments to a user-defined type can be defined in a module different from the module
containing the type definition. In C++, such assignments have to be members of the defining
class. Consequently, the logic to generate this code is attached to the language-specific printer
class. Because the generator covers all argument combinations for the elementary operations,
there remains a need for defining copy constructors for calling user-defined methods. Like the
assignment operators, the copy constructors in C++ have to be members of the type-defining
class.

Overloading

While overloading in C++ allows declarations with different argument types and the same method
name, Fortran module interfaces contain module procedure declarations with different names
that resemble a manual name mangling. To cover both concepts we generate for each specific
operator/intrinsic overloading two subtrees. One subtree is used for the module contains
section (or the source file for C++). The second subtree is identical to the first except that it does
not contain the actual implementation body. It is used for declarations in module interfaces (or
header files for C++) with a signature-specific appendix (for Fortran) or all arguments with types
14 I. Charpentier and J. Utke

(for C++). The declaration and definition subtrees are collected in two groups that are children of
a common ObjectSource node. The printer distinguishes the declaration from the definition
context and produces the proper output.

Elementary operations

With a few exceptions the uses of elementary operations φ are identical in Fortran and C++.
One such exception is φ ≡ a b , in Fortran expressed as operator a**b and in C++ as an intrinsic
function pow(a,b). The respective printer classes ensure the proper representation. Another
set of differences relates to the combinations of permitted argument types. While the generator
produces all combinations of arguments, the printer classes filter out all invalid cases. In the same
spirit as the limited generalization of the AST explained in Section 2.1, we view this pragmatic
Downloaded by [Deakin University Library] at 14:47 14 July 2013

solution as a good compromise, given the limited scope of the generator.

You might also like