Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Introduction to

Algorithmic (or Automatic) Differentiation


for Fast Gradient Calculations

Jeff Reinbolt

EML 5595
September 26, 2005
What exactly is AD?
• Automatic Differentiation (AD) is a technique for
accurate and efficient computation of derivatives for
functions defined by computer programs
• AD exploits the fact every computer program, no
matter how complicated, executes a sequence of
elementary arithmetic operations (e.g., +, *, sin)
• AD is based on local partial derivatives and repeated
applications of the chain rule
• Approximately 200 researchers worldwide, 6
workshops, 8 short courses, 19 software tools, and
“the book” by Andreas Griewank
• Internet site: http://www.autodiff.org

Introduction
Why do we care about AD?
• The role of derivatives
– Sensitivity analysis
– Error estimation and propagation
– Model parameter identification
– Simulation-based optimization
• Computation of derivatives
Method Pro Con Puzzle
Symbolic Differentiation Pretty Messy Cost?
Hand-coded Derivatives Fast Tedious Correct?
Finite Divided Differences Simple Inexact Step size?
Automatic Differentiation Easy Unknown Implement?

Motivation
AD vs. Finite Difference?
• Finite Difference • Automatic Differentiation
– Approximate – Analytic (exact)
– Dependent on step – Independent of step
size size
– Multiple runs (one – Complete Jacobian in
variable at a time) a single run
– Slow – Fast

Matlab Jacobian Example ((n


n = 1000)
F (1) = 3 x1 − 2 x12 − 2 x2 + 1
F (i ) = 3 xi − 2 xi2 − xi −1 − 2 xi +1 + 1
F (n) = 3 xn − 2 xn2 − xn −1 + 1

Motivation
Chain Rule
If y changes dy/du times as fast as u, and u changes
du/dx times as fast as x, then y changes (dy/du)(du/dx)
times as fast as x.
If y = f (u ) and u = g ( x), then y = f ( g ( x))
dy dy du
= ⋅
dx du dx
d
[ f ( g ( x))] = f ′( g ( x)) g ′( x)
dx

For example:
d
[sin(2 x)] = 2 cos(2 x)
dx

Math Review
AD Modes
• Forward mode
– Chain rule applied from beginning to end of
program
– Better if # independent < # dependent variables
– Requires minimal storage and recomputation of
intermediate data
• Reverse mode
– Chain rule applied from end to beginning of
program
– Better if # independent > # dependent variables
– Requires management of storage and
recomputation of intermediate data

Basics of AD
Baby Example
⎡ ⎛ x1 ⎞ x1 ⎤ ⎡ x1 ⎤
• Given: 1 2 ⎢ ⎜ x ⎟ x
y ( x , x ) = sin ⎜ ⎟ + − exp( x 2 ⎥ ⎢
) ∗ − exp( x 2 ⎥
)
⎣ ⎝ 2 ⎠ 2 ⎦ ⎣ x 2 ⎦
• Find: y ( x1 = 1.5, x2 = 0.5) = ?
• Evaluation Trace:
v−1 = x1 = 1.5000
v0 = x2 = 0.5000
v1 = v−1 / v0 = 1.5000 / 0.5000 = 3.0000
v2 = sin (v1 ) = sin (3.0000) = 0.1411
v3 = exp(v0 ) = exp(0.5000) = 1.6487
v4 = v1 − v3 = 3.0000 − 1.6487 = 1.3513
v5 = v2 + v4 = 0.1411 + 1.3513 = 1.4924
v6 = v5 ∗ v4 = 1.4924 ∗1.3513 = 2.0167
y = v6 = 2.0167

Basics of AD
Forward Mode by Baby Example
• Find: ∂y / ∂x1 = ? Using: v&i = ∂vi / ∂x1
• Evaluation Trace:
v−1 = x1 = 1.5000
v&−1 = ∂v−1 / ∂x1 = x&1 = 1.0000
v0 = x2 = 0.5000
v&0 = ∂v0 / ∂x1 = x&2 = 0.0000
v1 = v−1 / v0 = 1.5000 / 0.5000 = 3.0000
v&1 = ∂ (v−1 / v0 ) / ∂x1 = (v&−1 − v1 ∗ v&0 ) / v0 = 1.0000 / 0.5000 = 2.0000
v2 = sin (v1 ) = sin (3.0000) = 0.1411
v&2 = ∂[sin (v1 )] / ∂x1 = cos(v1 ) ∗ v&1 = −0.9900 ∗ 2.0000 = −1.9800
v3 = exp(v0 ) = exp(0.5000) = 1.6487
v&3 = ∂[exp(v0 )] / ∂x1 = exp(v0 ) ∗ v&0 = v3 ∗ v&0 = 1.6487 ∗ 0.0000 = 0.0000
v4 = v1 − v3 = 3.0000 − 1.6487 = 1.3513
v&4 = ∂(v1 − v3 ) / ∂x1 = v&1 − v&3 = 2.0000 − 0.0000 = 2.0000
v5 = v2 + v4 = 0.1411 + 1.3513 = 1.4924
v&5 = ∂(v2 + v4 ) / ∂x1 = v&2 + v&4 = −1.9800 + 2.0000 = 0.0200
v6 = v5 ∗ v4 = 1.4924 ∗1.3513 = 2.0167
v&6 = ∂(v5 ∗ v4 ) / ∂x1 = v&5 ∗ v4 + v5 ∗ v&4 = 0.0200 ∗1.3513 + 1.4924 ∗ 2.0000 = 3.0118
y = v6 = 2.0167
y& = ∂(v6 ) / ∂x1 = v&6 = 3.0118 Basics of AD
Reverse Mode by Baby Example
• Find: ∂y / ∂x1 = ? and ∂y / ∂x2 = ? Using: vi = ∂y / ∂vi
• Evaluation Trace:
v−1 = x1 = 1.5000
v0 = x2 = 0.5000
v1 = v−1 / v0 = 1.5000 / 0.5000 = 3.0000
v2 = sin (v1 ) = sin (3.0000) = 0.1411
v3 = exp(v0 ) = exp(0.5000) = 1.6487
v4 = v1 − v3 = 3.0000 − 1.6487 = 1.3513
v5 = v2 + v4 = 0.1411 + 1.3513 = 1.4924
v6 = v5 ∗ v4 = 1.4924 ∗1.3513 = 2.0167
y = v6 = 2.0167
v6 = ∂y / ∂v6 = y = 1.0000
v5 = ∂y / ∂v5 = (∂y / ∂v6 ) ∗ (∂v6 / ∂v5 ) = v6 ∗ v4 = 1.0000 ∗1.3513 = 1.3513
v4 = ∂y / ∂v4 = (∂y / ∂v6 ) ∗ (∂v6 / ∂v4 ) = v6 ∗ v5 = 1.0000 ∗1.4924 = 1.4924
v4 = ∂y / ∂v4 = (∂y / ∂v6 ) ∗ [∂(v5 ∗ v4 ) / ∂v4 ] = v4 + v5 = 1.4924 + 1.3513 = 2.8437
v2 = ∂y / ∂v2 = (∂y / ∂v5 ) ∗ (∂v5 / ∂v2 ) = v5 ∗1.0000 = 1.3513
v3 = ∂y / ∂v3 = (∂y / ∂v4 ) ∗ (∂v4 / ∂v3 ) = v4 ∗ (−1.0000) = −v4 = −2.8437
v1 = ∂y / ∂v1 = (∂y / ∂v4 ) ∗ (∂v4 / ∂v1 ) = v4 ∗1.0000 = v4 = 2.8437
v0 = ∂y / ∂v0 = (∂y / ∂v3 ) ∗ (∂v3 / ∂v0 ) = v3 ∗ v3 = −2.8437 ∗1.6487 = −4.6884
v1 = ∂y / ∂v1 = (∂y / ∂v6 ) ∗ [∂(v5 ∗ v4 ) / ∂v1 ] = v1 + v2 ∗ cos(v1 ) = 1.5059
v0 = ∂y / ∂v0 = (∂y / ∂v6 ) ∗ [∂(v5 ∗ v4 ) / ∂v0 ] = v0 + v1 ∗ v1 / v0 = −13.7239
v−1 = ∂y / ∂v−1 = (∂y / ∂v1 ) ∗ (∂v1 / ∂v−1 ) = v1 / v0 = 1.5059 / 0.5000 = 3.0118
x2 = ∂y / ∂x2 = ∂y / ∂v0 = v0 = −13.7239
x1 = ∂y / ∂x1 = ∂y / ∂v−1 = v−1 = 3.0118 Basics of AD
AD Implementation
• Source code transformation
– Input is code with function evaluation
– Output is function evaluation and desired
derivative computations using new code
– Preprocessor generates new algorithm
• Overloading of arithmetic operators
– Input is code with function evaluation
– Output is function evaluation and desired
derivative computations using “original” code with
overloaded operators
– Code is linked with libraries including overloaded
versions of intrinsic functions

Basics of AD
AD Tools
• ADIC (C/C++) • NAGWare (F77, F95)
• ADIFOR (Fortran77) • OpenAD (F77, F95,
• ADiMat (Matlab) C/C++)
• ADOL-C (C/C++) • PCOMP (Fortran77)
• AUTO_DERIV (F77, F95) • TAF (F77, F95)
• COSY INFINITY (F77, • TAMC (Fortran77)
F95, C/C++) • TAPENADE (F77, F95)
• CppAD (C/C++) • TOMLAB/MAD (Matlab)
• FADBAD/TADIFF (C/C++) • Treeverse/Revolve (F77,
• FFADLib (C/C++) F95, C/C++)
• GRESS (Fortran77) • YAO (C/C++)

AD Software
Summary
• A wide variety of problems require the computation of
derivatives
• There are several different ways of finding derivatives
• AD accurately and efficiently computes derivatives for
computer programs by using partial derivatives and
chain rule
• AD can be implemented using the forward or reverse
mode
• AD can be implemented using source code
transformation or operator overloading
• There are several software tools available for AD

Conclusion
What is AD?
Basic Enhancement Process
Original
code
Adifor Derivative
User processor code
specified
variables
How it works
•AD computes analytic derivatives via symbolic differentiation
•Applies the chain rule to compute derivatives of outputs w.r.t. inputs
L = ε (σ (T )) ⋅ Lo (T )
∂L ∂ε ∂σ o ∂Lo
= L (T ) + ε (σ (T ))
∂T ∂σ ∂T ∂T
•Follows loops, conditional statements, subroutines, common blocks, etc.
•Can create the entire sensitivity matrix (Jacobian) in a single run of the code
Automatic Differentiation (AD)
vs. Finite Differences (FD)

eiv
at
riv Ÿy 'y
de
= lim 'x
ct

y
a

Ÿx
Ex

Finite Differences
'xK0

x
AD FD
Derivatives are analytic (exact) Derivatives are approximate
Independent of step size Depends on step size
Complete Jacobian with single execution Multiple runs (one variable at a time)
AD is computationally more efficient FD is 15-30 times slower than AD

• Historically, AD-enhanced codes have been difficult to create


Forward Mode by Baby Example
• Find: ∂y / ∂x1 = ? Using: v&i = ∂vi / ∂x1
• Evaluation Trace:
v−1 = x1 = 1.5000
v&−1 = ∂v−1 / ∂x1 = x&1 = 1.0000
v0 = x2 = 0.5000
v&0 = ∂v0 / ∂x1 = x&2 = 0.0000
v1 = v−1 / v0 = 1.5000 / 0.5000 = 3.0000
v&1 = ∂ (v−1 / v0 ) / ∂x1 = (v&−1 − v1 ∗ v&0 ) / v0 = 1.0000 / 0.5000 = 2.0000
v2 = sin (v1 ) = sin (3.0000) = 0.1411
v&2 = ∂[sin (v1 )] / ∂x1 = cos(v1 ) ∗ v&1 = −0.9900 ∗ 2.0000 = −1.9800
v3 = exp(v0 ) = exp(0.5000) = 1.6487
v&3 = ∂[exp(v0 )] / ∂x1 = exp(v0 ) ∗ v&0 = v3 ∗ v&0 = 1.6487 ∗ 0.0000 = 0.0000
v4 = v1 − v3 = 3.0000 − 1.6487 = 1.3513
v&4 = ∂(v1 − v3 ) / ∂x1 = v&1 − v&3 = 2.0000 − 0.0000 = 2.0000
v5 = v2 + v4 = 0.1411 + 1.3513 = 1.4924
v&5 = ∂(v2 + v4 ) / ∂x1 = v&2 + v&4 = −1.9800 + 2.0000 = 0.0200
v6 = v5 ∗ v4 = 1.4924 ∗1.3513 = 2.0167
v&6 = ∂(v5 ∗ v4 ) / ∂x1 = v&5 ∗ v4 + v5 ∗ v&4 = 0.0200 ∗1.3513 + 1.4924 ∗ 2.0000 = 3.0118
y = v6 = 2.0167
y& = ∂(v6 ) / ∂x1 = v&6 = 3.0118 Basics of AD

You might also like