Martins MDO Course Slides PDF

A Short Course on
Multidisciplinary Design Optimization
Joaquim R. R. A. Martins
Multidisciplinary Design Optimization Laboratory
http://mdolab.engin.umich.edu
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 1 / 427

Contents 1
1. Introduction
1.1 About
1.2 Aircraft as Multidisciplinary Systems
1.3 Design Optimization
1.4 Optimization Problem Statement
1.6 Classification of Optimization Problems
1.7 History
2. Line Search Techniques
2.1 Motivation
2.2 Optimality
2.3 Numerics
2.4 Method of Bisection
2.5 Newton’s Method
2.6 Secant Method
2.7 Golden Section Search
2.8 Polynomial Interpolation
2.9 Line Search
Contents 2
3. Gradient-Based Optimization
3.1 Introduction
3.2 Gradients and Hessians
3.3 Optimality Conditions
3.4 Steepest Descent
3.5 Conjugate Gradient
3.7 Quasi-Newton Methods
3.8 Trust Region Methods
4. Computing Derivatives
4.1 Introduction
4.2 Finite Differences
4.3 Complex-Step Method
4.4 C/C++ Implementations
4.5 Unifying Chain Rule
4.6 The Unifying Chain Rule
4.7 Monolithic Differentiation

Contents 3
4.8 Algorithmic Differentiation
4.9 Analytic Methods
5. Constrained Optimization
5.1 Introduction
5.2 Equality Constraints
5.3 Inequality Constraints
5.4 Constraint Qualification
5.5 Penalty Methods
5.6 Sequential Quadratic Programming
6. Gradient-Free Optimization
6.1 Introduction
6.2 Nelder–Mead Simplex
6.3 DIvided RECTangles (DIRECT)
6.4 Genetic Algorithms
6.5 Particle Swarm Optimization
7. Multidisciplinary Design Optimization
7.1 Introduction
Contents 4
7.2 Multidisciplinary Analysis
7.3 Extended Design Structure Matrix
7.4 Monolithic Architectures
Multidisciplinary Feasible (MDF)
Individual Discipline Feasible (IDF)
Simultaneous Analysis and Design (SAND)
The All-at-Once (AAO) Problem Statement
7.5 Distributed Architectures
Classification
7.6 Computing Coupled Derivatives

Introduction
Introduction
1. Introduction
1.1 About
1.2 Aircraft as Multidisciplinary Systems
1.3 Design Optimization
1.6 Classification of Optimization Problems
1.7 History

Introduction About
About Me
Bio
I 1991–1995: M.Eng. in Aeronautics, Imperial College, London
I 1996–2002: M.Sc. and Ph.D. in Aeronautics and Astronautics, Stanford
I 2002–2009: Assistant/Associate Prof., University of Toronto Inst. for
Aerospace Studies
I 2009– : Associate Prof., University of Michigan, Dept. of Aerospace Eng.
Highlights
I Two best papers at the AIAA MA&O Conference (2002, 2006)
I Canada Research Chair in Multidisciplinary Optimization (2002–2009)
I Keynote speaker at the International Forum on Aeroelasticity and Structural
Dynamics (Stockholm, 2007)
I Keynote speaker at the Aircraft Structural Design Conference (London, 2010)
I Associate editor for the AIAA Journal and Optimization and Engineering

Introduction About
About You
I Name
I Title and responsibilities
I Why are you taking this course?
I What do you hope to get from this course?

Introduction About
About the Course

I Introduction to MDO applications and advanced topics
I Assumes no previous knowledge of optimization
I Requires knowledge of multivariable calculus and linear algebra
I Please interrupt!
I Questions
I Share your experience with design and optimization

Introduction About
About the Course Materials

I I will use slides to teach, but please refer to course notes as well.
I Notes include a lot of detail, but if you want more, check the references: I
cite almost 300 papers and books.
I Notes are optimized for electronic reading with hyperlinks.
I History of the notes and slides:
I I originally created the notes in the form of slides in 2003, because I wanted to
cover a range of material in a particular way
I Colleagues at Stanford have used these notes since, and I taught one of the
chapters MIT.
I I have recently separated the notes from the slides.
I Please email if you find any typos or have any suggestions.

Introduction About
Course Content
Introduction
Single
Variable
Minimization
Computing
Derivatives
Gradient-
MDO Based
Optimization
Handling
Constraints
Gradient-Free
Optimization
MDO
Architectures

Introduction Aircraft as Multidisciplinary Systems
Sir George Cayley

Wright Brother’s Flyer

Santos–Dumont’s Demoiselle

100 Years Later . . .

Multidisciplinary Trade-off for Supercritical Airfoils

Why you should not trust an aerodynamicist, even a brilliant one

Introduction Design Optimization
What is MDO?
I We will first cover the “DO” in MDO.
I In industry, problems routinely arise that require making the best possible
design decision.
I However, optimization is still underused in industry. . . Why?
I Numerical optimization and MDO still not part of most undergraduate and
graduate curricula
I Backlash due to “overselling” of numerical optimization
I Inertia in the industrial environment
I Aerospace is one of the leading applications of engineering design
optimization. Why?

Conventional vs. Optimal Design Process

Conventional Optimal
Baseline Baseline
Specifications Specifications
design design
Analyze or
experiment Analyze
Change Evaluate Evaluate

Change
design performance objective and
design
constraints
Is the design Is the design

No good? No optimal?
Yes Yes
Final design Final design

Multidisciplinary Design Optimization (MDO)

I Most modern engineering systems are multidisciplinary and their analysis is
often very complex, involving hundreds of computer programs, and many
people in different locations. This makes it difficult for companies to manage
the design process.
I In the early days, design teams tended to be small and were managed by a
single chief designer who knew most about the design details and could make
all the important decisions.
I Modern design projects are more complex and the problem has to be
decomposed and each part tackled by a different team. The way these teams
should interact is still being debated by managers, engineers and researchers.
I More in the last chapter . . .

Introduction Optimization Problem Statement
Objective Function
I What do we mean by “best”?
I Objective function is a “measure of badness” that enables us to compare two
designs quantitatively — assuming we want to minimize it.
I Need to be able to estimate this measure numerically.
I If we select the wrong goal, it doesn’t matter how good the analysis is, or
how efficient the optimization method is. Therefore, it’s important to select a
good objective function.
I Selecting a good objective function is often overlooked, and not an easy
problem, even for experienced designers.
I Objective function may be linear or nonlinear and may or not be given
explicitly.
I We will represent the objective function by the scalar f .
I There is no such thing as multiobjective optimization!

The “Disciplanes”
Is there one aircraft which is the fastest, most efficient, quietest, most
inexpensive?

Design Variables
I Design variables are also known as design parameters and are represented by
the vector x. They are the variables in the problem that we allow to vary in
the design process.
I Optimization is the process of choosing the design variables that yield an
optimum design.
I Design variables should be independent of each other.
I Design variables can be continuous or discrete. Discrete variables are
sometimes integer variables.

Constraints
I Few practical engineering optimizations problems are unconstrained.
I Constraints on the design variables are called bounds and are easy to enforce.
I Like the objective function, constraints can be linear or nonlinear and may or
not be given in an explicitly form. They may be equality or inequality
constraints.
I At a given design point, constraints may be active of inactive. This
distinction is particularly important at the optimum.

Optimization Problem Statement

Objective function, design variables, and constraints form the optimization
problem statement:
minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, 2, . . . , m̂
ck (x) ≥ 0, k = 1, 2, . . . , m
f : objective function, output (e.g. structural weight).

x : vector of design variables, inputs (e.g. aerodynamic shape);
bounds can be set on these variables.
ĉ : vector of equality constraints (e.g. lift); in general these are
nonlinear functions of the design variables.
c : vector of inequality constraints (e.g. structural stresses), may also
be nonlinear and implicit.

Example: Trade-off Between Aerodynamics and Structures

Sequential MDF AS
10
Twist (degrees)
0
I Need a truly multidisciplinary objective
−5
Jigtwist
−10 Deflected
0 5 10 15 20 V L Wi
Spanwise Distance (m) Range = ln
0.06 c D Wf
Thickness (m)
0.05
0.04 I Sequential optimization does not lead to

0.03 the true optimum.
0.02
0 5 10
Spanwise Distance (m)
15 20
I Achieving the proper trade-off requires
4
x 10
5
Elliptical
simultaneous optimization
4
I More on this in the MDO chapter . . .
Lift (N)
1
0 5 10 15 20

Introduction Classification of Optimization Problems
Classification of Optimization Problems

Smooth
Discontinuous
Linear
Continuity
Linearity Nonlinear
Static
Continuous
Dynamic Quantitative
Discrete
Optimization
Design
Time Problem Variables
Classification
Qualitative
Deterministic
Data Constraints
Convexity Unconstrained
Stochastic
Constrained
Non-
Convex
Convex

Introduction Classification of Optimization Problems
Optimization Methods for Nonlinear Problems

Steepest
Descent
Gradient Conjugate
Based Gradient
Quasi-
Newton
Optimization
Methods Grid or
Random
Search
Genetic
Algorithms
Simulated
Annealing
Gradient Free
Nelder–
Meade
DIRECT
Particle
Swarm

Introduction History
Historical Developments in Optimization 1

300 bc: Euclid considers the minimal distance between a point a line, and
proves that a square has the greatest area among the rectangles
with given total length of edges.
200 bc: Zenodorus works on “Dido’s Problem”, which involved finding the
figure bounded by a line that has the maximum area for a given
perimeter.
100 bc: Heron proves that light travels between two points through the
path with shortest length when reflecting from a mirror, resulting in
an angle of reflection equal to the angle of incidence.
1615: Johannes Kepler finds the optimal dimensions of wine barrel. He
also formulated an early version of the “marriage problem” (a
classical application of dynamic programming also known as the
“secretary problem”) when he started to look for his second wife.
The problem involved maximizing a utility function based on the
balance of virtues and drawbacks of 11 candidates.


1621 W. van Royen Snell discovers the law of refraction. This law
follows the more general principle of least time (or Fermat’s
principle), which states that a ray of light going from one point to
another will follow the path that takes the least time.
1646: P. de Fermat shows that the gradient of a function is zero at the
extreme point the gradient of a function.
1695: Isaac Newton solves for the shape of a symmetrical body of
revolution that minimizes fluid drag using calculus of variations.
1696: Johann Bernoulli challenges all the mathematicians in the world to
find the path of a body subject to gravity that minimizes the travel
time between two points of different heights — the brachistochrone
problem. Bernoulli already had a solution that he kept secret. Five
mathematicians respond with solutions: Isaac Newton, Jakob
Bernoulli (Johann’s brother), Gottfried Leibniz, Ehrenfried Walther
von Tschirnhaus and Guillaume de l’Hôpital. Newton reportedly
started solving the problem as soon as he received it, did not sleep


that night and took almost 12 hours to solve it, sending back the
solution that same day.
1740: L. Euler’s publication begins the research on general theory of
calculus of variations.
1746: P. L. Maupertuis proposed the principle of least action, which
unifies various laws of physical motion. This is the precursor of the
variation principle of stationary action, which uses calculus of
variations and plays a central role in Lagrangian and Hamiltonian
classical mechanics.
1784: G. Monge investigates a combinatorial optimization problem known
as the transportation problem.
1805: Adrien Legendre describes the method of least squares, which was
used in the prediction of asteroid orbits and curve fitting. Frederich
Gauss publishes a rigorous mathematical foundation for the method
of least squares and claims he used to predict the orbit of the
asteroid Ceres in 1801. Legendre and Gauss engage in a bitter
dispute on who first developed the method.


1815: D. Ricardo publishes the law of diminishing returns for land
cultivation.
1847: A. L. Cauchy presents the steepest descent methods, the first
gradient-based method.
1857: J. W. Gibbs shows that chemical equilibrium is attained when the
energy is a minimum.
1902: Gyula Farkas presents and important lemma that is later used in
the proof of the Karush–Kuhn–Tucker theorem.
1917: H. Hancock publishes the first text book on optimization.
1932: K. Menger presents a general formulation of the traveling salesman
problem, one of the most intensively studied problems in
optimization.
1939: William Karush derives the necessary conditions for the inequality
constrained problem in his Masters thesis. Harold Kuhn and Albert
Tucker rediscover these conditions an publish their seminal paper in
1951. These became known as the Karush–Kuhn–Tucker (KKT)
conditions.

1939 Leonid Kantorovich develops a technique to solve linear
optimization problems after having given the task of optimizing
production in the Soviet government plywood industry.
1947: George Dantzig publishes the simplex algorithm. Dantzig, who
worked for the US Air Force, reinvented and developed linear
programming further to plan expenditures and returns in order to
reduce costs to the army and increase losses to the enemy in World
War II. The algorithm was kept secret until its publication.
1947: John von Neumann develops the theory of duality for linear
problems.
1949: The first international conference on optimization, the International
Symposium on Mathematical Programming, is held in Chicago.
1951: H. Markowitz presents his portfolio theory that is based on
quadratic optimization. He receives the Nobel memorial prize in
economics in 1990.


1954: L. R. Ford and D. R. Fulkerson research network problems,
founding the field of combinatorial optimization.
1957: R. Bellman presents the necessary optimality conditions for
dynamic programming problems. The Bellman equation was first
applied to engineering control theory, and subsequently became an
important principle in the development of economic theory.
1959: Davidon develops the first quasi-Newton method for solving
nonlinear optimization problems. Fletcher and Powell publish
further developments in 1963.
1960: Zoutendijk presents the methods of feasible directions to generalize
the Simplex method for nonlinear programs. Rosen, Wolfe, and
Powell develop similar ideas.
1963: Wilson invents the sequential quadratic programming method for
the first time. Han re-invents it in 1975 and Powell does the same
in 1977.


1975: Pironneau publishes the a seminal paper on aerodynamic shape
optimization, which first proposes the use of adjoint methods for
sensitivity analysis.
1975: John Holland proposed the first genetic algorithm.
1977: Raphael Haftka publishes one of the first multidisciplinary design
optimization (MDO) applications, in a paper entitled
“Optimization of flexible wing structures subject to strength and
induced drag constraints”.
1979: Kachiyan proposes the first polynomial time algorithm for linear
problems. The New York times publishes the front headline “A
Soviet Discovery Rocks World of Mathematics”, saying, “A surprise
discovery by an obscure Soviet mathematician has rocked the world
of mathematics and computer analysis . . . Apart from its profound
theoretical interest, the new discovery may be applicable in weather
prediction, complicated industrial processes, petroleum refining,
t.he scheduling of workers at large factories . . . the theory of secret
codes could eventually be affected by the Russian discovery, and


this fact has obvious importance to intelligence agencies
everywhere.” In 1975, Kantorovich and T.C. Koopmans get the
Nobel memorial price of economics for their contributions on linear
programming.
1984: Narendra Karmarkar starts the age of interior point methods by
proposing a more efficient algorithm for solving linear problems. In
a particular application in communications network optimization,
the solution time was reduced from weeks to days, enabling faster
business and policy decisions. Karmarkar’s algorithm stimulated the
development of several other interior point methods, some of which
are used in current codes for solving linear programs.
1985: The first conference in MDO, the Multidisciplinary Analysis and
Optimization (MA&O) conference, takes place.
1988: Jameson develops adjoint-based aerodynamic shape optimization
for computational fluid dynamics (CFD).
1995: Kennedy and Eberhart propose the particle swarm optimization
algorithm.

Line Search Techniques
Line Search Techniques

1. Introduction

2.1 Motivation
2.2 Optimality
2.3 Numerics
2.4 Method of Bisection
2.6 Secant Method
2.7 Golden Section Search
2.8 Polynomial Interpolation
2.9 Line Search

Line Search Techniques Motivation
Single Variable Minimization — Motivation

x0
I Gradient-based optimization with respect to

Search multiple variables requires a line search
direction
I Not necessary (or advisable) to find the exact
minimum in a line search
I Desired:
Update x Line search
I Low computational cost (few iterations and low
cost per iteration)
I Low memory requirements
I Low failure rate
Is x a
I Computational effort other dominated by the
No minimum? evaluation of objectives, constraints, and their
gradients
Yes
x∗

Line Search Techniques Optimality
Classification of Minima
We can classify a minimum as a:
1. Strong local minimum
2. Weak local minimum
3. Global minimum

Optimality Conditions 1
Taylor’s theorem is the key for identifying local minima
1 1 1
f (x+h) = f (x)+hf 0 (x)+ h2 f 00 (x)+· · ·+ hn−1 f n−1 (x)+ hn f n (x + θh)
2 (n − 1)! |n! {z }
O(hn )
Assume that f is twice-continuously differentiable and that a minimum of f exists

at x∗ . Using n = 2 and x = x∗ ,
1
f (x∗ + ε) = f (x∗ ) + εf 0 (x∗ ) + ε2 f 00 (x∗ + θε)
2
For x∗ to be a local minimizer, we require that f (x∗ + ε) ≥ f (x∗ ) for ε ∈ [−δ, δ].
Therefore we require
1
εf 0 (x∗ ) + ε2 f 00 (x∗ + θε) ≥ 0
2
εf 0 (x∗ ) ≥ 0 ⇒ f 0 (x∗ ) = 0, since the sign of ε is arbitrary. This is the
first-order optimality condition. A point that satisfies the first-order optimality
condition is a stationary point.
Optimality Conditions 2
Since f 0 (x∗ ) = 0, we have to consider the second derivative term.
This term must be non-negative for a local minimum at x∗ .
Since ε2 > 0, then f 00 (x∗ ) ≥ 0. This is the second-order optimality condition.
Thus the necessary conditions for a local minimum are:
f 0 (x∗ ) = 0
f 00 (x∗ ) ≥ 0
We have a strong local minimum if
f 0 (x∗ ) = 0
f 00 (x∗ ) > 0
which are sufficient conditions

What use are the optimality conditions?

The optimality conditions can be used to:
1. Verify that a point is a minimum (sufficient conditions).
2. Realize that a point is not a minimum (necessary conditions).
3. Define equations that can be solved to find a minimum.
Gradient-based minimization methods find local minima by finding points that
satisfy the optimality conditions.

Line Search Techniques Numerics
Numerical Precision
I Finding x∗ such that f 0 (x∗ ) = 0, is equivalent to finding the roots of the first
derivative of the function to be minimized.
I Therefore, root finding methods can be used to find stationary points and are
useful in function minimization.
I Using machine precision, it is not possible find the exact zero, so we will be
satisfied with finding an x∗ that belongs to an interval [a, b] such that the
function g satisfies
g(a)g(b) < 0 and |a − b| < ε
where ε is a small tolerance.

I This tolerance is be dictated by:
I Finite precision arithmetic (for double precision this is usually 1 × 10−16 )
I Precision of the function evaluation
I Limit on number of iterations we can afford to do

Convergence Rate 1
Two questions are important when considering an optimization algorithm:
I Does it converge?
I How fast does it converge?
Suppose you we have a sequence of points xk (k = 1, 2, . . .) converging to a

solution x∗ . For a convergent sequence, we have
lim xk − x∗ = 0
k→∞
The rate of convergence is a measure of how fast an iterative method converges

to the numerical solution. An iterative method is said to converge with order r
when r is the largest positive number such that
kxk+1 − x∗ k
0 ≤ lim < ∞.
k→∞ kxk − x∗ kr
For a sequence with convergence rate r, asymptotic error constant, γ is

kxk+1 − x∗ k
γ = lim .
k→∞ kxk − x∗ kr
Convergence Rate 2
Assume ideal convergence behavior so that the above condition and we do not
have to take the limit. Then,
kxk+1 − x∗ k = γkxk − x∗ kr for all k.
The larger r is, the faster the convergence:

I If r = 1, we have linear convergence, and kxk+1 − x∗ k = γkxk − x∗ k.
Convergence varies widely depending on γ:
I If γ ∈ (0, 1) then norm of error decreases by a constant factor for every
iteration.
I If γ = 0 when r = 1, we have a special case: superlinear convergence.
I If γ = 1, we have sublinear convergence.
I If γ > 1, the sequence diverges.
I If r = 2 we have quadratic convergence. Highly desirable, since convergence
is rapid and independent of γ. For example, if γ = 1 and the initial error is
kx0 − x∗ k = 10−1 , then the sequence of errors will be
10−1 , 10−2 , 10−4 , 10−8 , 10−16 , i.e., the number of digits doubles every
iteration: double precision in four iterations!

Convergence Rate 3
In general, x is an n-vector and we have to rethink the definition of the error.
I We could use, for example, ||xk − x∗ ||.
I But this depends on the scaling of x, so we should normalize it,
||xk − x∗ ||
.
||xk ||
I And . . . xk might be zero, so fix this,
||xk − x∗ ||
.
1 + ||xk ||
I And . . . gradients might be large. Thus, we should use a combined quantity,
||xk − x∗ || |f (xk ) − f (x∗ )|

+
1 + ||xk || 1 + |f (xk )|

Convergence Rate 4
I A final issue: x∗ is usually not known! You can monitor the progress of your
algorithm using the steps,
||xk+1 − xk || |f (xk+1 ) − f (xk )|

+ .
1 + ||xk || 1 + |f (xk )|
Sometimes, you might just use the second fraction in the above term, or the
norm of the gradient. You should plot these quantities in a log-axis versus k.

Line Search Techniques Method of Bisection
Method of Bisection
I Bisection is a bracketing method: it generates a set of nested intervals and
requires an initial interval where is is assumed a solution exists.
I First we find a bracket [x1 , x2 ] such that f (x1 )f (x2 ) < 0
I For an initial interval [x1 , x2 ], bisection yields the following interval at
iteration k,
x1 − x2
δk =
2k
I To achieve a specified tolerance ε, we need log2 (x1 − x2 )/δ evaluations.
I From the definition of rate of convergence, for r = 1,
δk+1 1
lim = =
k→∞ δk 2
I Converges linearly with asymptotic error constant γ = 1/2.
I To find the minimum of a function using bisection, we evaluate the derivative
of f at each iteration, and find a point for which f 0 = 0.

Line Search Techniques Newton’s Method
Newton’s Method
Newton’s method for finding a zero can be derived from the Taylor’s series
expansion about the current iteration xk ,
f (xk+1 ) = f (xk ) + (xk+1 − xk )f 0 (xk ) + O((xk+1 − xk )2 )
Ignoring the terms higher than order two and assuming the function next iteration
to be the root (i.e., f (xk+1 ) = 0), we obtain,
f (xk )
xk+1 = xk − .
f 0 (xk )
This iterative procedure converges quadratically, so
|xk+1 − x∗ |
lim = const.
k→∞ |xk − x∗ |2

9.4 Newton-Raphson Method Using Derivative 357

Newton Method for Root Finding
f (x)
http://www.nr.com or call 1-800-872-7423 (North America only), or

readable files (including this one) to any server computer, is strictly
Permission is granted for internet users to make one paper copy fo
2
3
x
Figure
J.R.R.A.9.4.1.
Martins Newton’s method extrapolates the local
Multidisciplinary Designderivative to find the next estimate of theAugust
Optimization root. 2012
In 49 / 427
ge University Press. Programs Copyright (C) 1986-1992 by Numerical Recipes Software.
ny server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website
rs to make one paper copy for their own personal use. Further reproduction, or any copying of machine-
423 (North America only), or send email to directcustserv@cambridge.org (outside North America).
IPES IN FORTRAN 77: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43064-X)

Newton Method Failure Examples

Newton’s method is not guaranteed to converge, and only works under certain
9.4.1. Newton’s method extrapolates the local derivative to find the next estimate of the root. In
358 Chapter 9.
Root Finding and Nonlinear Sets of Equations
conditions.
xample it works well and converges quadratically.
f(x)
f(x)
2
3
1
x
x
9.4.2. Unfortunate case where Newton’s method encounters a local extremum and shoots off to
space. Here bracketing bounds, as in rtsafe, would save the day. Figure 9.4.3. Unfortunate case where Newton’s method enters a nonconvergent cycle. This
is often encountered when the function f is obtained, in whole or in part, by table interpolati
a better initial guess, the method would have succeeded.

Newton’s Method for Function Minimization

To minimize a function using Newton’s method, we substitute the function for its
first derivative and the first derivative by the second derivative,
f (xk ) f 0 (xk )
xk+1 = xk − → xk+1 = xk − .
f 0 (xk ) f 00 (xk )

Example: Newton’s Method Applied to Polynomial 1

Consider the following single-variable optimization problem
minimize f (x) = (x − 3)x3 (x − 6)4

w.r.t. x
Newton’s method starting from different initial points

Example: Newton’s Method Applied to Polynomial 2

Line Search Techniques Secant Method
Secant Method
I Newton’s method requires the first derivative for each iteration (and the
second derivative when applied to minimization).
I In some cases, it might not be easy to obtain these derivatives.
I If we use a forward-difference approximation for f 0 (xk ) in Newton’s method
we obtain
xk − xk−1
xk+1 = xk − f (xk ) .
f (xk ) − f (xk−1 )
which is the secant method.
I Also known as “the poor-man’s Newton method”.
I Under favorable conditions, this method has superlinear convergence
(1 < r < 2), with r ≈ 1.6180.

Line Search Techniques Golden Section Search
Golden Section Search 1

I The golden section method does not find roots, it finds minima.
I Starts with an interval that contains minimum and reduces the interval.
I Start with uncertainty interval [0, 1]. Need two evaluations in the interval to
reduce the size of the interval.
I We do not want to bias towards one side, so choose the points symmetrically:
0 1−τ τ 1
0 1−τ τ
1−τ τ 1
I If we evaluate two points such that the two next possible intervals are the
same size and one of the points is reused, we have a more efficient method.

Golden Section Search 2

I Mathematically,
τ 1−τ
= ⇒ τ2 + τ − 1 = 0
1 τ
The positive solution of this equation is the golden ratio,
√
5−1
τ= = 0.618033988749895 . . .
2
I We evaluate the function at 1 − τ and τ , and then the two possible intervals
are [0, τ ] and [1 − τ, 1], which have the same size. If, say [0, τ ] is selected,
then the next two interior points would be τ (1 − τ ) and τ τ . But τ 2 = 1 − τ ,
and we already have this point!
I The golden search convergence rate is linear.

Example: Golden Section Applied to Polynomial
I Converges to different optima, depending on the starting interval

I Might not converge to the best optimum within the starting interval
Line Search Techniques Polynomial Interpolation
Polynomial Interpolation 1
I Idea: use information about f gathered during iteration.
I One way of using this information is to produce an estimate of the function
which we can easily minimize.
I The lowest order function that we can use for this purpose is a quadratic,
since a linear function does not have a minimum.
I Suppose we approximate f by
1
f˜ = ax2 + bx + c.
2
I If a > 0, the minimum of this function is x∗ = −b/a.

Line Search Techniques Polynomial Interpolation
96 Chapter 10. Minimization or Maximization of Functions
Polynomial Interpolation 2
parabola through 1 2 3
parabola through 1 2 4
2
5 4

Line Search Techniques Line Search
Line Search Methods 1

I Line search methods, like single-variable optimization methods, minimize a
function of one variable
I But . . . line search is applied to a line in n-space and does not necessarily find
the exact minimum
I An important procedure in most gradient-based optimization methods
I For a search direction pk , we must decide the step length, i.e., αk in the
equation,
xk+1 = xk + αk pk
xk+1
pk+1
pk
gk+1
xk
gk

Line Search Methods 2

I Gradient-based algorithms find pk such that it is a descent direction, i.e.,
pTk gk < 0, since this guarantees that f can be reduced by stepping along this
direction.
I Want to compute a step length αk that yields a reduction in f , but we do
not want to spend too much computational effort in making the choice.
I Ideally, we would find the global minimum along the line, but this is usually
not worthwhile, as it requires many iterations.
I More practical methods perform an inexact line search that achieves
adequate reductions of f at reasonable cost.

Wolfe Conditions 1
I Typical line search tries a sequence of step lengths, accepting the first that
satisfies certain conditions.
I A common condition requires that αk should yield a sufficient decrease of f ,
f (xk + αpk ) ≤ f (xk ) + µ1 αgkT pk
for a a small value of µ1 , e.g., 10−4 .

I Any sufficiently small step can satisfy the sufficient decrease condition, since
the slope is negative at the start.
I To prevent steps that are too small, we use second requirement called the
curvature condition,
g(xk + αpk )T pk ≥ µ2 gkT pk
where µ1 ≤ µ2 ≤ 1, and g(xk + αpk )T pk is the derivative of f (xk + αpk )
with respect to αk .
I This condition requires that the slope of the function at the new point be
greater than the starting one by a certain amount.

Wolfe Conditions 2
I Since we start with a negative slope, the gradient at the new point must be
either less negative or positive.
I Typical values of µ2 range from 0.1 to 0.9.
I The sufficient decrease and curvature conditions are known collectively as the
Wolfe conditions.

Strong Wolfe Conditions 1

I We can modify the curvature condition to force αk to lie in a neighborhood
of a stationary point,
f (xk + αpk ) ≤ f (xk ) + µ1 αgkT pk .

g(xk + αpk )T pk ≤ µ2 gkT pk ,
where 0 < µ1 < µ2 < 1.

I This condition, together with the sufficient decrease condition, represents the
strong Wolfe conditions.
I The only difference when comparing with the Wolfe conditions is that we do
not allow points where the derivative has a positive value that is too large

Backtracking Algorithm
I One of the simplest line search techniques is backtracking.
I It only checks for the sufficient decrease.
I It is guaranteed to satisfy this condition . . . eventually.
Algorithm 1 Backtracking line search algorithm

1: Input: α > 0, 0 < ρ < 1 . Initial step length and step reduction ratio
2: Output: αk . Step length
3: repeat
4: α ← ρα . Step length reduction
5: until f (xk + αpk ) ≤ f (xk ) + µ1 αgkT pk . Sufficient decrease condition
6: αk ← α

Line Search Satisfying Strong Wolfe Conditions

I This procedure has two stages:
1. Begins with trial α1 , and keeps increasing it until it finds either an acceptable
step length or an interval that brackets the desired step lengths.
2. In the latter case, a second stage (the zoom algorithm below) is performed
that decreases the size of the interval until an acceptable step length is found.
I Define the univariate function φ(α) = f (xk + αpk ), so that φ(0) = f (xk ).
I The first stage that brackets the minimum is as follows . . .

Bracketing Stage Algorithm
1: Input: α1 > 0 and αmax

2: Output: α∗
3: α0 = 0
4: i←1
5: repeat
6: Evaluate φ(αi )
7: if [φ(αi ) > φ(0) + µ1 αi φ0 (0)] or [φ(αi ) > φ(αi−1 ) and i > 1] then
8: α∗ ← zoom(αi−1 , αi ) return α∗
9: end if
10: Evaluate φ0 (αi )
11: if |φ0 (αi )| ≤ −µ2 φ0 (0) then return α∗ ← αi
12: else if φ0 (αi ) ≥ 0 then
13: α∗ ← zoom(αi , αi−1 ) return α∗
14: else
15: Choose αi+1 such that αi < αi+1 < αmax
16: end if
17: i←i+1
18: until

Bracketing Stage Flow Chart

1. Choose
initial point
2. Evaluate function value at

point
3. Bracket interval
3. Does point satisfy sufficient between previous
NO
decrease? point and current
point
YES
4. Evaluate function derivative

at point
5. Does point satisfy the curvature 6. Is derivative 7. Choose new

NO positive? NO point beyond
condition?
current one
YES
YES
6. Bracket interval
between current
point and previous "zoom" function
point
Point is good
enough
End line Call "zoom" function to

search find good point in interval

Zoom Stage Algorithm
1: Input: αlow , αhigh

2: Output: α∗
3: j←0
4: repeat
5: Find a trial point αj between αlow and αhigh
6: Evaluate φ(αj )
7: if φ(αj ) > φ(0) + µ1 αj φ0 (0) or φ(αj ) > φ(αlow ) then
8: αhigh ← αj
9: else
10: Evaluate φ0 (αj )
11: if |φ0 (αj )| ≤ −µ2 φ0 (0) then
12: α∗ = αj return α∗
13: else if φ0 (αj )(αhigh − αlow ) ≥ 0 then
14: αhigh ← αlow
15: end if
16: αlow ← αj
17: end if
18: j ←j+1
19: until

Zoom Stage Flow Chart

1. Interpolate between the low value point
and high value point to find a trial point in
the interval
2. Evaluate function value at

trial point
3. Does trial point satisfy sufficient

decrease and is less or equal than low 3. Set point to be new high
NO
point value point
YES
4.1 Evaluate function derivative

at point
4.2 Does point satisfy the curvature 4.3 Does derivative sign at point 4.3 Replace high point with low
NO agree with interval trend? YES
condition? point
YES
Point is good NO
enough
4.4 Replace low point with trial
point
Exit zoom (end

line search)

Example: Strong Wolfe Algorithm Applied to Polynomial

Gradient-Based Optimization
1. Introduction
3.1 Introduction
3.2 Gradients and Hessians
3.3 Optimality Conditions
3.4 Steepest Descent
3.5 Conjugate Gradient
3.7 Quasi-Newton Methods
3.8 Trust Region Methods

Gradient-Based Optimization Introduction
Gradient-Based Optimization 1
I In previous chapter, described methods to decrease a function of one variable.
I Now, consider problems with multiple design variables
The unconstrained optimization problem is,
minimize f (x)
I x is the n-vector x = [x1 , x2 , . . . , xn ]T

I f can be nonlinear and must have continuous first derivatives, and in some
cases second derivatives

Gradient-Based Optimization 2
I Gradient-based methods use the gradient of the objective function to find the
most promising search directions
I For large numbers of design variables, gradient-based methods are more
efficient
I Assumptions and restrictions:
I No constraints (address these in later chapter)
I Smooth functions (gradient-free methods in later chapter)

General Gradient-Based Optimization Algorithm 1

x0
Search
direction Input: Initial guess, x0
Output: Optimum, x∗
k←0
while Not converged do
Update x Line search Compute a search direction pk
Find a step length αk , such that f (xk + αk pk ) <
f (xk ) (the curvature condition may also be included)
Update the design variables: xk+1 ← xk + αk pk
Is x a
No minimum?
k ←k+1
end while
Yes
x∗

General Gradient-Based Optimization Algorithm 2

I Iterations in “while” loop with index k are major iterations
I Iterations in the line search are minor iterations
I pk is the search direction for major iteration
I αk is the step length from the line search
I The two way a gradient-based algorithm determines the search direction is
the distinguishnig feature.
I Any line search that satisfies sufficient decrease can be used, but one that
satisfies the Strong Wolfe conditions is recommended.

Gradient-Based Optimization Gradients and Hessians
Gradients
Consider a function f (x). The gradient of this function is
 
∂f
 ∂x1 
 ∂f 
 
 

∇f (x) ≡ g(x) ≡  2 ∂x

.
 .. 
 
 ∂f 
∂xn
In the multivariate case, the gradient vector is perpendicular to the the hyperplane
tangent to the contour surfaces of constant f .

Hessians 1
I The second derivative of an n-variable function is defined by n2 partial
derivatives:
∂2f ∂2f
, i 6= jand , i = j.
∂xi ∂xj ∂x2i
I If the partial derivatives ∂f /∂xi , ∂f /∂xj and ∂ 2 f /∂xi ∂xj are continuous
and f is single valued, ∂ 2 f /∂xi ∂xj = ∂ 2 f /∂xj ∂xi .
I The second-order partial derivatives can be represented by a square
symmetric matrix called the Hessian matrix,
 
∂2f ∂2f
···
 ∂ 2 x1 ∂x1 ∂xn 
 .. .. 
∇ f (x) ≡ H(x) ≡ 
2
 . .


 ∂2f ∂ f 
2
··· ,
∂xn ∂x1 ∂ 2 xn
which contains n(n + 1)/2 independent elements.

Hessians 2
I If f is quadratic, the Hessian of f is constant, and the function can be
expressed as
1
f (x) = xT Hx + g T x + α.
2

Gradient-Based Optimization Optimality Conditions
Optimality Conditions
As in single-variable case, optimality conditions derived from the Taylor-series
expansion:
1
f (x∗ + εp) ≈ f (x∗ ) + εpT g(x∗ ) + ε2 pT H(x∗ )p,
2
where ε is a scalar, and p is an n-vector.
I For x∗ to be a local minimum, then
f (x∗ + εp) ≥ f (x∗ ) ⇒ f (x∗ + εp) − f (x∗ ) ≥ 0.
I This means that the sum of the first and second order terms in the
Taylor-series expansion must be greater than or equal to zero.
I Start with first order term: Since p is an arbitrary vector and ε can be positive
or negative, every component of the gradient vector g(x∗ ) must be zero.
I Second order term: For ε2 pT H(x∗ )p to be non-negative, H(x∗ ) has to be
positive semi-definite.

Relation of Hessian to Shape of Quadratic 1

Positive definite Indefinite
Positive semi-definite Negative definite

Relation of Hessian to Shape of Quadratic 2

Assuming H = H T , the Hessian can be classified as:
I Positive definite if pT Hp > 0 for all nonzero vectors p. All the eigenvalues of
H are strictly positive.
I Positive semi-definite if pT Hp ≥ 0 for all vectors p. All eigenvalues of H are
positive or zero.
I Indefinite if there exists p, q such that pT Hp > 0 and q T Hq < 0. H has
eigenvalues of mixed sign.
I Negative definite if pT Hp < 0 for all nonzero vectors p. All eigenvalues of H
are strictly negative.

Optimality Conditions
Necessary conditions (for a local minimum):
kg(x∗ )k = 0 and H(x∗ ) is positive semi-definite.
Sufficient conditions (for a strong local minimum):
kg(x∗ )k = 0 and H(x∗ ) is positive definite.

Example: Critical Points of a Function 1

Consider the function:
f (x) = 1.5x21 + x22 − 2x1 x2 + 2x31 + 0.5x41
Find all stationary points of f and classify them.

Solve ∇f (x) = 0, get three solutions:
(0, 0) local minimum

√ √
1/2(−3 − 7, −3 − 7) global minimum
√ √
1/2(−3 + 7, −3 + 7) saddle point
To establish the type of point:

1. Determine if the Hessian is positive definite.
2. Compare the values of the function at the points.

Example: Critical Points of a Function 2

Gradient-Based Optimization Steepest Descent
Steepest Descent Method 1

I Steepest descent method uses the negative of the gradient vector as the
search direction
I The gradient is the direction of steepest increase, so the opposite direction
give steepest decrease
Input: Initial guess, x0 , convergence tolerances, εg , εa and εr .

k←0
repeat
Compute the gradient of the objective function, g(xk ) ≡ ∇f (xk )
Compute the normalized search direction, pk ← −g(xk )/kg(xk )k
Perform line search to find step length αk
Update the current point, xk+1 ← xk + αk pk
k ←k+1
until |f (xk ) − f (xk−1 )| ≤ εa + εr |f (xk−1 )| and kg(xk−1 )k ≤ εg


I |f (xk+1 ) − f (xk )| ≤ εa + εr |f (xk )| is a check for the successive reductions
of f .
I εa is the absolute tolerance on the change in function value (usually ≈ 10−6 )
I εr is the relative tolerance (usually ≈ 10−2 ).
I If f is order 1, then εr dominates.
I If f gets too small, then the absolute error takes over.
There is a fundamental problem with steepest descent
I With exact line searchs, the steepest descent direction at each iteration is
orthogonal to the previous one,
df (xk+1 )
=0
dα
∂f (xk+1 ) ∂ (xk + αpk )
⇒ =0
∂xk+1 ∂α
⇒ ∇T f (xk+1 )pk = 0
⇒ −g T (xk+1 )g(xk ) = 0

I So the directions “zigzag”, which is inneficient.
I Rate of convergence is linear.
I Substantial decrease in the first few iterations, but then it is slow to converge.
I Guaranteed to converge, but may theoretically take an infinite number of
iterations.

Example: Minimization of Quadratic with Steepest

Descent
Consider this quadratic function of two variables,
f (x) = (1/2)(x21 + 10x22 )

Step-size Scaling
I Since steepest descent and other gradient methods that do not produce
well-scaled search directions, we need to use other information to guess a
step length.
I One strategy is to assume that the first-order change in xk will be the same
as the one obtained in the previous step. i.e, that
ᾱgkT pk = αk−1 gk−1

T
pk−1
and therefore:
T
gk−1 pk−1
ᾱ = αk−1 T
.
gk pk

Example: Steepest Descent 1

Consider the following function.
2 2
f (x1 , x2 ) = 1 − e−(10x1 +x2 ) .
The function f is not quadratic, but, as |x1 | and |x2 | → 0, we see that
f (x1 , x2 ) = 10x21 + x22 + O(x41 ) + O(x42 )
Thus, this function is essentially a quadratic near the minimum (0, 0)T .

Example: Steepest Descent 2

Gradient-Based Optimization Conjugate Gradient
Conjugate Gradient Method 1

I A small and simple modification to the steepest descent method results in
much improved convergence . . .
I . . . but it involves a lengthy derivation!
Suppose we want to minimize a convex quadratic function
1 T
φ(x) = x Ax − bT x
2
where A is an n × n matrix that is symmetric and positive definite. Differentiating
this with respect to x we obtain,
∇φ(x) = Ax − b ≡ r(x).
Minimizing the quadratic is thus equivalent to solving the linear system,
∇φ = 0 ⇒ Ax = b.
The conjugate gradient method is an iterative method for solving linear systems of
equations.

A set of nonzero vectors {p0 , p1 , . . . , pn−1 } is conjugate with respect to A if
pTi Apj = 0, for all i 6= j.
There is a simple interpretation of the conjugate directions.

I If A where diagonal, the isosurfaces would be ellipsoids with axes aligned
with coordinate directions . . .
I . . . in which case we could find the minimum by performing univariate
minimization along each coordinate direction in turn, converging in n
iterations.
I When A not diagonal, the contours are still elliptical, but they are not
aligned with the coordinate axes.
I Minimization along coordinate directions no longer leads to solution in n
iterations (or even a finite n).


I However, we can do a coordinate transformation to align the coordinate axis
with the ellipsoid axes
x̂ = S −1 x,
where S, is a matrix whose columns are the conjugate directions with respect
to A.
I The quadratic now becomes
1 T T T
φ̂(x̂) = x̂ S AS x̂ − S T b x̂
2

I By conjugacy, S T AS is diagonal so we can do a sequence of n line
minimizations along the coordinate directions of x̂. Each univariate
minimization determines a component of x∗ correctly.

Nonlinear Conjugate Gradient Method

When the conjugate-gradient method is adapted to general nonlinear problems,
we obtain the nonlinear conjugate-gradient method, also known as the
Fletcher–Reeves method.
Algorithm 2 Nonlinear conjugate gradient method

k←0
repeat
Compute the gradient of the objective function, g(xk )
if k=0 then
Compute the normalized steepest descent direction, pk ←
−g(xk )/kg(xk )k
else
g T gk
Compute β ← gT k gk−1
k−1
Compute the conjugate gradient direction pk = −gk /kg(xk )k + βk pk−1
end if
Perform line search to find step length αk
k ←k+1

Nonlinear Conjugate Gradient Method

I The only difference relative to the steepest descent is that the each descent
direction is modified by adding a contribution from the previous direction.
I The convergence rate of the nonlinear conjugate gradient is linear, but can be
superlinear, converging in n to 5n iterations.
I Needs to be restarted, usually after n iterations, or when the directions start
being far from orthogonal. Restart with a steepest descent direction.
I Does not produce well-scaled search directions, so we can use same strategy
to choose the initial step size as for steepest descent.
I Several variants exist. Most differ in their definition of βk . For example,
another alternative is
kgk k2
βk = .
(gk − gk−1 )T pk−1
Another variant is the Polak–Ribière formula
gkT (gk − gk−1 )
βk = T g
.
gk−1 k−1
I Since this method is just a minor modification away from steepest descent
and performs much better, there is no excuse for steepest descent!
Example: Conjugate Gradient Method in Action

Gradient-Based Optimization Newton’s Method
Newton’s Method 1
I Steepest descent and conjugate gradient methods only use first order
information to obtain a local model of the function.
I Newton methods use a second-order Taylor series expansion of the function
about the current design point
1
f (xk + sk ) ≈ fk + gkT sk + sTk Hk sk ,
2
where sk is the step to the minimum.
I Differentiating this with respect to sk and setting it to zero, we obtain
Hk sk = −gk .
This is a linear system which yields the Newton step, sk , as a solution.

I If f is a quadratic function and Hk is positive definite, Newton’s method
requires only one iteration to converge from any starting point.
I For a general nonlinear function, Newton’s method converges quadratically if
x0 is sufficiently close to x∗ and the Hessian is positive definite at x∗ .

Newton’s Method 2
I As in the single variable case, difficulties and even failure may occur when the
quadratic model is a poor approximation of f far from the current point.
I If Hk is not positive definite, the quadratic model might not have a minimum
or even a stationary point.
I So for some nonlinear functions, the Newton step might be such that
f (xk + sk ) > f (xk ) and the method is not guaranteed to converge.
I Another disadvantage of Newton’s method is the need to compute not only
the gradient, but also the Hessian, which contains n(n + 1)/2 second order
derivatives.

Modified Newton’s Method 1

A small modification to Newton’s method is to perform a line search along the
Newton direction, rather than accepting the step size that would minimize the
quadratic model.

k←0
repeat
Compute the Hessian of the objective function, H(xk )
Compute the search direction, pk = −H −1 gk
Perform line search to find step length αk , starting with α = 1
k ←k+1

Modified Newton’s Method 2

I Although this increases the probability that f (xk + pk ) < f (xk ), it is still
vulnerable to the problem of having an Hessian that is not positive definite.
I All the other disadvantages of the pure Newton’s method still apply.
I We could also use a symmetric positive definite matrix instead of the real
Hessian to ensure descent
Bk = Hk + γI,
where γ is chosen such that all eigenvalues of Bk are sufficiently positive.
I The starting step length ᾱ is usually set to 1, since Newton’s method already
provides a good guess for the step size.

Example: Modified Newton’s Method in Action

Gradient-Based Optimization Quasi-Newton Methods
Quasi-Newton Methods
I Quasi-Newton methods use only first order information . . .
I . . . but they build second order information — an approximate Hessian —
based on the sequence of function values and gradients from previous
iterations.
I They are the analog of the secant method in multidimensional space.
I The various quasi-Newton methods differ in how they update the
approximate Hessian.
I Most of them force the Hessian to be symmetric and positive definite.

The First Quasi-Newton Method

I A bit of interesting history . . .
I One of the first quasi-Newton methods was devised by Davidon in 1959, who
a physicist at Argonne National Laboratories.
I He was using a coordinate descent method, and had limited computer
resources, so he invented a more efficient method that resulted in the first
quasi-Newton method.
I This was one of the most revolutionary ideas in nonlinear optimization.
I Davidon’s paper was not accepted for publication! It remained a technical
report until 1991.
I Fletcher and Powell later modified the method and showed that it was much
faster than current ones, and hence it became known as the
Davidon–Fletcher–Powell (DFP) method.

The Basis for Quasi-Newton Methods 1

I Suppose we model the objective function as a quadratic
1
φk (p) = fk + gkT p + pT Bk p,
2
where Bk is an n × n symmetric positive definite matrix that is updated
every iteration.
I The step pk that minimizes this convex quadratic model is
pk = −Bk−1 gk .
I This solution is used to compute the search direction to obtain the new
iterate
xk+1 = xk + αk pk
where αk is obtained using a line search.
I This is the same procedure as the Newton method, except that we use an
approximate Hessian Bk instead of the true Hessian.


I Instead of computing Bk “from scratch” at every iteration, a quasi-Newton
method updates it in to account for the curvature estimate for the most
recent step.
I We want to build an updated quadratic model,
T 1
φk+1 (p) = fk+1 + gk+1 p + pT Bk+1 p.
2
I Using the secant method we can find the univariate quadratic function along
the previous direction pk based on the two last two gradients gk+1 and gk ,
and the last function value fk+1 .
I The slope of the univariate function is the gradient of the function projected
onto the p direction, f 0 = g T p. The univariate quadratic is given by
0 θ2 ˜00
φk+1 (θ) = fk+1 + θfk+1 + f
2 k+1
where s = αkpk and f˜k+1

00
is the approximation to the curvature


I This curvature approximation is given by a forward finite difference on the
slopes
f 0 − fk0
f˜k+1
00
= k+1
αk kpk k
These slopes are obtained by projecting the respective gradients onto the last
direction pk .
I The result is a quadratic that slope and value at the current point, and the
slope of the previous point.

fk0
0
fk+1 φ
xk xk+1
Projection of the quadratic model onto the last search direction, illustrating
the secant condition


I Going back to n-dimensional space, after some manipulation we obtain,
Bk+1 αk pk = gk+1 − gk .
which is called the secant condition.

I For convenience, we set the difference of the gradients to yk = gk+1 − gk ,
and sk = xk+1 − xk so the secant condition is then written as
Bk+1 sk = yk .
xk+1
pk+1
pk
xk gk+1
gk

Davidon–Fletcher–Powell (DFP) Method 1

I In the Hessian update, Bk+1 we have n(n + 1)/2 unknowns and only n
equations.
I To determine the solution uniquely, we impose a condition that among all the
matrices that satisfy the secant condition, selects the Bk+1 that is “closest”
to the previous Hessian approximation Bk
I This can be done by solving the optimization problem
minimize kB − Bk k
with respect to B
subject to B = BT , Bsk = yk .
I Using different matrix norms result in different quasi-Newton methods.


I One norm that makes it easy to solve this problem and possesses good
numerical properties is the weighted Frobenius norm
kAkW = kW 1/2 AW 1/2 kF ,

Pn Pn
where the norm is defined as kCkF = i=1 j=1 c2ij . The weights W are
chosen to satisfy certain favorable conditions.
I The norm is adimensional (i.e., does not depend on the units of the problem)
if the weights are chosen appropriately.
I Using this norm and weights, the unique solution of the norm minimization
problem is,

yk sT sk y T yk y T
Bk+1 = I − T k Bk I − T k + T k ,
yk sk yk sk yk sk
which is the DFP updating formula originally proposed by Davidon.


I Using the inverse of Bk is usually more useful, since the search direction can
then be obtained by matrix multiplication. Defining,
Vk = Bk−1 .
I The DFP update for the inverse of the Hessian approximation can be shown
to be
Vk yk y T Vk sk sT
Vk+1 = Vk − T k + T k
yk Vk yk yk sk
I Note that this is a rank 2 update.

Quasi-Newton Algorithm with DFP Update

k←0
V0 ← I
repeat
Compute the search direction, pk ← −Vk gk
Perform line search to find step length αk , starting with α ← 1
Set the step length, sk ← αk pk
Compute the change in the gradient, yk ← gk+1 − gk
V y yT V
Ak ← yk TkVkkyk k
k
s sT
Bk ← sTk ykk
k
Compute the updated approximation to the inverse of the Hessian, Vk+1 ←
V k − Ak + B k
Broyden–Fletcher–Goldfarb–Shanno (BFGS) Method

I The DFP update was soon superseded by the BFGS formula, which is
generally considered to be the most effective quasi-Newton update.
I Instead of solving the norm minimization problem of B we now solve the
same problem for its inverse, V , resulting in

sk y T yk sT sk sT
Vk+1 = I − T k Vk I − T k + T k .
sk yk sk yk sk yk
I The relative performance between the DFP and BFGS methods is problem
dependent.

A Beer-Inspired Algorithm?
Broyden, Fletcher, Goldfarb and Shanno at the NATO Optimization Meeting

(Cambridge, UK, 1983), a seminal meeting for continuous optimization
Example: BFGS Applied to Simple Function

Example: Minimization of the Rosenbrock Function 1
Steepest descent
15
2
0.5
0
15
15
20
10
1.5
20
0
20
0.5
5
1 0.2.2
10
15
2 0
2
5
10
0
10
1
1
20
10
5
10
5
15
15
0.5
20
2
15
0.5
0.5
20 0
x2
20
10
200
1 00
100
2
20
15
40 0
10
1
300
300
0 5
5 00
10 2
15
600
20
-0.5
4 00
20
7 00
20
10
0
0
10
400
500
60 0
30
-1 0
30
0
-1.5 -1 -0.5 0 0.5 1 1.5

x1


Nonlinear conjugate gradient
10
0
20
20
1.5
10
2
0
15
10
0.2 0.22 1
5
0.5
0
5
10 0
10
15
1
10 0
10
2
0.5
15
5
20
5
20
1
10
15
20 0
20
0.5 15
10 0
10
2
x2
1
2
300
5
5
400
0.5
100
200
15
0
10
10
1
5
20
50 0
15
300
20
200
-0.5
0
10
60 0
700 00
40 0
3 00
6
5 00
400
10
0
0
70 0
-1 20
-1.5 -1 -0.5 0 0.5 1 1.5
x1


Modified Newton
0.2
1.5
20
10
0
5
2
15
200
15
2
1
2105
5
20
0.5
0
10
10
10
0
15 2
1
100
10
.2
00.5
10
1
2 00
5
0.2
1 00
5
0.5
20
20
2
x2
15
5
15
15
5
2
2
0.
1
10
300
20
0.5
10
2
0
200
500
10
10
0
3 00
4 00
5
2 00
400
15
6 00
20
0
10
-0.5
7 00
30
0
500
6 00
1 00
-1
-1.5 -1 -0.5 0 0.5 1 1.5
x1


BFGS
300
20
20
0
5
20
0.2
2
10
1.5
5
5
15
10 0
2
10
0.5
1
10
0
10
1
20
0.2
1
10
15
15
0.5
15
20
20 10
200
20
5
0.2
5
2
300
0.5 15
5
x2
1 00
10 2
100
15
0.
10
2 00
40 0
1
0
20
15
15
5 10
20
5 00
20 0
300
40 0
-0.5
30 0
700
500
60 0
600
0
10
10
0
20
0
0
-1
40
-1.5 -1 -0.5 0 0.5 1 1.5
x1


Convergence rate comparison

Symmetric Rank-1 Update Method (SR1) 1

I If we drop the requirement that the approximate Hessian (or its inverse) be
positive definite, we can derive a simple rank-1 update formula for Bk that
maintains the symmetry of the matrix and satisfies the secant equation.
I The symmetric rank-1 update (SR1) is such a formula
(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk + .
(yk − Bk sk )T sk
I With this formula, we must have safe-guards:
I If yk = Bk sk then the denominator is zero, and the only update that satisfies
the secant equation is Bk+1 = Bk (i.e., do not change the matrix).
I if yk 6= Bk sk and (yk − Bk sk )T sk = 0 then there is no symmetric rank-1
update that satisfies the secant equation.

Symmetric Rank-1 Update Method (SR1) 2

I To avoid the second case, we update the matrix only if the following
condition is met:
|ykT (sk − Bk yk )| ≥ rksk kkyk − Bk sk k,
where r ∈ (0, 1) is a small number (e.g., r = 10−8 ). Hence, if this condition

is not met, we use Bk+1 = Bk .
I In practice, the matrices produced by SR1 have been found to approximate
the true Hessian matrix well (often better than BFGS)
I This may be useful in trust-region methods or constrained optimization
problems, where the Hessian of the Lagrangian is often indefinite, even at the
minimizer.
I It may be necessary to add a diagonal matrix γI to Bk when calculating the
search direction, as was done in modified Newton’s method.
I A simple back-tracking line search can be used, since the Wolfe conditions
are not required as part of the update — unlike BFGS.

Gradient-Based Optimization Trust Region Methods
Trust Region Methods

I Trust region, or “restricted-step” methods are a different approach to
resolving the weaknesses of the pure form of Newton’s method.
I These weaknesses arise from the fact that we are stepping outside a the
region for which the quadratic approximation is reasonable.
I We can overcome these difficulties by minimizing the quadratic function
within a region around xk within which we trust the quadratic model.
I The reliability index, rk , is the ratio of the actual reduction to the predicted
reduction; the closer it is to unity, the better the agreement. If fk+1 > fk
(new point is worse), rk is negative.

Gradient-Based Optimization Trust Region Methods
Trust Region Algorithm

Algorithm 3 Trust region algorithm
Input: Initial guess x0 , convergence tolerances, εg , εa and εr , initial size of the
trust region, h0
k←0
repeat
Compute the Hessian of the objective function H(xk ), and solve the quadratic
subproblem:
1
minimize q(sk ) = f (xk ) + g(xk )T sk + sTk H(xk )sk
2
w.r.t. sk
s.t. − hk ≤ sk ≤ hk , i = 1, . . . , n
Evaluate f (xk + sk ) and compute the ratio that measures the accuracy of
the quadratic model,
f (xk ) − f (xk + sk ) ∆f
rk ← =
f (xk ) − q(sk ) ∆q
if rk < 0.25 then

hk+1 ← ks4k k . Model is not good; shrink the trust region
else if rk > 0.75 and hk = ksk k then
hk+1 ← 2hk . Model is good and new point on edge; expand trust
region
else
hk+1 ← hk . New point with trust region and the model is reasonable;
keep trust region the same size
end if
if rk ≤ 0 then
xk+1 ← xk . Keep trust region centered about the same point
else
xk+1 ← xk + sk . Move center of trust region to new point
end if
k ←k+1

Computing Derivatives
Computing Derivatives
1. Introduction
4.1 Introduction
4.2 Finite Differences
4.3 Complex-Step Method
4.4 C/C++ Implementations
4.5 Unifying Chain Rule
4.6 The Unifying Chain Rule
4.7 Monolithic Differentiation
4.8 Algorithmic Differentiation
4.9 Analytic Methods

Computing Derivatives Introduction
What’s in a name?
I Derivatives have also been called:
I “Sensitivities” . . . but sensitivity analysis is actually a much broader area of
mathematics.
I “Sensitivity derivatives” — a somewhat redundant term?
I “Design sensitivities” — a fair term to use.
I I have been using the terms “sensitivities” and “sensitivity analysis” up until
this year, but now I prefer “derivatives”, since it is more precise.
I A “gradient” is a vector of derivatives
I A Jacobian is a matrix of derivatives (the gradient of a vector)
I We will focus on first order derivatives of deterministic numerical models.
I A model can be any numerical procedure that given inputs computes some
outputs

What derivatives do we need for optimization?

Consider a general constrained optimization problem of the form:
minimize f (xi ) (1)

w.r.t xi i = 1, 2, . . . , n (2)
subject to cj (xi ) ≥ 0, j = 1, 2, . . . , m (3)
To solve this problem using gradient-based optimization we require:

I Gradient of the objective function, ∇f (x) = ∂f /∂xi , an n-vector.
I Gradient of all active constraints, ∂cj /∂xi , an (m × n) matrix (Jacobian)

The Root of Most Problems in Gradient-Based

Optimization
x0 I The computation of the derivatives can be
Optimizer the bottleneck in gradient-based
optimization
Search
Analysis
direction I Most gradient-based optimizers use finite
x
Gradient differences as the default
Line search computation
I This often leads to long computational
Converged?
times and failure to converge
I Accurate and efficient gradients are
essential for effective optimization

Methods for Computing Derivatives

Symbolic: Exact, but limited to explicit functions
Finite differences: Easy to implement and no source code is needed, but subject
to large errors; cost proportional to the number of design variables
Complex step: Relatively easy to implement, but source code is needed.
Numerically exact. Cost is still proportional to the number of
variables.
Algorithmic differentiation: Requires the source code, memory requirements can
become prohibitive, cost can be independent of the number of
design variables.
Analytic methods: Numerically exact, long development time, source code is
needed, but cost can be independent of the number of design
variables.

Computing Derivatives Finite Differences
Finite Differences 1
I Finite differences are one of the most popular methods for computing
derivatives, mostly because they are extremely easy to implement and do not
require source code
I . . . but they suffer from some serious accuracy and performance issues.
I Finite-difference formulas are derived by combining Taylor series expansions
I It is possible to obtain formulas for arbitrary order derivatives with arbitrary
order truncation error (but it will cost you!)

The simplest finite-difference formula can be directly derived from one Taylor
series expansion,
∂F h2 ∂ 2 F h3 ∂ 3 F
F (x + ej h) = F (x) + h + + + ...,
∂xj 2! ∂x2j 3! ∂x3j
Solving for ∂F /∂xj we get,
∂F F (x + ej h) − F (x)
= + O(h)
∂xj h
where h is the finite-difference interval. This approximation is called a forward

difference and is directly related to the definition of derivative. The truncation
error is O(h), and hence this is a first-order approximation.
I F can be a vector with all the functions if interest
I The forward difference formula requires two function evaluates and yields one
column of the

I Each additional column requires an additional evaluation
I Hence, the cost of computing the complete Jacobian is proportional to the
number of input variables of interest, nx .
For a second-order estimate we use the expansion of f (x − h),
h2 00 h3
f (x − h) = f (x) − hf 0 (x) + f (x) − f 000 (x) + . . . ,
2! 3!
and subtract it from the f (x + h) to get the central-difference formula,
f (x + h) − f (x − h)
f 0 (x) = + O(h2 ).
2h
More accurate estimates can also be derived by combining different Taylor series
expansions.

Formulas for estimating higher-order derivatives can be obtained by nesting
finite-difference formulas. We can use, for example, the central difference formula
to estimate the second derivative instead of the first,
f 0 (x + h) − f 0 (x − h)
f 00 (x) = + O(h2 ).
2h
and use central difference again to estimate both f 0 (x + h) and f 0 (x − h) in the
above equation to obtain,
f (x + 2h) − 2f (x) + f (x − 2h)

f 00 (x) = + O(h).
4h2
I Finite differences are subject to the step-size dilemma:

I Want to use a very small h to reduce the truncation error
I . . . but cannot make h too small because of subtractive cancellation
Subtractive cancellation is due to finite precision arithmetic.

f (x + h) +1.234567890123431
f (x) +1.234567890123456
∆f −0.000000000000025
f(x)
f(x+h)
x x+h
Finite difference approximation

I For functions of several variables, then we have to calculate each component
of the gradient ∇f (x) by perturbing each component of x and recomputing
f.
I Thus the cost of calculating a gradient is proportional to the number of
design variables.

Computing Derivatives Complex-Step Method
The Complex-Step Method

I The complex-step derivative approximation computes derivatives of real
functions using complex variables.
I Originates from a more general method published in 1967 for computing
higher order derivatives with arbitrary precision
I Rediscovered in 1998 as a simple formula for first derivatives
I Generalized for real-world applications soon after that
I Extremely accurate, robust, and relatively easy to implement

Complex-step Method Applications 1

I Gradients and Jacobians in CFD
I Verification of high-fidelity aerostructural derivatives
I Immunology model sensitivities
I Jacobians in liquid chromotography
I First and second derivatives of Kalman filters
I Hessian matrices in statistics
I Sensitivities in biotechnology

Theory 1
I Like finite-difference formulas, the complex-step approximations can also be
derived using a Taylor series expansion.
I Instead of using a real step h, we now use a pure imaginary step, ih.
I If f is a real function in real variables and it is also analytic, we can expand it
in a Taylor series about a real point x as follows,
∂F h2 ∂ 2 F ih3 ∂ 3 F
F (x + ihej ) = F (x) + ih − − + ...
∂xj 2 ∂x2j 6 ∂x3j
Taking the imaginary parts of both sides of this equation and dividing it by h
yields
∂F Im [F (x + ihej )]
= + O(h2 )
∂xj h
We call this the complex-step derivative approximation. Hence the
approximations is a O(h2 ) estimate of the derivative.

Theory 2
I Like finite-differences, each additional evaluation results in a column of the
dF
Jacobian , and the cost of computing the derivatives is proportional to
dx
the number of design variables, nx .
I No subtraction operation in the complex-step approximation, so no
subtractive cancellation error
I the only source of numerical error is the truncation error, O(h2 ).
I By decreasing h to a small enough value, the truncation error can be made to
be of the same order as the numerical precision of the evaluation of f .
I If we take the real part of the Taylor series expansion, we get
f 00 (x)
f (x) = Re [f (x + ih)] + h2 − ...
2!
showing that the real part of the result give the value of f (x) correct to
O(h2 ).

Theory 3
I The second order errors in the function value and the function derivative can
be eliminated when using finite-precision arithmetic by ensuring that h is
sufficiently small.
I If ε is the relative working precision of a given algorithm, to eliminate the
truncation error, we need an h such that
00
f (x)
h2 < ε |f (x)|
2!
I Similarly, for the truncation error of the derivative estimate to vanish we
require that 000
f (x)
h2 < ε |f 0 (x)|
3!
I Although the step h can be very small values, in some cases, it is not possible
to satisfy these conditions, e.g., when f (x), f 0 (x) tend to zero.

Another derivation of the complex-step 1

I Consider a function, f = u + iv, of the complex variable, z = x + iy. If f is
analytic the Cauchy–Riemann equations apply, i.e.,
∂u ∂v
=
∂x ∂y
∂u ∂v
=− .
∂y ∂x
I We can use the definition of a derivative in the right hand side of the first
Cauchy–Riemann to get
∂u v(x + i(y + h)) − v(x + iy)

= lim
∂x h→0 h
where h is a small real number.


I Since the functions are real functions of a real variable, y = 0, u(x) = f (x)
and v(x) = 0 and we can write,
∂f Im [f (x + ih)]
= lim .
∂x h→0 h
I For a small discrete h, this can be approximated by,
∂f Im [f (x + ih)]
≈ .
∂x h


Im
(x, ih)
Re Re
(x, 0) (x + h, 0) (x, 0)
∂F F (x + h) − F (x) ∂F Im[F (x + ih)] − Im[F (x)]

≈ ≈
∂x h ∂x Im[ih]
∂F Im[F (x + ih)]
⇒ ≈
∂x h

Example: The Complex-Step Method Applied to a Simple

Function 1
I Consider the following analytic function:
ex
f (x) = p
sin3 x + cos3 x
I We define the relative error as,

0 0
f − fref
ε = .

0
fref

Example: The Complex-Step Method Applied to a Simple

Function 2
Normalized Error, e
Step Size, h
Relative error of the derivative vs. decreasing step size

Application of the Complex-Step to General Programs

I To what extent can the complex-step method be used in a general numerical
algorithm?
I We had to assumed that the function F is analytic, so we need to examine
this assumption holds in numerical algorithms.

Relational logic operators 1

I Relational logic operators (=, <, >, ≤, ≥) are usually not defined for complex
numbers.
I These operators are used with conditional statements to redirect the
execution thread.
I Original algorithm and its “complexified” version should follow the same
execution thread.
I Therefore, defining these operators to compare only the real parts is the
correct approach.
I max and min are based on relational operators, we should choose a number
based on its real part alone.

Relational logic operators 2

I Algorithms that use conditional statements are likely to be a discontinuous
function of its inputs
I Either the function value itself is discontinuous or the discontinuity is in the
first or higher derivatives.
I Using finite-difference, the estimate is incorrect if the two function evaluations
are within h of the discontinuity location.
I Using the complex-step, the resulting derivative estimate is correct right up to
the discontinuity.

Arithmetic functions
I Arithmetic functions and operators include addition, multiplication, and
trigonometric functions.
I Most of these functions have a standard complex definition that is analytic,
so the complex-step derivative approximation yields the correct result.
I The only standard complex function definition that is non-analytic is the
absolute value function.

Redefining the absolute value function 1

I When the argumentpis a complex number, the function returns the positive
real number, |z| = x2 + y 2 .
I This function is not analytic, so the complex-step does not work.
I To derive an analytic definition of this function, we apply the
Cauchy–Riemann equations to get:
(
∂u ∂v −1, if x < 0,
= =
∂x ∂y +1, if x > 0.
I Since ∂v/∂x = 0 on the real axis, we get ∂u/∂y = 0 on the same axis, so the
real part of the result must be independent of the imaginary part of the
variable.
I Therefore, the new sign of the imaginary part depends only on the sign of the
real part of the complex number, and an analytic “absolute value” function is
(
−x − iy, if x < 0,
abs(x + iy) =
+x + iy, if x > 0.
Redefining the absolute value function 2

I This is not analytic at x = 0 since a derivative does not exist for the real
absolute value.
I In practice, the x > 0 condition is substituted by x ≥ 0, so that we can obtain
a function value for x = 0 and calculate the correct right-hand-side derivative.

Other Issues 1
I Improvements to the complex-step method are necessary because of the way
certain compilers implement the functions.
I For example, the following formula might be used for the arcsin function:
h p i
arcsin(z) = −i log iz + 1 − z 2 ,
which may yield a zero derivative.

I To see how this happens, consider z = x + ih, where x = O(1) and
h = O(10−20 ), then in the addition,
iz + z = (x − h) + i (x + h) ,
h vanishes when using finite precision arithmetic. Therefore, we would like to

keep the real and imaginary parts separate.
I The complex definition of sine also problematic. For example, in
eiz − e−iz
sin(z) = .
2i
Other Issues 2
I The complex trigonometric relation yields a better alternative,
sin(x + ih) = sin(x) cosh(h) + i cos(x) sinh(h).
I Linearizing this last equation (that is for small h) this simplifies to,
sin(x + ih) ≈ sin(x) + ih cos(x).
I From the standard complex definition,

h p i
arcsin(z) = −i log iz + 1 − z 2 .
I We would like the real and imaginary parts to be calculated separately. This
can be achieved by linearizing in h to obtain,
h
arcsin(x + ih) ≡ arcsin(x) + i √ .
1 − x2

Implementation Procedure
The general procedure for the implementation of the complex-step method for an
arbitrary computer program can be summarized as follows:
1. Substitute all real type variable declarations with complex declarations. It is
not strictly necessary to declare all variables complex, but it is much easier to
do so.
2. Define all functions and operators that are not defined for complex
arguments.
3. Add a small complex step (e.g. h = 1 × 10−20 ) to the desired x, run the
algorithm that evaluates f , and then take the imaginary part of the result
and divide by h.
The above procedure is independent of the programming language. We now
describe the details of our Fortran and C/C++ implementations.

Fortran Implementation 1
I complexify.f90: a module that defines additional functions and operators
for complex arguments.
I Complexify.py: Python script that makes necessary changes to source
code, e.g., type declarations.
I Features:
I Script is versatile:
I Compatible with many more platforms and compilers.
I Supports MPI based parallel implementations.
I Resolves some of the input and output issues.
I Some of the function definitions were improved: tangent, inverse and
hyperbolic trigonometric functions.
I complexify.h: defines additional functions and operators for the

complex-step method.
I derivify.h: simple automatic differentiation. Defines a new type which
contains the value and its derivative.

Fortran Implementation 2
Templates, a C++ feature, can be used to create program source code that is
independent of variable type declarations.
I Compared run time with real-valued code:
I Complexified version: ≈ ×3
I Algorithmic differentiation version: ≈ ×2

Other Programming Languages 1

Matlab: As in the case of Fortran, one must redefine functions such as abs,
max and min. All differentiable functions are defined for complex
variables. The standard transpose operation represented by an
apostrophe (’) poses a problem as it takes the complex conjugate
of the elements of the matrix, so one should use the non-conjugate
transpose represented by “dot apostrophe” (.’) instead.
Java: Complex arithmetic is not standardized at the moment but there
are plans for its implementation. Although function overloading is
possible, operator overloading is currently not supported.
Python: A simple implementation of the complex-step method for Python
was also developed in this work. The cmath module must be
imported to gain access to complex arithmetic. Since Python
supports operator overloading, it is possible to define complex
functions and operators as described earlier.

Other Programming Languages 2

I Algorithmic differentiation by overloading can be implemented in any
programming language that supports derived datatypes and operator
overloading.
I For languages that do not have these features, the complex-step method can
be used wherever complex arithmetic is supported.

Example: Application of the complex-step method to an

aerostructural optimization problem 1
I Aerodynamics: SYN107-MB, a
parallel, multiblock Navier–Stokes
flow solver.
I Structures: detailed finite element
model with plates and trusses.
I Coupling: high-fidelity, consistent
and conservative.
I Geometry: centralized database for
exchanges (jig shape, pressure
distributions, displacements.)
I Coupled-adjoint sensitivity analysis


0 CD
10
∂ CD / ∂ b1
−2
10
Reference Error, ε
−4
10
−6
10
−8
10
100 200 300 400 500 600 700 800
Iterations


0
10
Complex−Step
−1
10 Finite−difference
−2
10
Relative Error, ε
−3
10
−4
10
−5
10
−6
10 −5 −10 −15
10 10 10
Step Size, h


Complex−Step, h = 1×10−20
0.15
Finite−Difference, h = 1×10−2
0.1
∂ CD / ∂ bi
0.05
−0.05
2 4 6 8 10 12 14 16 18
Shape variable, i


Computation Type Normalized Cost
aerostructural Solution 1.0
Finite difference 14.2
Complex step 34.4

Example: Application of the complex-step method to a

supersonic viscous-inviscid solver 1
Framework for preliminary design of natural laminar flow supersonic aircraft


I Transition prediction
I Viscous and inviscid drag
I Design optimization
I Wing planform and airfoil design
I Wing-Body intersection design


I Python wrapper defines geometry


I CH GRID automatic grid generator
I Wing only or wing-body
I Complexified with our script
I CFL3D calculates Euler solution
I Version 6 includes complex-step
I New improvements incorporated
I C++ post-processor for the. . .
I Quasi-3D boundary-layer solver
I Laminar and turbulent
I Transition prediction
I C++ automatic differentiation
I Python wrapper collects data and computes structural constraints


2
10
Finite Difference
Complex−Step
0
10
Relative Error, ε
−2
10
−4
10
−6
10
−8
10 0 −5 −10 −15 −20
10 10 10 10 10
Step Size, h


−4
x 10
4.3736
Function Evaluations
Complex−Step Slope
4.3735
4.3734
4.3733
4.3732
Cdf
4.3731
4.373
4.3729
4.3728
22.495 22.5 22.505

Root Chord (ft)

Computing Derivatives Unifying Chain Rule
Total Derivatives of a System 1

I In addition to finite differences, the complex-step method and symbolic
differentiation, there are other methods for computing total derivatives
I We derive these various methods from a single formula . . .
I . . . but first we must go through some assumptions an definitions
I The computational model is assumed to be a deterministic series of
computations
I Any computational model can be defined as a sequence of explicit functions
Vi , where i = 1, . . . , n.
vi = Vi (v1 , v2 , . . . , vi−1 ).
where we adopt the convention that the lower case represents the value of a
variable, and the upper case represents the function that computes that value.
I In the more general case, a given function might require values that have not
been previously computed, i.e.,
vi = Vi (v1 , v2 , . . . , vi , . . . , vn ).


I The solution of such systems require numerical methods that can be
programmed by using loops where variables are updated.
I Numerical methods range from simple fixed-point iterations to sophisticated
Newton-type algorithms.
I Loops are also used to repeat one or more computations over a
computational grid.
I It is always possible to represent any given computation without loops and
dependencies if we unroll all of the loops, and represent all values a variable
might take in the iteration as a separate variable that is never overwritten.
I In cases where the computational model requires iteration, it is helpful to
denote the computation as a vector of residual equations,
r = R(v) = 0
where the algorithm changes certain components of v until all of the

residuals converge to a small tolerance.


I The subset of v that is iterated to achieve the solution of these equations are
called the state variables.
I We now separate the subsets in v into:
Independent variables: x
State variables: y
Quantities of interest: f
I Using this notation, we can write the residual equations as,
r = R(x, y(x)) = 0
where y(x) denotes the fact that y depends implicitly on x through the
solution of the residual equations
I The solution of these equations completely determines y for a given x.
I The functions of interest (usually included in the set of component outputs)
also have the same type of variable dependence in the general case,
f = F (x, y(x)).


I When we compute f , we assume that the state variables y have already been
determined by the solution of the residual equations.
x
x ∈ Rn x
y ∈ Rn y
r ∈ Rn y
R(x, y) = 0 F (x, y) f f ∈ Rn f

Computing Derivatives The Unifying Chain Rule
One Chain to Rule them All 1

I We now derive a single equation that unifies the methods for computing total
derivatives.
I The methods differ in the extent to which they decompose a system, but
they all come from a basic principle: a generalized chain rule.
I We start from the sequence of variables (v1 , . . . , vn ), whose values are
functions of earlier variables,
vi = Vi (v1 , . . . , vi−1 )
For brevity, Vi (v1 , . . . , vi−1 ) is written as Vi (·).

I We define a partial derivative, ∂Vi /∂vj , of a function Vi with respect to a
variable vj as
∂ Vi Vi (v1 , . . . , vj−1 , vj + h, vj+1 , . . . , vi−1 ) − Vi (·)

= .
∂vj h


I Consider a total variation vi due to a perturbation vj , which can be
computed by using the sum of partial derivatives,
i−1
X ∂ Vi
vi = vk
∂vk
k=j
where all intermediate v’s between j and i are computed and used.
I The total derivative is,
dvi vi
= ,
dvj vj
I Using the two equations above, we can write:
X ∂ Vi dvk i−1
dvi
= δij + ,
dvj ∂vk dvj
k=j
which expresses a total derivative in terms of the other total derivatives and
the Jacobian of partial derivatives. The δij term is added to account for the
case in which i = j.

I This represents the chain rule for a system whose variables are v.
I To get a better understanding of the structure of the chain rule, we now
write it in matrix form:
 
0 ···
 ∂V2 0 ··· 
 ∂v1 
∂Vi  ∂V3 ∂V3 0 · · · 
DV = 
=  ∂v1 ∂v2 ,
∂vj 
 .. .. ..
.
..
. 
 . . 
∂Vn ∂Vn ∂Vn
∂v1 ∂v2 · · · ∂vn−1 0
where D is a differential operator.


I The total derivatives of the variables vi form another Jacobian matrix of the
same size that has a unit diagonal,
 
1 0 ···
 dv2 1 0 ··· 
 dv1 
dvi  dv3 dv3
1 0 · · ·
Dv = =  dv. 1 ∂v2 .

dvj  . .. .. .. 
 . . . . 
dvn dvn dvn
dv1 dv2 ··· dvn−1 1
I Both of these matrices are lower triangular matrices, due to our assumption
that we have unrolled all of the loops.
I Using this notation, the chain rule can be written as
Dv = I + DV Dv .


I Rearranging this, we obtain,
(I − DV ) Dv = I.
where all of these matrices are square, with size n × n.

I The matrix (I − DV ) can be formed by finding the partial derivatives, and
then we can solve for the total derivatives Dv .
I Since (I − DV ) and Dv are inverses of each other, we can further rearrange
it to obtain the transposed system:
T
(I − DV ) DvT = I.


I This leads to the following symmetric relationship:
T
(I − DV ) Dv = I = (I − DV ) DvT
− = = −
= =
I We call the left and right hand sides of this equation the forward and reverse
chain rule equations, respectively.
I All methods for derivative computation can be derived from one of the forms
of this chain rule by changing what we mean by “variables”, which can be
seen as a level of decomposition.


I The derivatives of interest, df / dx, are typically the derivatives of some of
the last variables with respect to some of the first variables in the sequence,
   
df1 df1 dv(n−nf ) dv(n−nf )
 dx1 · · · dxn   dv ···
dvnx 
df  x   1 
 .. . .. .  
..  =  . . .. 
= . .. .. . ,
dx    
 dfnf dfnf   dvn dvn 
··· ···
dx1 dxnx dv1 dvnx
This is an nf × nx matrix that corresponds to the lower-left block of Dv , or

the corresponding transposed upper-right block of DvT .
I DV is lower triangular, and therefore we can solve for a column of Dv using
forward substitution.
T
I Conversely, DV is upper triangular, and therefore we can solve for a row of
Dv using back substitution.


I Each of these versions of the chain rule incur different computational costs,
depending on the shape of the Jacobian df / dx:
I If nx > nf it is advantageous to use the forward chain rule
I f nf > nx the reverse chain rule is more efficient.

Unification of Derivative Computation Methods

I The choice of v is the main difference between the various methods for
computing total derivatives.
I A second major difference is the technique used to solve the linear system.
Monolithic Analytic Multidisciplinary analytic AD
Level of decomposition Black box Solver Discipline Line of code

Differentiation method FD/CS Any Any Symbolic
Solution of the linear system Trivial Numerical Numerical (block) Forward substitution
Back substitution

Example: Simple Computational Model 1

I This model can be interpreted as an explicit function, a model with states
constrained by residuals, or a multidisciplinary system.
I Two inputs, x = [x1 , x2 ]T
I Residual equations,

R1 (x1 , x2 , y1 , y2 ) x1 y1 + 2y2 − sin x1
R= =
R2 (x1 , x2 , y1 , y2 ) −y1 + x22 y2
I State variables y = [y1 y2 ]T

I Output functions,

F1 (x1 , x2 , y1 , y2 ) y1
F = =
F2 (x1 , x2 , y1 , y2 ) y2 sin x1
I To drive the residuals to zero, we have to solve the following linear system,

x1 2 y1 sin x1
=
−1 x22 y2 0

Example: Simple Computational Model 2

I The algorithm solves the system directly and there are no loops.
I The v’s introduced above correspond to each variable assignment
v = [x(1) x(2) det y(1) y(2) f(1) f(2)]T
FUNCTION F ( x )
REAL :: x (2) , det , y (2) , f (2)
det = 2 + x (1) * x (2) **2
y (1) = x (2) **2* SIN ( x (1) ) / det
y (2) = SIN ( x (1) ) / det
f (1) = y (1)
f (2) = y (2) * SIN ( x (1) )
RETURN
END FUNCTION F
The objective is to compute the derivatives of both outputs with respect to both
inputs, i.e., the Jacobian, " #
df1 df1
df
= dx 1
df2
dx2
df2
dx dx1 dx2
We will use this example in later sections to show the application of all methods.
Computing Derivatives Monolithic Differentiation
Monolithic Differentiation 1
I In monolithic differentiation, the entire computational model is treated as a
“black box”
I Only track inputs and outputs.
I This is often the only option
I Both the forward and reverse modes of the generalized chain rule reduce to
dfi ∂ Fi
=
dxj ∂xj
for each input xj and output variable fi .

Monolithic Differentiation 2
x
r1 r f
r2
y1 y
y2
v = [v1 , . . . , vnx , v(n−nf ) , . . . , vn ]T

| {z } | {z }
x f

Example: Finite-Difference and Complex-Step Methods

Applied to Simple Model 1
I The monolithic approach treats the entire code as a black box whose internal
variables and computations are unknown.
I Thus, the tracked variables are
v1 = x 1 , v2 = x2 , v3 = f1 , v4 = f2
I forward and reverse chain rule equations yield,

df1 ∂ f1 df1 ∂ f1
= , = ,...
dx1 ∂x1 dx2 ∂x2


I Computing df1 / dx1 simply amounts to computing ∂f1 /∂x1
I Using the forward-difference formula (with step size h = 10−5 ), yields
∂ f1 f1 (x1 + h, x2 ) − f1 (x1 , x2 )
≈ = 0.0866023014079,
∂x1 h
I The complex-step method (with step size h = 10−15 ), yields
∂ f1 Im [f1 (x1 + ih, x2 )]

≈ = 0.0866039925329.
∂x1 h
I The digits that agree with the exact derivative are shown in blue and those
that are incorrect are in red.


102
100
10-2
10-4
Log of relative error
10-6
10-8
10-10
10-12
10-14
10-16 FD
CS
10-18 -20
10 10-18 10-16 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100
Log of step size
Computing Derivatives Algorithmic Differentiation
Algorithmic Differentiation 1
I Algorithmic differentiation (AD) is also known as computational
differentiation or automatic differentiation
I Well known method based on the systematic application of the differentiation
chain rule to computer programs.
I With AD the variables v in the chain rule are all of the variables assigned in
the computer program
I Thus, AD applies the chain rule for every single line in the program.
I The computer program is considered as sequence of explicit functions Vi ,
where i = 1, . . . , n.
I Assume that all of the loops in the program are unrolled, and therefore no
variables are overwritten and each variable only depends on earlier variables
in the sequence.
I This assumption is not restrictive, as programs iterate the chain rule together
with the program variables, converging to the correct total derivatives.

I Typically, the design variables are among the first v’s, and the quantities of
interest are the last quantities.
v = [v1 , . . . , vnx , . . . , vj , . . . , vi , . . . , v(n−nf ) , . . . , vn ]T .

| {z } | {z }
x f

v1 x
v2
v3
v4 r1 r
.
.
.
r2
y1 y
y2
f
vn
v = [v1 , . . . , vnx , . . . , vj , . . . , vi , . . . , v(n−nf ) , . . . , vn ]T

| {z } | {z }
x f

I The chain rule is
X ∂Vi dvk i−1
dvi
= δij + ,
dvj ∂vk dvj
k=j
where the V represent explicit functions, each defined by a single line in the
computer program.
I The partial derivatives, ∂Vi /∂vk can be automatically differentiated
symbolically by applying another chain rule within the function defined by the
respective line in the program.
I The chain rule can be solved in two ways.
Forward mode: choose one vj and keep j fixed. Then we work our way
forward in the index i = 1, 2, . . . , n until we get the desired
total derivative.
Reverse mode: fix vi (the quantity we want to differentiate) and work our
way backward in the index j = n, n − 1, . . . , 1 all of the way to
the independent variables.

I The chain rule in matrix form,
(I − DV ) Dv = I ⇒
  
1 0 ··· 1 0 ···
 − ∂V 2
1 0 ···   dv 2
1 0 ··· 
 ∂V∂v1   dv 1 
 − ∂v 3 − ∂V3
1 0 · · ·  dv31
dv dv3
1 0 · · · =
 1 ∂v2  ∂v2 
 .. .. .. ..   .. .. .. .. 
 . . . .  . . . . 
− ∂Vn
∂v1
− ∂Vn
∂v2
··· − ∂v∂Vn
n−1
1 dvn
dv1
dvn
dv2
··· dvn
dvn−1
1
 
1 0 ···
0 1 0 ··· 
0 0 
· · · .
 1 0
. . .. .. .. 
 .. .. . . .
0 0 0 0 1

I The terms that we ultimately want to compute are the total derivatives of
the quantities of interest with respect to the design variables, corresponding
to a nf × nx block in the Dv matrix in the lower left:
   
df1 df1 dv(n−nf ) dv(n−nf )
 dx1 ··· ···
 dxnx 
 
 dv
1 dvnx  
df  . .. ..   .. .. .. 
=  .. .  =
.    . . . ,
dx  
 dfnf dfnf   dvn dvn 
··· ···
dx1 dxnx dv1 dvnx
which is an nf × nx matrix.
I The forward mode is equivalent to solving the linear system for one column of
Dv .
I Since (I − DV ) is a lower triangular matrix, this solution can be
accomplished by forward substitution.
I In the process, we end up computing the derivative of the chosen quantity
with respect to all of the other variables.

I The cost of this procedure is similar to the cost of the procedure that
computes the v’s.

Example: Forward Mode Applied to Simple Model 1

I The variables in this case are
   
v1 x(1)
v2  x(2)
   
v3   det 
   
v=   
v4  = y(1) .
v5  y(2)
   
v6  f(1)
v7 f(2)
I Performing the partial differentiation using symbolic differentiation we get

 1 0 0 0 0 0 0
  dv1 0 
0 1 0 0 0 0 0 dv1
2 1 1 
 2
 dv3 dv3  0
  dv 
−v2 −2v1 v2 1 0 0 0 0 0 1
 2 cos v
v2 2v2 sin v1
2 sin v
v2
0 
dv4
1 ∂v2
dv4  00 
0
− 1 − 1 1 0 0
 = 0 .

v3 v3 2 dv1 ∂v2 0
 v3
0  dv1 
cos v1 sin v1 dv5 dv5 0
 −
v3
0
v32 0 1 0
 dv6 ∂v2
dv6  0 0
0 0
0 0 0 −1 0 1 0 dv1 ∂v2
−v5 cos v1 0 0 0 − sin v1 0 1 dv7 dv7
dv1 ∂v2

Example: Forward Mode Applied to Simple Model 2

I We only kept the first two columns of the matrices Dv and I, because the
only derivatives of interest are in those two columns.

Reverse Mode Matrix Equations 1

I The matrix representation for the reverse mode of algorithmic differentiation
is
T
(I − DV ) DvT = I ⇒
  
1 − ∂V
∂v1
2
− ∂V
∂v1
3
··· − ∂V
∂v1
n
1 dv2
dv1
dv3
dv1 ··· dvn
dv1
0 1 ∂V3
− ∂v2 ··· ∂Vn
− ∂v2  0 1 dv3
··· dvn 
  dv2 dv2 
. .. .. .. ..  . .. .. .. .. 
.  . 
. . . . .  . . . . . =
. ..  . .. 
. .. ∂Vn   . .. dvn 
. . . 1 − ∂vn−1   . . . 1 dvn−1 
0 0 ··· 0 1 0 0 ··· 0 1
 
1 0 ···
0 1 0 ··· 
 
0 0 1 0 · · ·
 .
 .. .. .. .. .. 
. . . . .
0 0 0 0 1

Reverse Mode Matrix Equations 2

I The block matrix we want to compute is in the upper right section of DvT
and now its size is nx × nf .
I As with the forward mode, we need to solve this linear system one column at
the time, but now each column yields the derivatives of the chosen quantity
with respect to all the other variables.
I Because the matrix (I − DV )T is upper triangular, the system can be solved
using back substitution.

Example: Reverse Mode Applied to Simple Model 1

I Replacing the partial derivatives in the reverse matrix equations, we get
 2 v 2 cos v1 cos v1
  dv
dv1
6 dv7
∂v1

1 0 −v2 − 2
v3
−
v3
0 −v5 cos v1 dv6 dv7 0 
0 2v2 sin v1
 dv2 ∂v2  0
 
1 −2v1 v2 − 0 0 0 dv6 dv7 0 0
 v3
0 2 sin v
v2 1 sin v1  dv3 ∂v3  00
 = 0

0
 
0 1 0 0 dv6 dv7 0
 2 2
 
v3 v3 dv4 ∂v4 0
0 0 0 1 0 −1 0
 dv
dv5
6 dv7
∂v5  1 0
0 0 0 0 1 0 − sin v1 0 1
0 0 0 0 0 1 0 dv7
1
0 0 0 0 0 0 1 dv6
0 1
I The derivatives of interest are the top 2 × 2 block in the Dv matrix.

I In contrast to the forward mode, the derivatives of interest are computed by
performing two back substitutions, through which the derivatives of v6 and
v7 with respect to all variables are computed in the process.

Implementation and Tools

There are two main ways of implementing AD:
I Source code transformation
I The whole source code is processed with a parser and all the derivative
calculations are introduced as additional lines of code.
I Resulting source code for large programs is expanded and it may become
difficult to read.
I Every time the original code changes, must run the parser.
I Derived datatypes and operator overloading
I A new type of data structure is created that contains both the value and its
derivative: each real number v is replaced by v̄ = (v, dv).
I All operations are redefined (overloaded) such that, in addition to the result of
the original operations, they yield the derivative of that operation as well
I Compiler must support derived datatypes and operator overloading (e.g.,
Fortran 90, C++)

Available AD Tools 1
The tools for the various programming languages include:
I Fortran
I ADIFOR: Source transformation; forward and reverse modes; Fortran 77;
non-commercial.
I AD01: Operator overloading; forward and reverse modes; Fortran 90;
commercial.
I OPFAD/OPRAD: Operator overloading; forward and reverse modes; Fortran
90; non-commercial.
I TAMC: Source transformation; forward and reverse modes; Fortran 77;
non-commercial.
I TAF: Source transformation; forward and reverse modes; Fortran 90;
commercial.
I Tapenade: Source transformation; Fortran 90; non-commercial. Developed at
INRIA Sophia-Antipolis. Formerly Odyssée.
I C/C++: Various established tools for automatic differentiation. These
include include ADIC, an implementation mirroring ADIFOR, and ADOL-C, a
free package that uses operator overloading and can operate in the forward or
reverse modes and compute higher order derivatives.

Available AD Tools 2
I Other languages: Tools also exist for other languages, such as Matlab and
Python.

The Connection Between AD and the Complex-Step

Method
One significant connection to make is that the complex-step method is equivalent
to the forward mode of AD with an operator overloading implementation
Automatic Complex-Step
∆x1 = 1 h1 = 10−20
∆x2 = 0 h2 = 0
f = x1 x2 f = (x1 + ih1 )(x2 + ih2 )
∆f = x1 ∆x2 + x2 ∆x1 f = x1 x2 − h1 h2 + i(x1 h2 + x2 h1 )
df /dx1 = ∆f df /dx1 = Im f /h
Complex-step method computes one extra term. Other functions are similar:
I Superfluous calculations are made.
I For h ≤ x × 10−20 they vanish, but still affect speed.

Example: Forward AD Using Source Code Transformation

FUNCTION F_D (x , xd , f )
REAL :: x (2) , xd (2)
REAL :: det , detd
REAL :: y (2) , yd (2)
REAL :: f (2) , f_d (2)
detd = xd (1) * x (2) **2 + x (1) *2* x
(2) * xd (2)
FUNCTION F ( x ) det = 2 + x (1) * x (2) **2
REAL :: x (2) , det , y (2) , f yd = 0.0
(2) yd (1) = ((2* x (2) * xd (2) * SIN ( x (1)
det = 2 + x (1) * x (2) **2 ) + x (2) **2* xd (1) * COS ( x (1) ) ) *
y (1) = x (2) **2* SIN ( x (1) ) / det - x (2) **2*&
det & SIN ( x (1) ) * detd ) / det **2
y (2) = SIN ( x (1) ) / det y (1) = x (2) **2* SIN ( x (1) ) / det
f (1) = y (1) yd (2) = ( xd (1) * COS ( x (1) ) * det -
f (2) = y (2) * SIN ( x (1) ) SIN ( x (1) ) * detd ) / det **2
RETURN y (2) = SIN ( x (1) ) / det
END FUNCTION F f_d = 0.0
f_d (1) = yd (1)
f (1) = y (1)
f_d (2) = yd (2) * SIN ( x (1) ) + y (2)
* xd (1) * COS ( x (1) )
f (2) = y (2) * SIN ( x (1) )
RETURN
END FUNCTION F_D
Example: Reverse AD Using Source Code Transformation

SUBROUTINE F_B (x , xb , fb )
REAL :: x (2) , xb (2) ,
REAL :: y (2) , yb (2)
REAL :: f (2) , fb (2)
REAL :: det , detb , tempb , temp
det = 2 + x (1) * x (2) **2
y (1) = x (2) **2* SIN ( x (1) ) / det
y (2) = SIN ( x (1) ) / det
FUNCTION F ( x ) xb = 0.0
REAL :: x (2) , det , y (2) , f yb = 0.0
(2) yb (2) = yb (2) + SIN ( x (1) ) * fb (2)
det = 2 + x (1) * x (2) **2 xb (1) = xb (1) + y (2) * COS ( x (1) ) *
fb (2)
y (1) = x (2) **2* SIN ( x (1) ) / fb (2) = 0.0
det yb (1) = yb (1) + fb (1)
y (2) = SIN ( x (1) ) / det xb (1) = xb (1) + COS ( x (1) ) * yb (2)
f (1) = y (1) / det
detb = -( SIN ( x (1) ) * yb (2) / det
f (2) = y (2) * SIN ( x (1) ) **2)
RETURN yb (2) = 0.0
END FUNCTION F tempb = SIN ( x (1) ) * yb (1) / det
temp = x (2) **2/ det
xb (2) = xb (2) + 2* x (2) * tempb
detb = detb - temp * tempb
xb (1) = xb (1) + x (2) **2* detb +
temp * COS ( x (1) ) * yb (1)
xb (2) = xb (2) + x (1) *2* x (2) *
detb
END SUBROUTINE F_B
Computing Derivatives Analytic Methods
Analytic Methods 1
I Analytic methods are the most accurate and efficient methods.
I Much more involved, since they require detailed knowledge of the
computational model and a long implementation time.
I Applicable when f depends implicitly on x:
f = F (x, y(x)).
I The implicit relationship between the state variables y and the independent
variables is defined by the solution of a set of residual equations,
r = R(x, y(x)) = 0.
I We assumed a discrete analytic approach. This is in contrast to the

continuous approach, in which the equations are not discretized until later.

Analytic Methods 2
Continuous Discrete
Sensitivity Sensitivity
Equations Equations 1
Continuous
Governing
Equations
Discrete Discrete
Governing Sensitivity
Equations Equations 2

Traditional Derivation 1
I Using the chain rule we can write,
df ∂F ∂ F dy
= + ,
dx ∂x ∂y dx
where the result is an nf × nx matrix.
I The partial derivatives represent the variation of f = F (x) with respect to
changes in x for a fixed y
I The total derivative df / dx takes into account the change in y that is
required to keep the residual equations equal to zero.
I This distinction depends on the context, i.e., what is considered a total or
partial derivative depends on the level that is being considered in the nested
system of components.

I Since the governing equations must always be satisfied, the total derivative of
the residuals (210) with respect to the design variables must be zero. Thus,
using the chain rule
dr ∂ R ∂ R dy
= + = 0.
dx ∂x ∂y dx
I The computation of the total derivative matrix dy/ dx is much more
expensive than any of the partial derivatives, since it requires the solution of
the residual equations.
I The partial derivatives can be computed by differentiating the function F
with respect to x while keeping y constant, and can be computed using
symbolic differentiation, finite differences, complex step, or AD.
I The linearized residual equations provide the means for computing the total
Jacobian matrix dy/ dx, by rewriting them as,
∂ R dy ∂R
=− .
∂y dx ∂x

I Substituting this result into the total derivative equation, we obtain
dy
− dx
}| { z
−1
df ∂F ∂F ∂R ∂R
= − .
dx ∂x ∂y ∂y ∂x
| {z }
ψ
I The inverse of the square Jacobian matrix ∂R/∂y is not necessarily explicitly
calculated.
I There are two ways of computing the total derivative matrix dy/ dx:
Direct method: Factorize the Jacobian nx times with the columns of ∂R/∂x
in the right hand side.
Adjoint method: Factorize the Jacobian nf times with the columns of
∂F /∂y in the right hand side.

1
dfDerivatives
Computing @F @Analytic
F @R @R
Methods
= . (43)
dx @x @y @y @x
| {z }
Direct vs. Adjoint Method
nf > nx nx > nf
 = –
1 = –
df @F @F @R @R
=
dx @x @y @y @x
Direct method nf > nx nx > nf
= + = +
df @F @ F dy
= +
dx @x @y dx
@ R dy @ R – = – =
=
@y dx @x
Adjoint method nf > nx nx > nf
= + = +
df @F df @ R
= +
dx @x dr @x
 T  T  T
@R df @F – = – =
=
@y dr @y

Example: Analytic Adjoint Methods Applied to

Finite-Element Structural Analysis 1
I The discretized governing equations for a finite-element structural model are,
Rk = Kki ui − Fk = 0,
where Kki is the stiffness matrix, ui is the vector of displacement (the state)
and Fk is the vector of applied force (not to be confused with the function of
interest from the previous section!).
I We want the derivatives of the stresses, which are related to the
displacements by the equation,
σm = Smi ui .
I The design variables are the cross-sectional areas of the elements, Aj .

I The Jacobian of the residuals with respect to the displacements is simply the
stiffness matrix:
∂Rk ∂(Kki ui − Fk )
= = Kki .
∂yi ∂ui

I The derivative of the residuals with respect to the design variables is
∂Rk ∂(Kki ui − Fk ) ∂Kki

= = ui
∂xj ∂Aj ∂Aj
I The partial derivative of the stress with respect to the displacements is
simply given by
∂fm ∂σm
= = Smi
∂yi ∂ui
I Finally, the explicit variation of stress with respect to the cross-sectional areas
is zero, since the stresses depends only on the displacement field,
∂fm ∂σm
= = 0.
∂xj ∂Aj


I Substituting these into the total derivative equation we get:
dσm ∂σm −1 ∂Kki
=− K ui
dAj ∂ui ki ∂Aj
I If we were to use the direct method, we would solve,
dui ∂Kki
Kki =− ui
dAj ∂Aj
and then substitute the result in,

dσm ∂σm dui
=
dAj ∂ui dAj
to calculate the desired derivatives.


I The adjoint method is the other alternative, by solving,
T ∂σm
Kki ψk = .
∂ui
Then we would substitute the adjoint vector into the equation,

dσm ∂σm ∂Kki
= + ψkT − ui .
dAj ∂Aj ∂Aj
to calculate the desired derivatives.

Derivation of Analytic Methods from the Unifying Chain

Rule 1
I The assumption that the Jacobians are lower triangular matrices does no
longer apply.
I Therefore, we first linearize the residuals so that it is possible to write explicit
equations for the state variables y.
I We linearize about the converged point [x0 , r0 , y0 , f0 ]T , and divide v into
v1 = x, v2 = r, v3 = y, v4 = f .
I So instead of defining them as every single variable assignment in the

computer program, we defined them as variations in the design variables,
residuals, state variables and quantities of interest.


Rule 2
x
r1 r
r2
y1 y
y2
v = [v1 , . . . , vnx , v(nx +1) , . . . , v(nx +ny ) , v(nx +ny +1) , . . . , v(nx +2ny ) , v(n−nf ) , . . . , tn ]T .
| {z } | {z } | {z } | {z }
x r y f

Rule 3
∆x
∆r
∆y
∆f


Rule 4
I We have an initial perturbation x that leads to a response r.
I However, we require that r = 0 be satisfied when we take a total derivative,
so
∂R ∂R
R=0 ⇒ x+ y=0
∂x ∂y


Rule 5
I The solution vector y from this linear system is used with the original
perturbation vector x to compute the total change
v1 = x
∂R
v2 = r = x
∂x
−1
∂R
v3 = y = (−r)
∂y
∂F ∂F
v4 = f = x+ y
∂x ∂y
I Now, all variables are functions of only previous variables, so we can apply
the forward and reverse chain rule equations to the linearized system

2    3 2
2 32 3 @V2
T
@V3
T
@V4
T
dv4
I 0 0 0 I
Derivation of Analytic Methods from the Unifying Chain 6I 76
6 @V2 7 6 dv2 7 2 3 6 @v1  @v1  @v 7 6 dv1
T 7 6
1
6 I 0 07 6 7 I 6 T
6 @v1 7 6 dv1 7 6 7 6 @V3 @V4 7 6 dv4
Rule 6 @V3
6
6 @V3 I 7 6 dv 7 607
76 37 = 2 5
60
6 @v @v2 3 7
76
6 dv2
62 I 3027 6 3 7 4 0 6 T  T  2 T
3 2 T T 7 6
6 @v
I1 0@v2 0 0 7 I6 dv1 7 6 @V2 @V3 @V4 @V4 4 7
dv 6 dv
4 @V @V4 5 4
@V4 7 6 dv2 dv 5 60I 60 0 @v I @v 7 6 7 7 62 3 4
6 @V42 7 42 3 6 4 @v 7 6 dv
@v 7 5 4 0dv
6 I 0 07I6 7 I 6 1  1 T  1
T 7 6
1
3 T7 3
6 @v
@v11 @v2 @v3 7 6 dv
7 6
7 6 dv1 7 1 607 6 0I 0 @V3 0@V4 7 6 dv
7 6 I4 7 6
7 7
6 @V
6 @V3 7 6 dv3 7 = 6 7 6
0
@v2 76 7=4 5 607 I
 @v2 T 7 6 dv2
3
6 I 07 6 7 405 6 T7 0
6 @v1 (a)2 Forward chain
@v rule
7 6 dv 17 6
(b) @V 7 6 dv4 7
0 I Reverse chain 7
rule I
4 @V4 5 4 dv4 5 60 4 6 7
@V4 @V4 4 0
I @v3 5 4 dv3 5
@v1 @v2 @v3 dv1 0 0 0 I I
2 3
(a) Forward chain rule 2 2 32
3  T  T
3 2 T
3
(b) Reverse
@R chain rule @F df
6 I 0 0 0 7 6 I 7 6I 7 6 I 0 76 7
6 76 7 6 76 @x @x 7 6 dx 7
62 3 27 6 76  376  T7
7 63dr 7
T 3 2
6 @R 2 32607 6
 
@R T 76 2 3 df 7
07 6 7 T T
6 I 0 6 0@R I @F df
0 7 6 7
66 I 07 67I6 7I 66I 7 6 0 776 7
66
@x  0 0
7 67 67 dx 7
6=766 7 @x @y @x
76
7 6 dx 7 6 07dr
66
1
7 67 67 6 76
7dy 7 6 76  7 6 T76 6 7 T7
66 0@R
@R
I 70 67 6
dr 7
6 76
7
6 7 6 7 6
6 @R
T
7 6 df@F T 7
777
6
6 7
6
6 df
7
7
7
66 I
@x  @y
0 07 67 67 7 6007 60I
607 6 0 I0 76 776
7 6 07 7
66
6
@R
1
7 6 dx67dx 7
7 67 47 = 5 6 76
6 76
6 76
4 5
@y
 T7
7 6 dr
6 @y T 7 7=66 7
6 7
dy 7
46
6 @F
0
7
I @F07 I65dy 6
7df 607 6 4 7 6
@F 7 6 df 7 6 7 7 5 4 5
6 @y0 76 7 6 7 600 0 I 7 6 7 607
6 @x @y 7 6 dx 7dx 6 7 6 0 0 0@y 7 6 dy I 7 6 7I
4 @F @F 5 4 df 5 4 5 4 54 5 4 5
0 I 0 0 0 0 I I I
@x(c) Forward chain dx
@y rule (simplified) (d) Reverse chain rule (simplified)
(c) Forward chain rule (simplified) (d) Reverse chain rule (simplified)
dr @R df @F
=
J.R.R.A. Martins Multidisciplinary Design Optimization = August 2012 225 / 427
Example: Direct method applied to simple model 1

I Since there are just two state variables, we get the linear system:
 ∂R ∂ R1   dy1 dy1   ∂ R1 ∂ R1 
1
− −
 ∂y1 ∂y2   dx1 dx2   ∂x1 ∂x2 
 ∂R
2 ∂ R2   dy2 dy2  =  ∂ R2 ∂ R2  .
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
I We can use symbolic differentiation to compute each partial derivative of the
residual to obtain
 
dy1 dy1
−x1 −2  dx1 dx2  y1 − cos x1 0
= .
1 −x22  dy2 dy2  0 2x2 y2
dx1 dx2
I In a more realistic example, the computation of the partial derivatives would
not be as easy, since the residuals typically do not have simple analytical
expressions.

Example: Direct method applied to simple model 2

I Since the analytic methods are derived based on a linearization of the system
at a converged state, we must evaluate the system at [x1 , x2 ] = [1, 1] and
[y1 , y2 ] = [ sin3 1 , sin3 1 ].
I The computed values for dy1 /dx1 and dy2 /dx1 can be used to find df1 /dx1
using the following equation:
df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2
= + + .
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1

Adjoint Method 1
I The linear system involving the Jacobian matrix ∂R/∂y can be solved with
∂f /∂y as the right-hand side.
I This results in the following adjoint equations,
T T
∂R ∂F
ψ=− ,
∂y ∂y
where ψ the adjoint matrix (of size ny × nf ).

I Although ψ is usually expressed as a vector, we obtain a matrix due to our
generalization for the case where f is a vector.
I The solution of this linear system needs to be solved for each column of
T
[∂F /∂y] , and thus the computational cost is proportional to the number of
quantities of interest, nf .
I The adjoint vector can be substituted to find the total derivative,
df ∂F ∂R
= + ψT
dx ∂x ∂x

Adjoint Method 2
I Thus, the cost of computing the total derivative matrix using the adjoint
method is independent of the number of design variables, nx , and instead
proportional to the number of quantities of interest, f .
I The partial derivatives shown in these equations need to be computed using
some other method. They can be differentiated symbolically, computed by
finite differences, the complex-step method or even AD. The use of AD for
these partials has been shown to be particularly effective in the development
of analytic methods for PDE solvers.

Example: Adjoint method applied to simple model

I Applying the adjoint method to compute df1 / dx1 , we get
 ∂R ∂ R2   df1 df2   ∂ F1 ∂ F2 
1
− −
 ∂y1 ∂y1   dr1 dr1   ∂y1 ∂y1 
 ∂R
1 ∂ R2   df1 df2  =  ∂ F1 ∂ F2 
− −
∂y2 ∂y2 dr2 dr2 ∂y2 ∂y2
I Replacing the partial derivatives computed symbolically,
 df2 
df1
−x1 1  dr1 dr1  1 0
−2 −x22  df1 df2  = 0 sin x1
dr2 dr2
I After evaluating the system at [x1 , x2 ] = [1, 1] and [y1 , y2 ] = [ sin3 1 , sin3 1 ], we
can find df1 / dx1 using the computed values for df1 / dr1 and df1 / dr2 :
df1 ∂ F1 df1 ∂ R1 df1 ∂ R2

= + +
dx1 ∂x1 dr1 ∂x1 dr2 ∂x1

Example: Computational Accuracy and Cost Comparison

Method Sample derivative Time Memory
Complex –39.049760045804646 1.00 1.00
ADIFOR –39.049760045809059 2.33 8.09
Analytic –39.049760045805281 0.58 2.42
FD –39.049724352820375 0.88 0.72

Constrained Optimization
1. Introduction
5.1 Introduction
5.2 Equality Constraints
5.3 Inequality Constraints
5.4 Constraint Qualification
5.5 Penalty Methods
5.6 Sequential Quadratic Programming

Constrained Optimization Introduction
I Engineering design optimization problems are rarely unconstrained.
I The constraints that appear in these problems are typically nonlinear.
I Thus, we are interested in general nonlinearly constrained optimization theory
and methods.
Recall the statement of a general optimization problem,
minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m

Example: Graphical Solution of a Constrained

Optimization Problem 1
Suppose we want to solve the following optimization problem,
minimize f (x) = 4x21 − x1 − x2 − 2.5

with respect to x1 , x2
subject to c1 (x) = x22 − 1.5x21 + 2x1 − 1 ≥ 0,
c2 (x) = x22 + 2x21 − 2x1 − 4.25 ≤ 0
How can we solve this?

Example: Graphical Solution of a Constrained

Optimization Problem 2

Constrained Optimization Equality Constraints
Optimality Conditions for Equality Constrained Problems

I The optimality conditions for nonlinearly constrained problems are important
because they form the basis of many algorithms for solving such problems.
I Suppose we have the following optimization problem with equality
constraints,
minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
I To solve this problem, we could solve for m̂ components of x by using the

equality constraints to express them in terms of the other components.
I The result would be an unconstrained problem with n − m̂ variables.
I However, this procedure is only feasible for simple explicit functions . . .

Lagrange Multipliers 1
I Joseph Louis Lagrange is credited with developing a more general method to
solve this problem.
I At a stationary point, the total differential of the objective function has to be
equal to zero,
∂f ∂f ∂f
df = dx1 + dx2 + · · · + dxn = ∇f T dx = 0.
∂x1 ∂x2 ∂xn
I Unlike unconstrained optimization, the infinitesimal vector
T
dx = [ dx1 , dx2 , . . . , dxn ] is not arbitrary
I The perturbation x + dx must be feasible: ĉj (x + dx) = 0.
I Therefore, the above equation does not imply that ∇f = 0.

I For a feasible point, the total differential of each of the constraints
(ĉ1 , . . . ĉm̂ ) must also be zero:
∂ĉj ∂ĉj
dĉj = dx1 + · · · + dxn = ∇ĉTj dx = 0, j = 1, . . . , m̂
∂x1 ∂xn
I To interpret the above equation, recall that the gradient of a function is
orthogonal to its contours.
I Thus, since the displacement dx satisfies ĉj (x + dx) = 0 (the equation for a
contour), it follow that dx is orthogonal to the gradient ∇ĉj .
I Lagrange suggested that one could multiply each constraint variation by a
scalar λ̂j and subtract it from the objective function variation,
 
Xm̂ Xn Xm̂
 ∂f ∂ĉj
df − λ̂j dĉj = 0 ⇒ − λ̂j dxi = 0.
j=1 i=1
∂x i j=1
∂x i

I Notice what has happened: the components of the infinitesimal vector dx
have become independent and arbitrary, because we have accounted for the
constraints.
I Thus, for this equation to be satisfied, we need a vector λ̂ such that the
expression inside the parenthesis vanishes, i.e.,
m̂
X ∂ĉj
∂f
− λ̂j = 0, (i = 1, 2, . . . , n)
∂xi j=1 ∂xi
which is a system of n equations and n + m unknowns. To close the system,

we recognize that the m constraints must also be satisfied.

Karush–Kuhn–Tucker (KKT) Conditions 1

I Suppose we define a function as the objective function minus a weighted sum
of the constraints,
m̂
X
L(x, λ̂) = f (x) − λ̂j ĉj (x) ⇒
j=1
L(x, λ̂) = f (x) − λ̂T ĉ(x)
I We call this function the Lagrangian of the constrained problem, and the
weights the Lagrange multipliers. A stationary point of the Lagrangian with
respect to both x and λ̂ will satisfy
X ∂ĉj m̂
∂L ∂f
= − λ̂j = 0, (i = 1, . . . , n)
∂xi ∂xi j=1 ∂xi
∂L
= ĉj = 0, (j = 1, . . . , m̂).
∂ λ̂j


I Thus, a stationary point of the Lagrangian encapsulates our required
conditions: the constraints are satisfied and the gradient conditions are
satisfied.
I These first-order conditions are known as the Karush–Kuhn–Tucker (KKT)
conditions. They are necessary conditions for the optimum of a constrained
problem.


I As in the unconstrained case, the first-order conditions are not sufficient to
guarantee a local minimum.
I For this, we turn to the second-order sufficient conditions (which, as in the
unconstrained case, are not necessary).
I For equality constrained problems we are concerned with the behavior of the
Hessian of the Lagrangian, denoted ∇2xx L(x, λ̂), at locations where the KKT
conditions hold. In particular, we look for positive-definiteness in a subspace
defined by the linearized constraints.
I Geometrically, if we move away from a stationary point (x∗ , λ̂∗ ) along a
direction w that satisfies the linearized constraints, the Lagrangian should
look like a quadratic along this direction.
I More precisely, the second-order sufficient conditions are
wT ∇2xx L(x∗ , λ̂∗ )w > 0,
for all w ∈ Rn such that
∇ĉj (x∗ )T w = 0, j = 1, . . . , m̂.

Example: Problem with Single Equality Constraint 1

Consider the following equality constrained problem:
minimize f (x) = x1 + x2
weight respect to x1 , x2
subject to ĉ1 (x) = x21 + x22 − 2 = 0
-1
-2
-2 -1 0 1 2


I In this example, the Lagrangian is
L = x1 + x2 − λ̂1 (x21 + x22 − 2)
I And the optimality conditions are

" 1 #
1 − 2λ̂1 x1 x
∇x L = = 0 ⇒ 1 = 2λ̂1 1
1 − 2λ̂1 x2 x2 2λ̂1
1
∇λ̂1 L = x21 + x22 − 2 = 0 ⇒ λ̂1 = ±
2
I To establish which are minima as opposed to other types of stationary points,
we need to look at the second-order conditions.
I Directions w = (w1 , w2 )T that satisfy the linearized constraints are given by
1
∇ĉ1 (x∗ )T w = (w1 + w2 ) = 0
λ̂1
⇒ w2 = −w1


I The Hessian of the Lagrangian at the stationary points is

2 −2λ̂1 0
∇x L = .
0 −2λ̂1
I Consequently, the Hessian of the Lagrangian in the subspace defined by w is

−2λ̂1 0 w1
wT ∇2xx L(x∗ )w = w1 −w1 = −4λ̂1 w12
0 −2λ̂1 −w1
I In this case λ̂∗1 = − 21 corresponds to a positive-definite Hessian (in the space

w) and, therefore, the solution to the problem is
(x1 , x2 )T = ( 2λ̂1 , 2λ̂1 )T = (−1, −1)T .
1 1
I At the solution the constraint normal ∇ĉ1 (x∗ ) is parallel to ∇f (x∗ ), i.e.,
there is a scalar λ̂∗1 such that
∇f (x∗ ) = λ̂∗1 ∇ĉ1 (x∗ ).


I We can derive this expression by examining the first-order Taylor series
approximations to the objective and constraint functions. To retain feasibility
with respect to ĉ1 (x) = 0 we require that
ĉ1 (x + d) = 0 ⇒
ĉ1 (x + d) = ĉ1 (x) +∇ĉT1 (x)d + O(dT d).
| {z }
=0
I Linearizing this we get,

∇ĉT1 (x)d = 0 .
I We also know that a direction of improvement must result in a decrease in f ,
i.e.,
f (x + d) − f (x) < 0.


I Thus to first order we require that
f (x) + ∇f T (x)d − f (x) < 0 ⇒

∇f T (x)d < 0 .
I A necessary condition for optimality is that there be no direction satisfying

both of these conditions. The only way that such a direction cannot exist is if
∇f (x) and ∇ĉ1 (x) are parallel, that is, if ∇f (x) = λ̂1 ∇ĉ1 (x) holds.
I By defining the Lagrangian function
L(x, λ̂1 ) = f (x) − λ̂1 ĉ1 (x),
and noting that ∇x L(x, λ̂1 ) = ∇f (x) − λ̂1 ∇ĉ1 (x), we can state the
necessary optimality condition as follows: At the solution x∗ there is a scalar
λ̂∗1 such that ∇x L(x∗ , λ̂∗1 ) = 0.
I Thus we can search for solutions of the equality-constrained problem by
searching for a stationary point of the Lagrangian function. The scalar λ̂1 is
the Lagrange multiplier for the constraint ĉ1 (x) = 0.
Constrained Optimization Inequality Constraints
Optimality for Inequality Constrained Problems 1

I Suppose we now have a general problem with equality and inequality
constraints.
minimize f (x)
w.r.t x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m
I The optimality (KKT) conditions for this problem can also be obtained for
this case by modifying the Lagrangian to be

L(x, λ̂, λ, s) = f (x) − λ̂T ĉ(x) − λT c(x) − s2 ,
where λ are the Lagrange multipliers associated with the inequality

constraints and s is a vector of slack variables.

First-Order KKT Conditions
m̂
X ∂ĉj X ∂ck m
∂L ∂f
∇x L = 0 ⇒ = − λ̂j − λk = 0, i = 1, . . . , n
∂xi ∂xi j=1 ∂xi ∂xi
k=1
∂L
∇λ̂ L = 0 ⇒ = ĉj = 0, j = 1, . . . , m̂
∂ λ̂j
∂L
∇λ L = 0 ⇒ = ck − s2k = 0 k = 1, . . . , m
∂λk
∂L
∇s L = 0 ⇒ = λk sk = 0, k = 1, . . . , m
∂sk
λk ≥ 0, k = 1, . . . , m.
Now we have n + m̂ + 2m equations and for each constraint:

I sk > 0: the k-th constraint is inactive, and λk = 0.
I sk = 0: the k-th constraint is active, and λk 6= 0. λk must then be
non-negative

Sufficient Optimality Conditions

Sufficient conditions are obtained by examining the second-order requirements.
The set of sufficient conditions is as follows:
1. KKT necessary conditions must be satisfied at x∗ .
2. The Hessian matrix of the Lagrangian,
m̂
X m
X
∇2 L = ∇2 f (x∗ ) − λ̂j ∇2 ĉj − λk ∇2 ck
j=1 k=1
is positive definite in the feasible space. This is a subspace of n-space and is

defined as follows: any direction y that satisfies
y 6= 0
∇ĉTj (x∗ )y = 0, for all j = 1, . . . , m̂
∇cTk (x∗ )y = 0, for all k for which λk > 0.
Then the Hessian of the Lagrangian in feasible space must be positive
definite,
y T ∇2 L(x∗ )y > 0.
Example: Problem with a Single Inequality Constraint 1

I Suppose we now have the same problem, but with an inequality replacing the
equality constraint,
s.t. c1 (x) = 2 − x21 − x22 ≥ 0
I The feasible region is now the circle and its interior. Note that ∇c1 (x) now
points towards the center of the circle.
I Graphically, we can see that the solution is still (−1, −1)T and therefore
λ∗1 = 1/2.


2
-1
-2
-2 -1 0 1 2


I Given a point x that is not optimal, we can find a step d that both stays
feasible and decreases the objective function f , to first order. As in the
equality constrained case, the latter condition is expressed as
∇f T (x)d < 0 .
I The first condition, however is slightly different, since the constraint is not
necessarily zero, i.e.
c1 (x + d) ≥ 0
I Performing a Taylor series expansion we have,
c1 (x + d) ≈ c1 (x) + ∇cT1 (x)d.

| {z }
≥0
I Thus feasibility is retained to a first order if
c1 (x) + ∇cT1 (x)d ≥ 0 .


I In order to find valid steps d it helps two consider two possibilities.
1. Suppose x lies strictly inside the circle (c1 (x) > 0). In this case, any vector d
satisfies the feasibility condition, provided that its length is sufficiently small.
The only situation that will prevent us from finding a descent direction is if
∇f (x) = 0.
2. Consider now the case in which x lies on the boundary, i.e., c1 (x) = 0. The
conditions thus become ∇f T (x)d < 0 and ∇cT1 (x)d ≥ 0. The two regions
defined by these conditions fail to intersect only when ∇f (x) and ∇c1 (x)
point in the same direction, that is, when
∇f (x)T d = λ1 c1 (x), for some λ1 ≥ 0.
I The optimality conditions for these two cases can again be summarized by
using the Lagrangian function, that is,
∇x L(x∗ , λ∗1 ) = 0, for some λ∗1 ≥ 0 and λ∗1 s∗1 = 0.


I The last condition is known as a complementarity condition and implies that
the Lagrange multiplier can be strictly positive only when the constraint is
active.
2
-1
-2
-2 -1 0 1 2

Example: Lagrangian Whose Hessian is Not Positive

Definite
minimize f (x) = −x1 x2

s.t. ĉ1 (x) = 2 − x21 − x22 = 0
x1 ≥ 0, x2 ≥ 0
2
1.5
0.5
0
0 0.5 1 1.5 2

Example: Problem with Two Inequality Constraints 1

Suppose we now add another inequality constraint,
s.t. c1 (x) = 2 − x21 − x22 ≥ 0, c2 (x) = x2 ≥ 0.
The feasible
√ region is now a half disk. Graphically, we can see that the solution is
now (− 2, 0)T and that both constraints are active at this point.
1.5
0.5
0
-2 -1 0 1 2


The Lagrangian for this problem is

L(x, λ, s) = f (x) − λ1 c1 (x) − s21 − λ2 c2 (x) − s22 ,
where λ = (λ1 , λ2 )T is the vector of Lagrange multipliers. The first order
optimality conditions are thus,
∇x L(x∗ , λ∗ ) = 0, for some λ∗ ≥ 0.
Applying the complementarity conditions to both inequality constraints,
λ∗1 s∗1 = 0, and λ∗2 s∗2 = 0.
√
For x∗ = (− 2, 0)T we have,
√
∗ 1 ∗ 2 2 ∗ 0
∇f (x ) = , ∇c1 (x ) = , ∇c2 (x ) = ,
1 0 1
and ∇x L(x∗ , λ∗ ) = 0 when 1

√
∗ 2 2 .
λ =
1

I Now lets consider other feasible points that are not optimal and examine the
Lagrangian and its gradients at these points.
√
I For point x = ( 2, 0)T , both constraints are again active. However, ∇f (x)
no longer lies in the quadrant defined by ∇ci (x)T d ≥ 0, i = 1, 2 and
therefore there are descent directions that are feasible, like for example
d = (−1, 0)T .
1
I ∇x L(x∗ , λ∗ ) = 0 at this point for λ = (− 2√ 2
, 1)T . However, since λ1 is
negative, the first order conditions are not satisfied at this point.
1.5
0.5
0
-2 -1 0 1 2


I Now consider the point x = (1, 0)T , for which only the second constraint is
active. Linearizing f and c as before, d must satisfy the following to be a
feasible descent direction,
c1 (x + d) ≥ 0 ⇒ 1 + ∇c1 (x)T d ≥ 0,
c2 (x + d) ≥ 0 ⇒ ∇c2 (x)T d ≥ 0,
f (x + d) − f (x) < 0 ⇒ 1 + ∇f (x)T d < 0.
I We only need to worry about the last two conditions, since the first is always
satisfied for a small enough step.
I By noting that
1 0
∇f (x∗ ) = , ∇c2 (x∗ ) = ,
1 1

we can see that the vector d = − 12 , 14 , for example satisfies the two
conditions.


I Since c1 (x) > 0, we must have λ1 = 0. In order to satisfy ∇x L(x, λ) = 0 we
would have to find λ2 such that ∇f (x) = λ2 ∇c2 (x). No such λ2 exists and
this point is therefore not an optimum.
1.5
0.5
0
-2 -1 0 1 2

Constrained Optimization Constraint Qualification
Constraint Qualification 1
I The KKT conditions are derived using certain assumptions and depending on
the problem, these assumptions might not hold.
I A point x satisfying a set of constraints is a regular point if the gradient
vectors of the active constraints, ∇cj (x) are linearly independent.
I To illustrate this, suppose we replaced the ĉ1 (x) in the previous example by
the equivalent condition
2
ĉ1 (x) = x21 + x22 − 2 = 0.
I Then we have
4(x21 + x22 − 2)x1
∇ĉ1 (x) = ,
4(x21 + x22 − 2)x2
so ∇ĉ1 (x) = 0 for all feasible points and ∇f (x) = λ̂1 ∇ĉ1 (x) cannot be
satisfied. In other words, there is no (finite) Lagrange multiplier that makes
the objective gradient parallel to the constraint gradient, so we cannot solve
the optimality conditions.

Constrained Optimization Constraint Qualification
Constraint Qualification 2
I This does not imply there is no solution; on the contrary, the solution
remains unchanged for the earlier example.
I Instead, what it means is that most algorithms will fail, because they assume
the constraints are linearly independent.

Constrained Optimization Penalty Methods
Penalty Function Methods

I One of the ways of solving constrained optimization problems, at least
approximately, is by adding a penalty function to the objective function that
depends — in some logical way — on the value of the constraints.
I The idea is to minimize a sequence of unconstrained minimization problems
where the infeasibility of the constraints is minimized together with the
objective function.
I There two main types of penalization methods:
I Exterior penalty functions: These impose a penalty for violation of constraints
I Interior penalty functions: These impose a penalty for approaching the
boundary of an inequality constraint.

Exterior Penalty Functions 1

I Consider the equality-constrained problem:
minimize f (x)
subject to ĉ(x) = 0
where ĉ(x) is an m̂-dimensional vector whose j-th component is ĉj (x).

I We assume that all functions are twice continuously differentiable.
I We require a penalty for constraint violation to be a continuous function φ
with the following properties
φ(x) = 0 if x is feasible
φ(x) > 0 otherwise,
I The new objective function is
π(x, ρ) = f (x) + ρφ(x),
were ρ is positive and is called the penalty parameter.


I The penalty method consists of solving a sequence of unconstrained
minimization problems of the form
minimize π (x, ρk )
w.r.t. x
for an increasing sequence of positive values of ρk tending to infinity.

I For finite values of ρk , the minimizer of the penalty function violate the
equality constraints. The increasing penalty forces the minimizer toward the
feasible region.


General algorithm using exterior penalty functions:
1: Input: x0 , τ . Starting point, penalty multiplier

2: Output: x∗ . Optimum point
3: repeat
4: Solve the following unconstrained subproblem starting from xk :
minimize π(x, ρk )
w.r.t. x
5: xk+1 ← x
6: ρk+1 ← τ ρk . Increase the penalty parameter
7: k ←k+1
8: until xk converges to the desired tolerance
The increase in the penalty parameter for each iteration can range from modest
(ρk+1 = 1.4ρk ), to ambitious (ρk+1 = 10ρk ), depending on the problem.

Quadratic Penalty Method 1

I The quadratic penalty function is defined as
m̂
ρX ρ
π(x, ρ) = f (x) + ĉi (x)2 = f (x) + ĉ(x)T ĉ(x).
2 i=1 2
I The penalty is equal to the sum of the square of all the constraints and is
therefore greater than zero when any constraint is violated and is zero when
the point is feasible.
I We can modify this method to handle inequality constraints by defining the
penalty for these constraints as
m
X 2
φ(x, ρ) = ρ (max [0, −ci (x)]) .
i=1
I Penalty functions suffer from problems of ill conditioning. The solution of the
modified problem approaches the true solution as limρ→+∞ x∗ (ρ) = x∗

Quadratic Penalty Method 2

I However, as the penalty parameter increases, the condition number of the
Hessian matrix of π(x, ρ) increases and tends to ∞. This makes the problem
increasingly difficult to solve numerically.

Interior Penalty Methods 1

I Exterior penalty methods generate infeasible points and are therefore not
suitable when feasibility has to be strictly maintained.
I This might be the case if the objective function is undefined or ill-defined
outside the feasible region.
I Interior point methods also solve a sequence of unconstrained modified
differentiable functions whose unconstrained minima converge to the
optimum solution of the constrained problem in the limit.
I Consider the inequality-constrained problem:
minimize f (x)
subject to c(x) ≥ 0
where c(x) is an m-dimensional vector whose j-th component is cj (x).

I We assume that all functions are twice continuously differentiable.


I The logarithmic barrier function adds a penalty that tends to infinity as x
approaches infeasibility. The function is defined as
m
X
π(x, µ) = f (x) − µ log (cj (x)) ,
j=1
where the positive scalar µ is called the barrier parameter.

I The inverse barrier function is defined as
m
X 1
π(x, µ) = f (x) + µ ,
j=1
cj (x)
and shares many of the same characteristics of the logarithmic barrier.
I The solution of the modified problem for both functions approach the real
solution as limµ→0 x∗ (µ) = x∗ .
I Again, the Hessian matrix becomes increasingly ill conditioned as µ
approaches zero.

I Similarly to the an exterior point method, an algorithm using these barrier
functions finds the minimum of π(x, µk ), for a given starting (feasible) point
and terminates when norm of gradient is close to zero.
I The algorithm then chooses a new barrier parameter µk+1 and a new starting
point, finds the minimum of the new problem and so on.
I A value of 0.1 for the ratio µk+1 /µk is usually considered ambitious.

Example: Quadratic Penalty Function in Action

Constrained Optimization Sequential Quadratic Programming
Sequential Quadratic Programming (SQP) 1

I Consider the the equality-constrained problem,
minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
I The idea of SQP is to model this problem at the current point xk by a

quadratic subproblem and to use the solution of this subproblem to find the
new point xk+1 .
I SQP represents the application of Newton’s method to the KKT optimality
conditions.
I The Lagrangian function for this problem is L(x, λ̂) = f (x) − λ̂T ĉ(x). We
define the Jacobian of the constraints by
A(x)T = ∇ĉ(x)T = [∇ĉ1 (x), . . . , ∇ĉm̂ (x)]
which is an n × m matrix and g(x) ≡ ∇f (x) is an n-vector as before. Note

that A is generally not symmetric.

Sequential Quadratic Programming (SQP) 2

I Applying the first order KKT conditions to this problem we obtain

g(x) − A(x)T λ̂
∇L(x, λ̂) = 0 ⇒ =0
ĉ(x)
I This set of nonlinear equations can be solved using Newton’s method,

W (xk , λ̂k ) −A(xk )T pk −gk + ATk λ̂k
=
A(xk ) 0 pλ̂ −ĉk
where the Hessian of the Lagrangian is denoted by W (x, λ̂) = ∇2xx L(x, λ̂).
I The Newton step from the current point is given by

xk+1 xk p
= + k .
λ̂k+1 λ̂k pλ̂

Alternative View of SQP 1

I An alternative way of looking at this formulation of the SQP is to define the
following quadratic problem at (xk , λ̂k )
1 T
minimize p Wk p + gkT p
2
subject to Ak p + ĉk = 0
I This problem has a unique solution that satisfies
Wk p + gk − ATk λ̂k = 0
Ak p + ĉk = 0
I By writing this in matrix form, we see that pk and λ̂k can be identified as the
solution of the Newton equations we derived previously.

Wk −ATk pk −gk
=
Ak 0 λ̂k+1 −ĉk .

Alternative View of SQP 2

I This problem is equivalent, but the second set of variables, is now the actual
vector of Lagrange multipliers λ̂k+1 instead of the Lagrange multiplier step,
pλ̂ .

Quasi-Newton Approximations 1
I Any SQP method relies on a choice of Wk (an approximation of the Hessian
of the Lagrangian) in the quadratic model.
I When Wk is exact, then the SQP becomes the Newton method applied to
the optimality conditions.
I One way to approximate the Hessian of the Lagrangian would be to use a
quasi-Newton approximation, such as the BFGS update formula. We could
define,
sk = xk+1 − xk , yk = ∇x L(xk+1 , λk+1 ) − ∇x L(xk , λk+1 ),
and then compute the new approximation Bk+1 using the same formula used
in the unconstrained case.
I If ∇2xx L is positive definite at the sequence of points xk , the method will
converge rapidly, just as in the unconstrained case. If, however, ∇2xx L is not
positive definite, then using the BFGS update may not work well.

I To ensure that the update is always well-defined the damped BFGS updating
for SQP was devised. Using this scheme, we set
rk = θk yk + (1 − θk )Bk sk ,
where the scalar θk is defined as

(
1 if sTk yk ≥ 0.2sTk Bk sk ,
θk = 0.8sT
k Bk sk
sT T if sTk yk < 0.2sTk Bk sk .
k Bk sk −sk yk
Then we can update Bk+1 using,
Bk sk sTk Bk rk rT
Bk+1 = Bk − T
+ T k,
sk Bk sk sk rk
which is the standard BFGS update formula with yk replaced by rk . This

guarantees that the Hessian approximation is positive definite.
I When θk = 0, we have Bk+1 = Bk

I When θk = 1 we have an unmodified BFGS update.
I The modified method thus produces an interpolation between the current Bk
and the one corresponding to BFGS.
I The choice of θk ensures that the new approximation stays close enough to
the current approximation to guarantee positive definiteness.

Other Modifications 1
I In addition to using a different quasi-Newton update, SQP algorithms also
need modifications to the line search criteria in order to ensure that the
method converges from remote starting points.
I It is common to use a merit function, φ to control the size of the steps in the
line search. The following is one of the possibilities for such a function:
1
φ(xk ; µ) = f (x) + ||ĉ||1
µ
I The penalty parameter µ is positive and the L1 norm of the equality
constraints is
m̂
X
||ĉ||1 = |ĉj |.
j=1

Other Modifications 2
I To determine the sequence of penalty parameters, the following strategy is
often used (
µk−1 if µ−1
k−1 ≥ γ + δ
µk = −1
(γ + 2δ) otherwise,
where γ is set to max(λk+1 ) and δ is a small tolerance that should be larger
that the expected relative precision of your function evaluations.

SQP Algorithm
Input: Initial guess (x0 , λ0 ), parameters 0 < η < 0.5

k←0
Initialize the Hessian estimate, B0 ← I
repeat
Compute pk and pλ̂ by solving the KKT system, with Bk in place of Wk
Choose µk such that pk is a descent direction for φ at xk
αk ← 1
while φ(xk + αk pk , µk ) > φ(xk , µk ) + ηαk D [φ(xk , pk )] do
αk ← τα αk for some 0 < τα < 1
end while
xk+1 ← xk + αk pk
λ̂k+1 ← λ̂k + pλ̂
Evaluate fk+1 , gk+1 , ck+1 and Ak+1
sk ← αk pk , yk ← ∇x L(xk+1 , λk+1 ) − ∇x L(xk , λk+1 )
Obtain Bk+1 by using a quasi-Newton update to Bk
k ←k+1
until Convergence
D denotes the directional derivative in the pk direction.

Inequality Constraints 1
I The SQP method can be extended to handle inequality constraints.
I Consider general nonlinear optimization problem
minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m
I To define the subproblem we now linearize both the inequality and equality
constraints and obtain,
1 T
minimize p Wk p + gkT p
2
subject to ∇ĉj (x)T p + ĉj (x) = 0, j = 1, . . . , m̂
∇ck (x)T p + ck (x) ≥ 0, k = 1, . . . , m
I One of the most common type of strategy to solve this problem, the
active-set method, is to consider only the active constraints at a given
iteration and treat those as equality constraints.
Inequality Constraints 2
I This is a significantly more difficult problem because we do not know a priori
which inequality constraints are active at the solution. If we did, we could just
solve the equality constrained problem considering only the active constraints.
I The most commonly used active-set methods are feasible-point methods.
These start with a feasible solution a never let the new point be infeasible.

Example: Constrained Optimization Using SQP

Gradient-Free Optimization
Gradient-Free Optimization
1. Introduction
6.1 Introduction
6.2 Nelder–Mead Simplex
6.3 DIvided RECTangles (DIRECT)
6.4 Genetic Algorithms
6.5 Particle Swarm Optimization

Gradient-Free Optimization Introduction
Gradient-Free Optimization 1
Using optimization in the solution of practical applications we often encounter one
or more of the following challenges:
I non-differentiable functions and/or constraints
I disconnected and/or non-convex feasible space
I discrete feasible space
I mixed variables (discrete, continuous, permutation)
I large dimensionality
I multiple local minima (multi-modal)
I multiple objectives
Mi
xed
(
Int
eger
-Cont
inuous
)

Gradient-based methods are:
I Efficient in finding local minima for high-dimensional, nonlinearly-constrained,
convex problems
I Sensitive to noisy and discontinuous functions
I Limited to continuous design variables.
Consider, for example, the Griewank function:
n
P Q
n
x2i xi
f (x) = 4000 − cos √
i
+1
i=1 i=1
−600 ≤ xi ≤ 600

How we could find the best solution for this example?
I Multiple point restarts of gradient (local) based optimizer
I Systematically search the design space
I Use gradient-free optimizers
Some comments on gradient-free methods:
I Many mimic mechanisms observed in nature — biomimicry — or use other
heuristics.
I They are not necessarily guaranteed to find the true global optimal solutions
— unlike gradient-based methods in a convex search space . . .
I . . . but they are able to find many good solutions — the mathematician’s
answer vs. the engineer’s answer.
I Their key strength is the ability to solve some problems that are difficult to
solve using gradient-based methods.
I Many of them are designed as global optimizers and thus are able to find
multiple local optima while searching for the global optimum.

A wide variety of gradient-free methods have been developed. We are going to
look at some of the most commonly used algorithms:
I Nelder–Mead Simplex (Nonlinear Simplex)
I Divided Rectangles Method
I Genetic Algorithms
I Particle Swarm Optimization

Gradient-Free Optimization Nelder–Mead Simplex
Nelder–Mead Simplex 1
I The simplex method of Nelder and Mead performs a search in n-dimensional
space using heuristic ideas.
I It is also known as the nonlinear simplex
I Not to be confused with the linear simplex, with which it has nothing in
common.
I Strengths: requires no derivatives to be computed and that it does not
require the objective function to be smooth.
I The weakness: not very efficient, particularly for problems with more than
about 10 design variables; above this number of variables convergence
becomes increasingly difficult.
I A simplex is a structure in n-dimensional space formed by n + 1 points that

are not in the same plane.
I A line segment is a 1-dimensional simplex, a triangle is a 2-dimensional
simplex and a tetrahedron forms a simplex in 3-dimensional space.
I The simplex is also called a hypertetrahedron.

Nelder–Mead Simplex 2
The Nelder–Mead algorithm starts with a simplex (n + 1 sets of design variables
x) and then modifies the simplex at each iteration using four simple operations.
The sequence of operations to be performed is chosen based on the relative values
of the objective function at each of the points.

Nelder–Meade Algorithm 1
I The first step of the simplex algorithm is to find the n + 1 points of the
simplex given an initial guess x0 .
I This can be easily done by simply adding a step to each component of x0 to
generate n new points.
I However, generating a simplex with equal length edges is preferable . . .
I Suppose the length of all sides is required to be c and that the initial guess,
x0 is the (n + 1)th point.
I The remaining points of the simplex, i = 1, . . . , n can be computed by
adding a vector to x0 whose components are all b except for the ith
component which is set to a, where
c √
b= √ n+1−1
n 2
c
a=b+ √ .
2

0.9
0.8 0.9
0.8
0.7
0.7
0.6
0.6
0.5 0.5
x3
x2
0.4
0.4
0.3
0.3 0.2
0.1
0.2
0
0 0.8
0.1 0.2
0.4 0.6
0.6 0.4
0 0.8 0.2
0 0.2 0.4 0.6 0.8 1 1 0
x1 x2
x1
I After generating the initial simplex, we have to evaluate the objective

function at each of its vertices in order to identify three key points:
I The highest value — the worst point, xw
I The second highest value — the lousy point, xl
I The lowest value — the best point, xb

The Nelder–Mead algorithm starts by computing the average of the n points that
exclude exclude the worst,
n+1
1 X
xa = xi .
n
i=1,i6=w
The algorithm then performs five main operations to the simplex:


!"#$%

*+,-.
&'() /012345 6/7189613/7

KLMNOPQRO ABCADE FGBHIJFHAGB :;<=>?=>@
I Reflection
xr = xa + α (xa − xw )

I Expansion
xe = xr + γ (xr − xa ) ,
where the expansion parameter γ is usually set to 1.
I Inside contraction
xc = xa − β (xa − xw ) ,
where the contraction factor is usually set to β = 0.5.
I Outside contraction
xo = xa + β (xa − xw ) .
I Shrinking
xi = xb + ρ (xi − xb ) ,
where the scaling parameter is usually set to ρ = 0.5.
Each of these operations generates a new point and the sequence of operations
performed in one iteration depends on the value of the objective at the new point
relative to the other key points.

Initialize n-simplex,
evaluate n+1 points
Rank vertices:
best, lousy and
worst
Reflect
Is reflected point better

Yes Expand
than best point?
No Keep expanded point
Yes
Perform inside Is reflected point worse Is expanded point better
Yes
contraction than worst point? than best point?
No
No
Keep reflected point
Is new point worse than Is reflected point worse Perform outside

Yes
worst point? than lousy point? contraction
No Yes No
Shrink
Yes
Is new point worse than
Keep new point Shrink Keep reflected point
reflected point? No
Keep new point

Nelder–Meade Algorithm
Input: Initial guess, x0
k←0
Create a simplex with edge length c
repeat
Identify the highest (xw : worst), second highest (xl , lousy) and lowest (xb :
best) value points with function values fw , fl , and fb , respectively
Evaluate xa , the average of the point in simplex excluding xw
Perform reflection to obtain xr , evaluate fr
if fr < fb then
Perform expansion to obtain xe , evaluate fe .
if fe < fb then
xw ← xe , fw ← fe (accept expansion)
else
xw ← xr , fw ← fr (accept reflection)
end if
else if fr ≤ fl then
xw ← xr , fw ← fr (accept reflected point)
else
if fr > fw then
Perform an inside contraction and evaluate fc
if fc < fw then
xw ← xc (accept contraction)
else
Shrink the simplex
end if
else
Perform an outside contraction and evaluate fc
if fc ≤ fr then
xw ← xc (accept contraction)
else
Shrink the simplex
end if
end if
end if
k ←k+1
until (fw − fb ) < (ε1 + ε2 |fb |)

Alternative Convergence Criteria

I The criterion used above is based on the difference between the best and the
worst function value,
(fw − fb ) < (ε1 + ε2 |fb |)
I Alternatively, we can use the size of simplex,
n
X
s= |xi − xn+1 |
i=1
which must be less than a certain tolerance.

I Another measure of convergence that can be used is the standard deviation,
s
Pn+1
¯2
i=1 fi − f
σ=
n+1
where f¯ is the mean of the n + 1 function values.

Variations of the Simplex Algorithm

I Since the simplex method is largely based on heuristics, the original method
has been the subject of many proposed changes . . .
I . . . but none of the proposed changes have replace the original algorithm,
except for one:
I We notice that if fe < fb but fr is even better (i.e., fr < fe ) the algorithm
still accepts the expanded point xe . Now, it is standard practice to accept
the best of fr and fe

Example: Minimizing Rosenbrock with Nelder–Meade

Gradient-Free Optimization DIvided RECTangles (DIRECT)
DIvided RECTangles (DIRECT) Method 1

The DIRECT method uses a hyperdimensional adaptive meshing scheme to search
all the design space to find the optimum.
The overall idea behind DIRECT is as follows.
1. Scale the design box to a n-dimensional unit hypercube and evaluating the
objective function at the center point of the hypercube
2. Divide the potentially optimal hyper-rectangles by sampling the longest
coordinate directions of the hyper-rectangle and trisecting based on the
directions with the smallest function value until the global minimum is found
3. Sampling of the maximum length directions prevents boxes from becoming
overly skewed and trisecting in the direction of the best function value allows
the biggest rectangles contain the best function value. This strategy increases
the attractiveness of searching near points with good function values
4. Iterate the above procedure allow to identify and zoom into the most
promising design space regions


I
DENTIFYPOTENTI
ALLY SAMPLE&DI VI
DE
START
OPTI
MUM RECTANGLES RECTANGLES


I To identify the potentially optimal rectangles we consider the values of f
versus the d for a given group of points.
I The line connecting the points with the lowest f for a given d (or greatest d
for a given f ) represent the points with the most potential.


I Mathematically, assuming that the unit hypercube with center ci is divided
into m hyper-rectangles, a hyper-rectangle j is potentially optimal if there
exists rate-of-change constant K̄ > 0 such that
f (cj ) − K̄dj ≤ f (ci ) − K̄di for all i = 1, ..., m
(4)
f (cj ) − K̄dj ≤ fmin − ε|fmin |,
where
I d is the distance between c and the vertices of the hyper-rectangle
I fmin is the best current value of the objective function
I ε is positive parameter used so that f (cj ) exceeds the current best solution by
a non-trivial amount
I The first equation forces the selection of the rectangles on this line.
I The second equation requires that the function value exceeds the current
best function value by an amount that is not insignificant.
I This prevents the algorithm from becoming too local, wasting precious
function evaluations in search of smaller function improvements.
I The parameter ε balances the search between local and global. A typical
value is ε = 10−4 , and its the range is usually such that 10−2 ≤ ε ≤ 10−7 .
DIRECT Algorithm
Input: Initial guess, x0

k←0
repeat
Normalize the search space to be the unit hypercube. Let c1 be the center
point of this hypercube and evaluate f (c1 ).
Identify the set S of potentially optimal rectangles/cubes, that is all those
rectangles defining the bottom of the convex hull of a scatter plot of rectangle
diameter versus f (ci ) for all rectangle centers ci
for all Rectangles r ∈ S do
Identify the set I of dimensions with the maximum side length
Set δ equal one third of this maximum side length
for all i ∈ I do
Evaluate the rectangle/cube at the point cr ± δei for all i ∈ I, where
cr is the center of the rectangle r, and ei is the ith unit vector
end for
Divide the rectangle r into thirds along the dimensions in I, starting with
the dimension with the lowest value of f (c±δei ) and continuing to the dimension
with the highest f (c ± δei ).
end for
until Converged

Example: Minimization of Rosenbrock with DIRECT

Gradient-Free Optimization Genetic Algorithms
Genetic Algorithms
I Genetic algorithms for optimization are inspired by the process of natural
evolution of organisms.
I First developed by John Holland in the mid 1960’s. Holland was motivated
by a desire to better understand the evolution of life by simulating it in a
computer and the use of this process in optimization.
I Genetic algorithms are based on three essential components:
I Survival of the fittest — Selection
I Reproduction processes where genetic traits are propagated — Crossover
I Variation — Mutation
I We use the term “genetic algorithms” generically to refer to optimization
approaches that use the three components above.
I Depending on the approach they have different names, for example: genetic
algorithms, evolutionary computation, genetic programming, evolutionary
programming, evolutionary strategies.

Genetic Algorithm Nomenclature

We will start by posing the unconstrained optimization problem with design
variable bounds,
minimize f (x)
subject to xl ≤ x ≤ xu
where xl and xu are the vectors of lower and upper bounds on x, respectively.
In the context of genetic algorithms we will call each design variable vector x a
population member. The value of the objective function, f (x) is termed the
fitness.
Genetic algorithms are radically different from the gradient based methods we
have covered so far. Instead of looking at one point at a time and stepping to a
new point for each iteration, a whole population of solutions is iterated towards
the optimum at the same time. Using a population lets us explore multiple
“buckets” (local minima) simultaneously, increasing the likelihood of finding the
global optimum.

The Pros and Cons of Genetic Algorithms

I Pros:
I Uses a coding of the parameter set, not the parameter themselves, so the
algorithm can handle mixed continuous, integer and discrete design variables.
I The population can cover a large range of the design space and is less likely
than gradient based methods to “get stuck” in local minima.
I As with other gradient-free methods, it can handle noisy and discontinuous
objective functions.
I The implementation is straightforward and easily parallelized.
I Can be used for multiobjective optimization.
I There is “no free lunch”, of course, and these methods have some cons: The
main one is that genetic algorithms are expensive when compared to
gradient-based methods, especially for problems with a large number of
design variables.
I However, it is sometimes difficult to make gradient-based methods work and
in some of these problems genetic algorithms work very well with little effort.
I Although genetic algorithms are much better than completely random
methods, they are still “brute force” methods that require a large number of
function evaluations.
Single-Objective Optimization 1
The general procedure of a genetic algorithm can be described as follows:
1. Initialize a population: Each member of the population represents a design
point, x and has a value of the objective (fitness), and information about its
constraint violations associated with it.
2. Determine mating pool: Each population member is paired for reproduction
by using one of the following methods:
I Random selection
I Based on fitness: make the better members to reproduce more often than the
others.
3. Generate offspring: To generate offspring we need a scheme for the crossover
operation. There are various schemes that one can use. When the design
variables are continuous, for example, one offspring can be found by
interpolating between the two parents and the other one can be extrapolated
in the direction of the fitter parent.
4. Mutation: Add some randomness in the offspring’s variables to maintain
diversity.

Single-Objective Optimization 2
5. Compute Offspring’s Fitness
Evaluate the value of the objective function and constraint violations for each
offspring.
6. Tournament
Again, there are different schemes that can be used in this step. One method
involves replacing the worst parent from each “family” with the best offspring.
7. Identify the Best Member
8. Return to step 2 unless converged or computational budget is exceeded.
I Convergence is difficult to determine because the best solution so far may be

maintained for many generations.
I Rule of thumb: if the best solution among the current population has not
changed (much)for about 10 generations, it can be assumed as the
“optimum” for the problem.
I Since GAs are probabilistic methods, it is crucial to run the problem multiple
times when studying its characteristics.

Multi-Objective Optimization 1
I What if we want to investigate the trade-off between two (or more)
conflicting objectives?
I Examples . . .
I In this situation there is no one “best design” . . .
I . . . but there is a set of designs that are the best possible for that
combination of the two objectives.
I For these optimal solutions, the only way to improve one objective is to
worsen the other.
I Genetic algorithms can handle this problem with little modification: We
already evaluate a whole population, so we can use this to our advantage.
I Alternatively, we could use gradient-based optimization with one of two
strategies:
I Use a composite weighted function,
f = αf1 + (1 − α)f2
and do a sweep in α, performing an optimization for each value

I Solve the problem
minimize f1
subject to f2 = fc
for different values of fc

I The choice of a genetic algorithm vs. gradient-based depends on the number
of design variables and the required precision in the result.
I The concept of dominance is the key to the use of GA’s in multi-objective

optimization.
I Assume we have a population of 3 members, A, B and C, and that we want
to minimize two objective functions, f1 and f2 .
Member f1 f2
A 10 12
B 8 13
C 9 14

I Comparing members A and B, we can see that A has a higher (worse) f1 than
B, but has a lower (better) f2 . Hence we cannot determine whether A is
better than B or vice versa.
I On the other hand, B is clearly a fitter member than C since both of B’s
objectives are lower. We say that B dominates C.
I Comparing A and C, once again we are unable to say that one is better than
the other.
I In summary:
I A is non-dominated by either B or C
I B is non-dominated by either A or C
I C is dominated by B but not by A
I The rank of a member is the number of members that dominate it plus one.
In this case the ranks of the three members are:
rank(A) = 1
rank(B) = 1
rank(C) = 2

I In multi-objective optimization the rank is crucial in determining which
population member are the fittest.
I A solution of rank one is said to be Pareto optimal and the set of rank one
points for a given generation is called the Pareto set.
I As the number of generations increases, and the fitness of the population
improves, the size of the Pareto set grows.
I In the case above, the Pareto set includes A and B. The graphical
representation of a Pareto set is called a Pareto front.
I The procedure of a two-objective genetic algorithm is similar to the
single-objective one, with the following modifications:
I Instead of making decisions based on the objective function, we make
decisions based on rank (the lower the better)
I Instead of keeping track of the best member of population, we keep track of
all members with rank one, which should converge to the Pareto set
I One of the problems with this method is that there is no mechanism
“pushing” the Pareto front to a better one.

Example: Pareto Front in Aircraft Design

Coding and Decoding of Variables

There are two main variants in genetic algorithms:
I Bit GAs: represent the design variables with bits.
I Real GAs: keep the design variables as real numbers.
I A bit GA represents each variable as a binary number.

I Suppose we have m bits available for each number.
I To represent a real-valued variable, we have to divide the feasible interval of
xi into 2m − 1 intervals.
I Then each possibility for xi can be represented by any combination of m bits.
I For m = 5, for example, the number of intervals would be 31 and a possible
representation for xi would be 10101, which can be decoded as

xi = xl + si 1 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = xl + 21si ,
where si is the size of interval for xi given by

xu i − xli
si = .
31
Creation of the Initial Population

I As a rule of thumb, the population size should be of 15 to 20 times the
number of design variables.
I Using bit encoding, each bit is assigned a 50% chance of being either 1 or 0.
One way of doing this is to generate a random number 0 ≤ r ≤ 1 and setting
the bit to 0 if r ≤ 0.5 and 1 if r > 0.5.
I Each member is chosen at random. For a problem with real design variables
and a given variable x such that xl ≤ x ≤ xu , we could use,
x = xl + r(xu − xl )
where r is a random number such that 0 ≤ r ≤ 1.

Selection: Determining the Mating Pool 1

I Here we assume that we want to maximize f (x).
I Consider the highest (best) and the lowest (worst) values, fh and fl ,
respectively.
I The function values can be converted to a positive quantity by adding,
C = 0.1fh − 1.1fl
to each function value. Thus the new highest value will be 1.1(fh − fl ) and
the new lowest value 0.1(fh − fl ). The values are then normalized as follows,
fi + C
fi0 =
D
where
D = max(1, fh + C).

Selection: Determining the Mating Pool 2

After the fitness values are scaled, they are summed,
N
X
S= fi0
i=1
where N is the number of members in the population.

I We now use roulette wheel selection to make copies of the fittest members
for reproduction.
I A mating pool of N members is created by turning the roulette wheel N
times.
I A random number 0 ≤ r ≤ 1 is generated at each turn. The j th member is
copied to the mating pool if
f10 + . . . + fj−1
0
≤ rS ≤ f10 + . . . + fj0
This ensures that the probability of a member being selected for reproduction
is proportional to its scaled fitness value.

Crossover Operation — Real GA

I Various crossover strategies are possible in genetic algorithms.
I The following crossover strategy is one devised specifically for optimization
problems with real-valued design variables.
I Each member of the population corresponds to a point in n-space, that is, a
vector x
I Let two members of the population that have been mated (parents) be
ordered such that fp1 < fp2 . Two offspring are to be generated:
1. The midpoint between the two parents:
1
xc1 = (xp1 + xp2 )
2
2. A point extrapolated in the direction defined by the two parents beyond the
better parents:
xc2 = 2xp1 + xp2
I Then the tournament is performed by selecting the best parent (xp1 ) and
either the second parent or the best offspring, whichever is the best one of the
two.

Crossover Operation — Bit GA

When the information is stored as bits, the crossover operation involves the steps:
1. Generate a random integer 1 ≤ k ≤ m − 1 that defines the crossover point.
2. For one of the offspring, the first k bits are taken from parent 1 and the
remaining bits from parent 2.
3. For the second offspring, the first k bits are taken from parent 2 and the
remaining ones from parent 1.
Before Crossover After Crossover

11 111 11 000
00 000 00 111

Mutation
I Mutation is a random operation performed to change the genetic information.
I Mutation is needed because even though reproduction and crossover
effectively recombine existing information, occasionally some useful genetic
information might be lost.
I The mutation operation protects against such irrecoverable loss.
I It also introduces additional diversity into the population.
I When using bit representation, every bit is assigned a small permutation
probability, say p = 0.005 ∼ 0.1. This is done by generating a random
number 0 ≤ r ≤ 1 for each bit, which is changed if r < p.
Before Mutation After Mutation
11111 11010
I The mutation of the real representation can be done in a variety of way. A
simple way involves generating a small probability that each design variable
changes by a random amount (within certain bounds). Another more
sophisticated alternative consists in using a probability density function.

Why do genetic algorithms work?

A fundamental question which is still being researched is how the three main
operations (selection, crossover and mutation) are able to find better solutions.
Two main mechanism allow the algorithm to progress towards better solutions:
I Selection + Mutation = Improvement: Mutation makes local changes while
selection accepts better changes, this can be seen as a resilient and general
form of reducing the objective function.
I Selection + Crossover = Innovation: When the information of the best
population members is exchanged, there is a greater chance a new better
combination will emerge.

Jet Engine Design at General Electric 1

Jet Engine Design at General Electric 2

I Genetic algorithm combined with expert system
I Find the most efficient shape for the fan blades in the GE90 jet engines
I 100 design variables
I Found 2% increase in efficiency as compared to previous engines
I Allowed the elimination of one stage of the engine’s compressor reducing
engine weight and manufacturing costs without any sacrifice in performance

ST5 Antenna 1
I The antenna for the ST5 satellite system presented a challenging design
problem, requiring both a wide beam width for a circularly-polarized wave
and a wide bandwidth.
I Two teams were assigned the same design problem: one used a traditional
method, and the other used GAs.
I The GA team found an antenna configuration (ST5-3-10) that was slightly
more difficult to manufacture, but it:
I Used less power
I Removed two steps in design and fabrication

ST5 Antenna 2
I Had more uniform coverage and wider range of operational elevation angle
relative to the ground changes
I Took 3 person-months to design and fabricate the first prototype as compared
to 5 person-months for the conventionally designed antenna.

Gradient-Free Optimization Particle Swarm Optimization
Particle Swarm Optimization (PSO) 1

I PSO is a stochastic, population-based computer algorithm developed in 1995
by James Kennedy (social-psychologist) and Russell Eberhart (electrical
engineer)
I PSO applies the concept of “swarm intelligence” to problem solving.
I “Swarm intelligence” is the property of a system whereby the collective
behaviors of (unsophisticated) agents interacting locally with their
environment cause coherent functional global patterns to emerge (e.g.
self-organization, emergent behavior).
I In other words: Dumb agents, properly connected into a swarm, yield
“smart” results.
I The basic idea of the PSO algorithm is:
I Each agent (or particle) represents a design point and moves in n-dimensional
space looking for the best solution.
I Each agent adjusts its movement according to the effects of “cognitivism”
(self experience) and “sociocognition” (social interaction).

Particle Swarm Optimization (PSO) 2

I The update of particle i’s position is given by:
xik+1 = xik + vk+1

i
∆t
where the velocity of the particle is given by

i i pik − xik pgk − xik
vk+1 = wvk + c1 r1 + c2 r2
∆t ∆t
I r1 and r2 are random numbers in the interval [0, 1]

I pik is particle i’s best position so far, pgk is the swarm’s best particle position at
iteration k
I c1 is the cognitive parameter (confidence in itself), c2 is the social parameter
(confidence in the swarm)
I w is the inertia

How the swarm is updated 1
xki vki

wvki
xki vki
Inertia

pki
Cognitive Learning
wvki
xki vki
Inertia

pki
xki +1 Cognitive Learning

vki +1
c1r1 (pki − xki )
i
wv k
xki vki
Inertia

pki
xki +1 Cognitive Learning

Social Learning vki +1
pkg c1r1 (pki − xki )
i
wv k
xki vki
Inertia

PSO Algorithm
1. Initialize a set of particles positions xio and velocities voi randomly distributed
throughout the design space bounded by specified limits

2. Evaluate the objective function values f xik using the design space positions
xik
3. Update the best particle position pik at current iteration (k) and best particle
position in the complete history pgk
4. Update the position of each particle using its previous position and updated
velocity vector.
5. Repeat steps 2–4 until the stopping criteria is met.

PSO Characteristics
Compared to other global optimization approaches:
I Simple algorithm, extremely easy to implement.
I Still a population based algorithm, however it works well with few particles
(10 to 40 are usual) and there is not such thing as “generations”
I Unlike evolutionary approaches, design variables are directly updated, there
are no chromosomes, survival of the fittest, selection or crossover operations.
I Global and local search behavior can be directly “adjusted” as desired using
the cognitive c1 and social c2 parameters.
I Convergence “balance” is achieved through the inertial weight factor w

Analysis of PSO 1
I If we replace the velocity update equation into the position update the
following expression is obtained:
!
i i i pik − xik pgk − xik
xk+1 = xk + wvk + c1 r1 + c2 r2 ∆t
∆t ∆t
I Factorizing the cognitive and social terms:

c1 r1 pik + c2 r2 pgk
xik+1 = xik + wvki ∆t + (c1 r1 + c2 r2 ) i
− xk
| {z } | {z } c1 r1 + c2 r2
x̂ik αk | {z }
p̂k
I So the behavior of each particle can be viewed as a line-search dependent on

a stochastic step size and search direction.

Analysis of PSO 2
I Re-arranging the position and velocity term in the above equation we have:
xik+1 = xik (1 − c1 r1 − c2 r2 ) + wVki ∆t + c1 r1 pik + c2 r2 pgk

pi pg
i
vk+1 = −xik (c1 r1∆t
+c2 r2 )
+ wVki + c1 r1 ∆tk + c2 r2 ∆tk
I . . . which can be combined and written in a matrix form as:

i i i
xk+1 1 − c1 r1 − c2 r2 w∆t xk c1 r1 c2 r2 pk
= + c1 r1 c2 r2
i
Vk+1 − (c1 r1 +c2 r2 )
∆t w Vk
i
∆t ∆t pgk
where the above representation can be seen as a representation of a

discrete-dynamic system from which we can find stability criteria.
I Assuming constant external inputs, the system reduces to:
i i
0 − (c1 r1 + c2 r2 ) w∆t xk c1 r1 c2 r2 pk
= + c1 r1 c2 r2
0 − (c1 r1∆t
+c2 r2 )
w−1 Vki ∆t ∆t pgk
where the above is true only when Vki = 0 and xik = pik = pgk (equilibrium
point).
Analysis of PSO 3
I The eigenvalues of the dynamic system are:
λ2 − (w − c1 r1 − c2 r2 + 1) λ + w = 0
I Hence, the stability in the PSO dynamic system is guaranteed if

|λi=1,...,n | < 1, which leads to:
0 < (c1 + c2 ) < 4

(c1 +c2 )
2 −1<w <1

Effect of varying c1 and c2

Effect of varying the inertia

PSO Issues and Improvements

Several issues with PSO have been identified:
I Inertia weight updates can be problematic
I Original PSO does not handle constraints

Updating the inertia weight 1

I As k → ∞ particles “cluster” towards the “global” optimum.
I Fixed inertia makes the particles to overshoot the best regions (too much
momentum).
I A better way of controlling the global search is to dynamically update the
inertia weight.

Updating the inertia weight 2

14000
13000 Fixed Inertia Weight

Linear Varying Inertia Weight
12000 Dynamic Varying Inertia Weight
11000
Structure Weight [lbs]
10000
9000
8000
7000
6000
5000
4000
0 50 100 150 200
Iteration

Violated design points redirection 1

We can restrict the velocity vector of a constraint violated particle to a usable
feasible direction:

i pik − xik pgk − xik
vk+1 = c1 r1 + c2 r2
∆t ∆t

Violated design points redirection 2

Constraint Handling 1
The basic PSO algorithm is an unconstrained optimizer, to include constraints we
can use:
I Penalty methods
I Augmented Lagrangian function
I Recall the Lagrangian function:

m
X
Li xik , λi = f xik + λij gj xik
j=1
I The augmented Lagrangian function is:

m m
X X
Li xik , λi , rpi = f xik + λij θj xik + rp,j θj2 xik
j=1 j=1
where:
−λj
θj xik i
= max gj xk ,
2rp,i
Constraint Handling 2
I Multipliers and penalty factors that lead to the optimum are unknown and
problem dependent.
I A sequence of unconstrained minimizations of the augmented Lagrangian
function are required to obtain a solution.
I Multiplier update
λij v+1 = λij v + 2 rp,j |v θj xik
I Penalty factor update (penalizes infeasible movements):

 2 rp,j |v if gj xiv > gj xiv−1 ∧ gj xiv > εg
rp,j |v+1 = 1
rp,j v if gj xiv ≤ εg
 2
rp,j |v otherwise

Augmented Lagrangian PSO Algorithm

1. Initialize a set of particles positions xio and velocities voi randomly distributed
throughout the design space bounded by specified limits.

2. Initialize the Lagrange multipliers and penalty factors, e.g. λij 0 = 0,
rp,j |0 = r0 .
3. Evaluate the objective function values using the initial design space positions.
4. Solve the unconstrained optimization problem (the augmented Lagrange
multiplier equation) using the basic PSO algorithm for kmax iterations.
5. Update the Lagrange multipliers and penalty factors.
6. Repeat steps 4–5 until a stopping criterion is met.

Example: Minimizing the Griewank Function

So how do the different gradient-free methods compare? A simple (but
challenging!) numerical example is the Griewank function for n = 100,
Xn Yn
x2i xi
lf (x) = − cos √ + 1
i=1
4000 i i=1
−600 ≤ xi ≤ 600
Optimizer Evaluations Global optimum? Objective CPU time (s)

PSO (pop 40) 12,001 Yes 6.33e-07 15.9
GA (pop 250) 51,000 No 86.84 86.8438
DIRECT 649,522 Yes 1.47271e-011 321.57

Example: Gradient-based vs. Gradient-Free


1. Introduction

7.1 Introduction
7.2 Multidisciplinary Analysis
7.3 Extended Design Structure Matrix
7.4 Monolithic Architectures
7.5 Distributed Architectures
7.6 Computing Coupled Derivatives
Multidisciplinary Design Optimization Introduction
Introduction 1
I In the last few decades, numerical models that predict the performance of
engineering systems have been developed, and many of these models are now
mature areas of research. For example . . .
I Once engineers can predict the effect that changes in the design have on the
performance of a system, the next logical question is what changes in the
design produced optimal performance. The application of the numerical
optimization techniques described in the preceding chapters address this
question.
I Single-discipline optimization is in some cases quite mature, but the design
and optimization of systems that involve more than one discipline is still in its
infancy.
I When systems are composed of multiple systems, additional issues arise in
both the analysis and design optimization.
I MDO researchers think industry will not adopt MDO more widely because
they do not realize their utility.

Introduction 2
I Industry think that researchers are not presenting anything new, since
industry has already been doing multidisciplinary design.
I There is some truth to each of these perspectives . . .
I Real-world aerospace design problem may involve thousands of variables and
hundreds of analyses and engineers, and it is often difficult to apply the
numerical optimization techniques and solve the mathematically correct
optimization problems.
I The kinds of problems in industry are often of much larger scale, involve
much uncertainty, and include human decisions in the loop, making them
difficult to solve with traditional numerical optimization techniques.
I On the other hand, a better understanding of MDO by engineers in industry
is now contributing a more widespread use in practical design.
Why MDO?
I Parametric trade studies are subject to the “curse of dimensionality”.
I Iterated procedures for which convergence is not guaranteed.
I Sequential optimization that does not lead to the true optimum of the system
Introduction 3
Objectives of MDO:
I Avoid difficulties associated with sequential design or partial optimization.
I Provide more efficient and robust convergence than by simple iteration.
I Aid in the management of the design process.
Difficulties of MDO:
I Communication and translation
I Time
I Scheduling and planning
I Implementation

Typical Aircraft Company Organization
Personnel hierarchy
Design process

MDO Architectures
I MDO focuses on the development of strategies that use numerical analyses
and optimization techniques to enable the automation of the design process
of a multidisciplinary system.
I The big challenge: make such a strategy scalable and practical.
I An MDO architecture is a particular strategy for organizing the analysis
software, optimization software, and optimization subproblem statements to
achieve an optimal design.
I Other terms are used: “method”, “methodology”, “problem formulation”,
“strategy”, “procedure” and “algorithm”.

Nomenclature and Mathematical Notation 1
Symbol Definition
x Vector of design variables
yt Vector of coupling variable targets (inputs to a discipline analysis)
y Vector of coupling variable responses (outputs from a discipline analysis)
ȳ Vector of state variables (variables used inside only one discipline analysis
f Objective function
c Vector of design constraints
cc Vector of consistency constraints
R Governing equations of a discipline analysis in residual form
N Number of disciplines
n() Length of given variable vector
m() Length of given constraint vector
()0 Functions or variables that are shared by more than one discipline
()i Functions or variables that apply only to discipline i
()∗ Functions or variables at their optimal value
˜
() Approximation of a given function or vector of functions
ˆ
() Duplicates of certain variable sets distributed to other disciplines
Nomenclature and Mathematical Notation 2

I In MDO, we make the distinction between:
I Local design variables xi — directly affect only one discipline
I Shared design variables x0 — directly affect more than one discipline.
T
I Full vector of design variables x = xT0 , xT1 , . . . , xTN
I A discipline analysis solves a system of equations that computes the state
variables. Examples?
I In many formulations, independent copies of the coupling variables must be
made to allow discipline analyses to run independently and in parallel.
I These copies are also known as target variables, which we denote by a
superscript t.
I To preserve consistency between the coupling variable inputs and outputs at
the optimal solution, we define consistency constraints
cci = yit − yi
which we add to the optimization problem formulation.

Example: Aerostructural Problem Definition 1

I Common example used throughout this chapter to illustrate the notation and
MDO architectures.
I Suppose we want to design the wing of a business jet using low-fidelity
analysis tools.
I Model the aerodynamics using a panel method
I Model the structure as a single beam using finite elements


Wi=15961.3619lbs Ws=10442.5896lbs α=2.3673o Λ=30o CL=0.13225 CD=0.014797 L/D=8.9376
15
10
x (ft)
5
−30 −20 −10 0 10 20 30

y (ft)
z (ft)
1
0.5
0
30
20
10
0 15
10
−10
5
−20 0
y (ft) −30
x (ft)
I Aerodynamic inputs: angle-of-attack (α), wing twist distribution (γi )

I Aerodynamic outputs: lift (L) and the induced drag (D).


I Structural inputs: thicknesses of the beam (ti )
I Structural output: beam weight, which is added to a fixed weight to obtain
the total weight (W ), and the maximum stresses in each finite-element (σi ).
I In this example, we want to maximize the range of the aircraft, as given by
the Breguet range equation,

V L Wi
f = Range = ln .
c D Wf
I The multidisciplinary analysis consists in the simultaneous solution of the
following equations:
R1 = 0 ⇒ AΓ − v(u, α) = 0
R2 = 0 ⇒ Ku − F (Γ) = 0
R3 = 0 ⇒ L(Γ) − W = 0


I The complete state vector is
   
y1 Γ
y = y2  =  u  .
y3 α
I The angle of attack is considered a state variable here, and helps satisfy
L = W.
I The design variables are the the wing sweep (Λ), structural thicknesses (t)
and twist distribution (γ).
x0 = Λ

t
x= ,
γ
I Sweep is a shared variable because changing the sweep has a direct effect on
both the aerodynamic influence matrix and the stiffness matrix.


I The other two sets of design variables are local to the structures and
aerodynamics, respectively.
I In later examples, we will see the options we have to optimize the wing in
this example.

Multidisciplinary Design Optimization Multidisciplinary Analysis
Multidisciplinary Analysis 1
I To find the coupled state of a multidisciplinary system we need to perform a
multidisciplinary analysis — MDA.
I This is often done by repeating each disciplinary analysis until yit = yir for all
is.
Input: Design variables x

Output: Coupling variables, y
0: Initiate MDA iteration loop
repeat
1: Evaluate Analysis 1 and update y1 (y2 , y3 )
until 4 → 1: MDA has converged

Multidisciplinary Design Optimization Multidisciplinary Analysis
Multidisciplinary Analysis 2
I The design structure matrix (DSM) was originally developed to visualize the
interconnections between the various components of a system.
A B C D E F G H I J K L M N O A L H O D M E G N C B K I F J
Optimization A A Optimization A A
Aerodynamics B B Mission L L
Atmosphere C C Performance H H
Economics D D System O O
Emissions E E Economics D D
Loads F F Reliability M M
Noise G G Emissions E E
Performance H H Noise G G
Sizing I I Propulsion N N
Weight J J Atmosphere C C
Structures K K Aerodynamics B B
Mission L L Structures K K
Reliability M M Sizing I I
Propulsion N N Loads F F
System O O Weight J J
Original ordering Improved ordering

I Fixed-point iteration, such as the Gauss–Seidel algorithm above converge
slowly and sometimes do not converge at all.
I One way to improve the disciplines, is to reorder the sequence and possibly
do some inner loops for more coupled clusters.

Multidisciplinary Design Optimization Extended Design Structure Matrix
Extended Design Structure Matrix (XDSM) Diagrams 1

I DSMs are somewhat ambiguous as to what the connections are: data or
process flow?
I Numerous diagrams can be found in the literature that describe MDO
architectures and other computational procedures.
I We wanted to develop a new diagram type of diagram that could:
I Show both process flow and data dependencies in the same diagram
I Show complex procedures with multiple loops and parallel processes in a
compact manner
I The results was the extended design structure matrix, or XDSM
I We will use XDSM throughout this chapter to explain all the MDO
architectures

Block Gauss–Seidel Iteration
yt x0 , x1 x0 , x2 x0 , x3
0,4→1:
(no data) 1 : y2t , y3t 2 : y3t
MDA
y1 1:
4 : y1 2 : y1 3 : y1
Analysis 1
y2 2:
4 : y2 3 : y2
Analysis 2
y3 3:
4 : y3
Analysis 3

x(0)
0,2→1:
x∗ Optimization
1:x 1:x 1:x
1:
2:f
Objective
1:
2:c
Constraints
1:
2 : df / dx, dc/ dx
Gradients

Example: Aerostructural Optimization — Sequential

Design vs. MDO 1
I One commonly used approach to design is to perform a sequential
“optimization” approach, which consists in optimizing each discipline in
sequence:
1. For example, we could start by optimizing the aerodynamics,
minimize D (α, γi )
w.r.t. α, γi
s.t. L (α, γi ) = W
2. Once the aerodynamic optimization has converged, the twist distribution and
the forces are fixed
3. Then we optimize the structure by minimizing weight subject to stress
constraints at the maneuver condition, i.e.,
minimize W (ti )
w.r.t. ti
s.t. σj (ti ) ≤ σyield


Design vs. MDO 2
4. Repeat until this sequence has converged.
0 , t0
0
⇤
8
, t⇤ Iterator
1,3
7!1
1
Optimization
2,4
3!2
2
3
L/D Aerodynamics F
4
7
t Optimization t
5
6!5
5
6
u W, y Structures


Design vs. MDO 3
I The MDO procedure differs from the sequential approach in that it considers
all variables simultaneously
minimize Range (α, γi , ti )

w.r.t. α, γi , ti
s.t. σyield − σj (ti ) ≥ 0
L (α, γi ) − W = 0


Design vs. MDO 4
0 , t0 w 0 , u0
0
7
⇤
, t⇤ Optimization 5 : ,t 2: 3:t
1
6!1
6
6 : R, y Functions
5
1
5
MDA 2:u
2
4!2
2
5:w 4:w Aerodynamics 3:w

3
3
4
5:u 4:u Structures


Design vs. MDO 5
Sequential MDF AS
10
Twist (degrees)
5
−5
Jigtwist
−10 Deflected
0 5 10 15 20
0.06
Thickness (m)
0.05
0.04
0.03
0.02
0 5 10 15 20
4
x 10
5
Elliptical
4
Lift (N)
1
0 5 10 15 20
0.25
Multidisciplinary Design Optimization Monolithic Architectures
Monolithic Architectures
I Monolithic architectures solve the MDO problem by casting it as single
optimization problem.
I Distributed architectures, on the other hand, decompose the overall problem
into smaller ones.
I Monolithic architectures include:
I Multidisciplinary Feasible — MDF
I Individual Discipline Feasible — IDF
I Simultaneous Analysis and Design — SAND
I All-At-Once — AAO

Multidisciplinary Feasible (MDF) 1

I The MDF architecture is the most intuitive for engineers.
I The optimization problem formulation is identical to the single discipline
case, except the disciplinary analysis is replace by an MDA
minimize f0 (x, y (x, y))

with respect to x
subject to c0 (x, y (x, y)) ≥ 0
ci (x0 , xi , yi (x0 , xi , yj6=i )) ≥ 0 for i = 1, . . . , N.


x(0) y t,(0)
0, 7→1:
x∗ Optimization
2 : x0 , x1 3 : x0 , x2 4 : x0 , x3 6:x
1, 5→2:
2 : y2t , y3t 3 : y3t
MDA
2:
y1∗ 5 : y1 3 : y1 4 : y1 6 : y1
Analysis 1
3:
y2∗ 5 : y2 4 : y2 6 : y2
Analysis 2
4:
y3∗ 5 : y3 6 : y3
Analysis 3
6:
7 : f, c
Functions


I Advantages:
I Optimization problem is as small as it can be for a monolithic architecture
I Always returns a system design that satisfies the consistency constraints, even
if the optimization process is terminated early — good from the practical
engineering point of view
I Disadvantages:
I Intermediate results do not necessarily satisfy the optimization constraints
I Developing the MDA procedure might be time consuming, if not already in
place
I Gradients of the coupled system more challenging to compute (more in later
section)

Example: Aerostructural Optimization with MDF
minimize −R
w.r.t. Λ, γ, t
s.t. σyield − σi (u) ≥ 0
where the aerostructural analysis is as before:
AΓ − v(u, α) = 0
K(t, Λ)u − F (Γ) = 0
L(Γ) − W (t) = 0

Individual Discipline Feasible (IDF) 1

The IDF architecture decouples the MDA, adding consistency constraints, and
giving the optimizer control of the coupling variables.

minimize f0 x, y x, y t
with respect to x, y t

subject to c0 x, y x, y t ≥0

ci x0 , xi , yi x0 , xi , yj6t =i ≥0 for i = 1, . . . , N

cci = yit − yi x0 , xi , yj6t =i =0 for i = 1, . . . , N.
I Advantages:
I Optimizer typically converges the multidisciplinary feasibility better than
fixed-point MDA iterations
I Disadvantages:
I Problem is potentially much larger than MDF, depending on the number of
coupling variables
I Gradient computation can be costly

Individual Discipline Feasible (IDF) 2

I The large problem size can be mitigated to some extent by careful selection
of the disciplinary variable partitions or aggregation of the coupling variables
to reduce information transfer between disciplines.
x(0) , y t,(0)
0,3→1:
x∗ 1 : x0 , xi , yj6t =i 2 : x, y t
Optimization
1:
yi∗ 2 : yi
Analysis i
2:
3 : f, c, cc
Functions

Example: Aerostructural Optimization Using IDF
minimize −R
w.r.t. Λ, γ, t, Γt , αt , ut
s.t. σyield − σi ≥ 0
Γt − Γ = 0
αt − α = 0
ut − u = 0

Simultaneous Analysis and Design (SAND) 1

I SAND makes no distinction between disciplines, and can also be applied to
single discipline problems.
I The governing equations are constraints at the optimizer level.
minimize f0 (x, y)
with respect to x, y, ȳ
subject to c0 (x, y) ≥ 0
ci (x0 , xi , yi ) ≥ 0 for i = 1, . . . , N
Ri (x0 , xi , y, ȳi ) = 0 for i = 1, . . . , N.
I Advantages:
I If implemented well, can be the most efficient architecture
I Disadvantages:
I Intermediate results do not even satisfy the governing equations
I Difficult or impossible to implement for “black-box” components

Simultaneous Analysis and Design (SAND) 2

x(0) , y (0) , ȳ (0)
0,2→1:
x∗ , y ∗ 1 : x, y 1 : x0 , xi , y, ȳi
Optimization
1:
2 : f, c
Functions
1:
2 : Ri
Residual i

Aerostructural Optimization Using SAND 1
minimize −R
w.r.t. Λ, γ, t, Γ, α, u
s.t. σyield − σi (u) ≥ 0
AΓ = v(u, α)
K(t)u = f (Γ)
L(Γ) − W (t) = 0

The All-at-Once (AAO) Problem Statement 1

I AAO is not strictly an architecture, as it is not practical to solve a problem of
this form: the consistency constraints are linear and can be eliminated,
leading to SAND.
I Some inconsistency in the name, in the literature
I We present AAO for completeness, and to relate this to the other monolithic
architectures.
N
X
minimize f0 (x, y) + fi (x0 , xi , yi )
i=1
with respect to x, y t , y, ȳ
subject to c0 (x, y) ≥ 0
ci (x0 , xi , yi ) ≥ 0 for i = 1, . . . , N
cci = yit − yi = 0 for i = 1, . . . , N

Ri x0 , xi , yj6t =i , ȳi , yi =0 for i = 1, . . . , N.


I As we can see, it includes all the constraints that other monolithic
architectures eliminated.
x(0) , y t,(0) , y (0) , ȳ (0)
0, 2→1:
x∗ , y ∗ 1 : x, y, y t 1 : x0 , xi , yi , yj6t =i , ȳi
Optimization
1:
2 : f, c, cc
Functions
1:
2 : Ri
Residual i

Monolithic
Remove cc , y t
AAO SAND
Remove Remove
R, y, ȳ R, y, ȳ
Remove cc , y t
IDF MDF

Multidisciplinary Design Optimization Distributed Architectures
Distributed Architectures
I Monolithic MDO architectures solve a single optimization problem
I Distributed MDO architectures decompose the original problem into multiple
optimization problems
I Some problems have a special structure and can be efficiently decomposed,
but that is usually not the case
I In reality, the primary motivation for decomposing the MDO problem comes
from the structure of the engineering design environment
I Typical industrial practice involves breaking up the design of a large system
and distributing aspects of that design to specific engineering groups.
I These groups may be geographically distributed and may only communicate
infrequently.
I In addition, these groups typically like to retain control of their own design
procedures and make use of in-house expertise

Classification of MDO Architectures
Monolithic
AAO SAND
IDF MDF
Distributed IDF
Penalty Multilevel Distributed MDF
ECO ATC BLISS-2000 QSD CSSO MDOIS
IPD/EPD CO BLISS ASO

Concurrent Subspace Optimization (CSSO) 1

The CSSO system subproblem is given by
minimize f0 (x, ỹ (x, ỹ))

with respect to x
subject to c0 (x, ỹ (x, ỹ)) ≥ 0
ci (x0 , xi , ỹi (x0 , xi , ỹj6=i )) ≥ 0 for i = 1, . . . , N
and the discipline i subproblem is given by
minimize f0 (x, yi (xi , ỹj6=i ) , ỹj6=i )

with respect to x0 , xi
subject to c0 (x, ỹ (x, ỹ)) ≥ 0
ci (x0 , xi , yi (x0 , xi , ỹj6=i )) ≥ 0
cj (x0 , ỹj (x0 , ỹ)) ≥ 0 for j = 1, . . . , N j 6= i.

Concurrent Subspace Optimization (CSSO) 2

x(0) x(0) , y t,(0) x(0) x(0)
0,25→1:
(no data) Convergence
Check
1,6→2:
2 : yt 5 : x0 , xi 3 : x0 , xi
Initial DOE
13,18→14:
t 17 : x0 , xi 15 : x0 , xi
Discipline 14 : y
DOE
2,4→3,14,16→15: 3, 15 : yj6t =i
Exact MDA
19,24→20
x∗ 24 : x 1:x System 23 : x 7:x 21 : x
Optimization
11,23:
24 : f, c 12 : f, c
All Functions
20,22→21:
Metamodel 21 : ỹj6t =i
MDA
7,12→8:
13 : x 11 : x 9 : x0 , xj6=i 9 : x0 , xi
Optimization i
8,10→9: 9 : yj6t =i
9 : yt
Local MDA i
5,9,17,21:
11 : ỹj6=i
yi∗ 1 : ỹi 13 : ỹj6=i 22 : ỹi 10 : ỹj6=i Analysis i
23 : ỹi
Metamodel
3,9,15:
13 : yi 3, 15 : yi 11 : yi 10 : yi 5, 17 : yi
Analysis i

CSSO Algorithm
Input: Initial design variables x
Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate main CSSO iteration
repeat
1: Initiate a design of experiments (DOE) to generate design points
for Each DOE point do
2: Initiate an MDA that uses exact disciplinary information
repeat
3: Evaluate discipline analyses
4: Update coupling variables y
5: Update the disciplinary surrogate models with the latest design
end for 6 → 2
7: Initiate independent disciplinary optimizations (in parallel)
for Each discipline i do
repeat
8: Initiate an MDA with exact coupling variables for discipline i and
approximate coupling variables for the other disciplines
repeat
9: Evaluate discipline i outputs yi , and surrogate models for the
other disciplines, ỹj6=i
11: Compute objective f0 and constraint functions c using current
data
until 12 → 8: Disciplinary optimization i has converged
end for
13: Initiate a DOE that uses the subproblem solutions as sample points
for Each subproblem solution i do
14: Initiate an MDA that uses exact disciplinary information
repeat
15: Evaluate discipline analyses.
until 16 → 15 MDA has converged
17: Update the disciplinary surrogate models with the newest design
end for 18 → 14
19: Initiate system-level optimization
repeat
20: Initiate an MDA that uses only surrogate model information
repeat
21: Evaluate disciplinary surrogate models
23: Compute objective f0 , and constraint function values c
until 24 → 20: System level problem has converged
until 25 → 1: CSSO has converged

Collaborative Optimization (CO) 1

The CO2 system subproblem is given by:

minimize f0 x0 , x̂1 , . . . , x̂N , y t
with respect to x0 , x̂1 , . . . , x̂N , y t

subject to c0 x0 , x̂1 , . . . , x̂N , y t ≥ 0
Ji∗ = ||x̂0i − x0 ||22 + ||x̂i − xi ||22 +

||yit − yi x̂0i , xi , yj6t =i ||22 = 0 for i = 1, . . . , N
where x̂0i are duplicates of the global design variables passed to (and manipulated
by) discipline i and x̂i are duplicates of the local design variables passed to the
system subproblem.
The discipline i subproblem in both CO1 and CO2 is

minimize Ji x̂0i , xi , yi x̂0i , xi , yj6t =i
with respect to x̂0i , xi

subject to ci x̂0i , xi , yi x̂0i , xi , yj6t =i ≥ 0.

Collaborative Optimization (CO) 2

(0) (0) (0) (0)
x0 , x̂1···N , y t,(0) x̂0i , xi
0, 2→1:
x∗0 System 1 : x0 , x̂1···N , y t 1.1 : yj6t =i 1.2 : x0 , x̂i , y t
Optimization
1:
2 : f0 , c0 System
Functions
1.0, 1.3→1.1:
x∗i 1.1 : x̂0i , xi 1.2 : x̂0i , xi
Optimization i
1.1:
yi∗ 1.2 : yi
Analysis i
1.2:
2 : Ji∗ 1.3 : fi , ci , Ji Discipline i
Functions

CO Algorithm 1

0: Initiate system optimization iteration
repeat
1: Compute system subproblem objectives and constraints
for Each discipline i (in parallel) do
1.0: Initiate disciplinary subproblem optimization
repeat
1.1: Evaluate disciplinary analysis
1.2: Compute disciplinary subproblem objective and constraints
1.3: Compute new disciplinary subproblem design point and Ji
until 1.3 → 1.1: Optimization i has converged
end for
2: Compute a new system subproblem design point
until 2 → 1: System optimization has converged

Aerostructural Optimization Using CO 1

System-level problem:
minimize −R
w.r.t. Λt , Γt , αt , ut , W t
s.t. J1∗ ≤ 10−6
J2∗ ≤ 10−6
Aerodynamics subproblem:
2 X 2 2
Λ Γi α 2 W
minimize J1 = 1− t + 1− t + 1− t + 1− t
Λ Γi α W
w.r.t. Λ, γ, α
s.t. L−W =0

Aerostructural Optimization Using CO 2

Structures subproblem:
2 X 2
Λ ui
minimize J2 = 1− + 1−
Λt uti
w.r.t. Λ, t
s.t. σyield − σi ≥ 0

Bilevel Integrated System Synthesis (BLISS) 1

The system level subproblem is formulated as
∗
df0
minimize (f0∗ )0 + ∆x0
dx0
with respect to ∆x0

dc∗0
subject to (c∗0 )0 + ∆x0 ≥ 0
dx0
∗
dci
(c∗i )0 + ∆x0 ≥ 0 for i = 1, . . . , N
dx0
∆x0L ≤ ∆x0 ≤ ∆x0U .


The discipline i subproblem is given by

df0
minimize (f0 )0 + ∆xi
dxi
with respect to ∆xi

dc0
subject to (c0 )0 + ∆xi ≥ 0
dxi

dci
(ci )0 + ∆xi ≥ 0
dxi
∆xiL ≤ ∆xi ≤ ∆xiU .
Note the extra set of constraints in both system and discipline subproblems
denoting the design variables bounds.


(0) (0)
x(0) y t,(0) x0 xi
0,11→1:
(no data) Convergence
Check
1,3→2: 6 : yj6t =i 6, 9 : yj6t =i 6 : yj6t =i 2, 5 : yj6t =i

MDA
8,10:
x∗0 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
Optimization
4,7:
x∗i 11 : xi 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i
6,9:
10 : f0 , c0 7 : f0 , c0 System
Functions
6,9:
10 : fi , ci 7 : fi , ci Discipline i
Functions
9:
Shared
10 : df / dx0 , dc/ dx0
Variable
Derivatives
6:
Discipline i
7 : df0,i / dxi , dc0,i / dxi
Variable
Derivatives
2,5:
yi∗ 3 : yi 6, 9 : yi 6, 9 : yi 9 : yi 6 : yi
Analysis i

BLISS Algorithm
0: Initiate system optimization
repeat
1: Initiate MDA
repeat
2: Evaluate discipline analyses
3: Update coupling variables
4: Initiate parallel discipline optimizations
5: Evaluate discipline analysis
6: Compute objective and constraint function values and derivatives with
respect to local design variables
7: Compute the optimal solutions for the disciplinary subproblem
end for
9: Compute objective and constraint function values and derivatives with
respect to shared design variables using post-optimality analysis
10: Compute optimal solution to system subproblem

Analytical Target Cascading (ATC) 1

The ATC system subproblem is given by
N
X
minimize f0 x, y t + Φi x̂0i − x0 , yit − yi x0 , xi , y t +
i=1

Φ0 c0 x, y t
with respect to x0 , y t ,
where Φ0 is a penalty relaxation of the global design constraints and Φi is a

penalty relaxation of the discipline i consistency constraints. The ith discipline
subproblem is:

minimize f0 x̂0i , xi , yi x̂0i , xi , yj6t =i , yj6t =i + fi x̂0i , xi , yi x̂0i , xi , yj6t =i +

Φi yit − yi x̂0i , xi , yj6t =i , x̂0i − x0 +

Φ0 c0 x̂0i , xi , yi x̂0i , xi , yj6t =i , yj6t =i
with respect to x̂0i , xi

subject to ci x̂0i , xi , yi x̂0i , xi , yj6t =i ≥ 0.
Analytical Target Cascading (ATC) 2

(0) (0) (0)
w(0) x0 , y t,(0) x̂0i , xi
0,8→1:
(no data) 6:w 3 : wi
w update
5,7→6:
x∗0 System 6 : x0 , y t 3 : x0 , y t 2 : yj6t =i
Optimization
6:
System and
7 : f0 , Φ0···N
Penalty
Functions
1,4→2:
x∗i 6 : x̂0i , xi 3 : x̂0i , xi 2 : x̂0i , xi
Optimization i
3:
Discipline i
4 : fi , ci , Φ0 , Φi
and Penalty
Functions
2:
yi∗ 6 : yi 3 : yi
Analysis i

ATC Algorithm

0: Initiate main ATC iteration
repeat
1: Initiate discipline optimizer
repeat
2: Evaluate disciplinary analysis
3: Compute discipline objective and constraint functions and penalty
function values
4: Update discipline design variables
until 4 → 2: Discipline optimization has converged
end for
5: Initiate system optimizer
repeat
6: Compute system objective, constraints, and all penalty functions
7: Update system design variables and coupling targets.
8: Update penalty weights
until 8 → 1: Penalty weights are large enough

Asymmetric Subspace Optimization (ASO) 1

The system subproblem in ASO is
X
minimize f0 (x, y (x, y)) + fk (x0 , xk , yk (x0 , xk , yj6=k ))
k
with respect to x0 , xk
subject to c0 (x, y (x, y)) ≥ 0
ck (x0 , xk , yk (x0 , xk , yj6=k )) ≥ 0 for all k,
where subscript k denotes disciplinary information that remains outside of the

MDA. The disciplinary problem for discipline i, which is resolved inside the MDA,
is
minimize f0 (x, y (x, y)) + fi (x0 , xi , yi (x0 , xi , yj6=i ))
with respect to xi
subject to ci (x0 , xi , yi (x0 , xi , yj6=i )) ≥ 0.

Asymmetric Subspace Optimization (ASO) 2

(0) (0)
x0,1,2 y t,(0) x3
0,10→1:
x∗0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
Optimization
9:
10 : f0,1,2 , c0,1,2 Discipline 0, 1,
and 2
Functions
1,8→2:
2 : y2t , y3t 3 : y3t
MDA
2:
y1∗ 9 : y1 8 : y1 3 : y1 6 : y1 5 : y1
Analysis 1
3:
y2∗ 9 : y2 8 : y2 6 : y2 5 : y2
Analysis 2
4,7→5:
x∗3 9 : x3 6 : x3 5 : x3
Optimization 3
6:
Discipline 0
7 : f0 , c0 , f3 , c3
and 3
Functions
5:
y3∗ 9 : y3 8 : y3 6 : y3
Analysis 3

ASO Algorithm

repeat
1: Initiate MDA
repeat
2: Evaluate Analysis 1
4: Initiate optimization of Discipline 3
repeat
6: Compute discipline 3 objectives and constraints
7: Update local design variables
until 7 → 5: Discipline 3 optimization has converged
8: Update coupling variables
until 8 → 2 MDA has converged
9: Compute objective and constraint function values for all disciplines 1 and
2
10: Update design variables

Example: A Framework for Automatic Implementation of

MDO 1

Example: A Framework for Automatic Implementation of

MDO 2
1 N
MDO Discipline Solver
MDF SAND IDF CO CSSO
1
1
1
0..* 1
RS
Optimization Optimizer

Multidisciplinary Design Optimization Computing Coupled Derivatives
Analytic Methods for Computing Coupled Derivatives 1

I We now extend the analytic methods derived in the derivatives chapter to
multidisciplinary systems.
I Each discipline is seen as one component.
I We apply the analytic equations and partition each of the matrices in the
blocks corresponding to each discipline.
I The partitioning is as follows,
R = [R1 , R2 ]T y = [y1 , y2 ]T
where we have assumed two disciplines as an example.


x
r1 r
r2
y1 y
y2
v = [v1 , . . . , vnx , . . . , v(nx +ny1 ) , . . . , v(nx +ny1 +ny2 ) , . . . , v(nx +2ny1 +ny2 ) , . . . , v(nx
| {z } | {z } | {z } | {z } |
x r1 r2 y1


I To derive the direct and adjoint versions of this approach within our
mathematical framework, we define the artificial residual functions
Ri = Yi − yi ,
where the yi vector contains the intermediate variables of the ith discipline,
and Yi is the vector of functions that explicitly define these intermediate
variables.


∆x
∆x
∆r1
∆r2
∆y1
∆y1
∆y2
∆y2
∆f
∆f
(a) (b)
Residual Functional
∆x
∆r1
∆y1
∆y2
∆f
(c)
Hybrid


2 32 3 2 3 2   32 3 2 3
T T
@R @F df
6 I 0 0 7 6 I 7 6I 7 6I 7 6 7 607
6 76 7 6 7 6 @x @x 7 6 dx 7 6 7
6 @R @R 7 6 dy 7 6 7 6   T76
6 07 6 7 6 7 6 @R
T
@F 7 6 df 7 6 7
6 @x 7 6 7 = 607 60 76 7 =6
607
7
6 @y 7 6 dx 7 6 7 6 @y @y 7 6 dr 7
4 @F @F 5 4 df 5 4 5 6 74 5 6 7
4 5
7
I 0 4 5
@x @y dx 0 0 I I I
(a) Direct method

(b) Adjoint method
2 32 3 2 32  T  T  T
32 3 2 3
@R1 @R2 @F df
6 I 0 0 07 6 I 7 6I 76I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @R 76 7 6 76    T76 7 6 7
7 6 7
6
@R1 @R1 T T
6 1 7 6 dy 7 6076 @R1 @R2 @F 76 df
6 07 6 1
6 760 76 7 607
6 @x @y1 @y2 7 6 dx 7 7 = 6 76 @y @y @y 7 6 dr1 7 6 7 7
6 76 6 76  1  1  1 T7 6 7=
6 @R2 @R2 @R2 7 6 dy2 77 6076 @R1
T
@R2
T
@F 7 6 df 7 6 7
6 07 6 760 76 7 6 0 7
6 @x @y1 @y2 7 6 dx 7 6
7
6 76 7 6 dr 7 6 7
6
6 74 6 @y2 @y2 @y2 76 27 4 7
4 @F @F @F 5 df 5 4 54 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx
(c) Coupled direct — residual approach (d) Coupled adjoint — residual approach
2 32 3 2 32  T  T  T
32 3 2 3
@Y1 @Y2 @F df
6 I 0 0 07 6 I 7 I 6 I
6 7 6 Design@x 7 6 7 6 07
6 J.R.R.A. Martins 7 Multidisciplinary Optimization @x @x 7 6 dx
August 7
2012 419 / 427
@x @y1 dx Design Optimization
@y2 Multidisciplinary Computing Coupled Derivatives
(c) Coupled direct — residual approach (d) Coupled adjoint — residual approach
2 32 32 32  T  T  T
32 3 2 3
@Y1 @Y2 @F df
6 I 0 0 0 7 6 I 7 6I 7 6I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @Y 76 6 76   T76 7 6 7
dy1 7 7 6 df 7 6 7
@Y1 T
6 1
07 6 7 607 6 @Y2 @F
6 I 76 76 0 I 76 7 607
6 @x @y2 7 6 dx 7 6
7
76 @y1 @y 7 6 dy1 7 6 7
6 76 =6
6 76   1 T7 6 7= 6 7
6 @Y2 @Y2 7 6 dy2 7
7 607 6 6 @Y1
T
@F 7 6 df 7 6 7
6 I 07 6 76 7 6 07
6 @x @y1 7 6 dx 7 6 7 0
7 6 76
I 7 6 dy 7 6 7
6 74 6 @y2 @y2 76 27 4 7 6
4 @F @F @F 5 df 5 4 5 4 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx
(e) Coupled direct — functional approach (f) Coupled adjoint — functional approach
2 32 32 32  T  T  T
32 3 2 3
@R1 @Y2 @F df
6 I 0 0 07 6 I 7 6I 76I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @R 76 6 76    T76 7 6 7
dy1 7 7 6 df 7 6 7
@R1 @R1 T T
6 1
07 6 7 6076 @R1 @Y2 @F
6 76 76 0 76 7 607
6 @x @y1 @y2 7 6 dx 7 6
7
76 @y @y1 @y 7 6 dr1 7 6 7
6 76 =6
6 76  1  1 T7 6 7= 6 7
6 @Y2 @Y2 7 6 dy2 7
7 6076 6 @R1
T
@F 7 6 df 7 6 7
6 I 07 6 76 7 6 07
6 @x @y1 7 6 dx 7 6 7 0
7 6 76
I 7 6 dy 7 6 7
6 74 6 @y2 @y2 76 27 4 7 6
4 @F @F @F 5 df 5 4 54 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx
(g) Hybrid direct (h) Hybrid adjoint

Numerical Example 1
In most cases, the explicit computation of state variables involves solving the
nonlinear system corresponding to the discipline; however, in this example, this is
simplified because the residuals are linear in the state variables and each discipline
has only one state variable. Thus, the explicit forms are
2y2 sin x1
Y1 (x1 , x2 , y2 ) = − +
x1 x1
y1
Y2 (x1 , x2 , y1 ) = .
x22

Numerical Example 2
Coupled — Residual (Direct)
 ∂R ∂ R1   dy1 dy1   ∂ R1 ∂ R1 
1
− −
 ∂y1 ∂y2   dx1 dx2   ∂x1 ∂x2 
 ∂ R2 ∂ R2   dy2 dy2  =  ∂ R2 ∂ R2 
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
 dy dy 
1 1
−x1 −2  dx1 dx2  y1 − cos x1 0
2  dy =
1 −x2 2 dy2 0 2x2 y2

dx1 dx2
df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2

= + +
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1
df1 dy1 dy2
=0+1× +0×
dx1 dx1 dx1

Numerical Example 3
Coupled — Residual (Adjoint)
 ∂R ∂ R2   df1 df2   ∂ F1 ∂ F2 
1
− −
 ∂y1 ∂y1   dr1 dr1   ∂y1 ∂y1 
 ∂ R1 ∂ R2   df1 df2  =  ∂ F1 ∂ F2 
− −
∂y2 ∂y2 dr2 dr2 ∂y2 ∂y2
 df df 
1 2
−x1 1  dr1 dr1  1 0
2  df =
−2 −x2 1 df2 0 sin x1

dr2 dr2
df1 ∂ F1 df1 ∂ R1 df1 ∂ R2

= + +
dx1 ∂x1 dr1 ∂x1 dr2 ∂x1
df1 df1 df1
=0+ (y1 − cos x1 ) + 0
dx1 dr1 dr2

Numerical Example 4
Coupled — Functional (Direct)
 ∂ Y1   dy1 dy1   ∂ Y1 ∂ Y1 
1 −
∂y2   dx1 dx2   ∂x1 ∂x2 
  dy2 dy2  =  ∂ Y2 ∂ Y2 

 ∂ Y2
− 1
∂y1 dx1 dx2 ∂x1 ∂x2
 dy dy 
1 1
2y2
+ cosx1x1 − sinx2x1
" # " #
1 x21  dx dx  x2
0
1 2 = 1 1
− x12 1  dy2 dy2  0 − 2y1
2 x32
dx1 dx2
df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2

= + +
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1
df1 dy1 dy2
=0+1 +0
dx1 dx1 dx1

Numerical Example 5
Coupled — Functional (Adjoint)
 ∂ Y2   df1 df2   ∂ F1 ∂ F2 
1 −
∂y1   dy1 dy1   ∂y1 ∂y1 
  df1 df2  =  ∂ F1

 ∂ Y1 ∂ F2 
− 1
∂y2 dy2 dy2 ∂y2 ∂y2
 df df 
" # 1 2
1 − x12  dy dy 

1 0
2
2  df11 df12  =
x1
1 0 sin x1
dy2 dy2
df1 ∂ F1 df1 ∂ Y1 df1 ∂ Y2

= + +
dx1 ∂x1 dy1 ∂x1 dy2 ∂x1

df1 df1 2y2 cos x1 sin x1 df1
=0+ + − + 0
dx1 dy1 x21 x1 x21 dy2

Numerical Example 6
Coupled — Hybrid (Direct)
 ∂R ∂ R1   dy1 dy1   ∂ R1 ∂ R1 
1
− −
 ∂y1 ∂y2   dx1 dx2   ∂x1 ∂x2 
 ∂ Y2   dy2 dy2  =  ∂ Y2 ∂ Y2 
− 1
∂y1 dx1 dx2 ∂x1 ∂x2
# dy1 dy1
 
" " #
−x1 −2  dx dx  y1 − cos x1 0
1 2
− x12 1  dy2 dy2  = 0 − 2y
x3
1
2 2
dx1 dx2
df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2

= + +
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1
df1 dy1 dy2
=0+1 +0
dx1 dx1 dx1

Numerical Example 7
Coupled — Hybrid (Adjoint)
 ∂R ∂ Y2   df1 df2   ∂ F1 ∂ F2 
1
− −
 ∂y1 ∂y1   dr1 dr1   ∂y1 ∂y1 
 ∂ R1   df1 df2  =  ∂ F1 ∂ F2 
− 1
∂y2 dy2 dy2 ∂y2 ∂y2
# df1 df2
 
"
−x1 − x12  dr dr 

1 0
2  df11 df12  =
−2 1 0 sin x1
dy2 dy2
df1 ∂ F1 df1 ∂ R1 df1 ∂ Y2

= + +
dx1 ∂x1 dr1 ∂x1 dy2 ∂x1
df1 df1 df1
=0+ (y1 − cos x1 ) + 0
dx1 dr1 dy2

Martins MDO Course Slides PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Martins MDO Course Slides PDF

Uploaded by

Copyright:

Available Formats

A Short Course on

Multidisciplinary Design Optimization

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 1 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 3 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 5 / 427

2. Line Search Techniques

7. Multidisciplinary Design Optimization

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 7 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 8 / 427

About the Course

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 9 / 427

About the Course Materials

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 10 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 11 / 427

Sir George Cayley

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 12 / 427

Wright Brother’s Flyer

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 13 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 14 / 427

100 Years Later . . .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 15 / 427

Multidisciplinary Trade-off for Supercritical Airfoils

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 16 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 17 / 427

Conventional vs. Optimal Design Process

Change Evaluate Evaluate

Is the design Is the design

Final design Final design

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 18 / 427

Multidisciplinary Design Optimization (MDO)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 19 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 20 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 21 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 22 / 427

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 23 / 427

Optimization Problem Statement

f : objective function, output (e.g. structural weight).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 24 / 427

Example: Trade-off Between Aerodynamics and Structures

0.04 I Sequential optimization does not lead to

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 25 / 427

Classification of Optimization Problems

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 26 / 427

Optimization Methods for Nonlinear Problems

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 27 / 427

Historical Developments in Optimization 1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 28 / 427

Historical Developments in Optimization 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 29 / 427

Historical Developments in Optimization 3

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 30 / 427

Historical Developments in Optimization 4

Historical Developments in Optimization 5

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 32 / 427

Historical Developments in Optimization 6

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 33 / 427

Historical Developments in Optimization 7

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 34 / 427

Historical Developments in Optimization 8

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 35 / 427

Line Search Techniques

2. Line Search Techniques

7. Multidisciplinary Design Optimization