Download as pdf or txt
Download as pdf or txt
You are on page 1of 427

A Short Course on

Multidisciplinary Design Optimization

Joaquim R. R. A. Martins
Multidisciplinary Design Optimization Laboratory
http://mdolab.engin.umich.edu

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 1 / 427


Contents 1
1. Introduction
1.1 About
1.2 Aircraft as Multidisciplinary Systems
1.3 Design Optimization
1.4 Optimization Problem Statement
1.5 Optimization Problem Statement
1.6 Classification of Optimization Problems
1.7 History
2. Line Search Techniques
2.1 Motivation
2.2 Optimality
2.3 Numerics
2.4 Method of Bisection
2.5 Newton’s Method
2.6 Secant Method
2.7 Golden Section Search
2.8 Polynomial Interpolation
2.9 Line Search
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 2 / 427
Contents 2
3. Gradient-Based Optimization
3.1 Introduction
3.2 Gradients and Hessians
3.3 Optimality Conditions
3.4 Steepest Descent
3.5 Conjugate Gradient
3.6 Newton’s Method
3.7 Quasi-Newton Methods
3.8 Trust Region Methods

4. Computing Derivatives
4.1 Introduction
4.2 Finite Differences
4.3 Complex-Step Method
4.4 C/C++ Implementations
4.5 Unifying Chain Rule
4.6 The Unifying Chain Rule
4.7 Monolithic Differentiation

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 3 / 427


Contents 3
4.8 Algorithmic Differentiation
4.9 Analytic Methods
5. Constrained Optimization
5.1 Introduction
5.2 Equality Constraints
5.3 Inequality Constraints
5.4 Constraint Qualification
5.5 Penalty Methods
5.6 Sequential Quadratic Programming
6. Gradient-Free Optimization
6.1 Introduction
6.2 Nelder–Mead Simplex
6.3 DIvided RECTangles (DIRECT)
6.4 Genetic Algorithms
6.5 Particle Swarm Optimization
7. Multidisciplinary Design Optimization
7.1 Introduction
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 4 / 427
Contents 4
7.2 Multidisciplinary Analysis
7.3 Extended Design Structure Matrix
7.4 Monolithic Architectures
Multidisciplinary Feasible (MDF)
Individual Discipline Feasible (IDF)
Simultaneous Analysis and Design (SAND)
The All-at-Once (AAO) Problem Statement
7.5 Distributed Architectures
Classification
7.6 Computing Coupled Derivatives

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 5 / 427


Introduction

Introduction
1. Introduction
1.1 About
1.2 Aircraft as Multidisciplinary Systems
1.3 Design Optimization
1.4 Optimization Problem Statement
1.5 Optimization Problem Statement
1.6 Classification of Optimization Problems
1.7 History

2. Line Search Techniques

3. Gradient-Based Optimization

4. Computing Derivatives

5. Constrained Optimization

6. Gradient-Free Optimization

7. Multidisciplinary Design Optimization


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 6 / 427
Introduction About

About Me
Bio
I 1991–1995: M.Eng. in Aeronautics, Imperial College, London
I 1996–2002: M.Sc. and Ph.D. in Aeronautics and Astronautics, Stanford
I 2002–2009: Assistant/Associate Prof., University of Toronto Inst. for
Aerospace Studies
I 2009– : Associate Prof., University of Michigan, Dept. of Aerospace Eng.

Highlights
I Two best papers at the AIAA MA&O Conference (2002, 2006)
I Canada Research Chair in Multidisciplinary Optimization (2002–2009)
I Keynote speaker at the International Forum on Aeroelasticity and Structural
Dynamics (Stockholm, 2007)
I Keynote speaker at the Aircraft Structural Design Conference (London, 2010)
I Associate editor for the AIAA Journal and Optimization and Engineering

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 7 / 427


Introduction About

About You
I Name
I Title and responsibilities
I Why are you taking this course?
I What do you hope to get from this course?

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 8 / 427


Introduction About

About the Course


I Introduction to MDO applications and advanced topics
I Assumes no previous knowledge of optimization
I Requires knowledge of multivariable calculus and linear algebra
I Please interrupt!
I Questions
I Share your experience with design and optimization

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 9 / 427


Introduction About

About the Course Materials


I I will use slides to teach, but please refer to course notes as well.
I Notes include a lot of detail, but if you want more, check the references: I
cite almost 300 papers and books.
I Notes are optimized for electronic reading with hyperlinks.
I History of the notes and slides:
I I originally created the notes in the form of slides in 2003, because I wanted to
cover a range of material in a particular way
I Colleagues at Stanford have used these notes since, and I taught one of the
chapters MIT.
I I have recently separated the notes from the slides.
I Please email if you find any typos or have any suggestions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 10 / 427


Introduction About

Course Content

Introduction
Single
Variable
Minimization

Computing
Derivatives

Gradient-
MDO Based
Optimization

Handling
Constraints

Gradient-Free
Optimization
MDO
Architectures

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 11 / 427


Introduction Aircraft as Multidisciplinary Systems

Sir George Cayley

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 12 / 427


Introduction Aircraft as Multidisciplinary Systems

Wright Brother’s Flyer

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 13 / 427


Introduction Aircraft as Multidisciplinary Systems

Santos–Dumont’s Demoiselle

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 14 / 427


Introduction Aircraft as Multidisciplinary Systems

100 Years Later . . .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 15 / 427


Introduction Aircraft as Multidisciplinary Systems

Multidisciplinary Trade-off for Supercritical Airfoils


Why you should not trust an aerodynamicist, even a brilliant one

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 16 / 427


Introduction Design Optimization

What is MDO?
I We will first cover the “DO” in MDO.
I In industry, problems routinely arise that require making the best possible
design decision.
I However, optimization is still underused in industry. . . Why?
I Numerical optimization and MDO still not part of most undergraduate and
graduate curricula
I Backlash due to “overselling” of numerical optimization
I Inertia in the industrial environment
I Aerospace is one of the leading applications of engineering design
optimization. Why?

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 17 / 427


Introduction Design Optimization

Conventional vs. Optimal Design Process


Conventional Optimal
Baseline Baseline
Specifications Specifications
design design

Analyze or
experiment Analyze

Change Evaluate Evaluate


Change
design performance objective and
design
constraints

Is the design Is the design


No good? No optimal?

Yes Yes

Final design Final design

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 18 / 427


Introduction Design Optimization

Multidisciplinary Design Optimization (MDO)


I Most modern engineering systems are multidisciplinary and their analysis is
often very complex, involving hundreds of computer programs, and many
people in different locations. This makes it difficult for companies to manage
the design process.
I In the early days, design teams tended to be small and were managed by a
single chief designer who knew most about the design details and could make
all the important decisions.
I Modern design projects are more complex and the problem has to be
decomposed and each part tackled by a different team. The way these teams
should interact is still being debated by managers, engineers and researchers.
I More in the last chapter . . .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 19 / 427


Introduction Optimization Problem Statement

Objective Function
I What do we mean by “best”?
I Objective function is a “measure of badness” that enables us to compare two
designs quantitatively — assuming we want to minimize it.
I Need to be able to estimate this measure numerically.
I If we select the wrong goal, it doesn’t matter how good the analysis is, or
how efficient the optimization method is. Therefore, it’s important to select a
good objective function.
I Selecting a good objective function is often overlooked, and not an easy
problem, even for experienced designers.
I Objective function may be linear or nonlinear and may or not be given
explicitly.
I We will represent the objective function by the scalar f .
I There is no such thing as multiobjective optimization!

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 20 / 427


Introduction Optimization Problem Statement

The “Disciplanes”
Is there one aircraft which is the fastest, most efficient, quietest, most
inexpensive?

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 21 / 427


Introduction Optimization Problem Statement

Design Variables
I Design variables are also known as design parameters and are represented by
the vector x. They are the variables in the problem that we allow to vary in
the design process.
I Optimization is the process of choosing the design variables that yield an
optimum design.
I Design variables should be independent of each other.
I Design variables can be continuous or discrete. Discrete variables are
sometimes integer variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 22 / 427


Introduction Optimization Problem Statement

Constraints
I Few practical engineering optimizations problems are unconstrained.
I Constraints on the design variables are called bounds and are easy to enforce.
I Like the objective function, constraints can be linear or nonlinear and may or
not be given in an explicitly form. They may be equality or inequality
constraints.
I At a given design point, constraints may be active of inactive. This
distinction is particularly important at the optimum.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 23 / 427


Introduction Optimization Problem Statement

Optimization Problem Statement


Objective function, design variables, and constraints form the optimization
problem statement:

minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, 2, . . . , m̂
ck (x) ≥ 0, k = 1, 2, . . . , m

f : objective function, output (e.g. structural weight).


x : vector of design variables, inputs (e.g. aerodynamic shape);
bounds can be set on these variables.
ĉ : vector of equality constraints (e.g. lift); in general these are
nonlinear functions of the design variables.
c : vector of inequality constraints (e.g. structural stresses), may also
be nonlinear and implicit.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 24 / 427


Introduction Optimization Problem Statement

Example: Trade-off Between Aerodynamics and Structures


Sequential MDF AS

10
Twist (degrees)

0
I Need a truly multidisciplinary objective
−5
Jigtwist
−10 Deflected  
0 5 10 15 20 V L Wi
Spanwise Distance (m) Range = ln
0.06 c D Wf
Thickness (m)

0.05

0.04 I Sequential optimization does not lead to


0.03 the true optimum.
0.02
0 5 10
Spanwise Distance (m)
15 20
I Achieving the proper trade-off requires
4
x 10
5
Elliptical
simultaneous optimization
4
I More on this in the MDO chapter . . .
Lift (N)

1
0 5 10 15 20
Spanwise Distance (m)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 25 / 427


Introduction Classification of Optimization Problems

Classification of Optimization Problems


Smooth

Discontinuous
Linear

Continuity

Linearity Nonlinear

Static

Continuous

Dynamic Quantitative

Discrete

Optimization
Design
Time Problem Variables
Classification

Qualitative

Deterministic

Data Constraints

Convexity Unconstrained
Stochastic

Constrained

Non-
Convex
Convex

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 26 / 427


Introduction Classification of Optimization Problems

Optimization Methods for Nonlinear Problems


Steepest
Descent

Gradient Conjugate
Based Gradient

Quasi-
Newton

Optimization
Methods Grid or
Random
Search
Genetic
Algorithms

Simulated
Annealing

Gradient Free
Nelder–
Meade

DIRECT

Particle
Swarm

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 27 / 427


Introduction History

Historical Developments in Optimization 1


300 bc: Euclid considers the minimal distance between a point a line, and
proves that a square has the greatest area among the rectangles
with given total length of edges.
200 bc: Zenodorus works on “Dido’s Problem”, which involved finding the
figure bounded by a line that has the maximum area for a given
perimeter.
100 bc: Heron proves that light travels between two points through the
path with shortest length when reflecting from a mirror, resulting in
an angle of reflection equal to the angle of incidence.
1615: Johannes Kepler finds the optimal dimensions of wine barrel. He
also formulated an early version of the “marriage problem” (a
classical application of dynamic programming also known as the
“secretary problem”) when he started to look for his second wife.
The problem involved maximizing a utility function based on the
balance of virtues and drawbacks of 11 candidates.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 28 / 427


Introduction History

Historical Developments in Optimization 2


1621 W. van Royen Snell discovers the law of refraction. This law
follows the more general principle of least time (or Fermat’s
principle), which states that a ray of light going from one point to
another will follow the path that takes the least time.
1646: P. de Fermat shows that the gradient of a function is zero at the
extreme point the gradient of a function.
1695: Isaac Newton solves for the shape of a symmetrical body of
revolution that minimizes fluid drag using calculus of variations.
1696: Johann Bernoulli challenges all the mathematicians in the world to
find the path of a body subject to gravity that minimizes the travel
time between two points of different heights — the brachistochrone
problem. Bernoulli already had a solution that he kept secret. Five
mathematicians respond with solutions: Isaac Newton, Jakob
Bernoulli (Johann’s brother), Gottfried Leibniz, Ehrenfried Walther
von Tschirnhaus and Guillaume de l’Hôpital. Newton reportedly
started solving the problem as soon as he received it, did not sleep

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 29 / 427


Introduction History

Historical Developments in Optimization 3


that night and took almost 12 hours to solve it, sending back the
solution that same day.
1740: L. Euler’s publication begins the research on general theory of
calculus of variations.
1746: P. L. Maupertuis proposed the principle of least action, which
unifies various laws of physical motion. This is the precursor of the
variation principle of stationary action, which uses calculus of
variations and plays a central role in Lagrangian and Hamiltonian
classical mechanics.
1784: G. Monge investigates a combinatorial optimization problem known
as the transportation problem.
1805: Adrien Legendre describes the method of least squares, which was
used in the prediction of asteroid orbits and curve fitting. Frederich
Gauss publishes a rigorous mathematical foundation for the method
of least squares and claims he used to predict the orbit of the
asteroid Ceres in 1801. Legendre and Gauss engage in a bitter
dispute on who first developed the method.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 30 / 427


Introduction History

Historical Developments in Optimization 4


1815: D. Ricardo publishes the law of diminishing returns for land
cultivation.
1847: A. L. Cauchy presents the steepest descent methods, the first
gradient-based method.
1857: J. W. Gibbs shows that chemical equilibrium is attained when the
energy is a minimum.
1902: Gyula Farkas presents and important lemma that is later used in
the proof of the Karush–Kuhn–Tucker theorem.
1917: H. Hancock publishes the first text book on optimization.
1932: K. Menger presents a general formulation of the traveling salesman
problem, one of the most intensively studied problems in
optimization.
1939: William Karush derives the necessary conditions for the inequality
constrained problem in his Masters thesis. Harold Kuhn and Albert
Tucker rediscover these conditions an publish their seminal paper in
1951. These became known as the Karush–Kuhn–Tucker (KKT)
conditions.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 31 / 427
Introduction History

Historical Developments in Optimization 5


1939 Leonid Kantorovich develops a technique to solve linear
optimization problems after having given the task of optimizing
production in the Soviet government plywood industry.
1947: George Dantzig publishes the simplex algorithm. Dantzig, who
worked for the US Air Force, reinvented and developed linear
programming further to plan expenditures and returns in order to
reduce costs to the army and increase losses to the enemy in World
War II. The algorithm was kept secret until its publication.
1947: John von Neumann develops the theory of duality for linear
problems.
1949: The first international conference on optimization, the International
Symposium on Mathematical Programming, is held in Chicago.
1951: H. Markowitz presents his portfolio theory that is based on
quadratic optimization. He receives the Nobel memorial prize in
economics in 1990.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 32 / 427


Introduction History

Historical Developments in Optimization 6


1954: L. R. Ford and D. R. Fulkerson research network problems,
founding the field of combinatorial optimization.
1957: R. Bellman presents the necessary optimality conditions for
dynamic programming problems. The Bellman equation was first
applied to engineering control theory, and subsequently became an
important principle in the development of economic theory.
1959: Davidon develops the first quasi-Newton method for solving
nonlinear optimization problems. Fletcher and Powell publish
further developments in 1963.
1960: Zoutendijk presents the methods of feasible directions to generalize
the Simplex method for nonlinear programs. Rosen, Wolfe, and
Powell develop similar ideas.
1963: Wilson invents the sequential quadratic programming method for
the first time. Han re-invents it in 1975 and Powell does the same
in 1977.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 33 / 427


Introduction History

Historical Developments in Optimization 7


1975: Pironneau publishes the a seminal paper on aerodynamic shape
optimization, which first proposes the use of adjoint methods for
sensitivity analysis.
1975: John Holland proposed the first genetic algorithm.
1977: Raphael Haftka publishes one of the first multidisciplinary design
optimization (MDO) applications, in a paper entitled
“Optimization of flexible wing structures subject to strength and
induced drag constraints”.
1979: Kachiyan proposes the first polynomial time algorithm for linear
problems. The New York times publishes the front headline “A
Soviet Discovery Rocks World of Mathematics”, saying, “A surprise
discovery by an obscure Soviet mathematician has rocked the world
of mathematics and computer analysis . . . Apart from its profound
theoretical interest, the new discovery may be applicable in weather
prediction, complicated industrial processes, petroleum refining,
t.he scheduling of workers at large factories . . . the theory of secret
codes could eventually be affected by the Russian discovery, and

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 34 / 427


Introduction History

Historical Developments in Optimization 8


this fact has obvious importance to intelligence agencies
everywhere.” In 1975, Kantorovich and T.C. Koopmans get the
Nobel memorial price of economics for their contributions on linear
programming.
1984: Narendra Karmarkar starts the age of interior point methods by
proposing a more efficient algorithm for solving linear problems. In
a particular application in communications network optimization,
the solution time was reduced from weeks to days, enabling faster
business and policy decisions. Karmarkar’s algorithm stimulated the
development of several other interior point methods, some of which
are used in current codes for solving linear programs.
1985: The first conference in MDO, the Multidisciplinary Analysis and
Optimization (MA&O) conference, takes place.
1988: Jameson develops adjoint-based aerodynamic shape optimization
for computational fluid dynamics (CFD).
1995: Kennedy and Eberhart propose the particle swarm optimization
algorithm.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 35 / 427


Line Search Techniques

Line Search Techniques


1. Introduction

2. Line Search Techniques


2.1 Motivation
2.2 Optimality
2.3 Numerics
2.4 Method of Bisection
2.5 Newton’s Method
2.6 Secant Method
2.7 Golden Section Search
2.8 Polynomial Interpolation
2.9 Line Search

3. Gradient-Based Optimization

4. Computing Derivatives

5. Constrained Optimization

6. Gradient-Free Optimization

7. Multidisciplinary Design Optimization


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 36 / 427
Line Search Techniques Motivation

Single Variable Minimization — Motivation


x0

I Gradient-based optimization with respect to


Search multiple variables requires a line search
direction
I Not necessary (or advisable) to find the exact
minimum in a line search
I Desired:
Update x Line search
I Low computational cost (few iterations and low
cost per iteration)
I Low memory requirements
I Low failure rate
Is x a
I Computational effort other dominated by the
No minimum? evaluation of objectives, constraints, and their
gradients
Yes

x∗

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 37 / 427


Line Search Techniques Optimality

Classification of Minima
We can classify a minimum as a:
1. Strong local minimum
2. Weak local minimum
3. Global minimum

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 38 / 427


Line Search Techniques Optimality

Optimality Conditions 1
Taylor’s theorem is the key for identifying local minima
1 1 1
f (x+h) = f (x)+hf 0 (x)+ h2 f 00 (x)+· · ·+ hn−1 f n−1 (x)+ hn f n (x + θh)
2 (n − 1)! |n! {z }
O(hn )

Assume that f is twice-continuously differentiable and that a minimum of f exists


at x∗ . Using n = 2 and x = x∗ ,
1
f (x∗ + ε) = f (x∗ ) + εf 0 (x∗ ) + ε2 f 00 (x∗ + θε)
2
For x∗ to be a local minimizer, we require that f (x∗ + ε) ≥ f (x∗ ) for ε ∈ [−δ, δ].
Therefore we require
1
εf 0 (x∗ ) + ε2 f 00 (x∗ + θε) ≥ 0
2
εf 0 (x∗ ) ≥ 0 ⇒ f 0 (x∗ ) = 0, since the sign of ε is arbitrary. This is the
first-order optimality condition. A point that satisfies the first-order optimality
condition is a stationary point.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 39 / 427
Line Search Techniques Optimality

Optimality Conditions 2
Since f 0 (x∗ ) = 0, we have to consider the second derivative term.
This term must be non-negative for a local minimum at x∗ .
Since ε2 > 0, then f 00 (x∗ ) ≥ 0. This is the second-order optimality condition.
Thus the necessary conditions for a local minimum are:

f 0 (x∗ ) = 0
f 00 (x∗ ) ≥ 0

We have a strong local minimum if

f 0 (x∗ ) = 0
f 00 (x∗ ) > 0

which are sufficient conditions

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 40 / 427


Line Search Techniques Optimality

What use are the optimality conditions?


The optimality conditions can be used to:
1. Verify that a point is a minimum (sufficient conditions).
2. Realize that a point is not a minimum (necessary conditions).
3. Define equations that can be solved to find a minimum.
Gradient-based minimization methods find local minima by finding points that
satisfy the optimality conditions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 41 / 427


Line Search Techniques Numerics

Numerical Precision
I Finding x∗ such that f 0 (x∗ ) = 0, is equivalent to finding the roots of the first
derivative of the function to be minimized.
I Therefore, root finding methods can be used to find stationary points and are
useful in function minimization.
I Using machine precision, it is not possible find the exact zero, so we will be
satisfied with finding an x∗ that belongs to an interval [a, b] such that the
function g satisfies

g(a)g(b) < 0 and |a − b| < ε

where ε is a small tolerance.


I This tolerance is be dictated by:
I Finite precision arithmetic (for double precision this is usually 1 × 10−16 )
I Precision of the function evaluation
I Limit on number of iterations we can afford to do

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 42 / 427


Line Search Techniques Numerics

Convergence Rate 1
Two questions are important when considering an optimization algorithm:
I Does it converge?

I How fast does it converge?

Suppose you we have a sequence of points xk (k = 1, 2, . . .) converging to a


solution x∗ . For a convergent sequence, we have

lim xk − x∗ = 0
k→∞

The rate of convergence is a measure of how fast an iterative method converges


to the numerical solution. An iterative method is said to converge with order r
when r is the largest positive number such that
kxk+1 − x∗ k
0 ≤ lim < ∞.
k→∞ kxk − x∗ kr

For a sequence with convergence rate r, asymptotic error constant, γ is


kxk+1 − x∗ k
γ = lim .
k→∞ kxk − x∗ kr
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 43 / 427
Line Search Techniques Numerics

Convergence Rate 2
Assume ideal convergence behavior so that the above condition and we do not
have to take the limit. Then,

kxk+1 − x∗ k = γkxk − x∗ kr for all k.

The larger r is, the faster the convergence:


I If r = 1, we have linear convergence, and kxk+1 − x∗ k = γkxk − x∗ k.
Convergence varies widely depending on γ:
I If γ ∈ (0, 1) then norm of error decreases by a constant factor for every
iteration.
I If γ = 0 when r = 1, we have a special case: superlinear convergence.
I If γ = 1, we have sublinear convergence.
I If γ > 1, the sequence diverges.
I If r = 2 we have quadratic convergence. Highly desirable, since convergence
is rapid and independent of γ. For example, if γ = 1 and the initial error is
kx0 − x∗ k = 10−1 , then the sequence of errors will be
10−1 , 10−2 , 10−4 , 10−8 , 10−16 , i.e., the number of digits doubles every
iteration: double precision in four iterations!

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 44 / 427


Line Search Techniques Numerics

Convergence Rate 3
In general, x is an n-vector and we have to rethink the definition of the error.
I We could use, for example, ||xk − x∗ ||.
I But this depends on the scaling of x, so we should normalize it,

||xk − x∗ ||
.
||xk ||
I And . . . xk might be zero, so fix this,

||xk − x∗ ||
.
1 + ||xk ||
I And . . . gradients might be large. Thus, we should use a combined quantity,

||xk − x∗ || |f (xk ) − f (x∗ )|


+
1 + ||xk || 1 + |f (xk )|

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 45 / 427


Line Search Techniques Numerics

Convergence Rate 4
I A final issue: x∗ is usually not known! You can monitor the progress of your
algorithm using the steps,

||xk+1 − xk || |f (xk+1 ) − f (xk )|


+ .
1 + ||xk || 1 + |f (xk )|

Sometimes, you might just use the second fraction in the above term, or the
norm of the gradient. You should plot these quantities in a log-axis versus k.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 46 / 427


Line Search Techniques Method of Bisection

Method of Bisection
I Bisection is a bracketing method: it generates a set of nested intervals and
requires an initial interval where is is assumed a solution exists.
I First we find a bracket [x1 , x2 ] such that f (x1 )f (x2 ) < 0
I For an initial interval [x1 , x2 ], bisection yields the following interval at
iteration k,
x1 − x2
δk =
2k
I To achieve a specified tolerance ε, we need log2 (x1 − x2 )/δ evaluations.
I From the definition of rate of convergence, for r = 1,
δk+1 1
lim = =
k→∞ δk 2
I Converges linearly with asymptotic error constant γ = 1/2.
I To find the minimum of a function using bisection, we evaluate the derivative
of f at each iteration, and find a point for which f 0 = 0.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 47 / 427


Line Search Techniques Newton’s Method

Newton’s Method
Newton’s method for finding a zero can be derived from the Taylor’s series
expansion about the current iteration xk ,

f (xk+1 ) = f (xk ) + (xk+1 − xk )f 0 (xk ) + O((xk+1 − xk )2 )

Ignoring the terms higher than order two and assuming the function next iteration
to be the root (i.e., f (xk+1 ) = 0), we obtain,

f (xk )
xk+1 = xk − .
f 0 (xk )

This iterative procedure converges quadratically, so

|xk+1 − x∗ |
lim = const.
k→∞ |xk − x∗ |2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 48 / 427


Line Search Techniques Newton’s Method

9.4 Newton-Raphson Method Using Derivative 357


Newton Method for Root Finding

f (x)

http://www.nr.com or call 1-800-872-7423 (North America only), or


readable files (including this one) to any server computer, is strictly
Permission is granted for internet users to make one paper copy fo
2
3
x

Figure
J.R.R.A.9.4.1.
Martins Newton’s method extrapolates the local
Multidisciplinary Designderivative to find the next estimate of theAugust
Optimization root. 2012
In 49 / 427
ge University Press. Programs Copyright (C) 1986-1992 by Numerical Recipes Software.
ny server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website
rs to make one paper copy for their own personal use. Further reproduction, or any copying of machine-
423 (North America only), or send email to directcustserv@cambridge.org (outside North America).

IPES IN FORTRAN 77: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521-43064-X)


Line Search Techniques Newton’s Method

Newton Method Failure Examples


Newton’s method is not guaranteed to converge, and only works under certain
9.4.1. Newton’s method extrapolates the local derivative to find the next estimate of the root. In
358 Chapter 9.
Root Finding and Nonlinear Sets of Equations
conditions.
xample it works well and converges quadratically.

f(x)
f(x)

2
3
1

x
x

9.4.2. Unfortunate case where Newton’s method encounters a local extremum and shoots off to
space. Here bracketing bounds, as in rtsafe, would save the day. Figure 9.4.3. Unfortunate case where Newton’s method enters a nonconvergent cycle. This
is often encountered when the function f is obtained, in whole or in part, by table interpolati
a better initial guess, the method would have succeeded.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 50 / 427


Line Search Techniques Newton’s Method

Newton’s Method for Function Minimization


To minimize a function using Newton’s method, we substitute the function for its
first derivative and the first derivative by the second derivative,

f (xk ) f 0 (xk )
xk+1 = xk − → xk+1 = xk − .
f 0 (xk ) f 00 (xk )

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 51 / 427


Line Search Techniques Newton’s Method

Example: Newton’s Method Applied to Polynomial 1


Consider the following single-variable optimization problem

minimize f (x) = (x − 3)x3 (x − 6)4


w.r.t. x

Newton’s method starting from different initial points

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 52 / 427


Line Search Techniques Newton’s Method

Example: Newton’s Method Applied to Polynomial 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 53 / 427


Line Search Techniques Secant Method

Secant Method
I Newton’s method requires the first derivative for each iteration (and the
second derivative when applied to minimization).
I In some cases, it might not be easy to obtain these derivatives.
I If we use a forward-difference approximation for f 0 (xk ) in Newton’s method
we obtain  
xk − xk−1
xk+1 = xk − f (xk ) .
f (xk ) − f (xk−1 )
which is the secant method.
I Also known as “the poor-man’s Newton method”.
I Under favorable conditions, this method has superlinear convergence
(1 < r < 2), with r ≈ 1.6180.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 54 / 427


Line Search Techniques Golden Section Search

Golden Section Search 1


I The golden section method does not find roots, it finds minima.
I Starts with an interval that contains minimum and reduces the interval.
I Start with uncertainty interval [0, 1]. Need two evaluations in the interval to
reduce the size of the interval.
I We do not want to bias towards one side, so choose the points symmetrically:

0 1−τ τ 1

0 1−τ τ

1−τ τ 1
I If we evaluate two points such that the two next possible intervals are the
same size and one of the points is reused, we have a more efficient method.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 55 / 427


Line Search Techniques Golden Section Search

Golden Section Search 2


I Mathematically,
τ 1−τ
= ⇒ τ2 + τ − 1 = 0
1 τ
The positive solution of this equation is the golden ratio,

5−1
τ= = 0.618033988749895 . . .
2
I We evaluate the function at 1 − τ and τ , and then the two possible intervals
are [0, τ ] and [1 − τ, 1], which have the same size. If, say [0, τ ] is selected,
then the next two interior points would be τ (1 − τ ) and τ τ . But τ 2 = 1 − τ ,
and we already have this point!
I The golden search convergence rate is linear.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 56 / 427


Line Search Techniques Golden Section Search

Example: Golden Section Applied to Polynomial

I Converges to different optima, depending on the starting interval


I Might not converge to the best optimum within the starting interval
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 57 / 427
Line Search Techniques Polynomial Interpolation

Polynomial Interpolation 1
I Idea: use information about f gathered during iteration.
I One way of using this information is to produce an estimate of the function
which we can easily minimize.
I The lowest order function that we can use for this purpose is a quadratic,
since a linear function does not have a minimum.
I Suppose we approximate f by
1
f˜ = ax2 + bx + c.
2
I If a > 0, the minimum of this function is x∗ = −b/a.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 58 / 427


Line Search Techniques Polynomial Interpolation
96 Chapter 10. Minimization or Maximization of Functions
Polynomial Interpolation 2

parabola through 1 2 3
parabola through 1 2 4

2
5 4

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 59 / 427


Line Search Techniques Line Search

Line Search Methods 1


I Line search methods, like single-variable optimization methods, minimize a
function of one variable
I But . . . line search is applied to a line in n-space and does not necessarily find
the exact minimum
I An important procedure in most gradient-based optimization methods
I For a search direction pk , we must decide the step length, i.e., αk in the
equation,
xk+1 = xk + αk pk
xk+1
pk+1

pk
gk+1
xk
gk

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 60 / 427


Line Search Techniques Line Search

Line Search Methods 2


I Gradient-based algorithms find pk such that it is a descent direction, i.e.,
pTk gk < 0, since this guarantees that f can be reduced by stepping along this
direction.
I Want to compute a step length αk that yields a reduction in f , but we do
not want to spend too much computational effort in making the choice.
I Ideally, we would find the global minimum along the line, but this is usually
not worthwhile, as it requires many iterations.
I More practical methods perform an inexact line search that achieves
adequate reductions of f at reasonable cost.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 61 / 427


Line Search Techniques Line Search

Wolfe Conditions 1
I Typical line search tries a sequence of step lengths, accepting the first that
satisfies certain conditions.
I A common condition requires that αk should yield a sufficient decrease of f ,

f (xk + αpk ) ≤ f (xk ) + µ1 αgkT pk

for a a small value of µ1 , e.g., 10−4 .


I Any sufficiently small step can satisfy the sufficient decrease condition, since
the slope is negative at the start.
I To prevent steps that are too small, we use second requirement called the
curvature condition,
g(xk + αpk )T pk ≥ µ2 gkT pk
where µ1 ≤ µ2 ≤ 1, and g(xk + αpk )T pk is the derivative of f (xk + αpk )
with respect to αk .
I This condition requires that the slope of the function at the new point be
greater than the starting one by a certain amount.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 62 / 427


Line Search Techniques Line Search

Wolfe Conditions 2
I Since we start with a negative slope, the gradient at the new point must be
either less negative or positive.
I Typical values of µ2 range from 0.1 to 0.9.
I The sufficient decrease and curvature conditions are known collectively as the
Wolfe conditions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 63 / 427


Line Search Techniques Line Search

Strong Wolfe Conditions 1


I We can modify the curvature condition to force αk to lie in a neighborhood
of a stationary point,

f (xk + αpk ) ≤ f (xk ) + µ1 αgkT pk .



g(xk + αpk )T pk ≤ µ2 gkT pk ,

where 0 < µ1 < µ2 < 1.


I This condition, together with the sufficient decrease condition, represents the
strong Wolfe conditions.
I The only difference when comparing with the Wolfe conditions is that we do
not allow points where the derivative has a positive value that is too large

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 64 / 427


Line Search Techniques Line Search

Backtracking Algorithm
I One of the simplest line search techniques is backtracking.
I It only checks for the sufficient decrease.
I It is guaranteed to satisfy this condition . . . eventually.

Algorithm 1 Backtracking line search algorithm


1: Input: α > 0, 0 < ρ < 1 . Initial step length and step reduction ratio
2: Output: αk . Step length
3: repeat
4: α ← ρα . Step length reduction
5: until f (xk + αpk ) ≤ f (xk ) + µ1 αgkT pk . Sufficient decrease condition
6: αk ← α

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 65 / 427


Line Search Techniques Line Search

Line Search Satisfying Strong Wolfe Conditions


I This procedure has two stages:
1. Begins with trial α1 , and keeps increasing it until it finds either an acceptable
step length or an interval that brackets the desired step lengths.
2. In the latter case, a second stage (the zoom algorithm below) is performed
that decreases the size of the interval until an acceptable step length is found.
I Define the univariate function φ(α) = f (xk + αpk ), so that φ(0) = f (xk ).
I The first stage that brackets the minimum is as follows . . .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 66 / 427


Line Search Techniques Line Search

Bracketing Stage Algorithm

1: Input: α1 > 0 and αmax


2: Output: α∗
3: α0 = 0
4: i←1
5: repeat
6: Evaluate φ(αi )
7: if [φ(αi ) > φ(0) + µ1 αi φ0 (0)] or [φ(αi ) > φ(αi−1 ) and i > 1] then
8: α∗ ← zoom(αi−1 , αi ) return α∗
9: end if
10: Evaluate φ0 (αi )
11: if |φ0 (αi )| ≤ −µ2 φ0 (0) then return α∗ ← αi
12: else if φ0 (αi ) ≥ 0 then
13: α∗ ← zoom(αi , αi−1 ) return α∗
14: else
15: Choose αi+1 such that αi < αi+1 < αmax
16: end if
17: i←i+1
18: until

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 67 / 427


Line Search Techniques Line Search

Bracketing Stage Flow Chart


1. Choose
initial point

2. Evaluate function value at


point

3. Bracket interval
3. Does point satisfy sufficient between previous
NO
decrease? point and current
point

YES

4. Evaluate function derivative


at point

5. Does point satisfy the curvature 6. Is derivative 7. Choose new


NO positive? NO point beyond
condition?
current one

YES

YES
6. Bracket interval
between current
point and previous "zoom" function
point
Point is good
enough

End line Call "zoom" function to


search find good point in interval

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 68 / 427


Line Search Techniques Line Search

Zoom Stage Algorithm

1: Input: αlow , αhigh


2: Output: α∗
3: j←0
4: repeat
5: Find a trial point αj between αlow and αhigh
6: Evaluate φ(αj )
7: if φ(αj ) > φ(0) + µ1 αj φ0 (0) or φ(αj ) > φ(αlow ) then
8: αhigh ← αj
9: else
10: Evaluate φ0 (αj )
11: if |φ0 (αj )| ≤ −µ2 φ0 (0) then
12: α∗ = αj return α∗
13: else if φ0 (αj )(αhigh − αlow ) ≥ 0 then
14: αhigh ← αlow
15: end if
16: αlow ← αj
17: end if
18: j ←j+1
19: until

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 69 / 427


Line Search Techniques Line Search

Zoom Stage Flow Chart


1. Interpolate between the low value point
and high value point to find a trial point in
the interval

2. Evaluate function value at


trial point

3. Does trial point satisfy sufficient


decrease and is less or equal than low 3. Set point to be new high
NO
point value point

YES

4.1 Evaluate function derivative


at point

4.2 Does point satisfy the curvature 4.3 Does derivative sign at point 4.3 Replace high point with low
NO agree with interval trend? YES
condition? point

YES

Point is good NO
enough
4.4 Replace low point with trial
point

Exit zoom (end


line search)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 70 / 427


Line Search Techniques Line Search

Example: Strong Wolfe Algorithm Applied to Polynomial

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 71 / 427


Gradient-Based Optimization

Gradient-Based Optimization
1. Introduction

2. Line Search Techniques

3. Gradient-Based Optimization
3.1 Introduction
3.2 Gradients and Hessians
3.3 Optimality Conditions
3.4 Steepest Descent
3.5 Conjugate Gradient
3.6 Newton’s Method
3.7 Quasi-Newton Methods
3.8 Trust Region Methods

4. Computing Derivatives

5. Constrained Optimization

6. Gradient-Free Optimization

7. Multidisciplinary Design Optimization


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 72 / 427
Gradient-Based Optimization Introduction

Gradient-Based Optimization 1
I In previous chapter, described methods to decrease a function of one variable.
I Now, consider problems with multiple design variables
The unconstrained optimization problem is,

minimize f (x)
with respect to x ∈ Rn

I x is the n-vector x = [x1 , x2 , . . . , xn ]T


I f can be nonlinear and must have continuous first derivatives, and in some
cases second derivatives

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 73 / 427


Gradient-Based Optimization Introduction

Gradient-Based Optimization 2
I Gradient-based methods use the gradient of the objective function to find the
most promising search directions
I For large numbers of design variables, gradient-based methods are more
efficient
I Assumptions and restrictions:
I No constraints (address these in later chapter)
I Smooth functions (gradient-free methods in later chapter)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 74 / 427


Gradient-Based Optimization Introduction

General Gradient-Based Optimization Algorithm 1


x0

Search
direction Input: Initial guess, x0
Output: Optimum, x∗
k←0
while Not converged do
Update x Line search Compute a search direction pk
Find a step length αk , such that f (xk + αk pk ) <
f (xk ) (the curvature condition may also be included)
Update the design variables: xk+1 ← xk + αk pk
Is x a
No minimum?
k ←k+1
end while

Yes

x∗

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 75 / 427


Gradient-Based Optimization Introduction

General Gradient-Based Optimization Algorithm 2


I Iterations in “while” loop with index k are major iterations
I Iterations in the line search are minor iterations
I pk is the search direction for major iteration
I αk is the step length from the line search
I The two way a gradient-based algorithm determines the search direction is
the distinguishnig feature.
I Any line search that satisfies sufficient decrease can be used, but one that
satisfies the Strong Wolfe conditions is recommended.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 76 / 427


Gradient-Based Optimization Gradients and Hessians

Gradients
Consider a function f (x). The gradient of this function is
 
∂f
 ∂x1 
 ∂f 
 
 

∇f (x) ≡ g(x) ≡  2 ∂x

.
 .. 
 
 ∂f 
∂xn
In the multivariate case, the gradient vector is perpendicular to the the hyperplane
tangent to the contour surfaces of constant f .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 77 / 427


Gradient-Based Optimization Gradients and Hessians

Hessians 1
I The second derivative of an n-variable function is defined by n2 partial
derivatives:
∂2f ∂2f
, i 6= jand , i = j.
∂xi ∂xj ∂x2i

I If the partial derivatives ∂f /∂xi , ∂f /∂xj and ∂ 2 f /∂xi ∂xj are continuous
and f is single valued, ∂ 2 f /∂xi ∂xj = ∂ 2 f /∂xj ∂xi .
I The second-order partial derivatives can be represented by a square
symmetric matrix called the Hessian matrix,
 
∂2f ∂2f
···
 ∂ 2 x1 ∂x1 ∂xn 
 .. .. 
∇ f (x) ≡ H(x) ≡ 
2
 . .


 ∂2f ∂ f 
2
··· ,
∂xn ∂x1 ∂ 2 xn
which contains n(n + 1)/2 independent elements.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 78 / 427


Gradient-Based Optimization Gradients and Hessians

Hessians 2
I If f is quadratic, the Hessian of f is constant, and the function can be
expressed as
1
f (x) = xT Hx + g T x + α.
2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 79 / 427


Gradient-Based Optimization Optimality Conditions

Optimality Conditions
As in single-variable case, optimality conditions derived from the Taylor-series
expansion:
1
f (x∗ + εp) ≈ f (x∗ ) + εpT g(x∗ ) + ε2 pT H(x∗ )p,
2
where ε is a scalar, and p is an n-vector.
I For x∗ to be a local minimum, then
f (x∗ + εp) ≥ f (x∗ ) ⇒ f (x∗ + εp) − f (x∗ ) ≥ 0.
I This means that the sum of the first and second order terms in the
Taylor-series expansion must be greater than or equal to zero.
I Start with first order term: Since p is an arbitrary vector and ε can be positive
or negative, every component of the gradient vector g(x∗ ) must be zero.
I Second order term: For ε2 pT H(x∗ )p to be non-negative, H(x∗ ) has to be
positive semi-definite.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 80 / 427


Gradient-Based Optimization Optimality Conditions

Relation of Hessian to Shape of Quadratic 1


Positive definite Indefinite

Positive semi-definite Negative definite

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 81 / 427


Gradient-Based Optimization Optimality Conditions

Relation of Hessian to Shape of Quadratic 2


Assuming H = H T , the Hessian can be classified as:
I Positive definite if pT Hp > 0 for all nonzero vectors p. All the eigenvalues of
H are strictly positive.
I Positive semi-definite if pT Hp ≥ 0 for all vectors p. All eigenvalues of H are
positive or zero.
I Indefinite if there exists p, q such that pT Hp > 0 and q T Hq < 0. H has
eigenvalues of mixed sign.
I Negative definite if pT Hp < 0 for all nonzero vectors p. All eigenvalues of H
are strictly negative.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 82 / 427


Gradient-Based Optimization Optimality Conditions

Optimality Conditions
Necessary conditions (for a local minimum):

kg(x∗ )k = 0 and H(x∗ ) is positive semi-definite.

Sufficient conditions (for a strong local minimum):

kg(x∗ )k = 0 and H(x∗ ) is positive definite.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 83 / 427


Gradient-Based Optimization Optimality Conditions

Example: Critical Points of a Function 1


Consider the function:

f (x) = 1.5x21 + x22 − 2x1 x2 + 2x31 + 0.5x41

Find all stationary points of f and classify them.


Solve ∇f (x) = 0, get three solutions:

(0, 0) local minimum


√ √
1/2(−3 − 7, −3 − 7) global minimum
√ √
1/2(−3 + 7, −3 + 7) saddle point

To establish the type of point:


1. Determine if the Hessian is positive definite.
2. Compare the values of the function at the points.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 84 / 427


Gradient-Based Optimization Optimality Conditions

Example: Critical Points of a Function 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 85 / 427


Gradient-Based Optimization Steepest Descent

Steepest Descent Method 1


I Steepest descent method uses the negative of the gradient vector as the
search direction
I The gradient is the direction of steepest increase, so the opposite direction
give steepest decrease

Input: Initial guess, x0 , convergence tolerances, εg , εa and εr .


Output: Optimum, x∗
k←0
repeat
Compute the gradient of the objective function, g(xk ) ≡ ∇f (xk )
Compute the normalized search direction, pk ← −g(xk )/kg(xk )k
Perform line search to find step length αk
Update the current point, xk+1 ← xk + αk pk
k ←k+1
until |f (xk ) − f (xk−1 )| ≤ εa + εr |f (xk−1 )| and kg(xk−1 )k ≤ εg

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 86 / 427


Gradient-Based Optimization Steepest Descent

Steepest Descent Method 2


I |f (xk+1 ) − f (xk )| ≤ εa + εr |f (xk )| is a check for the successive reductions
of f .
I εa is the absolute tolerance on the change in function value (usually ≈ 10−6 )
I εr is the relative tolerance (usually ≈ 10−2 ).
I If f is order 1, then εr dominates.
I If f gets too small, then the absolute error takes over.
There is a fundamental problem with steepest descent
I With exact line searchs, the steepest descent direction at each iteration is
orthogonal to the previous one,

df (xk+1 )
=0

∂f (xk+1 ) ∂ (xk + αpk )
⇒ =0
∂xk+1 ∂α
⇒ ∇T f (xk+1 )pk = 0
⇒ −g T (xk+1 )g(xk ) = 0
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 87 / 427
Gradient-Based Optimization Steepest Descent

Steepest Descent Method 3


I So the directions “zigzag”, which is inneficient.
I Rate of convergence is linear.
I Substantial decrease in the first few iterations, but then it is slow to converge.
I Guaranteed to converge, but may theoretically take an infinite number of
iterations.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 88 / 427


Gradient-Based Optimization Steepest Descent

Example: Minimization of Quadratic with Steepest


Descent
Consider this quadratic function of two variables,

f (x) = (1/2)(x21 + 10x22 )

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 89 / 427


Gradient-Based Optimization Steepest Descent

Step-size Scaling
I Since steepest descent and other gradient methods that do not produce
well-scaled search directions, we need to use other information to guess a
step length.
I One strategy is to assume that the first-order change in xk will be the same
as the one obtained in the previous step. i.e, that

ᾱgkT pk = αk−1 gk−1


T
pk−1

and therefore:
T
gk−1 pk−1
ᾱ = αk−1 T
.
gk pk

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 90 / 427


Gradient-Based Optimization Steepest Descent

Example: Steepest Descent 1


Consider the following function.
2 2
f (x1 , x2 ) = 1 − e−(10x1 +x2 ) .

The function f is not quadratic, but, as |x1 | and |x2 | → 0, we see that

f (x1 , x2 ) = 10x21 + x22 + O(x41 ) + O(x42 )

Thus, this function is essentially a quadratic near the minimum (0, 0)T .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 91 / 427


Gradient-Based Optimization Steepest Descent

Example: Steepest Descent 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 92 / 427


Gradient-Based Optimization Conjugate Gradient

Conjugate Gradient Method 1


I A small and simple modification to the steepest descent method results in
much improved convergence . . .
I . . . but it involves a lengthy derivation!
Suppose we want to minimize a convex quadratic function
1 T
φ(x) = x Ax − bT x
2
where A is an n × n matrix that is symmetric and positive definite. Differentiating
this with respect to x we obtain,

∇φ(x) = Ax − b ≡ r(x).

Minimizing the quadratic is thus equivalent to solving the linear system,

∇φ = 0 ⇒ Ax = b.

The conjugate gradient method is an iterative method for solving linear systems of
equations.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 93 / 427
Gradient-Based Optimization Conjugate Gradient

Conjugate Gradient Method 2


A set of nonzero vectors {p0 , p1 , . . . , pn−1 } is conjugate with respect to A if

pTi Apj = 0, for all i 6= j.

There is a simple interpretation of the conjugate directions.


I If A where diagonal, the isosurfaces would be ellipsoids with axes aligned
with coordinate directions . . .
I . . . in which case we could find the minimum by performing univariate
minimization along each coordinate direction in turn, converging in n
iterations.
I When A not diagonal, the contours are still elliptical, but they are not
aligned with the coordinate axes.
I Minimization along coordinate directions no longer leads to solution in n
iterations (or even a finite n).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 94 / 427


Gradient-Based Optimization Conjugate Gradient

Conjugate Gradient Method 3


I However, we can do a coordinate transformation to align the coordinate axis
with the ellipsoid axes
x̂ = S −1 x,
where S, is a matrix whose columns are the conjugate directions with respect
to A.
I The quadratic now becomes
1 T T  T
φ̂(x̂) = x̂ S AS x̂ − S T b x̂
2

I By conjugacy, S T AS is diagonal so we can do a sequence of n line
minimizations along the coordinate directions of x̂. Each univariate
minimization determines a component of x∗ correctly.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 95 / 427


Gradient-Based Optimization Conjugate Gradient

Nonlinear Conjugate Gradient Method


When the conjugate-gradient method is adapted to general nonlinear problems,
we obtain the nonlinear conjugate-gradient method, also known as the
Fletcher–Reeves method.

Algorithm 2 Nonlinear conjugate gradient method


Input: Initial guess, x0 , convergence tolerances, εg , εa and εr .
Output: Optimum, x∗
k←0
repeat
Compute the gradient of the objective function, g(xk )
if k=0 then
Compute the normalized steepest descent direction, pk ←
−g(xk )/kg(xk )k
else
g T gk
Compute β ← gT k gk−1
k−1
Compute the conjugate gradient direction pk = −gk /kg(xk )k + βk pk−1
end if
Perform line search to find step length αk
Update the current point, xk+1 ← xk + αk pk
k ←k+1
until |f (xk ) − f (xk−1 )| ≤ εa + εr |f (xk−1 )| and kg(xk−1 )k ≤ εg

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 96 / 427


Gradient-Based Optimization Conjugate Gradient

Nonlinear Conjugate Gradient Method


I The only difference relative to the steepest descent is that the each descent
direction is modified by adding a contribution from the previous direction.
I The convergence rate of the nonlinear conjugate gradient is linear, but can be
superlinear, converging in n to 5n iterations.
I Needs to be restarted, usually after n iterations, or when the directions start
being far from orthogonal. Restart with a steepest descent direction.
I Does not produce well-scaled search directions, so we can use same strategy
to choose the initial step size as for steepest descent.
I Several variants exist. Most differ in their definition of βk . For example,
another alternative is
kgk k2
βk = .
(gk − gk−1 )T pk−1
Another variant is the Polak–Ribière formula
gkT (gk − gk−1 )
βk = T g
.
gk−1 k−1

I Since this method is just a minor modification away from steepest descent
and performs much better, there is no excuse for steepest descent!
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 97 / 427
Gradient-Based Optimization Conjugate Gradient

Example: Conjugate Gradient Method in Action

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 98 / 427


Gradient-Based Optimization Newton’s Method

Newton’s Method 1
I Steepest descent and conjugate gradient methods only use first order
information to obtain a local model of the function.
I Newton methods use a second-order Taylor series expansion of the function
about the current design point
1
f (xk + sk ) ≈ fk + gkT sk + sTk Hk sk ,
2
where sk is the step to the minimum.
I Differentiating this with respect to sk and setting it to zero, we obtain

Hk sk = −gk .

This is a linear system which yields the Newton step, sk , as a solution.


I If f is a quadratic function and Hk is positive definite, Newton’s method
requires only one iteration to converge from any starting point.
I For a general nonlinear function, Newton’s method converges quadratically if
x0 is sufficiently close to x∗ and the Hessian is positive definite at x∗ .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 99 / 427


Gradient-Based Optimization Newton’s Method

Newton’s Method 2
I As in the single variable case, difficulties and even failure may occur when the
quadratic model is a poor approximation of f far from the current point.
I If Hk is not positive definite, the quadratic model might not have a minimum
or even a stationary point.
I So for some nonlinear functions, the Newton step might be such that
f (xk + sk ) > f (xk ) and the method is not guaranteed to converge.
I Another disadvantage of Newton’s method is the need to compute not only
the gradient, but also the Hessian, which contains n(n + 1)/2 second order
derivatives.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 100 / 427


Gradient-Based Optimization Newton’s Method

Modified Newton’s Method 1


A small modification to Newton’s method is to perform a line search along the
Newton direction, rather than accepting the step size that would minimize the
quadratic model.

Input: Initial guess, x0 , convergence tolerances, εg , εa and εr .


Output: Optimum, x∗
k←0
repeat
Compute the gradient of the objective function, g(xk )
Compute the Hessian of the objective function, H(xk )
Compute the search direction, pk = −H −1 gk
Perform line search to find step length αk , starting with α = 1
Update the current point, xk+1 ← xk + αk pk
k ←k+1
until |f (xk ) − f (xk−1 )| ≤ εa + εr |f (xk−1 )| and kg(xk−1 )k ≤ εg

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 101 / 427


Gradient-Based Optimization Newton’s Method

Modified Newton’s Method 2


I Although this increases the probability that f (xk + pk ) < f (xk ), it is still
vulnerable to the problem of having an Hessian that is not positive definite.
I All the other disadvantages of the pure Newton’s method still apply.
I We could also use a symmetric positive definite matrix instead of the real
Hessian to ensure descent
Bk = Hk + γI,
where γ is chosen such that all eigenvalues of Bk are sufficiently positive.
I The starting step length ᾱ is usually set to 1, since Newton’s method already
provides a good guess for the step size.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 102 / 427


Gradient-Based Optimization Newton’s Method

Example: Modified Newton’s Method in Action

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 103 / 427


Gradient-Based Optimization Quasi-Newton Methods

Quasi-Newton Methods
I Quasi-Newton methods use only first order information . . .
I . . . but they build second order information — an approximate Hessian —
based on the sequence of function values and gradients from previous
iterations.
I They are the analog of the secant method in multidimensional space.
I The various quasi-Newton methods differ in how they update the
approximate Hessian.
I Most of them force the Hessian to be symmetric and positive definite.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 104 / 427


Gradient-Based Optimization Quasi-Newton Methods

The First Quasi-Newton Method


I A bit of interesting history . . .
I One of the first quasi-Newton methods was devised by Davidon in 1959, who
a physicist at Argonne National Laboratories.
I He was using a coordinate descent method, and had limited computer
resources, so he invented a more efficient method that resulted in the first
quasi-Newton method.
I This was one of the most revolutionary ideas in nonlinear optimization.
I Davidon’s paper was not accepted for publication! It remained a technical
report until 1991.
I Fletcher and Powell later modified the method and showed that it was much
faster than current ones, and hence it became known as the
Davidon–Fletcher–Powell (DFP) method.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 105 / 427


Gradient-Based Optimization Quasi-Newton Methods

The Basis for Quasi-Newton Methods 1


I Suppose we model the objective function as a quadratic
1
φk (p) = fk + gkT p + pT Bk p,
2
where Bk is an n × n symmetric positive definite matrix that is updated
every iteration.
I The step pk that minimizes this convex quadratic model is

pk = −Bk−1 gk .

I This solution is used to compute the search direction to obtain the new
iterate
xk+1 = xk + αk pk
where αk is obtained using a line search.
I This is the same procedure as the Newton method, except that we use an
approximate Hessian Bk instead of the true Hessian.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 106 / 427


Gradient-Based Optimization Quasi-Newton Methods

The Basis for Quasi-Newton Methods 2


I Instead of computing Bk “from scratch” at every iteration, a quasi-Newton
method updates it in to account for the curvature estimate for the most
recent step.
I We want to build an updated quadratic model,

T 1
φk+1 (p) = fk+1 + gk+1 p + pT Bk+1 p.
2
I Using the secant method we can find the univariate quadratic function along
the previous direction pk based on the two last two gradients gk+1 and gk ,
and the last function value fk+1 .
I The slope of the univariate function is the gradient of the function projected
onto the p direction, f 0 = g T p. The univariate quadratic is given by

0 θ2 ˜00
φk+1 (θ) = fk+1 + θfk+1 + f
2 k+1

where s = αkpk and f˜k+1


00
is the approximation to the curvature

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 107 / 427


Gradient-Based Optimization Quasi-Newton Methods

The Basis for Quasi-Newton Methods 3


I This curvature approximation is given by a forward finite difference on the
slopes
f 0 − fk0
f˜k+1
00
= k+1
αk kpk k
These slopes are obtained by projecting the respective gradients onto the last
direction pk .
I The result is a quadratic that slope and value at the current point, and the
slope of the previous point.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 108 / 427


Gradient-Based Optimization Quasi-Newton Methods

The Basis for Quasi-Newton Methods 4

fk0

0
fk+1 φ

xk xk+1

Projection of the quadratic model onto the last search direction, illustrating
the secant condition

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 109 / 427


Gradient-Based Optimization Quasi-Newton Methods

The Basis for Quasi-Newton Methods 5


I Going back to n-dimensional space, after some manipulation we obtain,

Bk+1 αk pk = gk+1 − gk .

which is called the secant condition.


I For convenience, we set the difference of the gradients to yk = gk+1 − gk ,
and sk = xk+1 − xk so the secant condition is then written as

Bk+1 sk = yk .

xk+1
pk+1
pk
xk gk+1
gk

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 110 / 427


Gradient-Based Optimization Quasi-Newton Methods

Davidon–Fletcher–Powell (DFP) Method 1


I In the Hessian update, Bk+1 we have n(n + 1)/2 unknowns and only n
equations.
I To determine the solution uniquely, we impose a condition that among all the
matrices that satisfy the secant condition, selects the Bk+1 that is “closest”
to the previous Hessian approximation Bk
I This can be done by solving the optimization problem

minimize kB − Bk k
with respect to B
subject to B = BT , Bsk = yk .

I Using different matrix norms result in different quasi-Newton methods.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 111 / 427


Gradient-Based Optimization Quasi-Newton Methods

Davidon–Fletcher–Powell (DFP) Method 2


I One norm that makes it easy to solve this problem and possesses good
numerical properties is the weighted Frobenius norm

kAkW = kW 1/2 AW 1/2 kF ,


Pn Pn
where the norm is defined as kCkF = i=1 j=1 c2ij . The weights W are
chosen to satisfy certain favorable conditions.
I The norm is adimensional (i.e., does not depend on the units of the problem)
if the weights are chosen appropriately.
I Using this norm and weights, the unique solution of the norm minimization
problem is,
   
yk sT sk y T yk y T
Bk+1 = I − T k Bk I − T k + T k ,
yk sk yk sk yk sk

which is the DFP updating formula originally proposed by Davidon.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 112 / 427


Gradient-Based Optimization Quasi-Newton Methods

Davidon–Fletcher–Powell (DFP) Method 3


I Using the inverse of Bk is usually more useful, since the search direction can
then be obtained by matrix multiplication. Defining,

Vk = Bk−1 .

I The DFP update for the inverse of the Hessian approximation can be shown
to be
Vk yk y T Vk sk sT
Vk+1 = Vk − T k + T k
yk Vk yk yk sk
I Note that this is a rank 2 update.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 113 / 427


Gradient-Based Optimization Quasi-Newton Methods

Quasi-Newton Algorithm with DFP Update

Input: Initial guess, x0 , convergence tolerances, εg , εa and εr .


Output: Optimum, x∗
k←0
V0 ← I
repeat
Compute the gradient of the objective function, g(xk )
Compute the search direction, pk ← −Vk gk
Perform line search to find step length αk , starting with α ← 1
Update the current point, xk+1 ← xk + αk pk
Set the step length, sk ← αk pk
Compute the change in the gradient, yk ← gk+1 − gk
V y yT V
Ak ← yk TkVkkyk k
k
s sT
Bk ← sTk ykk
k
Compute the updated approximation to the inverse of the Hessian, Vk+1 ←
V k − Ak + B k
until |f (xk ) − f (xk−1 )| ≤ εa + εr |f (xk−1 )| and kg(xk−1 )k ≤ εg
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 114 / 427
Gradient-Based Optimization Quasi-Newton Methods

Broyden–Fletcher–Goldfarb–Shanno (BFGS) Method


I The DFP update was soon superseded by the BFGS formula, which is
generally considered to be the most effective quasi-Newton update.
I Instead of solving the norm minimization problem of B we now solve the
same problem for its inverse, V , resulting in
   
sk y T yk sT sk sT
Vk+1 = I − T k Vk I − T k + T k .
sk yk sk yk sk yk
I The relative performance between the DFP and BFGS methods is problem
dependent.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 115 / 427


Gradient-Based Optimization Quasi-Newton Methods

A Beer-Inspired Algorithm?

Broyden, Fletcher, Goldfarb and Shanno at the NATO Optimization Meeting


(Cambridge, UK, 1983), a seminal meeting for continuous optimization
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 116 / 427
Gradient-Based Optimization Quasi-Newton Methods

Example: BFGS Applied to Simple Function

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 117 / 427


Gradient-Based Optimization Quasi-Newton Methods

Example: Minimization of the Rosenbrock Function 1

Steepest descent

15

2
0.5
0

15
15
20

10
1.5

20

0
20

0.5
5
1 0.2.2
10

15
2 0
2
5

10
0
10
1

1
20
10
5
10

5
15

15

0.5
20

2
15

0.5

0.5

20 0
x2

20

10
200

1 00

100
2

20
15
40 0

10
1
300

300
0 5

5 00
10 2

15

600
20
-0.5

4 00
20

7 00
20

10

0
0

10
400
500
60 0

30

-1 0
30
0

-1.5 -1 -0.5 0 0.5 1 1.5


x1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 118 / 427


Gradient-Based Optimization Quasi-Newton Methods

Example: Minimization of the Rosenbrock Function 2


Nonlinear conjugate gradient

10
0
20

20
1.5

10

2
0

15
10

0.2 0.22 1
5

0.5
0

5
10 0
10

15
1
10 0

10
2

0.5
15

5
20
5

20
1
10
15
20 0

20

0.5 15

10 0
10

2
x2

1
2
300

5
5
400

0.5
100

200
15
0
10

10
1
5
20

50 0
15

300
20
200

-0.5

0
10
60 0

700 00
40 0
3 00

6
5 00

400

10
0

0
70 0

-1 20
-1.5 -1 -0.5 0 0.5 1 1.5
x1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 119 / 427


Gradient-Based Optimization Quasi-Newton Methods

Example: Minimization of the Rosenbrock Function 3


Modified Newton

0.2
1.5

20
10
0
5

2
15
200

15
2
1
2105

5
20

0.5
0
10

10

10
0
15 2
1
100

10

.2
00.5
10

1
2 00

5
0.2

1 00
5
0.5

20
20

2
x2

15
5
15

15
5

2
2
0.
1
10
300

20

0.5

10
2
0

200

500
10

10
0

3 00
4 00
5
2 00
400

15

6 00
20

0
10
-0.5

7 00
30
0
500
6 00

1 00
-1
-1.5 -1 -0.5 0 0.5 1 1.5
x1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 120 / 427


Gradient-Based Optimization Quasi-Newton Methods

Example: Minimization of the Rosenbrock Function 4


BFGS

300

20
20
0

5
20

0.2
2
10
1.5

5
5

15
10 0

2
10

0.5
1
10
0
10
1
20

0.2
1
10

15
15

0.5
15

20

20 10

200
20

5
0.2
5

2
300

0.5 15

5
x2

1 00
10 2
100

15
0.
10
2 00

40 0
1
0
20

15
15

5 10

20

5 00
20 0

300
40 0

-0.5
30 0
700

500
60 0

600
0
10
10
0
20
0

0
-1

40
-1.5 -1 -0.5 0 0.5 1 1.5
x1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 121 / 427


Gradient-Based Optimization Quasi-Newton Methods

Example: Minimization of the Rosenbrock Function 5


Convergence rate comparison

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 122 / 427


Gradient-Based Optimization Quasi-Newton Methods

Symmetric Rank-1 Update Method (SR1) 1


I If we drop the requirement that the approximate Hessian (or its inverse) be
positive definite, we can derive a simple rank-1 update formula for Bk that
maintains the symmetry of the matrix and satisfies the secant equation.
I The symmetric rank-1 update (SR1) is such a formula

(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk + .
(yk − Bk sk )T sk
I With this formula, we must have safe-guards:
I If yk = Bk sk then the denominator is zero, and the only update that satisfies
the secant equation is Bk+1 = Bk (i.e., do not change the matrix).
I if yk 6= Bk sk and (yk − Bk sk )T sk = 0 then there is no symmetric rank-1
update that satisfies the secant equation.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 123 / 427


Gradient-Based Optimization Quasi-Newton Methods

Symmetric Rank-1 Update Method (SR1) 2


I To avoid the second case, we update the matrix only if the following
condition is met:

|ykT (sk − Bk yk )| ≥ rksk kkyk − Bk sk k,

where r ∈ (0, 1) is a small number (e.g., r = 10−8 ). Hence, if this condition


is not met, we use Bk+1 = Bk .
I In practice, the matrices produced by SR1 have been found to approximate
the true Hessian matrix well (often better than BFGS)
I This may be useful in trust-region methods or constrained optimization
problems, where the Hessian of the Lagrangian is often indefinite, even at the
minimizer.
I It may be necessary to add a diagonal matrix γI to Bk when calculating the
search direction, as was done in modified Newton’s method.
I A simple back-tracking line search can be used, since the Wolfe conditions
are not required as part of the update — unlike BFGS.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 124 / 427


Gradient-Based Optimization Trust Region Methods

Trust Region Methods


I Trust region, or “restricted-step” methods are a different approach to
resolving the weaknesses of the pure form of Newton’s method.
I These weaknesses arise from the fact that we are stepping outside a the
region for which the quadratic approximation is reasonable.
I We can overcome these difficulties by minimizing the quadratic function
within a region around xk within which we trust the quadratic model.
I The reliability index, rk , is the ratio of the actual reduction to the predicted
reduction; the closer it is to unity, the better the agreement. If fk+1 > fk
(new point is worse), rk is negative.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 125 / 427


Gradient-Based Optimization Trust Region Methods

Trust Region Algorithm


Algorithm 3 Trust region algorithm
Input: Initial guess x0 , convergence tolerances, εg , εa and εr , initial size of the
trust region, h0
Output: Optimum, x∗
k←0
repeat
Compute the Hessian of the objective function H(xk ), and solve the quadratic
subproblem:
1
minimize q(sk ) = f (xk ) + g(xk )T sk + sTk H(xk )sk
2
w.r.t. sk
s.t. − hk ≤ sk ≤ hk , i = 1, . . . , n

Evaluate f (xk + sk ) and compute the ratio that measures the accuracy of
the quadratic model,

f (xk ) − f (xk + sk ) ∆f
rk ← =
f (xk ) − q(sk ) ∆q

if rk < 0.25 then


hk+1 ← ks4k k . Model is not good; shrink the trust region
else if rk > 0.75 and hk = ksk k then
hk+1 ← 2hk . Model is good and new point on edge; expand trust
region
else
hk+1 ← hk . New point with trust region and the model is reasonable;
keep trust region the same size
end if
if rk ≤ 0 then
xk+1 ← xk . Keep trust region centered about the same point
else
xk+1 ← xk + sk . Move center of trust region to new point
end if
k ←k+1
until |f (xk ) − f (xk−1 )| ≤ εa + εr |f (xk−1 )| and kg(xk−1 )k ≤ εg

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 126 / 427


Computing Derivatives

Computing Derivatives
1. Introduction

2. Line Search Techniques

3. Gradient-Based Optimization

4. Computing Derivatives
4.1 Introduction
4.2 Finite Differences
4.3 Complex-Step Method
4.4 C/C++ Implementations
4.5 Unifying Chain Rule
4.6 The Unifying Chain Rule
4.7 Monolithic Differentiation
4.8 Algorithmic Differentiation
4.9 Analytic Methods

5. Constrained Optimization

6. Gradient-Free Optimization

7. Multidisciplinary Design Optimization


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 127 / 427
Computing Derivatives Introduction

What’s in a name?
I Derivatives have also been called:
I “Sensitivities” . . . but sensitivity analysis is actually a much broader area of
mathematics.
I “Sensitivity derivatives” — a somewhat redundant term?
I “Design sensitivities” — a fair term to use.
I I have been using the terms “sensitivities” and “sensitivity analysis” up until
this year, but now I prefer “derivatives”, since it is more precise.
I A “gradient” is a vector of derivatives
I A Jacobian is a matrix of derivatives (the gradient of a vector)
I We will focus on first order derivatives of deterministic numerical models.
I A model can be any numerical procedure that given inputs computes some
outputs

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 128 / 427


Computing Derivatives Introduction

What derivatives do we need for optimization?


Consider a general constrained optimization problem of the form:

minimize f (xi ) (1)


w.r.t xi i = 1, 2, . . . , n (2)
subject to cj (xi ) ≥ 0, j = 1, 2, . . . , m (3)

To solve this problem using gradient-based optimization we require:


I Gradient of the objective function, ∇f (x) = ∂f /∂xi , an n-vector.
I Gradient of all active constraints, ∂cj /∂xi , an (m × n) matrix (Jacobian)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 129 / 427


Computing Derivatives Introduction

The Root of Most Problems in Gradient-Based


Optimization
x0 I The computation of the derivatives can be
Optimizer the bottleneck in gradient-based
optimization
Search
Analysis
direction I Most gradient-based optimizers use finite
x
Gradient differences as the default
Line search computation
I This often leads to long computational
Converged?
times and failure to converge
I Accurate and efficient gradients are
essential for effective optimization

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 130 / 427


Computing Derivatives Introduction

Methods for Computing Derivatives


Symbolic: Exact, but limited to explicit functions
Finite differences: Easy to implement and no source code is needed, but subject
to large errors; cost proportional to the number of design variables
Complex step: Relatively easy to implement, but source code is needed.
Numerically exact. Cost is still proportional to the number of
variables.
Algorithmic differentiation: Requires the source code, memory requirements can
become prohibitive, cost can be independent of the number of
design variables.
Analytic methods: Numerically exact, long development time, source code is
needed, but cost can be independent of the number of design
variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 131 / 427


Computing Derivatives Finite Differences

Finite Differences 1
I Finite differences are one of the most popular methods for computing
derivatives, mostly because they are extremely easy to implement and do not
require source code
I . . . but they suffer from some serious accuracy and performance issues.
I Finite-difference formulas are derived by combining Taylor series expansions
I It is possible to obtain formulas for arbitrary order derivatives with arbitrary
order truncation error (but it will cost you!)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 132 / 427


Computing Derivatives Finite Differences

Finite Differences 2
The simplest finite-difference formula can be directly derived from one Taylor
series expansion,

∂F h2 ∂ 2 F h3 ∂ 3 F
F (x + ej h) = F (x) + h + + + ...,
∂xj 2! ∂x2j 3! ∂x3j

Solving for ∂F /∂xj we get,

∂F F (x + ej h) − F (x)
= + O(h)
∂xj h

where h is the finite-difference interval. This approximation is called a forward


difference and is directly related to the definition of derivative. The truncation
error is O(h), and hence this is a first-order approximation.
I F can be a vector with all the functions if interest
I The forward difference formula requires two function evaluates and yields one
column of the

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 133 / 427


Computing Derivatives Finite Differences

Finite Differences 3
I Each additional column requires an additional evaluation
I Hence, the cost of computing the complete Jacobian is proportional to the
number of input variables of interest, nx .
For a second-order estimate we use the expansion of f (x − h),

h2 00 h3
f (x − h) = f (x) − hf 0 (x) + f (x) − f 000 (x) + . . . ,
2! 3!
and subtract it from the f (x + h) to get the central-difference formula,

f (x + h) − f (x − h)
f 0 (x) = + O(h2 ).
2h
More accurate estimates can also be derived by combining different Taylor series
expansions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 134 / 427


Computing Derivatives Finite Differences

Finite Differences 4
Formulas for estimating higher-order derivatives can be obtained by nesting
finite-difference formulas. We can use, for example, the central difference formula
to estimate the second derivative instead of the first,
f 0 (x + h) − f 0 (x − h)
f 00 (x) = + O(h2 ).
2h
and use central difference again to estimate both f 0 (x + h) and f 0 (x − h) in the
above equation to obtain,

f (x + 2h) − 2f (x) + f (x − 2h)


f 00 (x) = + O(h).
4h2

I Finite differences are subject to the step-size dilemma:


I Want to use a very small h to reduce the truncation error
I . . . but cannot make h too small because of subtractive cancellation
Subtractive cancellation is due to finite precision arithmetic.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 135 / 427


Computing Derivatives Finite Differences

Finite Differences 5
f (x + h) +1.234567890123431
f (x) +1.234567890123456
∆f −0.000000000000025

f(x)

f(x+h)

x x+h
Finite difference approximation

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 136 / 427


Computing Derivatives Finite Differences

Finite Differences 6
I For functions of several variables, then we have to calculate each component
of the gradient ∇f (x) by perturbing each component of x and recomputing
f.
I Thus the cost of calculating a gradient is proportional to the number of
design variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 137 / 427


Computing Derivatives Complex-Step Method

The Complex-Step Method


I The complex-step derivative approximation computes derivatives of real
functions using complex variables.
I Originates from a more general method published in 1967 for computing
higher order derivatives with arbitrary precision
I Rediscovered in 1998 as a simple formula for first derivatives
I Generalized for real-world applications soon after that
I Extremely accurate, robust, and relatively easy to implement

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 138 / 427


Computing Derivatives Complex-Step Method

Complex-step Method Applications 1


I Gradients and Jacobians in CFD
I Verification of high-fidelity aerostructural derivatives
I Immunology model sensitivities
I Jacobians in liquid chromotography
I First and second derivatives of Kalman filters
I Hessian matrices in statistics
I Sensitivities in biotechnology

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 139 / 427


Computing Derivatives Complex-Step Method

Theory 1
I Like finite-difference formulas, the complex-step approximations can also be
derived using a Taylor series expansion.
I Instead of using a real step h, we now use a pure imaginary step, ih.
I If f is a real function in real variables and it is also analytic, we can expand it
in a Taylor series about a real point x as follows,

∂F h2 ∂ 2 F ih3 ∂ 3 F
F (x + ihej ) = F (x) + ih − − + ...
∂xj 2 ∂x2j 6 ∂x3j

Taking the imaginary parts of both sides of this equation and dividing it by h
yields
∂F Im [F (x + ihej )]
= + O(h2 )
∂xj h
We call this the complex-step derivative approximation. Hence the
approximations is a O(h2 ) estimate of the derivative.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 140 / 427


Computing Derivatives Complex-Step Method

Theory 2
I Like finite-differences, each additional evaluation results in a column of the
dF
Jacobian , and the cost of computing the derivatives is proportional to
dx
the number of design variables, nx .
I No subtraction operation in the complex-step approximation, so no
subtractive cancellation error
I the only source of numerical error is the truncation error, O(h2 ).
I By decreasing h to a small enough value, the truncation error can be made to
be of the same order as the numerical precision of the evaluation of f .
I If we take the real part of the Taylor series expansion, we get

f 00 (x)
f (x) = Re [f (x + ih)] + h2 − ...
2!
showing that the real part of the result give the value of f (x) correct to
O(h2 ).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 141 / 427


Computing Derivatives Complex-Step Method

Theory 3
I The second order errors in the function value and the function derivative can
be eliminated when using finite-precision arithmetic by ensuring that h is
sufficiently small.
I If ε is the relative working precision of a given algorithm, to eliminate the
truncation error, we need an h such that
00
f (x)
h2 < ε |f (x)|
2!
I Similarly, for the truncation error of the derivative estimate to vanish we
require that 000
f (x)
h2 < ε |f 0 (x)|
3!
I Although the step h can be very small values, in some cases, it is not possible
to satisfy these conditions, e.g., when f (x), f 0 (x) tend to zero.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 142 / 427


Computing Derivatives Complex-Step Method

Another derivation of the complex-step 1


I Consider a function, f = u + iv, of the complex variable, z = x + iy. If f is
analytic the Cauchy–Riemann equations apply, i.e.,
∂u ∂v
=
∂x ∂y
∂u ∂v
=− .
∂y ∂x
I We can use the definition of a derivative in the right hand side of the first
Cauchy–Riemann to get

∂u v(x + i(y + h)) − v(x + iy)


= lim
∂x h→0 h
where h is a small real number.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 143 / 427


Computing Derivatives Complex-Step Method

Another derivation of the complex-step 2


I Since the functions are real functions of a real variable, y = 0, u(x) = f (x)
and v(x) = 0 and we can write,

∂f Im [f (x + ih)]
= lim .
∂x h→0 h
I For a small discrete h, this can be approximated by,

∂f Im [f (x + ih)]
≈ .
∂x h

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 144 / 427


Computing Derivatives Complex-Step Method

Another derivation of the complex-step 3


Im

(x, ih)

Re Re
(x, 0) (x + h, 0) (x, 0)

∂F F (x + h) − F (x) ∂F Im[F (x + ih)] − Im[F (x)]


≈ ≈
∂x h ∂x Im[ih]

∂F Im[F (x + ih)]
⇒ ≈
∂x h

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 145 / 427


Computing Derivatives Complex-Step Method

Example: The Complex-Step Method Applied to a Simple


Function 1
I Consider the following analytic function:
ex
f (x) = p
sin3 x + cos3 x
I We define the relative error as,

0 0
f − fref
ε = .

0
fref

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 146 / 427


Computing Derivatives Complex-Step Method

Example: The Complex-Step Method Applied to a Simple


Function 2

Normalized Error, e

Step Size, h
Relative error of the derivative vs. decreasing step size

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 147 / 427


Computing Derivatives Complex-Step Method

Application of the Complex-Step to General Programs


I To what extent can the complex-step method be used in a general numerical
algorithm?
I We had to assumed that the function F is analytic, so we need to examine
this assumption holds in numerical algorithms.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 148 / 427


Computing Derivatives Complex-Step Method

Relational logic operators 1


I Relational logic operators (=, <, >, ≤, ≥) are usually not defined for complex
numbers.
I These operators are used with conditional statements to redirect the
execution thread.
I Original algorithm and its “complexified” version should follow the same
execution thread.
I Therefore, defining these operators to compare only the real parts is the
correct approach.
I max and min are based on relational operators, we should choose a number
based on its real part alone.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 149 / 427


Computing Derivatives Complex-Step Method

Relational logic operators 2


I Algorithms that use conditional statements are likely to be a discontinuous
function of its inputs
I Either the function value itself is discontinuous or the discontinuity is in the
first or higher derivatives.
I Using finite-difference, the estimate is incorrect if the two function evaluations
are within h of the discontinuity location.
I Using the complex-step, the resulting derivative estimate is correct right up to
the discontinuity.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 150 / 427


Computing Derivatives Complex-Step Method

Arithmetic functions
I Arithmetic functions and operators include addition, multiplication, and
trigonometric functions.
I Most of these functions have a standard complex definition that is analytic,
so the complex-step derivative approximation yields the correct result.
I The only standard complex function definition that is non-analytic is the
absolute value function.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 151 / 427


Computing Derivatives Complex-Step Method

Redefining the absolute value function 1


I When the argumentpis a complex number, the function returns the positive
real number, |z| = x2 + y 2 .
I This function is not analytic, so the complex-step does not work.
I To derive an analytic definition of this function, we apply the
Cauchy–Riemann equations to get:
(
∂u ∂v −1, if x < 0,
= =
∂x ∂y +1, if x > 0.

I Since ∂v/∂x = 0 on the real axis, we get ∂u/∂y = 0 on the same axis, so the
real part of the result must be independent of the imaginary part of the
variable.
I Therefore, the new sign of the imaginary part depends only on the sign of the
real part of the complex number, and an analytic “absolute value” function is
(
−x − iy, if x < 0,
abs(x + iy) =
+x + iy, if x > 0.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 152 / 427
Computing Derivatives Complex-Step Method

Redefining the absolute value function 2


I This is not analytic at x = 0 since a derivative does not exist for the real
absolute value.
I In practice, the x > 0 condition is substituted by x ≥ 0, so that we can obtain
a function value for x = 0 and calculate the correct right-hand-side derivative.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 153 / 427


Computing Derivatives Complex-Step Method

Other Issues 1
I Improvements to the complex-step method are necessary because of the way
certain compilers implement the functions.
I For example, the following formula might be used for the arcsin function:
h p i
arcsin(z) = −i log iz + 1 − z 2 ,

which may yield a zero derivative.


I To see how this happens, consider z = x + ih, where x = O(1) and
h = O(10−20 ), then in the addition,

iz + z = (x − h) + i (x + h) ,

h vanishes when using finite precision arithmetic. Therefore, we would like to


keep the real and imaginary parts separate.
I The complex definition of sine also problematic. For example, in

eiz − e−iz
sin(z) = .
2i
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 154 / 427
Computing Derivatives Complex-Step Method

Other Issues 2
I The complex trigonometric relation yields a better alternative,

sin(x + ih) = sin(x) cosh(h) + i cos(x) sinh(h).

I Linearizing this last equation (that is for small h) this simplifies to,

sin(x + ih) ≈ sin(x) + ih cos(x).

I From the standard complex definition,


h p i
arcsin(z) = −i log iz + 1 − z 2 .

I We would like the real and imaginary parts to be calculated separately. This
can be achieved by linearizing in h to obtain,
h
arcsin(x + ih) ≡ arcsin(x) + i √ .
1 − x2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 155 / 427


Computing Derivatives Complex-Step Method

Implementation Procedure
The general procedure for the implementation of the complex-step method for an
arbitrary computer program can be summarized as follows:
1. Substitute all real type variable declarations with complex declarations. It is
not strictly necessary to declare all variables complex, but it is much easier to
do so.
2. Define all functions and operators that are not defined for complex
arguments.
3. Add a small complex step (e.g. h = 1 × 10−20 ) to the desired x, run the
algorithm that evaluates f , and then take the imaginary part of the result
and divide by h.
The above procedure is independent of the programming language. We now
describe the details of our Fortran and C/C++ implementations.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 156 / 427


Computing Derivatives Complex-Step Method

Fortran Implementation 1
I complexify.f90: a module that defines additional functions and operators
for complex arguments.
I Complexify.py: Python script that makes necessary changes to source
code, e.g., type declarations.
I Features:
I Script is versatile:
I Compatible with many more platforms and compilers.
I Supports MPI based parallel implementations.
I Resolves some of the input and output issues.
I Some of the function definitions were improved: tangent, inverse and
hyperbolic trigonometric functions.

I complexify.h: defines additional functions and operators for the


complex-step method.
I derivify.h: simple automatic differentiation. Defines a new type which
contains the value and its derivative.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 157 / 427


Computing Derivatives Complex-Step Method

Fortran Implementation 2
Templates, a C++ feature, can be used to create program source code that is
independent of variable type declarations.
I Compared run time with real-valued code:
I Complexified version: ≈ ×3
I Algorithmic differentiation version: ≈ ×2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 158 / 427


Computing Derivatives Complex-Step Method

Other Programming Languages 1


Matlab: As in the case of Fortran, one must redefine functions such as abs,
max and min. All differentiable functions are defined for complex
variables. The standard transpose operation represented by an
apostrophe (’) poses a problem as it takes the complex conjugate
of the elements of the matrix, so one should use the non-conjugate
transpose represented by “dot apostrophe” (.’) instead.
Java: Complex arithmetic is not standardized at the moment but there
are plans for its implementation. Although function overloading is
possible, operator overloading is currently not supported.
Python: A simple implementation of the complex-step method for Python
was also developed in this work. The cmath module must be
imported to gain access to complex arithmetic. Since Python
supports operator overloading, it is possible to define complex
functions and operators as described earlier.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 159 / 427


Computing Derivatives Complex-Step Method

Other Programming Languages 2


I Algorithmic differentiation by overloading can be implemented in any
programming language that supports derived datatypes and operator
overloading.
I For languages that do not have these features, the complex-step method can
be used wherever complex arithmetic is supported.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 160 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to an


aerostructural optimization problem 1
I Aerodynamics: SYN107-MB, a
parallel, multiblock Navier–Stokes
flow solver.
I Structures: detailed finite element
model with plates and trusses.
I Coupling: high-fidelity, consistent
and conservative.
I Geometry: centralized database for
exchanges (jig shape, pressure
distributions, displacements.)
I Coupled-adjoint sensitivity analysis

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 161 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to an


aerostructural optimization problem 2
0 CD
10
∂ CD / ∂ b1

−2
10
Reference Error, ε

−4
10

−6
10

−8
10
100 200 300 400 500 600 700 800
Iterations

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 162 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to an


aerostructural optimization problem 3
0
10

Complex−Step
−1
10 Finite−difference

−2
10
Relative Error, ε

−3
10

−4
10

−5
10

−6
10 −5 −10 −15
10 10 10
Step Size, h

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 163 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to an


aerostructural optimization problem 4
Complex−Step, h = 1×10−20
0.15
Finite−Difference, h = 1×10−2

0.1
∂ CD / ∂ bi

0.05

−0.05

2 4 6 8 10 12 14 16 18
Shape variable, i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 164 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to an


aerostructural optimization problem 5
Computation Type Normalized Cost
aerostructural Solution 1.0
Finite difference 14.2
Complex step 34.4

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 165 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to a


supersonic viscous-inviscid solver 1
Framework for preliminary design of natural laminar flow supersonic aircraft

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 166 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to a


supersonic viscous-inviscid solver 2
I Transition prediction
I Viscous and inviscid drag
I Design optimization
I Wing planform and airfoil design
I Wing-Body intersection design

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 167 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to a


supersonic viscous-inviscid solver 3

I Python wrapper defines geometry

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 168 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to a


supersonic viscous-inviscid solver 4
I CH GRID automatic grid generator
I Wing only or wing-body
I Complexified with our script
I CFL3D calculates Euler solution
I Version 6 includes complex-step
I New improvements incorporated
I C++ post-processor for the. . .
I Quasi-3D boundary-layer solver
I Laminar and turbulent
I Transition prediction
I C++ automatic differentiation
I Python wrapper collects data and computes structural constraints

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 169 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to a


supersonic viscous-inviscid solver 5
2
10

Finite Difference
Complex−Step
0
10
Relative Error, ε

−2
10

−4
10

−6
10

−8
10 0 −5 −10 −15 −20
10 10 10 10 10
Step Size, h

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 170 / 427


Computing Derivatives Complex-Step Method

Example: Application of the complex-step method to a


supersonic viscous-inviscid solver 6
−4
x 10
4.3736
Function Evaluations
Complex−Step Slope
4.3735

4.3734

4.3733

4.3732
Cdf

4.3731

4.373

4.3729

4.3728

22.495 22.5 22.505


Root Chord (ft)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 171 / 427


Computing Derivatives Unifying Chain Rule

Total Derivatives of a System 1


I In addition to finite differences, the complex-step method and symbolic
differentiation, there are other methods for computing total derivatives
I We derive these various methods from a single formula . . .
I . . . but first we must go through some assumptions an definitions
I The computational model is assumed to be a deterministic series of
computations
I Any computational model can be defined as a sequence of explicit functions
Vi , where i = 1, . . . , n.

vi = Vi (v1 , v2 , . . . , vi−1 ).

where we adopt the convention that the lower case represents the value of a
variable, and the upper case represents the function that computes that value.
I In the more general case, a given function might require values that have not
been previously computed, i.e.,

vi = Vi (v1 , v2 , . . . , vi , . . . , vn ).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 172 / 427


Computing Derivatives Unifying Chain Rule

Total Derivatives of a System 2


I The solution of such systems require numerical methods that can be
programmed by using loops where variables are updated.
I Numerical methods range from simple fixed-point iterations to sophisticated
Newton-type algorithms.
I Loops are also used to repeat one or more computations over a
computational grid.
I It is always possible to represent any given computation without loops and
dependencies if we unroll all of the loops, and represent all values a variable
might take in the iteration as a separate variable that is never overwritten.
I In cases where the computational model requires iteration, it is helpful to
denote the computation as a vector of residual equations,

r = R(v) = 0

where the algorithm changes certain components of v until all of the


residuals converge to a small tolerance.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 173 / 427


Computing Derivatives Unifying Chain Rule

Total Derivatives of a System 3


I The subset of v that is iterated to achieve the solution of these equations are
called the state variables.
I We now separate the subsets in v into:
Independent variables: x
State variables: y
Quantities of interest: f
I Using this notation, we can write the residual equations as,

r = R(x, y(x)) = 0

where y(x) denotes the fact that y depends implicitly on x through the
solution of the residual equations
I The solution of these equations completely determines y for a given x.
I The functions of interest (usually included in the set of component outputs)
also have the same type of variable dependence in the general case,

f = F (x, y(x)).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 174 / 427


Computing Derivatives Unifying Chain Rule

Total Derivatives of a System 4


I When we compute f , we assume that the state variables y have already been
determined by the solution of the residual equations.

x
x ∈ Rn x
y ∈ Rn y
r ∈ Rn y
R(x, y) = 0 F (x, y) f f ∈ Rn f

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 175 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 1


I We now derive a single equation that unifies the methods for computing total
derivatives.
I The methods differ in the extent to which they decompose a system, but
they all come from a basic principle: a generalized chain rule.
I We start from the sequence of variables (v1 , . . . , vn ), whose values are
functions of earlier variables,

vi = Vi (v1 , . . . , vi−1 )

For brevity, Vi (v1 , . . . , vi−1 ) is written as Vi (·).


I We define a partial derivative, ∂Vi /∂vj , of a function Vi with respect to a
variable vj as

∂ Vi Vi (v1 , . . . , vj−1 , vj + h, vj+1 , . . . , vi−1 ) − Vi (·)


= .
∂vj h

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 176 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 2


I Consider a total variation vi due to a perturbation vj , which can be
computed by using the sum of partial derivatives,
i−1
X ∂ Vi
vi = vk
∂vk
k=j

where all intermediate v’s between j and i are computed and used.
I The total derivative is,
dvi vi
= ,
dvj vj
I Using the two equations above, we can write:

X ∂ Vi dvk i−1
dvi
= δij + ,
dvj ∂vk dvj
k=j

which expresses a total derivative in terms of the other total derivatives and
the Jacobian of partial derivatives. The δij term is added to account for the
case in which i = j.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 177 / 427
Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 3


I This represents the chain rule for a system whose variables are v.
I To get a better understanding of the structure of the chain rule, we now
write it in matrix form:
 
0 ···
 ∂V2 0 ··· 
 ∂v1 
∂Vi  ∂V3 ∂V3 0 · · · 
DV = 
=  ∂v1 ∂v2 ,
∂vj 
 .. .. ..
.
..
. 
 . . 
∂Vn ∂Vn ∂Vn
∂v1 ∂v2 · · · ∂vn−1 0

where D is a differential operator.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 178 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 4


I The total derivatives of the variables vi form another Jacobian matrix of the
same size that has a unit diagonal,
 
1 0 ···
 dv2 1 0 ··· 
 dv1 
dvi  dv3 dv3
1 0 · · ·
Dv = =  dv. 1 ∂v2 .

dvj  . .. .. .. 
 . . . . 
dvn dvn dvn
dv1 dv2 ··· dvn−1 1

I Both of these matrices are lower triangular matrices, due to our assumption
that we have unrolled all of the loops.
I Using this notation, the chain rule can be written as

Dv = I + DV Dv .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 179 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 5


I Rearranging this, we obtain,

(I − DV ) Dv = I.

where all of these matrices are square, with size n × n.


I The matrix (I − DV ) can be formed by finding the partial derivatives, and
then we can solve for the total derivatives Dv .
I Since (I − DV ) and Dv are inverses of each other, we can further rearrange
it to obtain the transposed system:
T
(I − DV ) DvT = I.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 180 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 6


I This leads to the following symmetric relationship:
T
(I − DV ) Dv = I = (I − DV ) DvT

− = = −

= =

I We call the left and right hand sides of this equation the forward and reverse
chain rule equations, respectively.
I All methods for derivative computation can be derived from one of the forms
of this chain rule by changing what we mean by “variables”, which can be
seen as a level of decomposition.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 181 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 7


I The derivatives of interest, df / dx, are typically the derivatives of some of
the last variables with respect to some of the first variables in the sequence,
   
df1 df1 dv(n−nf ) dv(n−nf )
 dx1 · · · dxn   dv ···
dvnx 
df  x   1 
 .. . .. .  
..  =  . . .. 
= . .. .. . ,
dx    
 dfnf dfnf   dvn dvn 
··· ···
dx1 dxnx dv1 dvnx

This is an nf × nx matrix that corresponds to the lower-left block of Dv , or


the corresponding transposed upper-right block of DvT .
I DV is lower triangular, and therefore we can solve for a column of Dv using
forward substitution.
T
I Conversely, DV is upper triangular, and therefore we can solve for a row of
Dv using back substitution.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 182 / 427


Computing Derivatives The Unifying Chain Rule

One Chain to Rule them All 8


I Each of these versions of the chain rule incur different computational costs,
depending on the shape of the Jacobian df / dx:
I If nx > nf it is advantageous to use the forward chain rule
I f nf > nx the reverse chain rule is more efficient.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 183 / 427


Computing Derivatives The Unifying Chain Rule

Unification of Derivative Computation Methods


I The choice of v is the main difference between the various methods for
computing total derivatives.
I A second major difference is the technique used to solve the linear system.
Monolithic Analytic Multidisciplinary analytic AD

Level of decomposition Black box Solver Discipline Line of code


Differentiation method FD/CS Any Any Symbolic
Solution of the linear system Trivial Numerical Numerical (block) Forward substitution
Back substitution

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 184 / 427


Computing Derivatives The Unifying Chain Rule

Example: Simple Computational Model 1


I This model can be interpreted as an explicit function, a model with states
constrained by residuals, or a multidisciplinary system.
I Two inputs, x = [x1 , x2 ]T
I Residual equations,
   
R1 (x1 , x2 , y1 , y2 ) x1 y1 + 2y2 − sin x1
R= =
R2 (x1 , x2 , y1 , y2 ) −y1 + x22 y2

I State variables y = [y1 y2 ]T


I Output functions,
   
F1 (x1 , x2 , y1 , y2 ) y1
F = =
F2 (x1 , x2 , y1 , y2 ) y2 sin x1

I To drive the residuals to zero, we have to solve the following linear system,
    
x1 2 y1 sin x1
=
−1 x22 y2 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 185 / 427


Computing Derivatives The Unifying Chain Rule

Example: Simple Computational Model 2


I The algorithm solves the system directly and there are no loops.
I The v’s introduced above correspond to each variable assignment

v = [x(1) x(2) det y(1) y(2) f(1) f(2)]T

FUNCTION F ( x )
REAL :: x (2) , det , y (2) , f (2)
det = 2 + x (1) * x (2) **2
y (1) = x (2) **2* SIN ( x (1) ) / det
y (2) = SIN ( x (1) ) / det
f (1) = y (1)
f (2) = y (2) * SIN ( x (1) )
RETURN
END FUNCTION F

The objective is to compute the derivatives of both outputs with respect to both
inputs, i.e., the Jacobian, " #
df1 df1
df
= dx 1
df2
dx2
df2
dx dx1 dx2

We will use this example in later sections to show the application of all methods.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 186 / 427
Computing Derivatives Monolithic Differentiation

Monolithic Differentiation 1
I In monolithic differentiation, the entire computational model is treated as a
“black box”
I Only track inputs and outputs.
I This is often the only option
I Both the forward and reverse modes of the generalized chain rule reduce to
dfi ∂ Fi
=
dxj ∂xj

for each input xj and output variable fi .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 187 / 427


Computing Derivatives Monolithic Differentiation

Monolithic Differentiation 2
x

r1 r f

r2

y1 y

y2

v = [v1 , . . . , vnx , v(n−nf ) , . . . , vn ]T


| {z } | {z }
x f

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 188 / 427


Computing Derivatives Monolithic Differentiation

Example: Finite-Difference and Complex-Step Methods


Applied to Simple Model 1
I The monolithic approach treats the entire code as a black box whose internal
variables and computations are unknown.
I Thus, the tracked variables are

v1 = x 1 , v2 = x2 , v3 = f1 , v4 = f2

I forward and reverse chain rule equations yield,


df1 ∂ f1 df1 ∂ f1
= , = ,...
dx1 ∂x1 dx2 ∂x2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 189 / 427


Computing Derivatives Monolithic Differentiation

Example: Finite-Difference and Complex-Step Methods


Applied to Simple Model 2
I Computing df1 / dx1 simply amounts to computing ∂f1 /∂x1
I Using the forward-difference formula (with step size h = 10−5 ), yields

∂ f1 f1 (x1 + h, x2 ) − f1 (x1 , x2 )
≈ = 0.0866023014079,
∂x1 h

I The complex-step method (with step size h = 10−15 ), yields

∂ f1 Im [f1 (x1 + ih, x2 )]


≈ = 0.0866039925329.
∂x1 h
I The digits that agree with the exact derivative are shown in blue and those
that are incorrect are in red.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 190 / 427


Computing Derivatives Monolithic Differentiation

Example: Finite-Difference and Complex-Step Methods


Applied to Simple Model 3

102
100
10-2
10-4
Log of relative error

10-6
10-8
10-10
10-12
10-14
10-16 FD
CS
10-18 -20
10 10-18 10-16 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100
Log of step size
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 191 / 427
Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 1
I Algorithmic differentiation (AD) is also known as computational
differentiation or automatic differentiation
I Well known method based on the systematic application of the differentiation
chain rule to computer programs.
I With AD the variables v in the chain rule are all of the variables assigned in
the computer program
I Thus, AD applies the chain rule for every single line in the program.
I The computer program is considered as sequence of explicit functions Vi ,
where i = 1, . . . , n.
I Assume that all of the loops in the program are unrolled, and therefore no
variables are overwritten and each variable only depends on earlier variables
in the sequence.
I This assumption is not restrictive, as programs iterate the chain rule together
with the program variables, converging to the correct total derivatives.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 192 / 427


Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 2
I Typically, the design variables are among the first v’s, and the quantities of
interest are the last quantities.

v = [v1 , . . . , vnx , . . . , vj , . . . , vi , . . . , v(n−nf ) , . . . , vn ]T .


| {z } | {z }
x f

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 193 / 427


Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 3
v1 x
v2
v3
v4 r1 r
.
.
.

r2

y1 y

y2

f
vn

v = [v1 , . . . , vnx , . . . , vj , . . . , vi , . . . , v(n−nf ) , . . . , vn ]T


| {z } | {z }
x f

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 194 / 427


Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 4
I The chain rule is
X ∂Vi dvk i−1
dvi
= δij + ,
dvj ∂vk dvj
k=j

where the V represent explicit functions, each defined by a single line in the
computer program.
I The partial derivatives, ∂Vi /∂vk can be automatically differentiated
symbolically by applying another chain rule within the function defined by the
respective line in the program.
I The chain rule can be solved in two ways.
Forward mode: choose one vj and keep j fixed. Then we work our way
forward in the index i = 1, 2, . . . , n until we get the desired
total derivative.
Reverse mode: fix vi (the quantity we want to differentiate) and work our
way backward in the index j = n, n − 1, . . . , 1 all of the way to
the independent variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 195 / 427


Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 5
I The chain rule in matrix form,

(I − DV ) Dv = I ⇒

  
1 0 ··· 1 0 ···
 − ∂V 2
1 0 ···   dv 2
1 0 ··· 
 ∂V∂v1   dv 1 
 − ∂v 3 − ∂V3
1 0 · · ·  dv31
dv dv3
1 0 · · · =
 1 ∂v2  ∂v2 
 .. .. .. ..   .. .. .. .. 
 . . . .  . . . . 
− ∂Vn
∂v1
− ∂Vn
∂v2
··· − ∂v∂Vn
n−1
1 dvn
dv1
dvn
dv2
··· dvn
dvn−1
1
 
1 0 ···
0 1 0 ··· 
0 0 
· · · .
 1 0
. . .. .. .. 
 .. .. . . .
0 0 0 0 1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 196 / 427


Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 6
I The terms that we ultimately want to compute are the total derivatives of
the quantities of interest with respect to the design variables, corresponding
to a nf × nx block in the Dv matrix in the lower left:
   
df1 df1 dv(n−nf ) dv(n−nf )
 dx1 ··· ···
 dxnx 
 
 dv
1 dvnx  
df  . .. ..   .. .. .. 
=  .. .  =
.    . . . ,
dx  
 dfnf dfnf   dvn dvn 
··· ···
dx1 dxnx dv1 dvnx

which is an nf × nx matrix.
I The forward mode is equivalent to solving the linear system for one column of
Dv .
I Since (I − DV ) is a lower triangular matrix, this solution can be
accomplished by forward substitution.
I In the process, we end up computing the derivative of the chosen quantity
with respect to all of the other variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 197 / 427


Computing Derivatives Algorithmic Differentiation

Algorithmic Differentiation 7
I The cost of this procedure is similar to the cost of the procedure that
computes the v’s.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 198 / 427


Computing Derivatives Algorithmic Differentiation

Example: Forward Mode Applied to Simple Model 1


I The variables in this case are
   
v1 x(1)
v2  x(2)
   
v3   det 
   
v=   
v4  = y(1) .
v5  y(2)
   
v6  f(1)
v7 f(2)

I Performing the partial differentiation using symbolic differentiation we get


 1 0 0 0 0 0 0
  dv1 0 
0 1 0 0 0 0 0 dv1
2 1 1 
 2
 dv3 dv3  0

  dv 
−v2 −2v1 v2 1 0 0 0 0 0 1
 2 cos v
v2 2v2 sin v1
2 sin v
v2
0 
dv4
1 ∂v2
dv4  00 
0
− 1 − 1 1 0 0
 = 0 .

v3 v3 2 dv1 ∂v2 0
 v3
0  dv1 
cos v1 sin v1 dv5 dv5 0
 −
v3
0
v32 0 1 0
 dv6 ∂v2
dv6  0 0
0 0
0 0 0 −1 0 1 0 dv1 ∂v2
−v5 cos v1 0 0 0 − sin v1 0 1 dv7 dv7
dv1 ∂v2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 199 / 427


Computing Derivatives Algorithmic Differentiation

Example: Forward Mode Applied to Simple Model 2


I We only kept the first two columns of the matrices Dv and I, because the
only derivatives of interest are in those two columns.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 200 / 427


Computing Derivatives Algorithmic Differentiation

Reverse Mode Matrix Equations 1


I The matrix representation for the reverse mode of algorithmic differentiation
is
T
(I − DV ) DvT = I ⇒
  
1 − ∂V
∂v1
2
− ∂V
∂v1
3
··· − ∂V
∂v1
n
1 dv2
dv1
dv3
dv1 ··· dvn
dv1
0 1 ∂V3
− ∂v2 ··· ∂Vn
− ∂v2  0 1 dv3
··· dvn 
  dv2 dv2 
. .. .. .. ..  . .. .. .. .. 
.  . 
. . . . .  . . . . . =
. ..  . .. 
. .. ∂Vn   . .. dvn 
. . . 1 − ∂vn−1   . . . 1 dvn−1 
0 0 ··· 0 1 0 0 ··· 0 1
 
1 0 ···
0 1 0 ··· 
 
0 0 1 0 · · ·
 .
 .. .. .. .. .. 
. . . . .
0 0 0 0 1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 201 / 427


Computing Derivatives Algorithmic Differentiation

Reverse Mode Matrix Equations 2


I The block matrix we want to compute is in the upper right section of DvT
and now its size is nx × nf .
I As with the forward mode, we need to solve this linear system one column at
the time, but now each column yields the derivatives of the chosen quantity
with respect to all the other variables.
I Because the matrix (I − DV )T is upper triangular, the system can be solved
using back substitution.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 202 / 427


Computing Derivatives Algorithmic Differentiation

Example: Reverse Mode Applied to Simple Model 1


I Replacing the partial derivatives in the reverse matrix equations, we get
 2 v 2 cos v1 cos v1
  dv
dv1
6 dv7
∂v1

1 0 −v2 − 2
v3

v3
0 −v5 cos v1 dv6 dv7 0 
0 2v2 sin v1
 dv2 ∂v2  0

 
1 −2v1 v2 − 0 0 0 dv6 dv7 0 0
 v3
0 2 sin v
v2 1 sin v1  dv3 ∂v3  00
 = 0

0

 
0 1 0 0 dv6 dv7 0
 2 2
 
v3 v3 dv4 ∂v4 0
0 0 0 1 0 −1 0
 dv
dv5
6 dv7
∂v5  1 0
0 0 0 0 1 0 − sin v1 0 1
0 0 0 0 0 1 0 dv7
1
0 0 0 0 0 0 1 dv6
0 1

I The derivatives of interest are the top 2 × 2 block in the Dv matrix.


I In contrast to the forward mode, the derivatives of interest are computed by
performing two back substitutions, through which the derivatives of v6 and
v7 with respect to all variables are computed in the process.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 203 / 427


Computing Derivatives Algorithmic Differentiation

Implementation and Tools


There are two main ways of implementing AD:
I Source code transformation
I The whole source code is processed with a parser and all the derivative
calculations are introduced as additional lines of code.
I Resulting source code for large programs is expanded and it may become
difficult to read.
I Every time the original code changes, must run the parser.
I Derived datatypes and operator overloading
I A new type of data structure is created that contains both the value and its
derivative: each real number v is replaced by v̄ = (v, dv).
I All operations are redefined (overloaded) such that, in addition to the result of
the original operations, they yield the derivative of that operation as well
I Compiler must support derived datatypes and operator overloading (e.g.,
Fortran 90, C++)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 204 / 427


Computing Derivatives Algorithmic Differentiation

Available AD Tools 1
The tools for the various programming languages include:
I Fortran
I ADIFOR: Source transformation; forward and reverse modes; Fortran 77;
non-commercial.
I AD01: Operator overloading; forward and reverse modes; Fortran 90;
commercial.
I OPFAD/OPRAD: Operator overloading; forward and reverse modes; Fortran
90; non-commercial.
I TAMC: Source transformation; forward and reverse modes; Fortran 77;
non-commercial.
I TAF: Source transformation; forward and reverse modes; Fortran 90;
commercial.
I Tapenade: Source transformation; Fortran 90; non-commercial. Developed at
INRIA Sophia-Antipolis. Formerly Odyssée.
I C/C++: Various established tools for automatic differentiation. These
include include ADIC, an implementation mirroring ADIFOR, and ADOL-C, a
free package that uses operator overloading and can operate in the forward or
reverse modes and compute higher order derivatives.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 205 / 427


Computing Derivatives Algorithmic Differentiation

Available AD Tools 2
I Other languages: Tools also exist for other languages, such as Matlab and
Python.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 206 / 427


Computing Derivatives Algorithmic Differentiation

The Connection Between AD and the Complex-Step


Method
One significant connection to make is that the complex-step method is equivalent
to the forward mode of AD with an operator overloading implementation

Automatic Complex-Step
∆x1 = 1 h1 = 10−20
∆x2 = 0 h2 = 0
f = x1 x2 f = (x1 + ih1 )(x2 + ih2 )
∆f = x1 ∆x2 + x2 ∆x1 f = x1 x2 − h1 h2 + i(x1 h2 + x2 h1 )
df /dx1 = ∆f df /dx1 = Im f /h

Complex-step method computes one extra term. Other functions are similar:
I Superfluous calculations are made.
I For h ≤ x × 10−20 they vanish, but still affect speed.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 207 / 427


Computing Derivatives Algorithmic Differentiation

Example: Forward AD Using Source Code Transformation


FUNCTION F_D (x , xd , f )
REAL :: x (2) , xd (2)
REAL :: det , detd
REAL :: y (2) , yd (2)
REAL :: f (2) , f_d (2)
detd = xd (1) * x (2) **2 + x (1) *2* x
(2) * xd (2)
FUNCTION F ( x ) det = 2 + x (1) * x (2) **2
REAL :: x (2) , det , y (2) , f yd = 0.0
(2) yd (1) = ((2* x (2) * xd (2) * SIN ( x (1)
det = 2 + x (1) * x (2) **2 ) + x (2) **2* xd (1) * COS ( x (1) ) ) *
y (1) = x (2) **2* SIN ( x (1) ) / det - x (2) **2*&
det & SIN ( x (1) ) * detd ) / det **2
y (2) = SIN ( x (1) ) / det y (1) = x (2) **2* SIN ( x (1) ) / det
f (1) = y (1) yd (2) = ( xd (1) * COS ( x (1) ) * det -
f (2) = y (2) * SIN ( x (1) ) SIN ( x (1) ) * detd ) / det **2
RETURN y (2) = SIN ( x (1) ) / det
END FUNCTION F f_d = 0.0
f_d (1) = yd (1)
f (1) = y (1)
f_d (2) = yd (2) * SIN ( x (1) ) + y (2)
* xd (1) * COS ( x (1) )
f (2) = y (2) * SIN ( x (1) )
RETURN
END FUNCTION F_D
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 208 / 427
Computing Derivatives Algorithmic Differentiation

Example: Reverse AD Using Source Code Transformation


SUBROUTINE F_B (x , xb , fb )
REAL :: x (2) , xb (2) ,
REAL :: y (2) , yb (2)
REAL :: f (2) , fb (2)
REAL :: det , detb , tempb , temp
det = 2 + x (1) * x (2) **2
y (1) = x (2) **2* SIN ( x (1) ) / det
y (2) = SIN ( x (1) ) / det
FUNCTION F ( x ) xb = 0.0
REAL :: x (2) , det , y (2) , f yb = 0.0
(2) yb (2) = yb (2) + SIN ( x (1) ) * fb (2)
det = 2 + x (1) * x (2) **2 xb (1) = xb (1) + y (2) * COS ( x (1) ) *
fb (2)
y (1) = x (2) **2* SIN ( x (1) ) / fb (2) = 0.0
det yb (1) = yb (1) + fb (1)
y (2) = SIN ( x (1) ) / det xb (1) = xb (1) + COS ( x (1) ) * yb (2)
f (1) = y (1) / det
detb = -( SIN ( x (1) ) * yb (2) / det
f (2) = y (2) * SIN ( x (1) ) **2)
RETURN yb (2) = 0.0
END FUNCTION F tempb = SIN ( x (1) ) * yb (1) / det
temp = x (2) **2/ det
xb (2) = xb (2) + 2* x (2) * tempb
detb = detb - temp * tempb
xb (1) = xb (1) + x (2) **2* detb +
temp * COS ( x (1) ) * yb (1)
xb (2) = xb (2) + x (1) *2* x (2) *
detb
END SUBROUTINE F_B
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 209 / 427
Computing Derivatives Analytic Methods

Analytic Methods 1
I Analytic methods are the most accurate and efficient methods.
I Much more involved, since they require detailed knowledge of the
computational model and a long implementation time.
I Applicable when f depends implicitly on x:

f = F (x, y(x)).

I The implicit relationship between the state variables y and the independent
variables is defined by the solution of a set of residual equations,

r = R(x, y(x)) = 0.

I We assumed a discrete analytic approach. This is in contrast to the


continuous approach, in which the equations are not discretized until later.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 210 / 427


Computing Derivatives Analytic Methods

Analytic Methods 2
Continuous Discrete
Sensitivity Sensitivity
Equations Equations 1
Continuous
Governing
Equations
Discrete Discrete
Governing Sensitivity
Equations Equations 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 211 / 427


Computing Derivatives Analytic Methods

Traditional Derivation 1
I Using the chain rule we can write,
df ∂F ∂ F dy
= + ,
dx ∂x ∂y dx
where the result is an nf × nx matrix.
I The partial derivatives represent the variation of f = F (x) with respect to
changes in x for a fixed y
I The total derivative df / dx takes into account the change in y that is
required to keep the residual equations equal to zero.
I This distinction depends on the context, i.e., what is considered a total or
partial derivative depends on the level that is being considered in the nested
system of components.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 212 / 427


Computing Derivatives Analytic Methods

Traditional Derivation 2
I Since the governing equations must always be satisfied, the total derivative of
the residuals (210) with respect to the design variables must be zero. Thus,
using the chain rule
dr ∂ R ∂ R dy
= + = 0.
dx ∂x ∂y dx
I The computation of the total derivative matrix dy/ dx is much more
expensive than any of the partial derivatives, since it requires the solution of
the residual equations.
I The partial derivatives can be computed by differentiating the function F
with respect to x while keeping y constant, and can be computed using
symbolic differentiation, finite differences, complex step, or AD.
I The linearized residual equations provide the means for computing the total
Jacobian matrix dy/ dx, by rewriting them as,

∂ R dy ∂R
=− .
∂y dx ∂x

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 213 / 427


Computing Derivatives Analytic Methods

Traditional Derivation 3
I Substituting this result into the total derivative equation, we obtain
dy
− dx
}| { z
−1 
df ∂F ∂F ∂R ∂R
= − .
dx ∂x ∂y ∂y ∂x
| {z }
ψ

I The inverse of the square Jacobian matrix ∂R/∂y is not necessarily explicitly
calculated.
I There are two ways of computing the total derivative matrix dy/ dx:
Direct method: Factorize the Jacobian nx times with the columns of ∂R/∂x
in the right hand side.
Adjoint method: Factorize the Jacobian nf times with the columns of
∂F /∂y in the right hand side.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 214 / 427


1
dfDerivatives
Computing @F @Analytic
F @R @R
Methods
= . (43)
dx @x @y @y @x
| {z }
Direct vs. Adjoint Method
nf > nx nx > nf

 = –
1 = –
df @F @F @R @R
=
dx @x @y @y @x

Direct method nf > nx nx > nf

= + = +
df @F @ F dy
= +
dx @x @y dx

@ R dy @ R – = – =
=
@y dx @x

Adjoint method nf > nx nx > nf

= + = +
df @F df @ R
= +
dx @x dr @x

 T  T  T
@R df @F – = – =
=
@y dr @y

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 215 / 427


Computing Derivatives Analytic Methods

Example: Analytic Adjoint Methods Applied to


Finite-Element Structural Analysis 1
I The discretized governing equations for a finite-element structural model are,

Rk = Kki ui − Fk = 0,

where Kki is the stiffness matrix, ui is the vector of displacement (the state)
and Fk is the vector of applied force (not to be confused with the function of
interest from the previous section!).
I We want the derivatives of the stresses, which are related to the
displacements by the equation,

σm = Smi ui .

I The design variables are the cross-sectional areas of the elements, Aj .


I The Jacobian of the residuals with respect to the displacements is simply the
stiffness matrix:
∂Rk ∂(Kki ui − Fk )
= = Kki .
∂yi ∂ui
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 216 / 427
Computing Derivatives Analytic Methods

Example: Analytic Adjoint Methods Applied to


Finite-Element Structural Analysis 2
I The derivative of the residuals with respect to the design variables is

∂Rk ∂(Kki ui − Fk ) ∂Kki


= = ui
∂xj ∂Aj ∂Aj
I The partial derivative of the stress with respect to the displacements is
simply given by
∂fm ∂σm
= = Smi
∂yi ∂ui
I Finally, the explicit variation of stress with respect to the cross-sectional areas
is zero, since the stresses depends only on the displacement field,
∂fm ∂σm
= = 0.
∂xj ∂Aj

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 217 / 427


Computing Derivatives Analytic Methods

Example: Analytic Adjoint Methods Applied to


Finite-Element Structural Analysis 3
I Substituting these into the total derivative equation we get:
dσm ∂σm −1 ∂Kki
=− K ui
dAj ∂ui ki ∂Aj
I If we were to use the direct method, we would solve,
dui ∂Kki
Kki =− ui
dAj ∂Aj

and then substitute the result in,


dσm ∂σm dui
=
dAj ∂ui dAj

to calculate the desired derivatives.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 218 / 427


Computing Derivatives Analytic Methods

Example: Analytic Adjoint Methods Applied to


Finite-Element Structural Analysis 4
I The adjoint method is the other alternative, by solving,

T ∂σm
Kki ψk = .
∂ui
Then we would substitute the adjoint vector into the equation,
 
dσm ∂σm ∂Kki
= + ψkT − ui .
dAj ∂Aj ∂Aj

to calculate the desired derivatives.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 219 / 427


Computing Derivatives Analytic Methods

Derivation of Analytic Methods from the Unifying Chain


Rule 1
I The assumption that the Jacobians are lower triangular matrices does no
longer apply.
I Therefore, we first linearize the residuals so that it is possible to write explicit
equations for the state variables y.
I We linearize about the converged point [x0 , r0 , y0 , f0 ]T , and divide v into

v1 = x, v2 = r, v3 = y, v4 = f .

I So instead of defining them as every single variable assignment in the


computer program, we defined them as variations in the design variables,
residuals, state variables and quantities of interest.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 220 / 427


Computing Derivatives Analytic Methods

Derivation of Analytic Methods from the Unifying Chain


Rule 2
x

r1 r

r2

y1 y

y2

v = [v1 , . . . , vnx , v(nx +1) , . . . , v(nx +ny ) , v(nx +ny +1) , . . . , v(nx +2ny ) , v(n−nf ) , . . . , tn ]T .
| {z } | {z } | {z } | {z }
x r y f
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 221 / 427
Computing Derivatives Analytic Methods

Derivation of Analytic Methods from the Unifying Chain


Rule 3

∆x

∆r

∆y

∆f

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 222 / 427


Computing Derivatives Analytic Methods

Derivation of Analytic Methods from the Unifying Chain


Rule 4
I We have an initial perturbation x that leads to a response r.
I However, we require that r = 0 be satisfied when we take a total derivative,
so
∂R ∂R
R=0 ⇒ x+ y=0
∂x ∂y

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 223 / 427


Computing Derivatives Analytic Methods

Derivation of Analytic Methods from the Unifying Chain


Rule 5
I The solution vector y from this linear system is used with the original
perturbation vector x to compute the total change

v1 = x
∂R
v2 = r = x
∂x
 −1
∂R
v3 = y = (−r)
∂y
∂F ∂F
v4 = f = x+ y
∂x ∂y

I Now, all variables are functions of only previous variables, so we can apply
the forward and reverse chain rule equations to the linearized system

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 224 / 427


Computing Derivatives Analytic Methods
2    3 2
2 32 3 @V2
T
@V3
T
@V4
T
dv4
I 0 0 0 I
Derivation of Analytic Methods from the Unifying Chain 6I 76
6 @V2 7 6 dv2 7 2 3 6 @v1  @v1  @v 7 6 dv1
T 7 6
1
6 I 0 07 6 7 I 6 T
6 @v1 7 6 dv1 7 6 7 6 @V3 @V4 7 6 dv4
Rule 6 @V3
6
6 @V3 I 7 6 dv 7 607
76 37 = 2 5
60
6 @v @v2 3 7
76
6 dv2
62 I 3027 6 3 7 4 0 6 T  T  2 T
3 2 T T 7 6
6 @v
I1 0@v2 0 0 7 I6 dv1 7 6 @V2 @V3 @V4 @V4 4 7
dv 6 dv
4 @V @V4 5 4
@V4 7 6 dv2 dv 5 60I 60 0 @v I @v 7 6 7 7 62 3 4
6 @V42 7 42 3 6 4 @v 7 6 dv
@v 7 5 4 0dv
6 I 0 07I6 7 I 6 1  1 T  1
T 7 6
1
3 T7 3
6 @v
@v11 @v2 @v3 7 6 dv
7 6
7 6 dv1 7 1 607 6 0I 0 @V3 0@V4 7 6 dv
7 6 I4 7 6
7 7
6 @V
6 @V3 7 6 dv3 7 = 6 7 6
0
@v2 76 7=4 5 607 I
 @v2 T 7 6 dv2
3
6 I 07 6 7 405 6 T7 0
6 @v1 (a)2 Forward chain
@v rule
7 6 dv 17 6
(b) @V 7 6 dv4 7
0 I Reverse chain 7
rule I
4 @V4 5 4 dv4 5 60 4 6 7
@V4 @V4 4 0
I @v3 5 4 dv3 5
@v1 @v2 @v3 dv1 0 0 0 I I
2 3
(a) Forward chain rule 2 2 32
3  T  T
3 2 T
3
(b) Reverse
@R chain rule @F df
6 I 0 0 0 7 6 I 7 6I 7 6 I 0 76 7
6 76 7 6 76 @x @x 7 6 dx 7
62 3 27 6 76  376  T7
7 63dr 7
T 3 2
6 @R 2 32607 6
 
@R T 76 2 3 df 7
07 6 7 T T
6 I 0 6 0@R I @F df
0 7 6 7
66 I 07 67I6 7I 66I 7 6 0 776 7
66
@x  0 0
7 67 67 dx 7
6=766 7 @x @y @x
76
7 6 dx 7 6 07dr
66
1
7 67 67 6 76
7dy 7 6 76  7 6 T76 6 7 T7
66 0@R
@R
I 70 67 6
dr 7
6 76
7
6 7 6 7 6
6 @R
T
7 6 df@F T 7
777
6
6 7
6
6 df
7
7
7
66 I
@x  @y
0 07 67 67 7 6007 60I
607 6 0 I0 76 776
7 6 07 7
66
6
@R
1
7 6 dx67dx 7
7 67 47 = 5 6 76
6 76
6 76
4 5
@y
 T7
7 6 dr
6 @y T 7 7=66 7
6 7
dy 7
46
6 @F
0
7
I @F07 I65dy 6
7df 607 6 4 7 6
@F 7 6 df 7 6 7 7 5 4 5
6 @y0 76 7 6 7 600 0 I 7 6 7 607
6 @x @y 7 6 dx 7dx 6 7 6 0 0 0@y 7 6 dy I 7 6 7I
4 @F @F 5 4 df 5 4 5 4 54 5 4 5
0 I 0 0 0 0 I I I
@x(c) Forward chain dx
@y rule (simplified) (d) Reverse chain rule (simplified)
(c) Forward chain rule (simplified) (d) Reverse chain rule (simplified)

dr @R df @F
=
J.R.R.A. Martins Multidisciplinary Design Optimization = August 2012 225 / 427
Computing Derivatives Analytic Methods

Example: Direct method applied to simple model 1


I Since there are just two state variables, we get the linear system:
 ∂R ∂ R1   dy1 dy1   ∂ R1 ∂ R1 
1
− −
 ∂y1 ∂y2   dx1 dx2   ∂x1 ∂x2 
 ∂R
2 ∂ R2   dy2 dy2  =  ∂ R2 ∂ R2  .
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
I We can use symbolic differentiation to compute each partial derivative of the
residual to obtain
 
  dy1 dy1  
−x1 −2  dx1 dx2  y1 − cos x1 0
= .
1 −x22  dy2 dy2  0 2x2 y2
dx1 dx2
I In a more realistic example, the computation of the partial derivatives would
not be as easy, since the residuals typically do not have simple analytical
expressions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 226 / 427


Computing Derivatives Analytic Methods

Example: Direct method applied to simple model 2


I Since the analytic methods are derived based on a linearization of the system
at a converged state, we must evaluate the system at [x1 , x2 ] = [1, 1] and
[y1 , y2 ] = [ sin3 1 , sin3 1 ].
I The computed values for dy1 /dx1 and dy2 /dx1 can be used to find df1 /dx1
using the following equation:
df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2
= + + .
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 227 / 427


Computing Derivatives Analytic Methods

Adjoint Method 1
I The linear system involving the Jacobian matrix ∂R/∂y can be solved with
∂f /∂y as the right-hand side.
I This results in the following adjoint equations,
 T  T
∂R ∂F
ψ=− ,
∂y ∂y

where ψ the adjoint matrix (of size ny × nf ).


I Although ψ is usually expressed as a vector, we obtain a matrix due to our
generalization for the case where f is a vector.
I The solution of this linear system needs to be solved for each column of
T
[∂F /∂y] , and thus the computational cost is proportional to the number of
quantities of interest, nf .
I The adjoint vector can be substituted to find the total derivative,
df ∂F ∂R
= + ψT
dx ∂x ∂x

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 228 / 427


Computing Derivatives Analytic Methods

Adjoint Method 2
I Thus, the cost of computing the total derivative matrix using the adjoint
method is independent of the number of design variables, nx , and instead
proportional to the number of quantities of interest, f .
I The partial derivatives shown in these equations need to be computed using
some other method. They can be differentiated symbolically, computed by
finite differences, the complex-step method or even AD. The use of AD for
these partials has been shown to be particularly effective in the development
of analytic methods for PDE solvers.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 229 / 427


Computing Derivatives Analytic Methods

Example: Adjoint method applied to simple model


I Applying the adjoint method to compute df1 / dx1 , we get
 ∂R ∂ R2   df1 df2   ∂ F1 ∂ F2 
1
− −
 ∂y1 ∂y1   dr1 dr1   ∂y1 ∂y1 
 ∂R
1 ∂ R2   df1 df2  =  ∂ F1 ∂ F2 
− −
∂y2 ∂y2 dr2 dr2 ∂y2 ∂y2
I Replacing the partial derivatives computed symbolically,
 df2  
  df1 
−x1 1  dr1 dr1  1 0
−2 −x22  df1 df2  = 0 sin x1
dr2 dr2

I After evaluating the system at [x1 , x2 ] = [1, 1] and [y1 , y2 ] = [ sin3 1 , sin3 1 ], we
can find df1 / dx1 using the computed values for df1 / dr1 and df1 / dr2 :

df1 ∂ F1 df1 ∂ R1 df1 ∂ R2


= + +
dx1 ∂x1 dr1 ∂x1 dr2 ∂x1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 230 / 427


Computing Derivatives Analytic Methods

Example: Computational Accuracy and Cost Comparison


Method Sample derivative Time Memory
Complex –39.049760045804646 1.00 1.00
ADIFOR –39.049760045809059 2.33 8.09
Analytic –39.049760045805281 0.58 2.42
FD –39.049724352820375 0.88 0.72

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 231 / 427


Constrained Optimization

Constrained Optimization
1. Introduction

2. Line Search Techniques

3. Gradient-Based Optimization

4. Computing Derivatives

5. Constrained Optimization
5.1 Introduction
5.2 Equality Constraints
5.3 Inequality Constraints
5.4 Constraint Qualification
5.5 Penalty Methods
5.6 Sequential Quadratic Programming

6. Gradient-Free Optimization

7. Multidisciplinary Design Optimization


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 232 / 427
Constrained Optimization Introduction

Constrained Optimization
I Engineering design optimization problems are rarely unconstrained.
I The constraints that appear in these problems are typically nonlinear.
I Thus, we are interested in general nonlinearly constrained optimization theory
and methods.
Recall the statement of a general optimization problem,

minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 233 / 427


Constrained Optimization Introduction

Example: Graphical Solution of a Constrained


Optimization Problem 1
Suppose we want to solve the following optimization problem,

minimize f (x) = 4x21 − x1 − x2 − 2.5


with respect to x1 , x2
subject to c1 (x) = x22 − 1.5x21 + 2x1 − 1 ≥ 0,
c2 (x) = x22 + 2x21 − 2x1 − 4.25 ≤ 0

How can we solve this?

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 234 / 427


Constrained Optimization Introduction

Example: Graphical Solution of a Constrained


Optimization Problem 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 235 / 427


Constrained Optimization Equality Constraints

Optimality Conditions for Equality Constrained Problems


I The optimality conditions for nonlinearly constrained problems are important
because they form the basis of many algorithms for solving such problems.
I Suppose we have the following optimization problem with equality
constraints,

minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂

I To solve this problem, we could solve for m̂ components of x by using the


equality constraints to express them in terms of the other components.
I The result would be an unconstrained problem with n − m̂ variables.
I However, this procedure is only feasible for simple explicit functions . . .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 236 / 427


Constrained Optimization Equality Constraints

Lagrange Multipliers 1
I Joseph Louis Lagrange is credited with developing a more general method to
solve this problem.
I At a stationary point, the total differential of the objective function has to be
equal to zero,
∂f ∂f ∂f
df = dx1 + dx2 + · · · + dxn = ∇f T dx = 0.
∂x1 ∂x2 ∂xn
I Unlike unconstrained optimization, the infinitesimal vector
T
dx = [ dx1 , dx2 , . . . , dxn ] is not arbitrary
I The perturbation x + dx must be feasible: ĉj (x + dx) = 0.
I Therefore, the above equation does not imply that ∇f = 0.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 237 / 427


Constrained Optimization Equality Constraints

Lagrange Multipliers 2
I For a feasible point, the total differential of each of the constraints
(ĉ1 , . . . ĉm̂ ) must also be zero:

∂ĉj ∂ĉj
dĉj = dx1 + · · · + dxn = ∇ĉTj dx = 0, j = 1, . . . , m̂
∂x1 ∂xn
I To interpret the above equation, recall that the gradient of a function is
orthogonal to its contours.
I Thus, since the displacement dx satisfies ĉj (x + dx) = 0 (the equation for a
contour), it follow that dx is orthogonal to the gradient ∇ĉj .
I Lagrange suggested that one could multiply each constraint variation by a
scalar λ̂j and subtract it from the objective function variation,
 
Xm̂ Xn Xm̂
 ∂f ∂ĉj
df − λ̂j dĉj = 0 ⇒ − λ̂j dxi = 0.
j=1 i=1
∂x i j=1
∂x i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 238 / 427


Constrained Optimization Equality Constraints

Lagrange Multipliers 3
I Notice what has happened: the components of the infinitesimal vector dx
have become independent and arbitrary, because we have accounted for the
constraints.
I Thus, for this equation to be satisfied, we need a vector λ̂ such that the
expression inside the parenthesis vanishes, i.e.,

X ∂ĉj
∂f
− λ̂j = 0, (i = 1, 2, . . . , n)
∂xi j=1 ∂xi

which is a system of n equations and n + m unknowns. To close the system,


we recognize that the m constraints must also be satisfied.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 239 / 427


Constrained Optimization Equality Constraints

Karush–Kuhn–Tucker (KKT) Conditions 1


I Suppose we define a function as the objective function minus a weighted sum
of the constraints,

X
L(x, λ̂) = f (x) − λ̂j ĉj (x) ⇒
j=1

L(x, λ̂) = f (x) − λ̂T ĉ(x)

I We call this function the Lagrangian of the constrained problem, and the
weights the Lagrange multipliers. A stationary point of the Lagrangian with
respect to both x and λ̂ will satisfy

X ∂ĉj m̂
∂L ∂f
= − λ̂j = 0, (i = 1, . . . , n)
∂xi ∂xi j=1 ∂xi
∂L
= ĉj = 0, (j = 1, . . . , m̂).
∂ λ̂j

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 240 / 427


Constrained Optimization Equality Constraints

Karush–Kuhn–Tucker (KKT) Conditions 2


I Thus, a stationary point of the Lagrangian encapsulates our required
conditions: the constraints are satisfied and the gradient conditions are
satisfied.
I These first-order conditions are known as the Karush–Kuhn–Tucker (KKT)
conditions. They are necessary conditions for the optimum of a constrained
problem.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 241 / 427


Constrained Optimization Equality Constraints

Karush–Kuhn–Tucker (KKT) Conditions 3


I As in the unconstrained case, the first-order conditions are not sufficient to
guarantee a local minimum.
I For this, we turn to the second-order sufficient conditions (which, as in the
unconstrained case, are not necessary).
I For equality constrained problems we are concerned with the behavior of the
Hessian of the Lagrangian, denoted ∇2xx L(x, λ̂), at locations where the KKT
conditions hold. In particular, we look for positive-definiteness in a subspace
defined by the linearized constraints.
I Geometrically, if we move away from a stationary point (x∗ , λ̂∗ ) along a
direction w that satisfies the linearized constraints, the Lagrangian should
look like a quadratic along this direction.
I More precisely, the second-order sufficient conditions are
wT ∇2xx L(x∗ , λ̂∗ )w > 0,

for all w ∈ Rn such that

∇ĉj (x∗ )T w = 0, j = 1, . . . , m̂.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 242 / 427


Constrained Optimization Equality Constraints

Example: Problem with Single Equality Constraint 1


Consider the following equality constrained problem:

minimize f (x) = x1 + x2
weight respect to x1 , x2
subject to ĉ1 (x) = x21 + x22 − 2 = 0

-1

-2
-2 -1 0 1 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 243 / 427


Constrained Optimization Equality Constraints

Example: Problem with Single Equality Constraint 2


I In this example, the Lagrangian is

L = x1 + x2 − λ̂1 (x21 + x22 − 2)

I And the optimality conditions are


    " 1 #
1 − 2λ̂1 x1 x
∇x L = = 0 ⇒ 1 = 2λ̂1 1
1 − 2λ̂1 x2 x2 2λ̂1
1
∇λ̂1 L = x21 + x22 − 2 = 0 ⇒ λ̂1 = ±
2
I To establish which are minima as opposed to other types of stationary points,
we need to look at the second-order conditions.
I Directions w = (w1 , w2 )T that satisfy the linearized constraints are given by
1
∇ĉ1 (x∗ )T w = (w1 + w2 ) = 0
λ̂1
⇒ w2 = −w1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 244 / 427


Constrained Optimization Equality Constraints

Example: Problem with Single Equality Constraint 3


I The Hessian of the Lagrangian at the stationary points is
 
2 −2λ̂1 0
∇x L = .
0 −2λ̂1

I Consequently, the Hessian of the Lagrangian in the subspace defined by w is


  
  −2λ̂1 0 w1
wT ∇2xx L(x∗ )w = w1 −w1 = −4λ̂1 w12
0 −2λ̂1 −w1

I In this case λ̂∗1 = − 21 corresponds to a positive-definite Hessian (in the space


w) and, therefore, the solution to the problem is
(x1 , x2 )T = ( 2λ̂1 , 2λ̂1 )T = (−1, −1)T .
1 1

I At the solution the constraint normal ∇ĉ1 (x∗ ) is parallel to ∇f (x∗ ), i.e.,
there is a scalar λ̂∗1 such that

∇f (x∗ ) = λ̂∗1 ∇ĉ1 (x∗ ).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 245 / 427


Constrained Optimization Equality Constraints

Example: Problem with Single Equality Constraint 4


I We can derive this expression by examining the first-order Taylor series
approximations to the objective and constraint functions. To retain feasibility
with respect to ĉ1 (x) = 0 we require that

ĉ1 (x + d) = 0 ⇒
ĉ1 (x + d) = ĉ1 (x) +∇ĉT1 (x)d + O(dT d).
| {z }
=0

I Linearizing this we get,


∇ĉT1 (x)d = 0 .
I We also know that a direction of improvement must result in a decrease in f ,
i.e.,
f (x + d) − f (x) < 0.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 246 / 427


Constrained Optimization Equality Constraints

Example: Problem with Single Equality Constraint 5


I Thus to first order we require that

f (x) + ∇f T (x)d − f (x) < 0 ⇒


∇f T (x)d < 0 .

I A necessary condition for optimality is that there be no direction satisfying


both of these conditions. The only way that such a direction cannot exist is if
∇f (x) and ∇ĉ1 (x) are parallel, that is, if ∇f (x) = λ̂1 ∇ĉ1 (x) holds.
I By defining the Lagrangian function

L(x, λ̂1 ) = f (x) − λ̂1 ĉ1 (x),

and noting that ∇x L(x, λ̂1 ) = ∇f (x) − λ̂1 ∇ĉ1 (x), we can state the
necessary optimality condition as follows: At the solution x∗ there is a scalar
λ̂∗1 such that ∇x L(x∗ , λ̂∗1 ) = 0.
I Thus we can search for solutions of the equality-constrained problem by
searching for a stationary point of the Lagrangian function. The scalar λ̂1 is
the Lagrange multiplier for the constraint ĉ1 (x) = 0.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 247 / 427
Constrained Optimization Inequality Constraints

Optimality for Inequality Constrained Problems 1


I Suppose we now have a general problem with equality and inequality
constraints.

minimize f (x)
w.r.t x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m

I The optimality (KKT) conditions for this problem can also be obtained for
this case by modifying the Lagrangian to be

L(x, λ̂, λ, s) = f (x) − λ̂T ĉ(x) − λT c(x) − s2 ,

where λ are the Lagrange multipliers associated with the inequality


constraints and s is a vector of slack variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 248 / 427


Constrained Optimization Inequality Constraints

First-Order KKT Conditions


X ∂ĉj X ∂ck m
∂L ∂f
∇x L = 0 ⇒ = − λ̂j − λk = 0, i = 1, . . . , n
∂xi ∂xi j=1 ∂xi ∂xi
k=1
∂L
∇λ̂ L = 0 ⇒ = ĉj = 0, j = 1, . . . , m̂
∂ λ̂j
∂L
∇λ L = 0 ⇒ = ck − s2k = 0 k = 1, . . . , m
∂λk
∂L
∇s L = 0 ⇒ = λk sk = 0, k = 1, . . . , m
∂sk
λk ≥ 0, k = 1, . . . , m.

Now we have n + m̂ + 2m equations and for each constraint:


I sk > 0: the k-th constraint is inactive, and λk = 0.
I sk = 0: the k-th constraint is active, and λk 6= 0. λk must then be
non-negative

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 249 / 427


Constrained Optimization Inequality Constraints

Sufficient Optimality Conditions


Sufficient conditions are obtained by examining the second-order requirements.
The set of sufficient conditions is as follows:
1. KKT necessary conditions must be satisfied at x∗ .
2. The Hessian matrix of the Lagrangian,

X m
X
∇2 L = ∇2 f (x∗ ) − λ̂j ∇2 ĉj − λk ∇2 ck
j=1 k=1

is positive definite in the feasible space. This is a subspace of n-space and is


defined as follows: any direction y that satisfies
y 6= 0
∇ĉTj (x∗ )y = 0, for all j = 1, . . . , m̂
∇cTk (x∗ )y = 0, for all k for which λk > 0.
Then the Hessian of the Lagrangian in feasible space must be positive
definite,
y T ∇2 L(x∗ )y > 0.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 250 / 427
Constrained Optimization Inequality Constraints

Example: Problem with a Single Inequality Constraint 1


I Suppose we now have the same problem, but with an inequality replacing the
equality constraint,

minimize f (x) = x1 + x2
s.t. c1 (x) = 2 − x21 − x22 ≥ 0

I The feasible region is now the circle and its interior. Note that ∇c1 (x) now
points towards the center of the circle.
I Graphically, we can see that the solution is still (−1, −1)T and therefore
λ∗1 = 1/2.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 251 / 427


Constrained Optimization Inequality Constraints

Example: Problem with a Single Inequality Constraint 2


2

-1

-2
-2 -1 0 1 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 252 / 427


Constrained Optimization Inequality Constraints

Example: Problem with a Single Inequality Constraint 3


I Given a point x that is not optimal, we can find a step d that both stays
feasible and decreases the objective function f , to first order. As in the
equality constrained case, the latter condition is expressed as

∇f T (x)d < 0 .

I The first condition, however is slightly different, since the constraint is not
necessarily zero, i.e.
c1 (x + d) ≥ 0
I Performing a Taylor series expansion we have,

c1 (x + d) ≈ c1 (x) + ∇cT1 (x)d.


| {z }
≥0

I Thus feasibility is retained to a first order if

c1 (x) + ∇cT1 (x)d ≥ 0 .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 253 / 427


Constrained Optimization Inequality Constraints

Example: Problem with a Single Inequality Constraint 4


I In order to find valid steps d it helps two consider two possibilities.
1. Suppose x lies strictly inside the circle (c1 (x) > 0). In this case, any vector d
satisfies the feasibility condition, provided that its length is sufficiently small.
The only situation that will prevent us from finding a descent direction is if
∇f (x) = 0.
2. Consider now the case in which x lies on the boundary, i.e., c1 (x) = 0. The
conditions thus become ∇f T (x)d < 0 and ∇cT1 (x)d ≥ 0. The two regions
defined by these conditions fail to intersect only when ∇f (x) and ∇c1 (x)
point in the same direction, that is, when

∇f (x)T d = λ1 c1 (x), for some λ1 ≥ 0.

I The optimality conditions for these two cases can again be summarized by
using the Lagrangian function, that is,

∇x L(x∗ , λ∗1 ) = 0, for some λ∗1 ≥ 0 and λ∗1 s∗1 = 0.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 254 / 427


Constrained Optimization Inequality Constraints

Example: Problem with a Single Inequality Constraint 5


I The last condition is known as a complementarity condition and implies that
the Lagrange multiplier can be strictly positive only when the constraint is
active.
2

-1

-2
-2 -1 0 1 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 255 / 427


Constrained Optimization Inequality Constraints

Example: Lagrangian Whose Hessian is Not Positive


Definite

minimize f (x) = −x1 x2


s.t. ĉ1 (x) = 2 − x21 − x22 = 0
x1 ≥ 0, x2 ≥ 0
2

1.5

0.5

0
0 0.5 1 1.5 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 256 / 427


Constrained Optimization Inequality Constraints

Example: Problem with Two Inequality Constraints 1


Suppose we now add another inequality constraint,

minimize f (x) = x1 + x2
s.t. c1 (x) = 2 − x21 − x22 ≥ 0, c2 (x) = x2 ≥ 0.

The feasible
√ region is now a half disk. Graphically, we can see that the solution is
now (− 2, 0)T and that both constraints are active at this point.

1.5

0.5

0
-2 -1 0 1 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 257 / 427


Constrained Optimization Inequality Constraints

Example: Problem with Two Inequality Constraints 2


The Lagrangian for this problem is
 
L(x, λ, s) = f (x) − λ1 c1 (x) − s21 − λ2 c2 (x) − s22 ,
where λ = (λ1 , λ2 )T is the vector of Lagrange multipliers. The first order
optimality conditions are thus,
∇x L(x∗ , λ∗ ) = 0, for some λ∗ ≥ 0.
Applying the complementarity conditions to both inequality constraints,
λ∗1 s∗1 = 0, and λ∗2 s∗2 = 0.

For x∗ = (− 2, 0)T we have,
   √   
∗ 1 ∗ 2 2 ∗ 0
∇f (x ) = , ∇c1 (x ) = , ∇c2 (x ) = ,
1 0 1

and ∇x L(x∗ , λ∗ ) = 0 when  1 



∗ 2 2 .
λ =
1
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 258 / 427
Constrained Optimization Inequality Constraints

Example: Problem with Two Inequality Constraints 3


I Now lets consider other feasible points that are not optimal and examine the
Lagrangian and its gradients at these points.

I For point x = ( 2, 0)T , both constraints are again active. However, ∇f (x)
no longer lies in the quadrant defined by ∇ci (x)T d ≥ 0, i = 1, 2 and
therefore there are descent directions that are feasible, like for example
d = (−1, 0)T .
1
I ∇x L(x∗ , λ∗ ) = 0 at this point for λ = (− 2√ 2
, 1)T . However, since λ1 is
negative, the first order conditions are not satisfied at this point.

1.5

0.5

0
-2 -1 0 1 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 259 / 427


Constrained Optimization Inequality Constraints

Example: Problem with Two Inequality Constraints 4


I Now consider the point x = (1, 0)T , for which only the second constraint is
active. Linearizing f and c as before, d must satisfy the following to be a
feasible descent direction,

c1 (x + d) ≥ 0 ⇒ 1 + ∇c1 (x)T d ≥ 0,
c2 (x + d) ≥ 0 ⇒ ∇c2 (x)T d ≥ 0,
f (x + d) − f (x) < 0 ⇒ 1 + ∇f (x)T d < 0.

I We only need to worry about the last two conditions, since the first is always
satisfied for a small enough step.
I By noting that    
1 0
∇f (x∗ ) = , ∇c2 (x∗ ) = ,
1 1

we can see that the vector d = − 12 , 14 , for example satisfies the two
conditions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 260 / 427


Constrained Optimization Inequality Constraints

Example: Problem with Two Inequality Constraints 5


I Since c1 (x) > 0, we must have λ1 = 0. In order to satisfy ∇x L(x, λ) = 0 we
would have to find λ2 such that ∇f (x) = λ2 ∇c2 (x). No such λ2 exists and
this point is therefore not an optimum.

1.5

0.5

0
-2 -1 0 1 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 261 / 427


Constrained Optimization Constraint Qualification

Constraint Qualification 1
I The KKT conditions are derived using certain assumptions and depending on
the problem, these assumptions might not hold.
I A point x satisfying a set of constraints is a regular point if the gradient
vectors of the active constraints, ∇cj (x) are linearly independent.
I To illustrate this, suppose we replaced the ĉ1 (x) in the previous example by
the equivalent condition
2
ĉ1 (x) = x21 + x22 − 2 = 0.

I Then we have  
4(x21 + x22 − 2)x1
∇ĉ1 (x) = ,
4(x21 + x22 − 2)x2

so ∇ĉ1 (x) = 0 for all feasible points and ∇f (x) = λ̂1 ∇ĉ1 (x) cannot be
satisfied. In other words, there is no (finite) Lagrange multiplier that makes
the objective gradient parallel to the constraint gradient, so we cannot solve
the optimality conditions.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 262 / 427


Constrained Optimization Constraint Qualification

Constraint Qualification 2
I This does not imply there is no solution; on the contrary, the solution
remains unchanged for the earlier example.
I Instead, what it means is that most algorithms will fail, because they assume
the constraints are linearly independent.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 263 / 427


Constrained Optimization Penalty Methods

Penalty Function Methods


I One of the ways of solving constrained optimization problems, at least
approximately, is by adding a penalty function to the objective function that
depends — in some logical way — on the value of the constraints.
I The idea is to minimize a sequence of unconstrained minimization problems
where the infeasibility of the constraints is minimized together with the
objective function.
I There two main types of penalization methods:
I Exterior penalty functions: These impose a penalty for violation of constraints
I Interior penalty functions: These impose a penalty for approaching the
boundary of an inequality constraint.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 264 / 427


Constrained Optimization Penalty Methods

Exterior Penalty Functions 1


I Consider the equality-constrained problem:

minimize f (x)
subject to ĉ(x) = 0

where ĉ(x) is an m̂-dimensional vector whose j-th component is ĉj (x).


I We assume that all functions are twice continuously differentiable.
I We require a penalty for constraint violation to be a continuous function φ
with the following properties

φ(x) = 0 if x is feasible
φ(x) > 0 otherwise,

I The new objective function is

π(x, ρ) = f (x) + ρφ(x),

were ρ is positive and is called the penalty parameter.


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 265 / 427
Constrained Optimization Penalty Methods

Exterior Penalty Functions 2


I The penalty method consists of solving a sequence of unconstrained
minimization problems of the form

minimize π (x, ρk )
w.r.t. x

for an increasing sequence of positive values of ρk tending to infinity.


I For finite values of ρk , the minimizer of the penalty function violate the
equality constraints. The increasing penalty forces the minimizer toward the
feasible region.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 266 / 427


Constrained Optimization Penalty Methods

Exterior Penalty Functions 3


General algorithm using exterior penalty functions:

1: Input: x0 , τ . Starting point, penalty multiplier


2: Output: x∗ . Optimum point
3: repeat
4: Solve the following unconstrained subproblem starting from xk :

minimize π(x, ρk )
w.r.t. x

5: xk+1 ← x
6: ρk+1 ← τ ρk . Increase the penalty parameter
7: k ←k+1
8: until xk converges to the desired tolerance

The increase in the penalty parameter for each iteration can range from modest
(ρk+1 = 1.4ρk ), to ambitious (ρk+1 = 10ρk ), depending on the problem.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 267 / 427


Constrained Optimization Penalty Methods

Quadratic Penalty Method 1


I The quadratic penalty function is defined as

ρX ρ
π(x, ρ) = f (x) + ĉi (x)2 = f (x) + ĉ(x)T ĉ(x).
2 i=1 2

I The penalty is equal to the sum of the square of all the constraints and is
therefore greater than zero when any constraint is violated and is zero when
the point is feasible.
I We can modify this method to handle inequality constraints by defining the
penalty for these constraints as
m
X 2
φ(x, ρ) = ρ (max [0, −ci (x)]) .
i=1

I Penalty functions suffer from problems of ill conditioning. The solution of the
modified problem approaches the true solution as limρ→+∞ x∗ (ρ) = x∗

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 268 / 427


Constrained Optimization Penalty Methods

Quadratic Penalty Method 2


I However, as the penalty parameter increases, the condition number of the
Hessian matrix of π(x, ρ) increases and tends to ∞. This makes the problem
increasingly difficult to solve numerically.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 269 / 427


Constrained Optimization Penalty Methods

Interior Penalty Methods 1


I Exterior penalty methods generate infeasible points and are therefore not
suitable when feasibility has to be strictly maintained.
I This might be the case if the objective function is undefined or ill-defined
outside the feasible region.
I Interior point methods also solve a sequence of unconstrained modified
differentiable functions whose unconstrained minima converge to the
optimum solution of the constrained problem in the limit.
I Consider the inequality-constrained problem:

minimize f (x)
subject to c(x) ≥ 0

where c(x) is an m-dimensional vector whose j-th component is cj (x).


I We assume that all functions are twice continuously differentiable.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 270 / 427


Constrained Optimization Penalty Methods

Interior Penalty Methods 2


I The logarithmic barrier function adds a penalty that tends to infinity as x
approaches infeasibility. The function is defined as
m
X
π(x, µ) = f (x) − µ log (cj (x)) ,
j=1

where the positive scalar µ is called the barrier parameter.


I The inverse barrier function is defined as
m
X 1
π(x, µ) = f (x) + µ ,
j=1
cj (x)

and shares many of the same characteristics of the logarithmic barrier.

I The solution of the modified problem for both functions approach the real
solution as limµ→0 x∗ (µ) = x∗ .
I Again, the Hessian matrix becomes increasingly ill conditioned as µ
approaches zero.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 271 / 427
Constrained Optimization Penalty Methods

Interior Penalty Methods 3


I Similarly to the an exterior point method, an algorithm using these barrier
functions finds the minimum of π(x, µk ), for a given starting (feasible) point
and terminates when norm of gradient is close to zero.
I The algorithm then chooses a new barrier parameter µk+1 and a new starting
point, finds the minimum of the new problem and so on.
I A value of 0.1 for the ratio µk+1 /µk is usually considered ambitious.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 272 / 427


Constrained Optimization Penalty Methods

Example: Quadratic Penalty Function in Action

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 273 / 427


Constrained Optimization Sequential Quadratic Programming

Sequential Quadratic Programming (SQP) 1


I Consider the the equality-constrained problem,

minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂

I The idea of SQP is to model this problem at the current point xk by a


quadratic subproblem and to use the solution of this subproblem to find the
new point xk+1 .
I SQP represents the application of Newton’s method to the KKT optimality
conditions.
I The Lagrangian function for this problem is L(x, λ̂) = f (x) − λ̂T ĉ(x). We
define the Jacobian of the constraints by

A(x)T = ∇ĉ(x)T = [∇ĉ1 (x), . . . , ∇ĉm̂ (x)]

which is an n × m matrix and g(x) ≡ ∇f (x) is an n-vector as before. Note


that A is generally not symmetric.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 274 / 427


Constrained Optimization Sequential Quadratic Programming

Sequential Quadratic Programming (SQP) 2


I Applying the first order KKT conditions to this problem we obtain
 
g(x) − A(x)T λ̂
∇L(x, λ̂) = 0 ⇒ =0
ĉ(x)

I This set of nonlinear equations can be solved using Newton’s method,


    
W (xk , λ̂k ) −A(xk )T pk −gk + ATk λ̂k
=
A(xk ) 0 pλ̂ −ĉk

where the Hessian of the Lagrangian is denoted by W (x, λ̂) = ∇2xx L(x, λ̂).
I The Newton step from the current point is given by
     
xk+1 xk p
= + k .
λ̂k+1 λ̂k pλ̂

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 275 / 427


Constrained Optimization Sequential Quadratic Programming

Alternative View of SQP 1


I An alternative way of looking at this formulation of the SQP is to define the
following quadratic problem at (xk , λ̂k )
1 T
minimize p Wk p + gkT p
2
subject to Ak p + ĉk = 0

I This problem has a unique solution that satisfies

Wk p + gk − ATk λ̂k = 0
Ak p + ĉk = 0

I By writing this in matrix form, we see that pk and λ̂k can be identified as the
solution of the Newton equations we derived previously.
    
Wk −ATk pk −gk
=
Ak 0 λ̂k+1 −ĉk .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 276 / 427


Constrained Optimization Sequential Quadratic Programming

Alternative View of SQP 2


I This problem is equivalent, but the second set of variables, is now the actual
vector of Lagrange multipliers λ̂k+1 instead of the Lagrange multiplier step,
pλ̂ .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 277 / 427


Constrained Optimization Sequential Quadratic Programming

Quasi-Newton Approximations 1
I Any SQP method relies on a choice of Wk (an approximation of the Hessian
of the Lagrangian) in the quadratic model.
I When Wk is exact, then the SQP becomes the Newton method applied to
the optimality conditions.
I One way to approximate the Hessian of the Lagrangian would be to use a
quasi-Newton approximation, such as the BFGS update formula. We could
define,

sk = xk+1 − xk , yk = ∇x L(xk+1 , λk+1 ) − ∇x L(xk , λk+1 ),

and then compute the new approximation Bk+1 using the same formula used
in the unconstrained case.
I If ∇2xx L is positive definite at the sequence of points xk , the method will
converge rapidly, just as in the unconstrained case. If, however, ∇2xx L is not
positive definite, then using the BFGS update may not work well.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 278 / 427


Constrained Optimization Sequential Quadratic Programming

Quasi-Newton Approximations 2
I To ensure that the update is always well-defined the damped BFGS updating
for SQP was devised. Using this scheme, we set

rk = θk yk + (1 − θk )Bk sk ,

where the scalar θk is defined as


(
1 if sTk yk ≥ 0.2sTk Bk sk ,
θk = 0.8sT
k Bk sk
sT T if sTk yk < 0.2sTk Bk sk .
k Bk sk −sk yk

Then we can update Bk+1 using,

Bk sk sTk Bk rk rT
Bk+1 = Bk − T
+ T k,
sk Bk sk sk rk

which is the standard BFGS update formula with yk replaced by rk . This


guarantees that the Hessian approximation is positive definite.
I When θk = 0, we have Bk+1 = Bk

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 279 / 427


Constrained Optimization Sequential Quadratic Programming

Quasi-Newton Approximations 3
I When θk = 1 we have an unmodified BFGS update.
I The modified method thus produces an interpolation between the current Bk
and the one corresponding to BFGS.
I The choice of θk ensures that the new approximation stays close enough to
the current approximation to guarantee positive definiteness.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 280 / 427


Constrained Optimization Sequential Quadratic Programming

Other Modifications 1
I In addition to using a different quasi-Newton update, SQP algorithms also
need modifications to the line search criteria in order to ensure that the
method converges from remote starting points.
I It is common to use a merit function, φ to control the size of the steps in the
line search. The following is one of the possibilities for such a function:
1
φ(xk ; µ) = f (x) + ||ĉ||1
µ
I The penalty parameter µ is positive and the L1 norm of the equality
constraints is

X
||ĉ||1 = |ĉj |.
j=1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 281 / 427


Constrained Optimization Sequential Quadratic Programming

Other Modifications 2
I To determine the sequence of penalty parameters, the following strategy is
often used (
µk−1 if µ−1
k−1 ≥ γ + δ
µk = −1
(γ + 2δ) otherwise,
where γ is set to max(λk+1 ) and δ is a small tolerance that should be larger
that the expected relative precision of your function evaluations.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 282 / 427


Constrained Optimization Sequential Quadratic Programming

SQP Algorithm

Input: Initial guess (x0 , λ0 ), parameters 0 < η < 0.5


Output: Optimum, x∗
k←0
Initialize the Hessian estimate, B0 ← I
repeat
Compute pk and pλ̂ by solving the KKT system, with Bk in place of Wk
Choose µk such that pk is a descent direction for φ at xk
αk ← 1
while φ(xk + αk pk , µk ) > φ(xk , µk ) + ηαk D [φ(xk , pk )] do
αk ← τα αk for some 0 < τα < 1
end while
xk+1 ← xk + αk pk
λ̂k+1 ← λ̂k + pλ̂
Evaluate fk+1 , gk+1 , ck+1 and Ak+1
sk ← αk pk , yk ← ∇x L(xk+1 , λk+1 ) − ∇x L(xk , λk+1 )
Obtain Bk+1 by using a quasi-Newton update to Bk
k ←k+1
until Convergence

D denotes the directional derivative in the pk direction.


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 283 / 427
Constrained Optimization Sequential Quadratic Programming

Inequality Constraints 1
I The SQP method can be extended to handle inequality constraints.
I Consider general nonlinear optimization problem

minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m

I To define the subproblem we now linearize both the inequality and equality
constraints and obtain,
1 T
minimize p Wk p + gkT p
2
subject to ∇ĉj (x)T p + ĉj (x) = 0, j = 1, . . . , m̂
∇ck (x)T p + ck (x) ≥ 0, k = 1, . . . , m

I One of the most common type of strategy to solve this problem, the
active-set method, is to consider only the active constraints at a given
iteration and treat those as equality constraints.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 284 / 427
Constrained Optimization Sequential Quadratic Programming

Inequality Constraints 2
I This is a significantly more difficult problem because we do not know a priori
which inequality constraints are active at the solution. If we did, we could just
solve the equality constrained problem considering only the active constraints.
I The most commonly used active-set methods are feasible-point methods.
These start with a feasible solution a never let the new point be infeasible.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 285 / 427


Constrained Optimization Sequential Quadratic Programming

Example: Constrained Optimization Using SQP

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 286 / 427


Gradient-Free Optimization

Gradient-Free Optimization
1. Introduction

2. Line Search Techniques

3. Gradient-Based Optimization

4. Computing Derivatives

5. Constrained Optimization

6. Gradient-Free Optimization
6.1 Introduction
6.2 Nelder–Mead Simplex
6.3 DIvided RECTangles (DIRECT)
6.4 Genetic Algorithms
6.5 Particle Swarm Optimization

7. Multidisciplinary Design Optimization


J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 287 / 427
Gradient-Free Optimization Introduction

Gradient-Free Optimization 1
Using optimization in the solution of practical applications we often encounter one
or more of the following challenges:
I non-differentiable functions and/or constraints
I disconnected and/or non-convex feasible space
I discrete feasible space
I mixed variables (discrete, continuous, permutation)
I large dimensionality
I multiple local minima (multi-modal)
I multiple objectives

Mi
xed
(
Int
eger
-Cont
inuous
)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 288 / 427


Gradient-Free Optimization Introduction

Gradient-Free Optimization 2
Gradient-based methods are:
I Efficient in finding local minima for high-dimensional, nonlinearly-constrained,
convex problems
I Sensitive to noisy and discontinuous functions
I Limited to continuous design variables.
Consider, for example, the Griewank function:
n 
P  Q
n  
x2i xi
f (x) = 4000 − cos √
i
+1
i=1 i=1
−600 ≤ xi ≤ 600

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 289 / 427


Gradient-Free Optimization Introduction

Gradient-Free Optimization 3
How we could find the best solution for this example?
I Multiple point restarts of gradient (local) based optimizer
I Systematically search the design space
I Use gradient-free optimizers
Some comments on gradient-free methods:
I Many mimic mechanisms observed in nature — biomimicry — or use other
heuristics.
I They are not necessarily guaranteed to find the true global optimal solutions
— unlike gradient-based methods in a convex search space . . .
I . . . but they are able to find many good solutions — the mathematician’s
answer vs. the engineer’s answer.
I Their key strength is the ability to solve some problems that are difficult to
solve using gradient-based methods.
I Many of them are designed as global optimizers and thus are able to find
multiple local optima while searching for the global optimum.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 290 / 427


Gradient-Free Optimization Introduction

Gradient-Free Optimization 4
A wide variety of gradient-free methods have been developed. We are going to
look at some of the most commonly used algorithms:
I Nelder–Mead Simplex (Nonlinear Simplex)
I Divided Rectangles Method
I Genetic Algorithms
I Particle Swarm Optimization

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 291 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Mead Simplex 1
I The simplex method of Nelder and Mead performs a search in n-dimensional
space using heuristic ideas.
I It is also known as the nonlinear simplex
I Not to be confused with the linear simplex, with which it has nothing in
common.
I Strengths: requires no derivatives to be computed and that it does not
require the objective function to be smooth.
I The weakness: not very efficient, particularly for problems with more than
about 10 design variables; above this number of variables convergence
becomes increasingly difficult.

I A simplex is a structure in n-dimensional space formed by n + 1 points that


are not in the same plane.
I A line segment is a 1-dimensional simplex, a triangle is a 2-dimensional
simplex and a tetrahedron forms a simplex in 3-dimensional space.
I The simplex is also called a hypertetrahedron.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 292 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Mead Simplex 2
The Nelder–Mead algorithm starts with a simplex (n + 1 sets of design variables
x) and then modifies the simplex at each iteration using four simple operations.
The sequence of operations to be performed is chosen based on the relative values
of the objective function at each of the points.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 293 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm 1
I The first step of the simplex algorithm is to find the n + 1 points of the
simplex given an initial guess x0 .
I This can be easily done by simply adding a step to each component of x0 to
generate n new points.
I However, generating a simplex with equal length edges is preferable . . .
I Suppose the length of all sides is required to be c and that the initial guess,
x0 is the (n + 1)th point.
I The remaining points of the simplex, i = 1, . . . , n can be computed by
adding a vector to x0 whose components are all b except for the ith
component which is set to a, where
c √ 
b= √ n+1−1
n 2
c
a=b+ √ .
2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 294 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm 2
0.9

0.8 0.9

0.8
0.7
0.7
0.6
0.6

0.5 0.5

x3
x2

0.4
0.4
0.3

0.3 0.2

0.1
0.2
0
0 0.8
0.1 0.2
0.4 0.6
0.6 0.4
0 0.8 0.2
0 0.2 0.4 0.6 0.8 1 1 0
x1 x2
x1

I After generating the initial simplex, we have to evaluate the objective


function at each of its vertices in order to identify three key points:
I The highest value — the worst point, xw
I The second highest value — the lousy point, xl
I The lowest value — the best point, xb

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 295 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm 3
The Nelder–Mead algorithm starts by computing the average of the n points that
exclude exclude the worst,
n+1
1 X
xa = xi .
n
i=1,i6=w

The algorithm then performs five main operations to the simplex:

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 296 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm 4

!"#$%
 


*+,-. 
&'()  /012345 6/7189613/7
 


KLMNOPQRO ABCADE FGBHIJFHAGB :;<=>?=>@

I Reflection
xr = xa + α (xa − xw )

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 297 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm 5
I Expansion
xe = xr + γ (xr − xa ) ,
where the expansion parameter γ is usually set to 1.
I Inside contraction
xc = xa − β (xa − xw ) ,
where the contraction factor is usually set to β = 0.5.
I Outside contraction
xo = xa + β (xa − xw ) .
I Shrinking
xi = xb + ρ (xi − xb ) ,
where the scaling parameter is usually set to ρ = 0.5.
Each of these operations generates a new point and the sequence of operations
performed in one iteration depends on the value of the objective at the new point
relative to the other key points.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 298 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm 6
Initialize n-simplex,
evaluate n+1 points

Rank vertices:
best, lousy and
worst

Reflect

Is reflected point better


Yes Expand
than best point?

No Keep expanded point

Yes
Perform inside Is reflected point worse Is expanded point better
Yes
contraction than worst point? than best point?
No
No
Keep reflected point

Is new point worse than Is reflected point worse Perform outside


Yes
worst point? than lousy point? contraction

No Yes No
Shrink
Yes
Is new point worse than
Keep new point Shrink Keep reflected point
reflected point? No
Keep new point

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 299 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Nelder–Meade Algorithm
Input: Initial guess, x0
Output: Optimum, x∗
k←0
Create a simplex with edge length c
repeat
Identify the highest (xw : worst), second highest (xl , lousy) and lowest (xb :
best) value points with function values fw , fl , and fb , respectively
Evaluate xa , the average of the point in simplex excluding xw
Perform reflection to obtain xr , evaluate fr
if fr < fb then
Perform expansion to obtain xe , evaluate fe .
if fe < fb then
xw ← xe , fw ← fe (accept expansion)
else
xw ← xr , fw ← fr (accept reflection)
end if
else if fr ≤ fl then
xw ← xr , fw ← fr (accept reflected point)
else
if fr > fw then
Perform an inside contraction and evaluate fc
if fc < fw then
xw ← xc (accept contraction)
else
Shrink the simplex
end if
else
Perform an outside contraction and evaluate fc
if fc ≤ fr then
xw ← xc (accept contraction)
else
Shrink the simplex
end if
end if
end if
k ←k+1
until (fw − fb ) < (ε1 + ε2 |fb |)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 300 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Alternative Convergence Criteria


I The criterion used above is based on the difference between the best and the
worst function value,
(fw − fb ) < (ε1 + ε2 |fb |)
I Alternatively, we can use the size of simplex,
n
X
s= |xi − xn+1 |
i=1

which must be less than a certain tolerance.


I Another measure of convergence that can be used is the standard deviation,
s
Pn+1 
¯2
i=1 fi − f
σ=
n+1

where f¯ is the mean of the n + 1 function values.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 301 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Variations of the Simplex Algorithm


I Since the simplex method is largely based on heuristics, the original method
has been the subject of many proposed changes . . .
I . . . but none of the proposed changes have replace the original algorithm,
except for one:
I We notice that if fe < fb but fr is even better (i.e., fr < fe ) the algorithm
still accepts the expanded point xe . Now, it is standard practice to accept
the best of fr and fe

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 302 / 427


Gradient-Free Optimization Nelder–Mead Simplex

Example: Minimizing Rosenbrock with Nelder–Meade

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 303 / 427


Gradient-Free Optimization DIvided RECTangles (DIRECT)

DIvided RECTangles (DIRECT) Method 1


The DIRECT method uses a hyperdimensional adaptive meshing scheme to search
all the design space to find the optimum.
The overall idea behind DIRECT is as follows.
1. Scale the design box to a n-dimensional unit hypercube and evaluating the
objective function at the center point of the hypercube
2. Divide the potentially optimal hyper-rectangles by sampling the longest
coordinate directions of the hyper-rectangle and trisecting based on the
directions with the smallest function value until the global minimum is found
3. Sampling of the maximum length directions prevents boxes from becoming
overly skewed and trisecting in the direction of the best function value allows
the biggest rectangles contain the best function value. This strategy increases
the attractiveness of searching near points with good function values
4. Iterate the above procedure allow to identify and zoom into the most
promising design space regions

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 304 / 427


Gradient-Free Optimization DIvided RECTangles (DIRECT)

DIvided RECTangles (DIRECT) Method 2


I
DENTIFYPOTENTI
ALLY SAMPLE&DI VI
DE
START
OPTI
MUM RECTANGLES RECTANGLES

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 305 / 427


Gradient-Free Optimization DIvided RECTangles (DIRECT)

DIvided RECTangles (DIRECT) Method 3


I To identify the potentially optimal rectangles we consider the values of f
versus the d for a given group of points.
I The line connecting the points with the lowest f for a given d (or greatest d
for a given f ) represent the points with the most potential.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 306 / 427


Gradient-Free Optimization DIvided RECTangles (DIRECT)

DIvided RECTangles (DIRECT) Method 4


I Mathematically, assuming that the unit hypercube with center ci is divided
into m hyper-rectangles, a hyper-rectangle j is potentially optimal if there
exists rate-of-change constant K̄ > 0 such that
f (cj ) − K̄dj ≤ f (ci ) − K̄di for all i = 1, ..., m
(4)
f (cj ) − K̄dj ≤ fmin − ε|fmin |,
where
I d is the distance between c and the vertices of the hyper-rectangle
I fmin is the best current value of the objective function
I ε is positive parameter used so that f (cj ) exceeds the current best solution by
a non-trivial amount
I The first equation forces the selection of the rectangles on this line.
I The second equation requires that the function value exceeds the current
best function value by an amount that is not insignificant.
I This prevents the algorithm from becoming too local, wasting precious
function evaluations in search of smaller function improvements.
I The parameter ε balances the search between local and global. A typical
value is ε = 10−4 , and its the range is usually such that 10−2 ≤ ε ≤ 10−7 .
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 307 / 427
Gradient-Free Optimization DIvided RECTangles (DIRECT)

DIRECT Algorithm

Input: Initial guess, x0


Output: Optimum, x∗
k←0
repeat
Normalize the search space to be the unit hypercube. Let c1 be the center
point of this hypercube and evaluate f (c1 ).
Identify the set S of potentially optimal rectangles/cubes, that is all those
rectangles defining the bottom of the convex hull of a scatter plot of rectangle
diameter versus f (ci ) for all rectangle centers ci
for all Rectangles r ∈ S do
Identify the set I of dimensions with the maximum side length
Set δ equal one third of this maximum side length
for all i ∈ I do
Evaluate the rectangle/cube at the point cr ± δei for all i ∈ I, where
cr is the center of the rectangle r, and ei is the ith unit vector
end for
Divide the rectangle r into thirds along the dimensions in I, starting with
the dimension with the lowest value of f (c±δei ) and continuing to the dimension
with the highest f (c ± δei ).
end for
until Converged

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 308 / 427


Gradient-Free Optimization DIvided RECTangles (DIRECT)

Example: Minimization of Rosenbrock with DIRECT

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 309 / 427


Gradient-Free Optimization Genetic Algorithms

Genetic Algorithms
I Genetic algorithms for optimization are inspired by the process of natural
evolution of organisms.
I First developed by John Holland in the mid 1960’s. Holland was motivated
by a desire to better understand the evolution of life by simulating it in a
computer and the use of this process in optimization.
I Genetic algorithms are based on three essential components:
I Survival of the fittest — Selection
I Reproduction processes where genetic traits are propagated — Crossover
I Variation — Mutation
I We use the term “genetic algorithms” generically to refer to optimization
approaches that use the three components above.
I Depending on the approach they have different names, for example: genetic
algorithms, evolutionary computation, genetic programming, evolutionary
programming, evolutionary strategies.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 310 / 427


Gradient-Free Optimization Genetic Algorithms

Genetic Algorithm Nomenclature


We will start by posing the unconstrained optimization problem with design
variable bounds,

minimize f (x)
subject to xl ≤ x ≤ xu

where xl and xu are the vectors of lower and upper bounds on x, respectively.
In the context of genetic algorithms we will call each design variable vector x a
population member. The value of the objective function, f (x) is termed the
fitness.
Genetic algorithms are radically different from the gradient based methods we
have covered so far. Instead of looking at one point at a time and stepping to a
new point for each iteration, a whole population of solutions is iterated towards
the optimum at the same time. Using a population lets us explore multiple
“buckets” (local minima) simultaneously, increasing the likelihood of finding the
global optimum.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 311 / 427


Gradient-Free Optimization Genetic Algorithms

The Pros and Cons of Genetic Algorithms


I Pros:
I Uses a coding of the parameter set, not the parameter themselves, so the
algorithm can handle mixed continuous, integer and discrete design variables.
I The population can cover a large range of the design space and is less likely
than gradient based methods to “get stuck” in local minima.
I As with other gradient-free methods, it can handle noisy and discontinuous
objective functions.
I The implementation is straightforward and easily parallelized.
I Can be used for multiobjective optimization.
I There is “no free lunch”, of course, and these methods have some cons: The
main one is that genetic algorithms are expensive when compared to
gradient-based methods, especially for problems with a large number of
design variables.
I However, it is sometimes difficult to make gradient-based methods work and
in some of these problems genetic algorithms work very well with little effort.
I Although genetic algorithms are much better than completely random
methods, they are still “brute force” methods that require a large number of
function evaluations.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 312 / 427
Gradient-Free Optimization Genetic Algorithms

Single-Objective Optimization 1
The general procedure of a genetic algorithm can be described as follows:
1. Initialize a population: Each member of the population represents a design
point, x and has a value of the objective (fitness), and information about its
constraint violations associated with it.
2. Determine mating pool: Each population member is paired for reproduction
by using one of the following methods:
I Random selection
I Based on fitness: make the better members to reproduce more often than the
others.
3. Generate offspring: To generate offspring we need a scheme for the crossover
operation. There are various schemes that one can use. When the design
variables are continuous, for example, one offspring can be found by
interpolating between the two parents and the other one can be extrapolated
in the direction of the fitter parent.
4. Mutation: Add some randomness in the offspring’s variables to maintain
diversity.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 313 / 427


Gradient-Free Optimization Genetic Algorithms

Single-Objective Optimization 2
5. Compute Offspring’s Fitness
Evaluate the value of the objective function and constraint violations for each
offspring.
6. Tournament
Again, there are different schemes that can be used in this step. One method
involves replacing the worst parent from each “family” with the best offspring.
7. Identify the Best Member
8. Return to step 2 unless converged or computational budget is exceeded.

I Convergence is difficult to determine because the best solution so far may be


maintained for many generations.
I Rule of thumb: if the best solution among the current population has not
changed (much)for about 10 generations, it can be assumed as the
“optimum” for the problem.
I Since GAs are probabilistic methods, it is crucial to run the problem multiple
times when studying its characteristics.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 314 / 427


Gradient-Free Optimization Genetic Algorithms

Multi-Objective Optimization 1
I What if we want to investigate the trade-off between two (or more)
conflicting objectives?
I Examples . . .
I In this situation there is no one “best design” . . .
I . . . but there is a set of designs that are the best possible for that
combination of the two objectives.
I For these optimal solutions, the only way to improve one objective is to
worsen the other.
I Genetic algorithms can handle this problem with little modification: We
already evaluate a whole population, so we can use this to our advantage.
I Alternatively, we could use gradient-based optimization with one of two
strategies:
I Use a composite weighted function,

f = αf1 + (1 − α)f2

and do a sweep in α, performing an optimization for each value

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 315 / 427


Gradient-Free Optimization Genetic Algorithms

Multi-Objective Optimization 2
I Solve the problem

minimize f1
subject to f2 = fc

for different values of fc


I The choice of a genetic algorithm vs. gradient-based depends on the number
of design variables and the required precision in the result.

I The concept of dominance is the key to the use of GA’s in multi-objective


optimization.
I Assume we have a population of 3 members, A, B and C, and that we want
to minimize two objective functions, f1 and f2 .
Member f1 f2
A 10 12
B 8 13
C 9 14

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 316 / 427


Gradient-Free Optimization Genetic Algorithms

Multi-Objective Optimization 3
I Comparing members A and B, we can see that A has a higher (worse) f1 than
B, but has a lower (better) f2 . Hence we cannot determine whether A is
better than B or vice versa.
I On the other hand, B is clearly a fitter member than C since both of B’s
objectives are lower. We say that B dominates C.
I Comparing A and C, once again we are unable to say that one is better than
the other.
I In summary:
I A is non-dominated by either B or C
I B is non-dominated by either A or C
I C is dominated by B but not by A
I The rank of a member is the number of members that dominate it plus one.
In this case the ranks of the three members are:

rank(A) = 1
rank(B) = 1
rank(C) = 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 317 / 427


Gradient-Free Optimization Genetic Algorithms

Multi-Objective Optimization 4
I In multi-objective optimization the rank is crucial in determining which
population member are the fittest.
I A solution of rank one is said to be Pareto optimal and the set of rank one
points for a given generation is called the Pareto set.
I As the number of generations increases, and the fitness of the population
improves, the size of the Pareto set grows.
I In the case above, the Pareto set includes A and B. The graphical
representation of a Pareto set is called a Pareto front.
I The procedure of a two-objective genetic algorithm is similar to the
single-objective one, with the following modifications:
I Instead of making decisions based on the objective function, we make
decisions based on rank (the lower the better)
I Instead of keeping track of the best member of population, we keep track of
all members with rank one, which should converge to the Pareto set
I One of the problems with this method is that there is no mechanism
“pushing” the Pareto front to a better one.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 318 / 427


Gradient-Free Optimization Genetic Algorithms

Example: Pareto Front in Aircraft Design

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 319 / 427


Gradient-Free Optimization Genetic Algorithms

Coding and Decoding of Variables


There are two main variants in genetic algorithms:
I Bit GAs: represent the design variables with bits.

I Real GAs: keep the design variables as real numbers.

I A bit GA represents each variable as a binary number.


I Suppose we have m bits available for each number.
I To represent a real-valued variable, we have to divide the feasible interval of
xi into 2m − 1 intervals.
I Then each possibility for xi can be represented by any combination of m bits.
I For m = 5, for example, the number of intervals would be 31 and a possible
representation for xi would be 10101, which can be decoded as

xi = xl + si 1 × 24 + 0 × 23 + 1 × 22 + 0 × 21 + 1 × 20 = xl + 21si ,

where si is the size of interval for xi given by


xu i − xli
si = .
31
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 320 / 427
Gradient-Free Optimization Genetic Algorithms

Creation of the Initial Population


I As a rule of thumb, the population size should be of 15 to 20 times the
number of design variables.
I Using bit encoding, each bit is assigned a 50% chance of being either 1 or 0.
One way of doing this is to generate a random number 0 ≤ r ≤ 1 and setting
the bit to 0 if r ≤ 0.5 and 1 if r > 0.5.
I Each member is chosen at random. For a problem with real design variables
and a given variable x such that xl ≤ x ≤ xu , we could use,

x = xl + r(xu − xl )

where r is a random number such that 0 ≤ r ≤ 1.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 321 / 427


Gradient-Free Optimization Genetic Algorithms

Selection: Determining the Mating Pool 1


I Here we assume that we want to maximize f (x).
I Consider the highest (best) and the lowest (worst) values, fh and fl ,
respectively.
I The function values can be converted to a positive quantity by adding,

C = 0.1fh − 1.1fl

to each function value. Thus the new highest value will be 1.1(fh − fl ) and
the new lowest value 0.1(fh − fl ). The values are then normalized as follows,

fi + C
fi0 =
D
where

D = max(1, fh + C).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 322 / 427


Gradient-Free Optimization Genetic Algorithms

Selection: Determining the Mating Pool 2


After the fitness values are scaled, they are summed,
N
X
S= fi0
i=1

where N is the number of members in the population.


I We now use roulette wheel selection to make copies of the fittest members
for reproduction.
I A mating pool of N members is created by turning the roulette wheel N
times.
I A random number 0 ≤ r ≤ 1 is generated at each turn. The j th member is
copied to the mating pool if

f10 + . . . + fj−1
0
≤ rS ≤ f10 + . . . + fj0

This ensures that the probability of a member being selected for reproduction
is proportional to its scaled fitness value.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 323 / 427


Gradient-Free Optimization Genetic Algorithms

Crossover Operation — Real GA


I Various crossover strategies are possible in genetic algorithms.
I The following crossover strategy is one devised specifically for optimization
problems with real-valued design variables.
I Each member of the population corresponds to a point in n-space, that is, a
vector x
I Let two members of the population that have been mated (parents) be
ordered such that fp1 < fp2 . Two offspring are to be generated:
1. The midpoint between the two parents:
1
xc1 = (xp1 + xp2 )
2
2. A point extrapolated in the direction defined by the two parents beyond the
better parents:
xc2 = 2xp1 + xp2
I Then the tournament is performed by selecting the best parent (xp1 ) and
either the second parent or the best offspring, whichever is the best one of the
two.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 324 / 427


Gradient-Free Optimization Genetic Algorithms

Crossover Operation — Bit GA


When the information is stored as bits, the crossover operation involves the steps:
1. Generate a random integer 1 ≤ k ≤ m − 1 that defines the crossover point.
2. For one of the offspring, the first k bits are taken from parent 1 and the
remaining bits from parent 2.
3. For the second offspring, the first k bits are taken from parent 2 and the
remaining ones from parent 1.

Before Crossover After Crossover


11 111 11 000
00 000 00 111

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 325 / 427


Gradient-Free Optimization Genetic Algorithms

Mutation
I Mutation is a random operation performed to change the genetic information.
I Mutation is needed because even though reproduction and crossover
effectively recombine existing information, occasionally some useful genetic
information might be lost.
I The mutation operation protects against such irrecoverable loss.
I It also introduces additional diversity into the population.
I When using bit representation, every bit is assigned a small permutation
probability, say p = 0.005 ∼ 0.1. This is done by generating a random
number 0 ≤ r ≤ 1 for each bit, which is changed if r < p.
Before Mutation After Mutation
11111 11010
I The mutation of the real representation can be done in a variety of way. A
simple way involves generating a small probability that each design variable
changes by a random amount (within certain bounds). Another more
sophisticated alternative consists in using a probability density function.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 326 / 427


Gradient-Free Optimization Genetic Algorithms

Why do genetic algorithms work?


A fundamental question which is still being researched is how the three main
operations (selection, crossover and mutation) are able to find better solutions.
Two main mechanism allow the algorithm to progress towards better solutions:
I Selection + Mutation = Improvement: Mutation makes local changes while
selection accepts better changes, this can be seen as a resilient and general
form of reducing the objective function.
I Selection + Crossover = Innovation: When the information of the best
population members is exchanged, there is a greater chance a new better
combination will emerge.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 327 / 427


Gradient-Free Optimization Genetic Algorithms

Jet Engine Design at General Electric 1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 328 / 427


Gradient-Free Optimization Genetic Algorithms

Jet Engine Design at General Electric 2


I Genetic algorithm combined with expert system
I Find the most efficient shape for the fan blades in the GE90 jet engines
I 100 design variables
I Found 2% increase in efficiency as compared to previous engines
I Allowed the elimination of one stage of the engine’s compressor reducing
engine weight and manufacturing costs without any sacrifice in performance

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 329 / 427


Gradient-Free Optimization Genetic Algorithms

ST5 Antenna 1

I The antenna for the ST5 satellite system presented a challenging design
problem, requiring both a wide beam width for a circularly-polarized wave
and a wide bandwidth.
I Two teams were assigned the same design problem: one used a traditional
method, and the other used GAs.
I The GA team found an antenna configuration (ST5-3-10) that was slightly
more difficult to manufacture, but it:
I Used less power
I Removed two steps in design and fabrication

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 330 / 427


Gradient-Free Optimization Genetic Algorithms

ST5 Antenna 2
I Had more uniform coverage and wider range of operational elevation angle
relative to the ground changes
I Took 3 person-months to design and fabricate the first prototype as compared
to 5 person-months for the conventionally designed antenna.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 331 / 427


Gradient-Free Optimization Particle Swarm Optimization

Particle Swarm Optimization (PSO) 1


I PSO is a stochastic, population-based computer algorithm developed in 1995
by James Kennedy (social-psychologist) and Russell Eberhart (electrical
engineer)
I PSO applies the concept of “swarm intelligence” to problem solving.
I “Swarm intelligence” is the property of a system whereby the collective
behaviors of (unsophisticated) agents interacting locally with their
environment cause coherent functional global patterns to emerge (e.g.
self-organization, emergent behavior).
I In other words: Dumb agents, properly connected into a swarm, yield
“smart” results.
I The basic idea of the PSO algorithm is:
I Each agent (or particle) represents a design point and moves in n-dimensional
space looking for the best solution.
I Each agent adjusts its movement according to the effects of “cognitivism”
(self experience) and “sociocognition” (social interaction).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 332 / 427


Gradient-Free Optimization Particle Swarm Optimization

Particle Swarm Optimization (PSO) 2


I The update of particle i’s position is given by:

xik+1 = xik + vk+1


i
∆t

where the velocity of the particle is given by


 
i i pik − xik pgk − xik
vk+1 = wvk + c1 r1 + c2 r2
∆t ∆t

I r1 and r2 are random numbers in the interval [0, 1]


I pik is particle i’s best position so far, pgk is the swarm’s best particle position at
iteration k
I c1 is the cognitive parameter (confidence in itself), c2 is the social parameter
(confidence in the swarm)
I w is the inertia

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 333 / 427


Gradient-Free Optimization Particle Swarm Optimization

How the swarm is updated 1

xki vki

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 334 / 427


Gradient-Free Optimization Particle Swarm Optimization

How the swarm is updated 2

wvki
xki vki

Inertia

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 335 / 427


Gradient-Free Optimization Particle Swarm Optimization

How the swarm is updated 3

pki

Cognitive Learning

wvki
xki vki

Inertia

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 336 / 427


Gradient-Free Optimization Particle Swarm Optimization

How the swarm is updated 4

pki

xki +1 Cognitive Learning


vki +1
c1r1 (pki − xki )
i
wv k
xki vki

Inertia

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 337 / 427


Gradient-Free Optimization Particle Swarm Optimization

How the swarm is updated 5

pki

xki +1 Cognitive Learning


Social Learning vki +1
pkg c1r1 (pki − xki )
i
wv k
xki vki

Inertia

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 338 / 427


Gradient-Free Optimization Particle Swarm Optimization

PSO Algorithm
1. Initialize a set of particles positions xio and velocities voi randomly distributed
throughout the design space bounded by specified limits

2. Evaluate the objective function values f xik using the design space positions
xik
3. Update the best particle position pik at current iteration (k) and best particle
position in the complete history pgk
4. Update the position of each particle using its previous position and updated
velocity vector.
5. Repeat steps 2–4 until the stopping criteria is met.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 339 / 427


Gradient-Free Optimization Particle Swarm Optimization

PSO Characteristics
Compared to other global optimization approaches:
I Simple algorithm, extremely easy to implement.
I Still a population based algorithm, however it works well with few particles
(10 to 40 are usual) and there is not such thing as “generations”
I Unlike evolutionary approaches, design variables are directly updated, there
are no chromosomes, survival of the fittest, selection or crossover operations.
I Global and local search behavior can be directly “adjusted” as desired using
the cognitive c1 and social c2 parameters.
I Convergence “balance” is achieved through the inertial weight factor w

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 340 / 427


Gradient-Free Optimization Particle Swarm Optimization

Analysis of PSO 1
I If we replace the velocity update equation into the position update the
following expression is obtained:
 !
i i i pik − xik pgk − xik
xk+1 = xk + wvk + c1 r1 + c2 r2 ∆t
∆t ∆t

I Factorizing the cognitive and social terms:


 
c1 r1 pik + c2 r2 pgk
xik+1 = xik + wvki ∆t + (c1 r1 + c2 r2 ) i
− xk
| {z } | {z } c1 r1 + c2 r2
x̂ik αk | {z }
p̂k

I So the behavior of each particle can be viewed as a line-search dependent on


a stochastic step size and search direction.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 341 / 427


Gradient-Free Optimization Particle Swarm Optimization

Analysis of PSO 2
I Re-arranging the position and velocity term in the above equation we have:

xik+1 = xik (1 − c1 r1 − c2 r2 ) + wVki ∆t + c1 r1 pik + c2 r2 pgk


pi pg
i
vk+1 = −xik (c1 r1∆t
+c2 r2 )
+ wVki + c1 r1 ∆tk + c2 r2 ∆tk

I . . . which can be combined and written in a matrix form as:


 i    i    i 
xk+1 1 − c1 r1 − c2 r2 w∆t xk c1 r1 c2 r2 pk
= + c1 r1 c2 r2
i
Vk+1 − (c1 r1 +c2 r2 )
∆t w Vk
i
∆t ∆t pgk

where the above representation can be seen as a representation of a


discrete-dynamic system from which we can find stability criteria.
I Assuming constant external inputs, the system reduces to:
    i    i 
0 − (c1 r1 + c2 r2 ) w∆t xk c1 r1 c2 r2 pk
= + c1 r1 c2 r2
0 − (c1 r1∆t
+c2 r2 )
w−1 Vki ∆t ∆t pgk

where the above is true only when Vki = 0 and xik = pik = pgk (equilibrium
point).
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 342 / 427
Gradient-Free Optimization Particle Swarm Optimization

Analysis of PSO 3
I The eigenvalues of the dynamic system are:

λ2 − (w − c1 r1 − c2 r2 + 1) λ + w = 0

I Hence, the stability in the PSO dynamic system is guaranteed if


|λi=1,...,n | < 1, which leads to:

0 < (c1 + c2 ) < 4


(c1 +c2 )
2 −1<w <1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 343 / 427


Gradient-Free Optimization Particle Swarm Optimization

Effect of varying c1 and c2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 344 / 427


Gradient-Free Optimization Particle Swarm Optimization

Effect of varying the inertia

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 345 / 427


Gradient-Free Optimization Particle Swarm Optimization

PSO Issues and Improvements


Several issues with PSO have been identified:
I Inertia weight updates can be problematic
I Original PSO does not handle constraints

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 346 / 427


Gradient-Free Optimization Particle Swarm Optimization

Updating the inertia weight 1


I As k → ∞ particles “cluster” towards the “global” optimum.
I Fixed inertia makes the particles to overshoot the best regions (too much
momentum).
I A better way of controlling the global search is to dynamically update the
inertia weight.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 347 / 427


Gradient-Free Optimization Particle Swarm Optimization

Updating the inertia weight 2


14000

13000 Fixed Inertia Weight


Linear Varying Inertia Weight
12000 Dynamic Varying Inertia Weight

11000
Structure Weight [lbs]

10000

9000

8000

7000

6000

5000

4000
0 50 100 150 200
Iteration

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 348 / 427


Gradient-Free Optimization Particle Swarm Optimization

Violated design points redirection 1


We can restrict the velocity vector of a constraint violated particle to a usable
feasible direction:
 
i pik − xik pgk − xik
vk+1 = c1 r1 + c2 r2
∆t ∆t

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 349 / 427


Gradient-Free Optimization Particle Swarm Optimization

Violated design points redirection 2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 350 / 427


Gradient-Free Optimization Particle Swarm Optimization

Constraint Handling 1
The basic PSO algorithm is an unconstrained optimizer, to include constraints we
can use:
I Penalty methods

I Augmented Lagrangian function

I Recall the Lagrangian function:


m
  X 
Li xik , λi = f xik + λij gj xik
j=1

I The augmented Lagrangian function is:


m m
  X  X 
Li xik , λi , rpi = f xik + λij θj xik + rp,j θj2 xik
j=1 j=1

where:  
  −λj
θj xik i
= max gj xk ,
2rp,i
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 351 / 427
Gradient-Free Optimization Particle Swarm Optimization

Constraint Handling 2
I Multipliers and penalty factors that lead to the optimum are unknown and
problem dependent.
I A sequence of unconstrained minimizations of the augmented Lagrangian
function are required to obtain a solution.
I Multiplier update 
λij v+1 = λij v + 2 rp,j |v θj xik
I Penalty factor update (penalizes infeasible movements):
   
 2 rp,j |v if gj xiv > gj xiv−1 ∧ gj xiv > εg
rp,j |v+1 = 1
rp,j v if gj xiv ≤ εg
 2
rp,j |v otherwise

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 352 / 427


Gradient-Free Optimization Particle Swarm Optimization

Augmented Lagrangian PSO Algorithm


1. Initialize a set of particles positions xio and velocities voi randomly distributed
throughout the design space bounded by specified limits.

2. Initialize the Lagrange multipliers and penalty factors, e.g. λij 0 = 0,
rp,j |0 = r0 .
3. Evaluate the objective function values using the initial design space positions.
4. Solve the unconstrained optimization problem (the augmented Lagrange
multiplier equation) using the basic PSO algorithm for kmax iterations.
5. Update the Lagrange multipliers and penalty factors.
6. Repeat steps 4–5 until a stopping criterion is met.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 353 / 427


Gradient-Free Optimization Particle Swarm Optimization

Example: Minimizing the Griewank Function


So how do the different gradient-free methods compare? A simple (but
challenging!) numerical example is the Griewank function for n = 100,

Xn   Yn  
x2i xi
lf (x) = − cos √ + 1
i=1
4000 i i=1
−600 ≤ xi ≤ 600

Optimizer Evaluations Global optimum? Objective CPU time (s)


PSO (pop 40) 12,001 Yes 6.33e-07 15.9
GA (pop 250) 51,000 No 86.84 86.8438
DIRECT 649,522 Yes 1.47271e-011 321.57

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 354 / 427


Gradient-Free Optimization Particle Swarm Optimization

Example: Gradient-based vs. Gradient-Free

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 355 / 427


Multidisciplinary Design Optimization

Multidisciplinary Design Optimization


1. Introduction

2. Line Search Techniques

3. Gradient-Based Optimization

4. Computing Derivatives

5. Constrained Optimization

6. Gradient-Free Optimization

7. Multidisciplinary Design Optimization


7.1 Introduction
7.2 Multidisciplinary Analysis
7.3 Extended Design Structure Matrix
7.4 Monolithic Architectures
7.5 Distributed Architectures
7.6 Computing Coupled Derivatives
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 356 / 427
Multidisciplinary Design Optimization Introduction

Introduction 1
I In the last few decades, numerical models that predict the performance of
engineering systems have been developed, and many of these models are now
mature areas of research. For example . . .
I Once engineers can predict the effect that changes in the design have on the
performance of a system, the next logical question is what changes in the
design produced optimal performance. The application of the numerical
optimization techniques described in the preceding chapters address this
question.
I Single-discipline optimization is in some cases quite mature, but the design
and optimization of systems that involve more than one discipline is still in its
infancy.
I When systems are composed of multiple systems, additional issues arise in
both the analysis and design optimization.
I MDO researchers think industry will not adopt MDO more widely because
they do not realize their utility.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 357 / 427


Multidisciplinary Design Optimization Introduction

Introduction 2
I Industry think that researchers are not presenting anything new, since
industry has already been doing multidisciplinary design.
I There is some truth to each of these perspectives . . .
I Real-world aerospace design problem may involve thousands of variables and
hundreds of analyses and engineers, and it is often difficult to apply the
numerical optimization techniques and solve the mathematically correct
optimization problems.
I The kinds of problems in industry are often of much larger scale, involve
much uncertainty, and include human decisions in the loop, making them
difficult to solve with traditional numerical optimization techniques.
I On the other hand, a better understanding of MDO by engineers in industry
is now contributing a more widespread use in practical design.
Why MDO?
I Parametric trade studies are subject to the “curse of dimensionality”.
I Iterated procedures for which convergence is not guaranteed.
I Sequential optimization that does not lead to the true optimum of the system
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 358 / 427
Multidisciplinary Design Optimization Introduction

Introduction 3
Objectives of MDO:
I Avoid difficulties associated with sequential design or partial optimization.
I Provide more efficient and robust convergence than by simple iteration.
I Aid in the management of the design process.
Difficulties of MDO:
I Communication and translation
I Time
I Scheduling and planning
I Implementation

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 359 / 427


Multidisciplinary Design Optimization Introduction

Typical Aircraft Company Organization

Personnel hierarchy
Design process

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 360 / 427


Multidisciplinary Design Optimization Introduction

MDO Architectures
I MDO focuses on the development of strategies that use numerical analyses
and optimization techniques to enable the automation of the design process
of a multidisciplinary system.
I The big challenge: make such a strategy scalable and practical.
I An MDO architecture is a particular strategy for organizing the analysis
software, optimization software, and optimization subproblem statements to
achieve an optimal design.
I Other terms are used: “method”, “methodology”, “problem formulation”,
“strategy”, “procedure” and “algorithm”.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 361 / 427


Multidisciplinary Design Optimization Introduction

Nomenclature and Mathematical Notation 1

Symbol Definition
x Vector of design variables
yt Vector of coupling variable targets (inputs to a discipline analysis)
y Vector of coupling variable responses (outputs from a discipline analysis)
ȳ Vector of state variables (variables used inside only one discipline analysis
f Objective function
c Vector of design constraints
cc Vector of consistency constraints
R Governing equations of a discipline analysis in residual form
N Number of disciplines
n() Length of given variable vector
m() Length of given constraint vector
()0 Functions or variables that are shared by more than one discipline
()i Functions or variables that apply only to discipline i
()∗ Functions or variables at their optimal value
˜
() Approximation of a given function or vector of functions
ˆ
() Duplicates of certain variable sets distributed to other disciplines
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 362 / 427
Multidisciplinary Design Optimization Introduction

Nomenclature and Mathematical Notation 2


I In MDO, we make the distinction between:
I Local design variables xi — directly affect only one discipline
I Shared design variables x0 — directly affect more than one discipline.
 T
I Full vector of design variables x = xT0 , xT1 , . . . , xTN
I A discipline analysis solves a system of equations that computes the state
variables. Examples?
I In many formulations, independent copies of the coupling variables must be
made to allow discipline analyses to run independently and in parallel.
I These copies are also known as target variables, which we denote by a
superscript t.
I To preserve consistency between the coupling variable inputs and outputs at
the optimal solution, we define consistency constraints

cci = yit − yi

which we add to the optimization problem formulation.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 363 / 427


Multidisciplinary Design Optimization Introduction

Example: Aerostructural Problem Definition 1


I Common example used throughout this chapter to illustrate the notation and
MDO architectures.
I Suppose we want to design the wing of a business jet using low-fidelity
analysis tools.
I Model the aerodynamics using a panel method
I Model the structure as a single beam using finite elements

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 364 / 427


Multidisciplinary Design Optimization Introduction

Example: Aerostructural Problem Definition 2


Wi=15961.3619lbs Ws=10442.5896lbs α=2.3673o Λ=30o CL=0.13225 CD=0.014797 L/D=8.9376

15

10
x (ft)
5

−30 −20 −10 0 10 20 30


y (ft)
z (ft)

1
0.5
0
30
20
10
0 15
10
−10
5
−20 0
y (ft) −30
x (ft)

I Aerodynamic inputs: angle-of-attack (α), wing twist distribution (γi )


I Aerodynamic outputs: lift (L) and the induced drag (D).

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 365 / 427


Multidisciplinary Design Optimization Introduction

Example: Aerostructural Problem Definition 3


I Structural inputs: thicknesses of the beam (ti )
I Structural output: beam weight, which is added to a fixed weight to obtain
the total weight (W ), and the maximum stresses in each finite-element (σi ).
I In this example, we want to maximize the range of the aircraft, as given by
the Breguet range equation,
 
V L Wi
f = Range = ln .
c D Wf
I The multidisciplinary analysis consists in the simultaneous solution of the
following equations:

R1 = 0 ⇒ AΓ − v(u, α) = 0
R2 = 0 ⇒ Ku − F (Γ) = 0
R3 = 0 ⇒ L(Γ) − W = 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 366 / 427


Multidisciplinary Design Optimization Introduction

Example: Aerostructural Problem Definition 4


I The complete state vector is
   
y1 Γ
y = y2  =  u  .
y3 α

I The angle of attack is considered a state variable here, and helps satisfy
L = W.
I The design variables are the the wing sweep (Λ), structural thicknesses (t)
and twist distribution (γ).

x0 = Λ
 
t
x= ,
γ

I Sweep is a shared variable because changing the sweep has a direct effect on
both the aerodynamic influence matrix and the stiffness matrix.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 367 / 427


Multidisciplinary Design Optimization Introduction

Example: Aerostructural Problem Definition 5


I The other two sets of design variables are local to the structures and
aerodynamics, respectively.
I In later examples, we will see the options we have to optimize the wing in
this example.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 368 / 427


Multidisciplinary Design Optimization Multidisciplinary Analysis

Multidisciplinary Analysis 1
I To find the coupled state of a multidisciplinary system we need to perform a
multidisciplinary analysis — MDA.
I This is often done by repeating each disciplinary analysis until yit = yir for all
is.

Input: Design variables x


Output: Coupling variables, y
0: Initiate MDA iteration loop
repeat
1: Evaluate Analysis 1 and update y1 (y2 , y3 )
2: Evaluate Analysis 2 and update y2 (y1 , y3 )
3: Evaluate Analysis 3 and update y3 (y1 , y2 )
until 4 → 1: MDA has converged

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 369 / 427


Multidisciplinary Design Optimization Multidisciplinary Analysis

Multidisciplinary Analysis 2
I The design structure matrix (DSM) was originally developed to visualize the
interconnections between the various components of a system.
A B C D E F G H I J K L M N O A L H O D M E G N C B K I F J
Optimization A A Optimization A A
Aerodynamics B B Mission L L
Atmosphere C C Performance H H
Economics D D System O O
Emissions E E Economics D D
Loads F F Reliability M M
Noise G G Emissions E E
Performance H H Noise G G
Sizing I I Propulsion N N
Weight J J Atmosphere C C
Structures K K Aerodynamics B B
Mission L L Structures K K
Reliability M M Sizing I I
Propulsion N N Loads F F
System O O Weight J J

Original ordering Improved ordering


I Fixed-point iteration, such as the Gauss–Seidel algorithm above converge
slowly and sometimes do not converge at all.
I One way to improve the disciplines, is to reorder the sequence and possibly
do some inner loops for more coupled clusters.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 370 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Extended Design Structure Matrix (XDSM) Diagrams 1


I DSMs are somewhat ambiguous as to what the connections are: data or
process flow?
I Numerous diagrams can be found in the literature that describe MDO
architectures and other computational procedures.
I We wanted to develop a new diagram type of diagram that could:
I Show both process flow and data dependencies in the same diagram
I Show complex procedures with multiple loops and parallel processes in a
compact manner
I The results was the extended design structure matrix, or XDSM
I We will use XDSM throughout this chapter to explain all the MDO
architectures

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 371 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Block Gauss–Seidel Iteration

yt x0 , x1 x0 , x2 x0 , x3

0,4→1:
(no data) 1 : y2t , y3t 2 : y3t
MDA

y1 1:
4 : y1 2 : y1 3 : y1
Analysis 1

y2 2:
4 : y2 3 : y2
Analysis 2

y3 3:
4 : y3
Analysis 3

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 372 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Gradient-Based Optimization

x(0)

0,2→1:
x∗ Optimization
1:x 1:x 1:x

1:
2:f
Objective

1:
2:c
Constraints

1:
2 : df / dx, dc/ dx
Gradients

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 373 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Example: Aerostructural Optimization — Sequential


Design vs. MDO 1
I One commonly used approach to design is to perform a sequential
“optimization” approach, which consists in optimizing each discipline in
sequence:
1. For example, we could start by optimizing the aerodynamics,

minimize D (α, γi )
w.r.t. α, γi
s.t. L (α, γi ) = W

2. Once the aerodynamic optimization has converged, the twist distribution and
the forces are fixed
3. Then we optimize the structure by minimizing weight subject to stress
constraints at the maneuver condition, i.e.,

minimize W (ti )
w.r.t. ti
s.t. σj (ti ) ≤ σyield

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 374 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Example: Aerostructural Optimization — Sequential


Design vs. MDO 2
4. Repeat until this sequence has converged.
0 , t0

0

8
, t⇤ Iterator
1,3
7!1
1

Optimization
2,4
3!2
2
3
L/D Aerodynamics F

4
7
t Optimization t
5
6!5
5
6
u W, y Structures

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 375 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Example: Aerostructural Optimization — Sequential


Design vs. MDO 3
I The MDO procedure differs from the sequential approach in that it considers
all variables simultaneously

minimize Range (α, γi , ti )


w.r.t. α, γi , ti
s.t. σyield − σj (ti ) ≥ 0
L (α, γi ) − W = 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 376 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Example: Aerostructural Optimization — Sequential


Design vs. MDO 4
0 , t0 w 0 , u0

0
7

, t⇤ Optimization 5 : ,t 2: 3:t
1
6!1
6
6 : R, y Functions

5
1
5
MDA 2:u
2
4!2
2

5:w 4:w Aerodynamics 3:w


3
3
4
5:u 4:u Structures

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 377 / 427


Multidisciplinary Design Optimization Extended Design Structure Matrix

Example: Aerostructural Optimization — Sequential


Design vs. MDO 5
Sequential MDF AS

10

Twist (degrees)
5

−5
Jigtwist
−10 Deflected

0 5 10 15 20
Spanwise Distance (m)

0.06
Thickness (m)

0.05

0.04

0.03

0.02
0 5 10 15 20
Spanwise Distance (m)
4
x 10
5
Elliptical
4
Lift (N)

1
0 5 10 15 20
Spanwise Distance (m)

0.25
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 378 / 427
Multidisciplinary Design Optimization Monolithic Architectures

Monolithic Architectures
I Monolithic architectures solve the MDO problem by casting it as single
optimization problem.
I Distributed architectures, on the other hand, decompose the overall problem
into smaller ones.
I Monolithic architectures include:
I Multidisciplinary Feasible — MDF
I Individual Discipline Feasible — IDF
I Simultaneous Analysis and Design — SAND
I All-At-Once — AAO

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 379 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Multidisciplinary Feasible (MDF) 1


I The MDF architecture is the most intuitive for engineers.
I The optimization problem formulation is identical to the single discipline
case, except the disciplinary analysis is replace by an MDA

minimize f0 (x, y (x, y))


with respect to x
subject to c0 (x, y (x, y)) ≥ 0
ci (x0 , xi , yi (x0 , xi , yj6=i )) ≥ 0 for i = 1, . . . , N.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 380 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Multidisciplinary Feasible (MDF) 2


x(0) y t,(0)

0, 7→1:
x∗ Optimization
2 : x0 , x1 3 : x0 , x2 4 : x0 , x3 6:x

1, 5→2:
2 : y2t , y3t 3 : y3t
MDA

2:
y1∗ 5 : y1 3 : y1 4 : y1 6 : y1
Analysis 1

3:
y2∗ 5 : y2 4 : y2 6 : y2
Analysis 2

4:
y3∗ 5 : y3 6 : y3
Analysis 3

6:
7 : f, c
Functions

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 381 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Multidisciplinary Feasible (MDF) 3


I Advantages:
I Optimization problem is as small as it can be for a monolithic architecture
I Always returns a system design that satisfies the consistency constraints, even
if the optimization process is terminated early — good from the practical
engineering point of view
I Disadvantages:
I Intermediate results do not necessarily satisfy the optimization constraints
I Developing the MDA procedure might be time consuming, if not already in
place
I Gradients of the coupled system more challenging to compute (more in later
section)

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 382 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Example: Aerostructural Optimization with MDF

minimize −R
w.r.t. Λ, γ, t
s.t. σyield − σi (u) ≥ 0

where the aerostructural analysis is as before:

AΓ − v(u, α) = 0
K(t, Λ)u − F (Γ) = 0
L(Γ) − W (t) = 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 383 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Individual Discipline Feasible (IDF) 1


The IDF architecture decouples the MDA, adding consistency constraints, and
giving the optimizer control of the coupling variables.

minimize f0 x, y x, y t
with respect to x, y t

subject to c0 x, y x, y t ≥0

ci x0 , xi , yi x0 , xi , yj6t =i ≥0 for i = 1, . . . , N

cci = yit − yi x0 , xi , yj6t =i =0 for i = 1, . . . , N.

I Advantages:
I Optimizer typically converges the multidisciplinary feasibility better than
fixed-point MDA iterations
I Disadvantages:
I Problem is potentially much larger than MDF, depending on the number of
coupling variables
I Gradient computation can be costly

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 384 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Individual Discipline Feasible (IDF) 2


I The large problem size can be mitigated to some extent by careful selection
of the disciplinary variable partitions or aggregation of the coupling variables
to reduce information transfer between disciplines.

x(0) , y t,(0)

0,3→1:
x∗ 1 : x0 , xi , yj6t =i 2 : x, y t
Optimization

1:
yi∗ 2 : yi
Analysis i

2:
3 : f, c, cc
Functions

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 385 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Example: Aerostructural Optimization Using IDF

minimize −R
w.r.t. Λ, γ, t, Γt , αt , ut
s.t. σyield − σi ≥ 0
Γt − Γ = 0
αt − α = 0
ut − u = 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 386 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Simultaneous Analysis and Design (SAND) 1


I SAND makes no distinction between disciplines, and can also be applied to
single discipline problems.
I The governing equations are constraints at the optimizer level.

minimize f0 (x, y)
with respect to x, y, ȳ
subject to c0 (x, y) ≥ 0
ci (x0 , xi , yi ) ≥ 0 for i = 1, . . . , N
Ri (x0 , xi , y, ȳi ) = 0 for i = 1, . . . , N.

I Advantages:
I If implemented well, can be the most efficient architecture
I Disadvantages:
I Intermediate results do not even satisfy the governing equations
I Difficult or impossible to implement for “black-box” components

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 387 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Simultaneous Analysis and Design (SAND) 2


x(0) , y (0) , ȳ (0)

0,2→1:
x∗ , y ∗ 1 : x, y 1 : x0 , xi , y, ȳi
Optimization

1:
2 : f, c
Functions

1:
2 : Ri
Residual i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 388 / 427


Multidisciplinary Design Optimization Monolithic Architectures

Aerostructural Optimization Using SAND 1

minimize −R
w.r.t. Λ, γ, t, Γ, α, u
s.t. σyield − σi (u) ≥ 0
AΓ = v(u, α)
K(t)u = f (Γ)
L(Γ) − W (t) = 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 389 / 427


Multidisciplinary Design Optimization Monolithic Architectures

The All-at-Once (AAO) Problem Statement 1


I AAO is not strictly an architecture, as it is not practical to solve a problem of
this form: the consistency constraints are linear and can be eliminated,
leading to SAND.
I Some inconsistency in the name, in the literature
I We present AAO for completeness, and to relate this to the other monolithic
architectures.
N
X
minimize f0 (x, y) + fi (x0 , xi , yi )
i=1
with respect to x, y t , y, ȳ
subject to c0 (x, y) ≥ 0
ci (x0 , xi , yi ) ≥ 0 for i = 1, . . . , N
cci = yit − yi = 0 for i = 1, . . . , N

Ri x0 , xi , yj6t =i , ȳi , yi =0 for i = 1, . . . , N.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 390 / 427


Multidisciplinary Design Optimization Monolithic Architectures

The All-at-Once (AAO) Problem Statement 2


I As we can see, it includes all the constraints that other monolithic
architectures eliminated.

x(0) , y t,(0) , y (0) , ȳ (0)

0, 2→1:
x∗ , y ∗ 1 : x, y, y t 1 : x0 , xi , yi , yj6t =i , ȳi
Optimization

1:
2 : f, c, cc
Functions

1:
2 : Ri
Residual i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 391 / 427


Multidisciplinary Design Optimization Monolithic Architectures

The All-at-Once (AAO) Problem Statement 3

Monolithic
Remove cc , y t
AAO SAND

Remove Remove
R, y, ȳ R, y, ȳ

Remove cc , y t
IDF MDF

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 392 / 427


Multidisciplinary Design Optimization Distributed Architectures

Distributed Architectures
I Monolithic MDO architectures solve a single optimization problem
I Distributed MDO architectures decompose the original problem into multiple
optimization problems
I Some problems have a special structure and can be efficiently decomposed,
but that is usually not the case
I In reality, the primary motivation for decomposing the MDO problem comes
from the structure of the engineering design environment
I Typical industrial practice involves breaking up the design of a large system
and distributing aspects of that design to specific engineering groups.
I These groups may be geographically distributed and may only communicate
infrequently.
I In addition, these groups typically like to retain control of their own design
procedures and make use of in-house expertise

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 393 / 427


Multidisciplinary Design Optimization Distributed Architectures

Classification of MDO Architectures

Monolithic

AAO SAND

IDF MDF

Distributed IDF

Penalty Multilevel Distributed MDF

ECO ATC BLISS-2000 QSD CSSO MDOIS

IPD/EPD CO BLISS ASO

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 394 / 427


Multidisciplinary Design Optimization Distributed Architectures

Concurrent Subspace Optimization (CSSO) 1


The CSSO system subproblem is given by

minimize f0 (x, ỹ (x, ỹ))


with respect to x
subject to c0 (x, ỹ (x, ỹ)) ≥ 0
ci (x0 , xi , ỹi (x0 , xi , ỹj6=i )) ≥ 0 for i = 1, . . . , N

and the discipline i subproblem is given by

minimize f0 (x, yi (xi , ỹj6=i ) , ỹj6=i )


with respect to x0 , xi
subject to c0 (x, ỹ (x, ỹ)) ≥ 0
ci (x0 , xi , yi (x0 , xi , ỹj6=i )) ≥ 0
cj (x0 , ỹj (x0 , ỹ)) ≥ 0 for j = 1, . . . , N j 6= i.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 395 / 427


Multidisciplinary Design Optimization Distributed Architectures

Concurrent Subspace Optimization (CSSO) 2


x(0) x(0) , y t,(0) x(0) x(0)

0,25→1:
(no data) Convergence
Check

1,6→2:
2 : yt 5 : x0 , xi 3 : x0 , xi
Initial DOE

13,18→14:
t 17 : x0 , xi 15 : x0 , xi
Discipline 14 : y
DOE

2,4→3,14,16→15: 3, 15 : yj6t =i
Exact MDA

19,24→20
x∗ 24 : x 1:x System 23 : x 7:x 21 : x
Optimization

11,23:
24 : f, c 12 : f, c
All Functions

20,22→21:
Metamodel 21 : ỹj6t =i
MDA

7,12→8:
13 : x 11 : x 9 : x0 , xj6=i 9 : x0 , xi
Optimization i

8,10→9: 9 : yj6t =i
9 : yt
Local MDA i

5,9,17,21:
11 : ỹj6=i
yi∗ 1 : ỹi 13 : ỹj6=i 22 : ỹi 10 : ỹj6=i Analysis i
23 : ỹi
Metamodel

3,9,15:
13 : yi 3, 15 : yi 11 : yi 10 : yi 5, 17 : yi
Analysis i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 396 / 427


Multidisciplinary Design Optimization Distributed Architectures

CSSO Algorithm
Input: Initial design variables x
Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate main CSSO iteration
repeat
1: Initiate a design of experiments (DOE) to generate design points
for Each DOE point do
2: Initiate an MDA that uses exact disciplinary information
repeat
3: Evaluate discipline analyses
4: Update coupling variables y
until 4 → 3: MDA has converged
5: Update the disciplinary surrogate models with the latest design
end for 6 → 2
7: Initiate independent disciplinary optimizations (in parallel)
for Each discipline i do
repeat
8: Initiate an MDA with exact coupling variables for discipline i and
approximate coupling variables for the other disciplines
repeat
9: Evaluate discipline i outputs yi , and surrogate models for the
other disciplines, ỹj6=i
until 10 → 9: MDA has converged
11: Compute objective f0 and constraint functions c using current
data
until 12 → 8: Disciplinary optimization i has converged
end for
13: Initiate a DOE that uses the subproblem solutions as sample points
for Each subproblem solution i do
14: Initiate an MDA that uses exact disciplinary information
repeat
15: Evaluate discipline analyses.
until 16 → 15 MDA has converged
17: Update the disciplinary surrogate models with the newest design
end for 18 → 14
19: Initiate system-level optimization
repeat
20: Initiate an MDA that uses only surrogate model information
repeat
21: Evaluate disciplinary surrogate models
until 22 → 21: MDA has converged
23: Compute objective f0 , and constraint function values c
until 24 → 20: System level problem has converged
until 25 → 1: CSSO has converged

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 397 / 427


Multidisciplinary Design Optimization Distributed Architectures

Collaborative Optimization (CO) 1


The CO2 system subproblem is given by:

minimize f0 x0 , x̂1 , . . . , x̂N , y t
with respect to x0 , x̂1 , . . . , x̂N , y t

subject to c0 x0 , x̂1 , . . . , x̂N , y t ≥ 0
Ji∗ = ||x̂0i − x0 ||22 + ||x̂i − xi ||22 +

||yit − yi x̂0i , xi , yj6t =i ||22 = 0 for i = 1, . . . , N

where x̂0i are duplicates of the global design variables passed to (and manipulated
by) discipline i and x̂i are duplicates of the local design variables passed to the
system subproblem.
The discipline i subproblem in both CO1 and CO2 is

minimize Ji x̂0i , xi , yi x̂0i , xi , yj6t =i
with respect to x̂0i , xi

subject to ci x̂0i , xi , yi x̂0i , xi , yj6t =i ≥ 0.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 398 / 427


Multidisciplinary Design Optimization Distributed Architectures

Collaborative Optimization (CO) 2


(0) (0) (0) (0)
x0 , x̂1···N , y t,(0) x̂0i , xi

0, 2→1:
x∗0 System 1 : x0 , x̂1···N , y t 1.1 : yj6t =i 1.2 : x0 , x̂i , y t
Optimization

1:
2 : f0 , c0 System
Functions

1.0, 1.3→1.1:
x∗i 1.1 : x̂0i , xi 1.2 : x̂0i , xi
Optimization i

1.1:
yi∗ 1.2 : yi
Analysis i

1.2:
2 : Ji∗ 1.3 : fi , ci , Ji Discipline i
Functions

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 399 / 427


Multidisciplinary Design Optimization Distributed Architectures

CO Algorithm 1

Input: Initial design variables x


Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate system optimization iteration
repeat
1: Compute system subproblem objectives and constraints
for Each discipline i (in parallel) do
1.0: Initiate disciplinary subproblem optimization
repeat
1.1: Evaluate disciplinary analysis
1.2: Compute disciplinary subproblem objective and constraints
1.3: Compute new disciplinary subproblem design point and Ji
until 1.3 → 1.1: Optimization i has converged
end for
2: Compute a new system subproblem design point
until 2 → 1: System optimization has converged

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 400 / 427


Multidisciplinary Design Optimization Distributed Architectures

Aerostructural Optimization Using CO 1


System-level problem:

minimize −R
w.r.t. Λt , Γt , αt , ut , W t
s.t. J1∗ ≤ 10−6
J2∗ ≤ 10−6

Aerodynamics subproblem:
 2 X 2   2
Λ Γi α 2 W
minimize J1 = 1− t + 1− t + 1− t + 1− t
Λ Γi α W
w.r.t. Λ, γ, α
s.t. L−W =0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 401 / 427


Multidisciplinary Design Optimization Distributed Architectures

Aerostructural Optimization Using CO 2


Structures subproblem:
 2 X 2
Λ ui
minimize J2 = 1− + 1−
Λt uti
w.r.t. Λ, t
s.t. σyield − σi ≥ 0

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 402 / 427


Multidisciplinary Design Optimization Distributed Architectures

Bilevel Integrated System Synthesis (BLISS) 1


The system level subproblem is formulated as
 ∗
df0
minimize (f0∗ )0 + ∆x0
dx0
with respect to ∆x0

dc∗0
subject to (c∗0 )0 + ∆x0 ≥ 0
dx0
 ∗
dci
(c∗i )0 + ∆x0 ≥ 0 for i = 1, . . . , N
dx0
∆x0L ≤ ∆x0 ≤ ∆x0U .

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 403 / 427


Multidisciplinary Design Optimization Distributed Architectures

Bilevel Integrated System Synthesis (BLISS) 2


The discipline i subproblem is given by
 
df0
minimize (f0 )0 + ∆xi
dxi
with respect to ∆xi

dc0
subject to (c0 )0 + ∆xi ≥ 0
dxi
 
dci
(ci )0 + ∆xi ≥ 0
dxi
∆xiL ≤ ∆xi ≤ ∆xiU .

Note the extra set of constraints in both system and discipline subproblems
denoting the design variables bounds.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 404 / 427


Multidisciplinary Design Optimization Distributed Architectures

Bilevel Integrated System Synthesis (BLISS) 3


(0) (0)
x(0) y t,(0) x0 xi

0,11→1:
(no data) Convergence
Check

1,3→2: 6 : yj6t =i 6, 9 : yj6t =i 6 : yj6t =i 2, 5 : yj6t =i


MDA

8,10:
x∗0 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
Optimization

4,7:
x∗i 11 : xi 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i

6,9:
10 : f0 , c0 7 : f0 , c0 System
Functions

6,9:
10 : fi , ci 7 : fi , ci Discipline i
Functions

9:
Shared
10 : df / dx0 , dc/ dx0
Variable
Derivatives

6:
Discipline i
7 : df0,i / dxi , dc0,i / dxi
Variable
Derivatives

2,5:
yi∗ 3 : yi 6, 9 : yi 6, 9 : yi 9 : yi 6 : yi
Analysis i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 405 / 427


Multidisciplinary Design Optimization Distributed Architectures

BLISS Algorithm
Input: Initial design variables x
Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate system optimization
repeat
1: Initiate MDA
repeat
2: Evaluate discipline analyses
3: Update coupling variables
until 3 → 2: MDA has converged
4: Initiate parallel discipline optimizations
for Each discipline i do
5: Evaluate discipline analysis
6: Compute objective and constraint function values and derivatives with
respect to local design variables
7: Compute the optimal solutions for the disciplinary subproblem
end for
8: Initiate system optimization
9: Compute objective and constraint function values and derivatives with
respect to shared design variables using post-optimality analysis
10: Compute optimal solution to system subproblem
until 11 → 1: System optimization has converged

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 406 / 427


Multidisciplinary Design Optimization Distributed Architectures

Analytical Target Cascading (ATC) 1


The ATC system subproblem is given by
N
 X 
minimize f0 x, y t + Φi x̂0i − x0 , yit − yi x0 , xi , y t +
i=1

Φ0 c0 x, y t
with respect to x0 , y t ,

where Φ0 is a penalty relaxation of the global design constraints and Φi is a


penalty relaxation of the discipline i consistency constraints. The ith discipline
subproblem is:
  
minimize f0 x̂0i , xi , yi x̂0i , xi , yj6t =i , yj6t =i + fi x̂0i , xi , yi x̂0i , xi , yj6t =i +
 
Φi yit − yi x̂0i , xi , yj6t =i , x̂0i − x0 +
 
Φ0 c0 x̂0i , xi , yi x̂0i , xi , yj6t =i , yj6t =i
with respect to x̂0i , xi

subject to ci x̂0i , xi , yi x̂0i , xi , yj6t =i ≥ 0.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 407 / 427
Multidisciplinary Design Optimization Distributed Architectures

Analytical Target Cascading (ATC) 2


(0) (0) (0)
w(0) x0 , y t,(0) x̂0i , xi

0,8→1:
(no data) 6:w 3 : wi
w update

5,7→6:
x∗0 System 6 : x0 , y t 3 : x0 , y t 2 : yj6t =i
Optimization

6:
System and
7 : f0 , Φ0···N
Penalty
Functions

1,4→2:
x∗i 6 : x̂0i , xi 3 : x̂0i , xi 2 : x̂0i , xi
Optimization i

3:
Discipline i
4 : fi , ci , Φ0 , Φi
and Penalty
Functions

2:
yi∗ 6 : yi 3 : yi
Analysis i

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 408 / 427


Multidisciplinary Design Optimization Distributed Architectures

ATC Algorithm

Input: Initial design variables x


Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate main ATC iteration
repeat
for Each discipline i do
1: Initiate discipline optimizer
repeat
2: Evaluate disciplinary analysis
3: Compute discipline objective and constraint functions and penalty
function values
4: Update discipline design variables
until 4 → 2: Discipline optimization has converged
end for
5: Initiate system optimizer
repeat
6: Compute system objective, constraints, and all penalty functions
7: Update system design variables and coupling targets.
until 7 → 6: System optimization has converged
8: Update penalty weights
until 8 → 1: Penalty weights are large enough

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 409 / 427


Multidisciplinary Design Optimization Distributed Architectures

Asymmetric Subspace Optimization (ASO) 1


The system subproblem in ASO is
X
minimize f0 (x, y (x, y)) + fk (x0 , xk , yk (x0 , xk , yj6=k ))
k
with respect to x0 , xk
subject to c0 (x, y (x, y)) ≥ 0
ck (x0 , xk , yk (x0 , xk , yj6=k )) ≥ 0 for all k,

where subscript k denotes disciplinary information that remains outside of the


MDA. The disciplinary problem for discipline i, which is resolved inside the MDA,
is
minimize f0 (x, y (x, y)) + fi (x0 , xi , yi (x0 , xi , yj6=i ))
with respect to xi
subject to ci (x0 , xi , yi (x0 , xi , yj6=i )) ≥ 0.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 410 / 427


Multidisciplinary Design Optimization Distributed Architectures

Asymmetric Subspace Optimization (ASO) 2


(0) (0)
x0,1,2 y t,(0) x3

0,10→1:
x∗0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
Optimization

9:
10 : f0,1,2 , c0,1,2 Discipline 0, 1,
and 2
Functions

1,8→2:
2 : y2t , y3t 3 : y3t
MDA

2:
y1∗ 9 : y1 8 : y1 3 : y1 6 : y1 5 : y1
Analysis 1

3:
y2∗ 9 : y2 8 : y2 6 : y2 5 : y2
Analysis 2

4,7→5:
x∗3 9 : x3 6 : x3 5 : x3
Optimization 3

6:
Discipline 0
7 : f0 , c0 , f3 , c3
and 3
Functions

5:
y3∗ 9 : y3 8 : y3 6 : y3
Analysis 3

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 411 / 427


Multidisciplinary Design Optimization Distributed Architectures

ASO Algorithm

Input: Initial design variables x


Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate system optimization
repeat
1: Initiate MDA
repeat
2: Evaluate Analysis 1
3: Evaluate Analysis 2
4: Initiate optimization of Discipline 3
repeat
5: Evaluate Analysis 3
6: Compute discipline 3 objectives and constraints
7: Update local design variables
until 7 → 5: Discipline 3 optimization has converged
8: Update coupling variables
until 8 → 2 MDA has converged
9: Compute objective and constraint function values for all disciplines 1 and
2
10: Update design variables
until 10 → 1: System optimization has converged

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 412 / 427


Multidisciplinary Design Optimization Distributed Architectures

Example: A Framework for Automatic Implementation of


MDO 1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 413 / 427


Multidisciplinary Design Optimization Distributed Architectures

Example: A Framework for Automatic Implementation of


MDO 2
1 N
MDO Discipline Solver

MDF SAND IDF CO CSSO

1
1
1
0..* 1

RS

Optimization Optimizer

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 414 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Analytic Methods for Computing Coupled Derivatives 1


I We now extend the analytic methods derived in the derivatives chapter to
multidisciplinary systems.
I Each discipline is seen as one component.
I We apply the analytic equations and partition each of the matrices in the
blocks corresponding to each discipline.
I The partitioning is as follows,

R = [R1 , R2 ]T y = [y1 , y2 ]T

where we have assumed two disciplines as an example.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 415 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Analytic Methods for Computing Coupled Derivatives 2


x

r1 r

r2

y1 y

y2

v = [v1 , . . . , vnx , . . . , v(nx +ny1 ) , . . . , v(nx +ny1 +ny2 ) , . . . , v(nx +2ny1 +ny2 ) , . . . , v(nx
| {z } | {z } | {z } | {z } |
x r1 r2 y1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 416 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Analytic Methods for Computing Coupled Derivatives 3


I To derive the direct and adjoint versions of this approach within our
mathematical framework, we define the artificial residual functions

Ri = Yi − yi ,

where the yi vector contains the intermediate variables of the ith discipline,
and Yi is the vector of functions that explicitly define these intermediate
variables.

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 417 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Analytic Methods for Computing Coupled Derivatives 4


∆x
∆x
∆r1

∆r2
∆y1

∆y1
∆y2
∆y2

∆f
∆f
(a) (b)
Residual Functional
∆x

∆r1

∆y1

∆y2

∆f
(c)
Hybrid

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 418 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Analytic Methods for Computing Coupled Derivatives 5


2 32 3 2 3 2   32 3 2 3
T T
@R @F df
6 I 0 0 7 6 I 7 6I 7 6I 7 6 7 607
6 76 7 6 7 6 @x @x 7 6 dx 7 6 7
6 @R @R 7 6 dy 7 6 7 6   T76
6 07 6 7 6 7 6 @R
T
@F 7 6 df 7 6 7
6 @x 7 6 7 = 607 60 76 7 =6
607
7
6 @y 7 6 dx 7 6 7 6 @y @y 7 6 dr 7
4 @F @F 5 4 df 5 4 5 6 74 5 6 7
4 5
7
I 0 4 5
@x @y dx 0 0 I I I

(a) Direct method


(b) Adjoint method

2 32 3 2 32  T  T  T
32 3 2 3
@R1 @R2 @F df
6 I 0 0 07 6 I 7 6I 76I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @R 76 7 6 76    T76 7 6 7
7 6 7
6
@R1 @R1 T T
6 1 7 6 dy 7 6076 @R1 @R2 @F 76 df
6 07 6 1
6 760 76 7 607
6 @x @y1 @y2 7 6 dx 7 7 = 6 76 @y @y @y 7 6 dr1 7 6 7 7
6 76 6 76  1  1  1 T7 6 7=
6 @R2 @R2 @R2 7 6 dy2 77 6076 @R1
T
@R2
T
@F 7 6 df 7 6 7
6 07 6 760 76 7 6 0 7
6 @x @y1 @y2 7 6 dx 7 6
7
6 76 7 6 dr 7 6 7
6
6 74 6 @y2 @y2 @y2 76 27 4 7
4 @F @F @F 5 df 5 4 54 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx

(c) Coupled direct — residual approach (d) Coupled adjoint — residual approach

2 32 3 2 32  T  T  T
32 3 2 3
@Y1 @Y2 @F df
6 I 0 0 07 6 I 7 I 6 I
6 7 6 Design@x 7 6 7 6 07
6 J.R.R.A. Martins 7 Multidisciplinary Optimization @x @x 7 6 dx
August 7
2012 419 / 427
@x @y1 dx Design Optimization
@y2 Multidisciplinary Computing Coupled Derivatives

(c) Coupled direct — residual approach (d) Coupled adjoint — residual approach
Analytic Methods for Computing Coupled Derivatives 6
2 32 32 32  T  T  T
32 3 2 3
@Y1 @Y2 @F df
6 I 0 0 0 7 6 I 7 6I 7 6I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @Y 76 6 76   T76 7 6 7
dy1 7 7 6 df 7 6 7
@Y1 T
6 1
07 6 7 607 6 @Y2 @F
6 I 76 76 0 I 76 7 607
6 @x @y2 7 6 dx 7 6
7
76 @y1 @y 7 6 dy1 7 6 7
6 76 =6
6 76   1 T7 6 7= 6 7
6 @Y2 @Y2 7 6 dy2 7
7 607 6 6 @Y1
T
@F 7 6 df 7 6 7
6 I 07 6 76 7 6 07
6 @x @y1 7 6 dx 7 6 7 0
7 6 76
I 7 6 dy 7 6 7
6 74 6 @y2 @y2 76 27 4 7 6
4 @F @F @F 5 df 5 4 5 4 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx

(e) Coupled direct — functional approach (f) Coupled adjoint — functional approach

2 32 32 32  T  T  T
32 3 2 3
@R1 @Y2 @F df
6 I 0 0 07 6 I 7 6I 76I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @R 76 6 76    T76 7 6 7
dy1 7 7 6 df 7 6 7
@R1 @R1 T T
6 1
07 6 7 6076 @R1 @Y2 @F
6 76 76 0 76 7 607
6 @x @y1 @y2 7 6 dx 7 6
7
76 @y @y1 @y 7 6 dr1 7 6 7
6 76 =6
6 76  1  1 T7 6 7= 6 7
6 @Y2 @Y2 7 6 dy2 7
7 6076 6 @R1
T
@F 7 6 df 7 6 7
6 I 07 6 76 7 6 07
6 @x @y1 7 6 dx 7 6 7 0
7 6 76
I 7 6 dy 7 6 7
6 74 6 @y2 @y2 76 27 4 7 6
4 @F @F @F 5 df 5 4 54 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx

(g) Hybrid direct (h) Hybrid adjoint

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 420 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 1
In most cases, the explicit computation of state variables involves solving the
nonlinear system corresponding to the discipline; however, in this example, this is
simplified because the residuals are linear in the state variables and each discipline
has only one state variable. Thus, the explicit forms are
2y2 sin x1
Y1 (x1 , x2 , y2 ) = − +
x1 x1
y1
Y2 (x1 , x2 , y1 ) = .
x22

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 421 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 2
Coupled — Residual (Direct)
 ∂R ∂ R1   dy1 dy1   ∂ R1 ∂ R1 
1
− −
 ∂y1 ∂y2   dx1 dx2   ∂x1 ∂x2 
 ∂ R2 ∂ R2   dy2 dy2  =  ∂ R2 ∂ R2 
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
 dy dy 
  1 1  
−x1 −2  dx1 dx2  y1 − cos x1 0
2  dy =
1 −x2 2 dy2 0 2x2 y2

dx1 dx2

df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2


= + +
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1
df1 dy1 dy2
=0+1× +0×
dx1 dx1 dx1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 422 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 3
Coupled — Residual (Adjoint)
 ∂R ∂ R2   df1 df2   ∂ F1 ∂ F2 
1
− −
 ∂y1 ∂y1   dr1 dr1   ∂y1 ∂y1 
 ∂ R1 ∂ R2   df1 df2  =  ∂ F1 ∂ F2 
− −
∂y2 ∂y2 dr2 dr2 ∂y2 ∂y2
 df df 
  1 2  
−x1 1  dr1 dr1  1 0
2  df =
−2 −x2 1 df2 0 sin x1

dr2 dr2

df1 ∂ F1 df1 ∂ R1 df1 ∂ R2


= + +
dx1 ∂x1 dr1 ∂x1 dr2 ∂x1
df1 df1 df1
=0+ (y1 − cos x1 ) + 0
dx1 dr1 dr2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 423 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 4
Coupled — Functional (Direct)
 ∂ Y1   dy1 dy1   ∂ Y1 ∂ Y1 
1 −
∂y2   dx1 dx2   ∂x1 ∂x2 
  dy2 dy2  =  ∂ Y2 ∂ Y2 

 ∂ Y2
− 1
∂y1 dx1 dx2 ∂x1 ∂x2
 dy dy 
1 1
2y2
+ cosx1x1 − sinx2x1
" # " #
1 x21  dx dx  x2
0
1 2 = 1 1
− x12 1  dy2 dy2  0 − 2y1
2 x32
dx1 dx2

df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2


= + +
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1
df1 dy1 dy2
=0+1 +0
dx1 dx1 dx1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 424 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 5
Coupled — Functional (Adjoint)
 ∂ Y2   df1 df2   ∂ F1 ∂ F2 
1 −
∂y1   dy1 dy1   ∂y1 ∂y1 
  df1 df2  =  ∂ F1

 ∂ Y1 ∂ F2 
− 1
∂y2 dy2 dy2 ∂y2 ∂y2
 df df 
" # 1 2
1 − x12  dy dy 
 
1 0
2
2  df11 df12  =
x1
1 0 sin x1
dy2 dy2

df1 ∂ F1 df1 ∂ Y1 df1 ∂ Y2


= + +
dx1 ∂x1 dy1 ∂x1 dy2 ∂x1
 
df1 df1 2y2 cos x1 sin x1 df1
=0+ + − + 0
dx1 dy1 x21 x1 x21 dy2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 425 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 6
Coupled — Hybrid (Direct)
 ∂R ∂ R1   dy1 dy1   ∂ R1 ∂ R1 
1
− −
 ∂y1 ∂y2   dx1 dx2   ∂x1 ∂x2 
 ∂ Y2   dy2 dy2  =  ∂ Y2 ∂ Y2 
− 1
∂y1 dx1 dx2 ∂x1 ∂x2
# dy1 dy1
 
" " #
−x1 −2  dx dx  y1 − cos x1 0
1 2
− x12 1  dy2 dy2  = 0 − 2y
x3
1
2 2
dx1 dx2

df1 ∂ F1 ∂ F1 dy1 ∂ F1 dy2


= + +
dx1 ∂x1 ∂y1 dx1 ∂y2 dx1
df1 dy1 dy2
=0+1 +0
dx1 dx1 dx1

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 426 / 427


Multidisciplinary Design Optimization Computing Coupled Derivatives

Numerical Example 7
Coupled — Hybrid (Adjoint)
 ∂R ∂ Y2   df1 df2   ∂ F1 ∂ F2 
1
− −
 ∂y1 ∂y1   dr1 dr1   ∂y1 ∂y1 
 ∂ R1   df1 df2  =  ∂ F1 ∂ F2 
− 1
∂y2 dy2 dy2 ∂y2 ∂y2
# df1 df2
 
"
−x1 − x12  dr dr 
 
1 0
2  df11 df12  =
−2 1 0 sin x1
dy2 dy2

df1 ∂ F1 df1 ∂ R1 df1 ∂ Y2


= + +
dx1 ∂x1 dr1 ∂x1 dy2 ∂x1
df1 df1 df1
=0+ (y1 − cos x1 ) + 0
dx1 dr1 dy2

J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 427 / 427

You might also like