Professional Documents
Culture Documents
Martins MDO Course Slides PDF
Martins MDO Course Slides PDF
Joaquim R. R. A. Martins
Multidisciplinary Design Optimization Laboratory
http://mdolab.engin.umich.edu
4. Computing Derivatives
4.1 Introduction
4.2 Finite Differences
4.3 Complex-Step Method
4.4 C/C++ Implementations
4.5 Unifying Chain Rule
4.6 The Unifying Chain Rule
4.7 Monolithic Differentiation
Introduction
1. Introduction
1.1 About
1.2 Aircraft as Multidisciplinary Systems
1.3 Design Optimization
1.4 Optimization Problem Statement
1.5 Optimization Problem Statement
1.6 Classification of Optimization Problems
1.7 History
3. Gradient-Based Optimization
4. Computing Derivatives
5. Constrained Optimization
6. Gradient-Free Optimization
About Me
Bio
I 1991–1995: M.Eng. in Aeronautics, Imperial College, London
I 1996–2002: M.Sc. and Ph.D. in Aeronautics and Astronautics, Stanford
I 2002–2009: Assistant/Associate Prof., University of Toronto Inst. for
Aerospace Studies
I 2009– : Associate Prof., University of Michigan, Dept. of Aerospace Eng.
Highlights
I Two best papers at the AIAA MA&O Conference (2002, 2006)
I Canada Research Chair in Multidisciplinary Optimization (2002–2009)
I Keynote speaker at the International Forum on Aeroelasticity and Structural
Dynamics (Stockholm, 2007)
I Keynote speaker at the Aircraft Structural Design Conference (London, 2010)
I Associate editor for the AIAA Journal and Optimization and Engineering
About You
I Name
I Title and responsibilities
I Why are you taking this course?
I What do you hope to get from this course?
Course Content
Introduction
Single
Variable
Minimization
Computing
Derivatives
Gradient-
MDO Based
Optimization
Handling
Constraints
Gradient-Free
Optimization
MDO
Architectures
Santos–Dumont’s Demoiselle
What is MDO?
I We will first cover the “DO” in MDO.
I In industry, problems routinely arise that require making the best possible
design decision.
I However, optimization is still underused in industry. . . Why?
I Numerical optimization and MDO still not part of most undergraduate and
graduate curricula
I Backlash due to “overselling” of numerical optimization
I Inertia in the industrial environment
I Aerospace is one of the leading applications of engineering design
optimization. Why?
Analyze or
experiment Analyze
Yes Yes
Objective Function
I What do we mean by “best”?
I Objective function is a “measure of badness” that enables us to compare two
designs quantitatively — assuming we want to minimize it.
I Need to be able to estimate this measure numerically.
I If we select the wrong goal, it doesn’t matter how good the analysis is, or
how efficient the optimization method is. Therefore, it’s important to select a
good objective function.
I Selecting a good objective function is often overlooked, and not an easy
problem, even for experienced designers.
I Objective function may be linear or nonlinear and may or not be given
explicitly.
I We will represent the objective function by the scalar f .
I There is no such thing as multiobjective optimization!
The “Disciplanes”
Is there one aircraft which is the fastest, most efficient, quietest, most
inexpensive?
Design Variables
I Design variables are also known as design parameters and are represented by
the vector x. They are the variables in the problem that we allow to vary in
the design process.
I Optimization is the process of choosing the design variables that yield an
optimum design.
I Design variables should be independent of each other.
I Design variables can be continuous or discrete. Discrete variables are
sometimes integer variables.
Constraints
I Few practical engineering optimizations problems are unconstrained.
I Constraints on the design variables are called bounds and are easy to enforce.
I Like the objective function, constraints can be linear or nonlinear and may or
not be given in an explicitly form. They may be equality or inequality
constraints.
I At a given design point, constraints may be active of inactive. This
distinction is particularly important at the optimum.
minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, 2, . . . , m̂
ck (x) ≥ 0, k = 1, 2, . . . , m
10
Twist (degrees)
0
I Need a truly multidisciplinary objective
−5
Jigtwist
−10 Deflected
0 5 10 15 20 V L Wi
Spanwise Distance (m) Range = ln
0.06 c D Wf
Thickness (m)
0.05
1
0 5 10 15 20
Spanwise Distance (m)
Discontinuous
Linear
Continuity
Linearity Nonlinear
Static
Continuous
Dynamic Quantitative
Discrete
Optimization
Design
Time Problem Variables
Classification
Qualitative
Deterministic
Data Constraints
Convexity Unconstrained
Stochastic
Constrained
Non-
Convex
Convex
Gradient Conjugate
Based Gradient
Quasi-
Newton
Optimization
Methods Grid or
Random
Search
Genetic
Algorithms
Simulated
Annealing
Gradient Free
Nelder–
Meade
DIRECT
Particle
Swarm
3. Gradient-Based Optimization
4. Computing Derivatives
5. Constrained Optimization
6. Gradient-Free Optimization
x∗
Classification of Minima
We can classify a minimum as a:
1. Strong local minimum
2. Weak local minimum
3. Global minimum
Optimality Conditions 1
Taylor’s theorem is the key for identifying local minima
1 1 1
f (x+h) = f (x)+hf 0 (x)+ h2 f 00 (x)+· · ·+ hn−1 f n−1 (x)+ hn f n (x + θh)
2 (n − 1)! |n! {z }
O(hn )
Optimality Conditions 2
Since f 0 (x∗ ) = 0, we have to consider the second derivative term.
This term must be non-negative for a local minimum at x∗ .
Since ε2 > 0, then f 00 (x∗ ) ≥ 0. This is the second-order optimality condition.
Thus the necessary conditions for a local minimum are:
f 0 (x∗ ) = 0
f 00 (x∗ ) ≥ 0
f 0 (x∗ ) = 0
f 00 (x∗ ) > 0
Numerical Precision
I Finding x∗ such that f 0 (x∗ ) = 0, is equivalent to finding the roots of the first
derivative of the function to be minimized.
I Therefore, root finding methods can be used to find stationary points and are
useful in function minimization.
I Using machine precision, it is not possible find the exact zero, so we will be
satisfied with finding an x∗ that belongs to an interval [a, b] such that the
function g satisfies
Convergence Rate 1
Two questions are important when considering an optimization algorithm:
I Does it converge?
lim xk − x∗ = 0
k→∞
Convergence Rate 2
Assume ideal convergence behavior so that the above condition and we do not
have to take the limit. Then,
Convergence Rate 3
In general, x is an n-vector and we have to rethink the definition of the error.
I We could use, for example, ||xk − x∗ ||.
I But this depends on the scaling of x, so we should normalize it,
||xk − x∗ ||
.
||xk ||
I And . . . xk might be zero, so fix this,
||xk − x∗ ||
.
1 + ||xk ||
I And . . . gradients might be large. Thus, we should use a combined quantity,
Convergence Rate 4
I A final issue: x∗ is usually not known! You can monitor the progress of your
algorithm using the steps,
Sometimes, you might just use the second fraction in the above term, or the
norm of the gradient. You should plot these quantities in a log-axis versus k.
Method of Bisection
I Bisection is a bracketing method: it generates a set of nested intervals and
requires an initial interval where is is assumed a solution exists.
I First we find a bracket [x1 , x2 ] such that f (x1 )f (x2 ) < 0
I For an initial interval [x1 , x2 ], bisection yields the following interval at
iteration k,
x1 − x2
δk =
2k
I To achieve a specified tolerance ε, we need log2 (x1 − x2 )/δ evaluations.
I From the definition of rate of convergence, for r = 1,
δk+1 1
lim = =
k→∞ δk 2
I Converges linearly with asymptotic error constant γ = 1/2.
I To find the minimum of a function using bisection, we evaluate the derivative
of f at each iteration, and find a point for which f 0 = 0.
Newton’s Method
Newton’s method for finding a zero can be derived from the Taylor’s series
expansion about the current iteration xk ,
Ignoring the terms higher than order two and assuming the function next iteration
to be the root (i.e., f (xk+1 ) = 0), we obtain,
f (xk )
xk+1 = xk − .
f 0 (xk )
|xk+1 − x∗ |
lim = const.
k→∞ |xk − x∗ |2
f (x)
Figure
J.R.R.A.9.4.1.
Martins Newton’s method extrapolates the local
Multidisciplinary Designderivative to find the next estimate of theAugust
Optimization root. 2012
In 49 / 427
ge University Press. Programs Copyright (C) 1986-1992 by Numerical Recipes Software.
ny server computer, is strictly prohibited. To order Numerical Recipes books or CDROMs, visit website
rs to make one paper copy for their own personal use. Further reproduction, or any copying of machine-
423 (North America only), or send email to directcustserv@cambridge.org (outside North America).
f(x)
f(x)
2
3
1
x
x
9.4.2. Unfortunate case where Newton’s method encounters a local extremum and shoots off to
space. Here bracketing bounds, as in rtsafe, would save the day. Figure 9.4.3. Unfortunate case where Newton’s method enters a nonconvergent cycle. This
is often encountered when the function f is obtained, in whole or in part, by table interpolati
a better initial guess, the method would have succeeded.
f (xk ) f 0 (xk )
xk+1 = xk − → xk+1 = xk − .
f 0 (xk ) f 00 (xk )
Secant Method
I Newton’s method requires the first derivative for each iteration (and the
second derivative when applied to minimization).
I In some cases, it might not be easy to obtain these derivatives.
I If we use a forward-difference approximation for f 0 (xk ) in Newton’s method
we obtain
xk − xk−1
xk+1 = xk − f (xk ) .
f (xk ) − f (xk−1 )
which is the secant method.
I Also known as “the poor-man’s Newton method”.
I Under favorable conditions, this method has superlinear convergence
(1 < r < 2), with r ≈ 1.6180.
0 1−τ τ 1
0 1−τ τ
1−τ τ 1
I If we evaluate two points such that the two next possible intervals are the
same size and one of the points is reused, we have a more efficient method.
Polynomial Interpolation 1
I Idea: use information about f gathered during iteration.
I One way of using this information is to produce an estimate of the function
which we can easily minimize.
I The lowest order function that we can use for this purpose is a quadratic,
since a linear function does not have a minimum.
I Suppose we approximate f by
1
f˜ = ax2 + bx + c.
2
I If a > 0, the minimum of this function is x∗ = −b/a.
parabola through 1 2 3
parabola through 1 2 4
2
5 4
pk
gk+1
xk
gk
Wolfe Conditions 1
I Typical line search tries a sequence of step lengths, accepting the first that
satisfies certain conditions.
I A common condition requires that αk should yield a sufficient decrease of f ,
Wolfe Conditions 2
I Since we start with a negative slope, the gradient at the new point must be
either less negative or positive.
I Typical values of µ2 range from 0.1 to 0.9.
I The sufficient decrease and curvature conditions are known collectively as the
Wolfe conditions.
Backtracking Algorithm
I One of the simplest line search techniques is backtracking.
I It only checks for the sufficient decrease.
I It is guaranteed to satisfy this condition . . . eventually.
3. Bracket interval
3. Does point satisfy sufficient between previous
NO
decrease? point and current
point
YES
YES
YES
6. Bracket interval
between current
point and previous "zoom" function
point
Point is good
enough
YES
4.2 Does point satisfy the curvature 4.3 Does derivative sign at point 4.3 Replace high point with low
NO agree with interval trend? YES
condition? point
YES
Point is good NO
enough
4.4 Replace low point with trial
point
Gradient-Based Optimization
1. Introduction
3. Gradient-Based Optimization
3.1 Introduction
3.2 Gradients and Hessians
3.3 Optimality Conditions
3.4 Steepest Descent
3.5 Conjugate Gradient
3.6 Newton’s Method
3.7 Quasi-Newton Methods
3.8 Trust Region Methods
4. Computing Derivatives
5. Constrained Optimization
6. Gradient-Free Optimization
Gradient-Based Optimization 1
I In previous chapter, described methods to decrease a function of one variable.
I Now, consider problems with multiple design variables
The unconstrained optimization problem is,
minimize f (x)
with respect to x ∈ Rn
Gradient-Based Optimization 2
I Gradient-based methods use the gradient of the objective function to find the
most promising search directions
I For large numbers of design variables, gradient-based methods are more
efficient
I Assumptions and restrictions:
I No constraints (address these in later chapter)
I Smooth functions (gradient-free methods in later chapter)
Search
direction Input: Initial guess, x0
Output: Optimum, x∗
k←0
while Not converged do
Update x Line search Compute a search direction pk
Find a step length αk , such that f (xk + αk pk ) <
f (xk ) (the curvature condition may also be included)
Update the design variables: xk+1 ← xk + αk pk
Is x a
No minimum?
k ←k+1
end while
Yes
x∗
Gradients
Consider a function f (x). The gradient of this function is
∂f
∂x1
∂f
∇f (x) ≡ g(x) ≡ 2 ∂x
.
..
∂f
∂xn
In the multivariate case, the gradient vector is perpendicular to the the hyperplane
tangent to the contour surfaces of constant f .
Hessians 1
I The second derivative of an n-variable function is defined by n2 partial
derivatives:
∂2f ∂2f
, i 6= jand , i = j.
∂xi ∂xj ∂x2i
I If the partial derivatives ∂f /∂xi , ∂f /∂xj and ∂ 2 f /∂xi ∂xj are continuous
and f is single valued, ∂ 2 f /∂xi ∂xj = ∂ 2 f /∂xj ∂xi .
I The second-order partial derivatives can be represented by a square
symmetric matrix called the Hessian matrix,
∂2f ∂2f
···
∂ 2 x1 ∂x1 ∂xn
.. ..
∇ f (x) ≡ H(x) ≡
2
. .
∂2f ∂ f
2
··· ,
∂xn ∂x1 ∂ 2 xn
which contains n(n + 1)/2 independent elements.
Hessians 2
I If f is quadratic, the Hessian of f is constant, and the function can be
expressed as
1
f (x) = xT Hx + g T x + α.
2
Optimality Conditions
As in single-variable case, optimality conditions derived from the Taylor-series
expansion:
1
f (x∗ + εp) ≈ f (x∗ ) + εpT g(x∗ ) + ε2 pT H(x∗ )p,
2
where ε is a scalar, and p is an n-vector.
I For x∗ to be a local minimum, then
f (x∗ + εp) ≥ f (x∗ ) ⇒ f (x∗ + εp) − f (x∗ ) ≥ 0.
I This means that the sum of the first and second order terms in the
Taylor-series expansion must be greater than or equal to zero.
I Start with first order term: Since p is an arbitrary vector and ε can be positive
or negative, every component of the gradient vector g(x∗ ) must be zero.
I Second order term: For ε2 pT H(x∗ )p to be non-negative, H(x∗ ) has to be
positive semi-definite.
Optimality Conditions
Necessary conditions (for a local minimum):
df (xk+1 )
=0
dα
∂f (xk+1 ) ∂ (xk + αpk )
⇒ =0
∂xk+1 ∂α
⇒ ∇T f (xk+1 )pk = 0
⇒ −g T (xk+1 )g(xk ) = 0
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 87 / 427
Gradient-Based Optimization Steepest Descent
Step-size Scaling
I Since steepest descent and other gradient methods that do not produce
well-scaled search directions, we need to use other information to guess a
step length.
I One strategy is to assume that the first-order change in xk will be the same
as the one obtained in the previous step. i.e, that
and therefore:
T
gk−1 pk−1
ᾱ = αk−1 T
.
gk pk
The function f is not quadratic, but, as |x1 | and |x2 | → 0, we see that
Thus, this function is essentially a quadratic near the minimum (0, 0)T .
∇φ(x) = Ax − b ≡ r(x).
∇φ = 0 ⇒ Ax = b.
The conjugate gradient method is an iterative method for solving linear systems of
equations.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 93 / 427
Gradient-Based Optimization Conjugate Gradient
I Since this method is just a minor modification away from steepest descent
and performs much better, there is no excuse for steepest descent!
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 97 / 427
Gradient-Based Optimization Conjugate Gradient
Newton’s Method 1
I Steepest descent and conjugate gradient methods only use first order
information to obtain a local model of the function.
I Newton methods use a second-order Taylor series expansion of the function
about the current design point
1
f (xk + sk ) ≈ fk + gkT sk + sTk Hk sk ,
2
where sk is the step to the minimum.
I Differentiating this with respect to sk and setting it to zero, we obtain
Hk sk = −gk .
Newton’s Method 2
I As in the single variable case, difficulties and even failure may occur when the
quadratic model is a poor approximation of f far from the current point.
I If Hk is not positive definite, the quadratic model might not have a minimum
or even a stationary point.
I So for some nonlinear functions, the Newton step might be such that
f (xk + sk ) > f (xk ) and the method is not guaranteed to converge.
I Another disadvantage of Newton’s method is the need to compute not only
the gradient, but also the Hessian, which contains n(n + 1)/2 second order
derivatives.
Quasi-Newton Methods
I Quasi-Newton methods use only first order information . . .
I . . . but they build second order information — an approximate Hessian —
based on the sequence of function values and gradients from previous
iterations.
I They are the analog of the secant method in multidimensional space.
I The various quasi-Newton methods differ in how they update the
approximate Hessian.
I Most of them force the Hessian to be symmetric and positive definite.
pk = −Bk−1 gk .
I This solution is used to compute the search direction to obtain the new
iterate
xk+1 = xk + αk pk
where αk is obtained using a line search.
I This is the same procedure as the Newton method, except that we use an
approximate Hessian Bk instead of the true Hessian.
T 1
φk+1 (p) = fk+1 + gk+1 p + pT Bk+1 p.
2
I Using the secant method we can find the univariate quadratic function along
the previous direction pk based on the two last two gradients gk+1 and gk ,
and the last function value fk+1 .
I The slope of the univariate function is the gradient of the function projected
onto the p direction, f 0 = g T p. The univariate quadratic is given by
0 θ2 ˜00
φk+1 (θ) = fk+1 + θfk+1 + f
2 k+1
fk0
0
fk+1 φ
xk xk+1
Projection of the quadratic model onto the last search direction, illustrating
the secant condition
Bk+1 αk pk = gk+1 − gk .
Bk+1 sk = yk .
xk+1
pk+1
pk
xk gk+1
gk
minimize kB − Bk k
with respect to B
subject to B = BT , Bsk = yk .
Vk = Bk−1 .
I The DFP update for the inverse of the Hessian approximation can be shown
to be
Vk yk y T Vk sk sT
Vk+1 = Vk − T k + T k
yk Vk yk yk sk
I Note that this is a rank 2 update.
A Beer-Inspired Algorithm?
Steepest descent
15
2
0.5
0
15
15
20
10
1.5
20
0
20
0.5
5
1 0.2.2
10
15
2 0
2
5
10
0
10
1
1
20
10
5
10
5
15
15
0.5
20
2
15
0.5
0.5
20 0
x2
20
10
200
1 00
100
2
20
15
40 0
10
1
300
300
0 5
5 00
10 2
15
600
20
-0.5
4 00
20
7 00
20
10
0
0
10
400
500
60 0
30
-1 0
30
0
10
0
20
20
1.5
10
2
0
15
10
0.2 0.22 1
5
0.5
0
5
10 0
10
15
1
10 0
10
2
0.5
15
5
20
5
20
1
10
15
20 0
20
0.5 15
10 0
10
2
x2
1
2
300
5
5
400
0.5
100
200
15
0
10
10
1
5
20
50 0
15
300
20
200
-0.5
0
10
60 0
700 00
40 0
3 00
6
5 00
400
10
0
0
70 0
-1 20
-1.5 -1 -0.5 0 0.5 1 1.5
x1
0.2
1.5
20
10
0
5
2
15
200
15
2
1
2105
5
20
0.5
0
10
10
10
0
15 2
1
100
10
.2
00.5
10
1
2 00
5
0.2
1 00
5
0.5
20
20
2
x2
15
5
15
15
5
2
2
0.
1
10
300
20
0.5
10
2
0
200
500
10
10
0
3 00
4 00
5
2 00
400
15
6 00
20
0
10
-0.5
7 00
30
0
500
6 00
1 00
-1
-1.5 -1 -0.5 0 0.5 1 1.5
x1
300
20
20
0
5
20
0.2
2
10
1.5
5
5
15
10 0
2
10
0.5
1
10
0
10
1
20
0.2
1
10
15
15
0.5
15
20
20 10
200
20
5
0.2
5
2
300
0.5 15
5
x2
1 00
10 2
100
15
0.
10
2 00
40 0
1
0
20
15
15
5 10
20
5 00
20 0
300
40 0
-0.5
30 0
700
500
60 0
600
0
10
10
0
20
0
0
-1
40
-1.5 -1 -0.5 0 0.5 1 1.5
x1
(yk − Bk sk )(yk − Bk sk )T
Bk+1 = Bk + .
(yk − Bk sk )T sk
I With this formula, we must have safe-guards:
I If yk = Bk sk then the denominator is zero, and the only update that satisfies
the secant equation is Bk+1 = Bk (i.e., do not change the matrix).
I if yk 6= Bk sk and (yk − Bk sk )T sk = 0 then there is no symmetric rank-1
update that satisfies the secant equation.
Evaluate f (xk + sk ) and compute the ratio that measures the accuracy of
the quadratic model,
f (xk ) − f (xk + sk ) ∆f
rk ← =
f (xk ) − q(sk ) ∆q
Computing Derivatives
1. Introduction
3. Gradient-Based Optimization
4. Computing Derivatives
4.1 Introduction
4.2 Finite Differences
4.3 Complex-Step Method
4.4 C/C++ Implementations
4.5 Unifying Chain Rule
4.6 The Unifying Chain Rule
4.7 Monolithic Differentiation
4.8 Algorithmic Differentiation
4.9 Analytic Methods
5. Constrained Optimization
6. Gradient-Free Optimization
What’s in a name?
I Derivatives have also been called:
I “Sensitivities” . . . but sensitivity analysis is actually a much broader area of
mathematics.
I “Sensitivity derivatives” — a somewhat redundant term?
I “Design sensitivities” — a fair term to use.
I I have been using the terms “sensitivities” and “sensitivity analysis” up until
this year, but now I prefer “derivatives”, since it is more precise.
I A “gradient” is a vector of derivatives
I A Jacobian is a matrix of derivatives (the gradient of a vector)
I We will focus on first order derivatives of deterministic numerical models.
I A model can be any numerical procedure that given inputs computes some
outputs
Finite Differences 1
I Finite differences are one of the most popular methods for computing
derivatives, mostly because they are extremely easy to implement and do not
require source code
I . . . but they suffer from some serious accuracy and performance issues.
I Finite-difference formulas are derived by combining Taylor series expansions
I It is possible to obtain formulas for arbitrary order derivatives with arbitrary
order truncation error (but it will cost you!)
Finite Differences 2
The simplest finite-difference formula can be directly derived from one Taylor
series expansion,
∂F h2 ∂ 2 F h3 ∂ 3 F
F (x + ej h) = F (x) + h + + + ...,
∂xj 2! ∂x2j 3! ∂x3j
∂F F (x + ej h) − F (x)
= + O(h)
∂xj h
Finite Differences 3
I Each additional column requires an additional evaluation
I Hence, the cost of computing the complete Jacobian is proportional to the
number of input variables of interest, nx .
For a second-order estimate we use the expansion of f (x − h),
h2 00 h3
f (x − h) = f (x) − hf 0 (x) + f (x) − f 000 (x) + . . . ,
2! 3!
and subtract it from the f (x + h) to get the central-difference formula,
f (x + h) − f (x − h)
f 0 (x) = + O(h2 ).
2h
More accurate estimates can also be derived by combining different Taylor series
expansions.
Finite Differences 4
Formulas for estimating higher-order derivatives can be obtained by nesting
finite-difference formulas. We can use, for example, the central difference formula
to estimate the second derivative instead of the first,
f 0 (x + h) − f 0 (x − h)
f 00 (x) = + O(h2 ).
2h
and use central difference again to estimate both f 0 (x + h) and f 0 (x − h) in the
above equation to obtain,
Finite Differences 5
f (x + h) +1.234567890123431
f (x) +1.234567890123456
∆f −0.000000000000025
f(x)
f(x+h)
x x+h
Finite difference approximation
Finite Differences 6
I For functions of several variables, then we have to calculate each component
of the gradient ∇f (x) by perturbing each component of x and recomputing
f.
I Thus the cost of calculating a gradient is proportional to the number of
design variables.
Theory 1
I Like finite-difference formulas, the complex-step approximations can also be
derived using a Taylor series expansion.
I Instead of using a real step h, we now use a pure imaginary step, ih.
I If f is a real function in real variables and it is also analytic, we can expand it
in a Taylor series about a real point x as follows,
∂F h2 ∂ 2 F ih3 ∂ 3 F
F (x + ihej ) = F (x) + ih − − + ...
∂xj 2 ∂x2j 6 ∂x3j
Taking the imaginary parts of both sides of this equation and dividing it by h
yields
∂F Im [F (x + ihej )]
= + O(h2 )
∂xj h
We call this the complex-step derivative approximation. Hence the
approximations is a O(h2 ) estimate of the derivative.
Theory 2
I Like finite-differences, each additional evaluation results in a column of the
dF
Jacobian , and the cost of computing the derivatives is proportional to
dx
the number of design variables, nx .
I No subtraction operation in the complex-step approximation, so no
subtractive cancellation error
I the only source of numerical error is the truncation error, O(h2 ).
I By decreasing h to a small enough value, the truncation error can be made to
be of the same order as the numerical precision of the evaluation of f .
I If we take the real part of the Taylor series expansion, we get
f 00 (x)
f (x) = Re [f (x + ih)] + h2 − ...
2!
showing that the real part of the result give the value of f (x) correct to
O(h2 ).
Theory 3
I The second order errors in the function value and the function derivative can
be eliminated when using finite-precision arithmetic by ensuring that h is
sufficiently small.
I If ε is the relative working precision of a given algorithm, to eliminate the
truncation error, we need an h such that
00
f (x)
h2 < ε |f (x)|
2!
I Similarly, for the truncation error of the derivative estimate to vanish we
require that 000
f (x)
h2 < ε |f 0 (x)|
3!
I Although the step h can be very small values, in some cases, it is not possible
to satisfy these conditions, e.g., when f (x), f 0 (x) tend to zero.
∂f Im [f (x + ih)]
= lim .
∂x h→0 h
I For a small discrete h, this can be approximated by,
∂f Im [f (x + ih)]
≈ .
∂x h
(x, ih)
Re Re
(x, 0) (x + h, 0) (x, 0)
∂F Im[F (x + ih)]
⇒ ≈
∂x h
Normalized Error, e
Step Size, h
Relative error of the derivative vs. decreasing step size
Arithmetic functions
I Arithmetic functions and operators include addition, multiplication, and
trigonometric functions.
I Most of these functions have a standard complex definition that is analytic,
so the complex-step derivative approximation yields the correct result.
I The only standard complex function definition that is non-analytic is the
absolute value function.
I Since ∂v/∂x = 0 on the real axis, we get ∂u/∂y = 0 on the same axis, so the
real part of the result must be independent of the imaginary part of the
variable.
I Therefore, the new sign of the imaginary part depends only on the sign of the
real part of the complex number, and an analytic “absolute value” function is
(
−x − iy, if x < 0,
abs(x + iy) =
+x + iy, if x > 0.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 152 / 427
Computing Derivatives Complex-Step Method
Other Issues 1
I Improvements to the complex-step method are necessary because of the way
certain compilers implement the functions.
I For example, the following formula might be used for the arcsin function:
h p i
arcsin(z) = −i log iz + 1 − z 2 ,
iz + z = (x − h) + i (x + h) ,
eiz − e−iz
sin(z) = .
2i
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 154 / 427
Computing Derivatives Complex-Step Method
Other Issues 2
I The complex trigonometric relation yields a better alternative,
I Linearizing this last equation (that is for small h) this simplifies to,
I We would like the real and imaginary parts to be calculated separately. This
can be achieved by linearizing in h to obtain,
h
arcsin(x + ih) ≡ arcsin(x) + i √ .
1 − x2
Implementation Procedure
The general procedure for the implementation of the complex-step method for an
arbitrary computer program can be summarized as follows:
1. Substitute all real type variable declarations with complex declarations. It is
not strictly necessary to declare all variables complex, but it is much easier to
do so.
2. Define all functions and operators that are not defined for complex
arguments.
3. Add a small complex step (e.g. h = 1 × 10−20 ) to the desired x, run the
algorithm that evaluates f , and then take the imaginary part of the result
and divide by h.
The above procedure is independent of the programming language. We now
describe the details of our Fortran and C/C++ implementations.
Fortran Implementation 1
I complexify.f90: a module that defines additional functions and operators
for complex arguments.
I Complexify.py: Python script that makes necessary changes to source
code, e.g., type declarations.
I Features:
I Script is versatile:
I Compatible with many more platforms and compilers.
I Supports MPI based parallel implementations.
I Resolves some of the input and output issues.
I Some of the function definitions were improved: tangent, inverse and
hyperbolic trigonometric functions.
Fortran Implementation 2
Templates, a C++ feature, can be used to create program source code that is
independent of variable type declarations.
I Compared run time with real-valued code:
I Complexified version: ≈ ×3
I Algorithmic differentiation version: ≈ ×2
−2
10
Reference Error, ε
−4
10
−6
10
−8
10
100 200 300 400 500 600 700 800
Iterations
Complex−Step
−1
10 Finite−difference
−2
10
Relative Error, ε
−3
10
−4
10
−5
10
−6
10 −5 −10 −15
10 10 10
Step Size, h
0.1
∂ CD / ∂ bi
0.05
−0.05
2 4 6 8 10 12 14 16 18
Shape variable, i
Finite Difference
Complex−Step
0
10
Relative Error, ε
−2
10
−4
10
−6
10
−8
10 0 −5 −10 −15 −20
10 10 10 10 10
Step Size, h
4.3734
4.3733
4.3732
Cdf
4.3731
4.373
4.3729
4.3728
vi = Vi (v1 , v2 , . . . , vi−1 ).
where we adopt the convention that the lower case represents the value of a
variable, and the upper case represents the function that computes that value.
I In the more general case, a given function might require values that have not
been previously computed, i.e.,
vi = Vi (v1 , v2 , . . . , vi , . . . , vn ).
r = R(v) = 0
r = R(x, y(x)) = 0
where y(x) denotes the fact that y depends implicitly on x through the
solution of the residual equations
I The solution of these equations completely determines y for a given x.
I The functions of interest (usually included in the set of component outputs)
also have the same type of variable dependence in the general case,
f = F (x, y(x)).
x
x ∈ Rn x
y ∈ Rn y
r ∈ Rn y
R(x, y) = 0 F (x, y) f f ∈ Rn f
vi = Vi (v1 , . . . , vi−1 )
where all intermediate v’s between j and i are computed and used.
I The total derivative is,
dvi vi
= ,
dvj vj
I Using the two equations above, we can write:
X ∂ Vi dvk i−1
dvi
= δij + ,
dvj ∂vk dvj
k=j
which expresses a total derivative in terms of the other total derivatives and
the Jacobian of partial derivatives. The δij term is added to account for the
case in which i = j.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 177 / 427
Computing Derivatives The Unifying Chain Rule
I Both of these matrices are lower triangular matrices, due to our assumption
that we have unrolled all of the loops.
I Using this notation, the chain rule can be written as
Dv = I + DV Dv .
(I − DV ) Dv = I.
− = = −
= =
I We call the left and right hand sides of this equation the forward and reverse
chain rule equations, respectively.
I All methods for derivative computation can be derived from one of the forms
of this chain rule by changing what we mean by “variables”, which can be
seen as a level of decomposition.
I To drive the residuals to zero, we have to solve the following linear system,
x1 2 y1 sin x1
=
−1 x22 y2 0
FUNCTION F ( x )
REAL :: x (2) , det , y (2) , f (2)
det = 2 + x (1) * x (2) **2
y (1) = x (2) **2* SIN ( x (1) ) / det
y (2) = SIN ( x (1) ) / det
f (1) = y (1)
f (2) = y (2) * SIN ( x (1) )
RETURN
END FUNCTION F
The objective is to compute the derivatives of both outputs with respect to both
inputs, i.e., the Jacobian, " #
df1 df1
df
= dx 1
df2
dx2
df2
dx dx1 dx2
We will use this example in later sections to show the application of all methods.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 186 / 427
Computing Derivatives Monolithic Differentiation
Monolithic Differentiation 1
I In monolithic differentiation, the entire computational model is treated as a
“black box”
I Only track inputs and outputs.
I This is often the only option
I Both the forward and reverse modes of the generalized chain rule reduce to
dfi ∂ Fi
=
dxj ∂xj
Monolithic Differentiation 2
x
r1 r f
r2
y1 y
y2
v1 = x 1 , v2 = x2 , v3 = f1 , v4 = f2
∂ f1 f1 (x1 + h, x2 ) − f1 (x1 , x2 )
≈ = 0.0866023014079,
∂x1 h
102
100
10-2
10-4
Log of relative error
10-6
10-8
10-10
10-12
10-14
10-16 FD
CS
10-18 -20
10 10-18 10-16 10-14 10-12 10-10 10-8 10-6 10-4 10-2 100
Log of step size
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 191 / 427
Computing Derivatives Algorithmic Differentiation
Algorithmic Differentiation 1
I Algorithmic differentiation (AD) is also known as computational
differentiation or automatic differentiation
I Well known method based on the systematic application of the differentiation
chain rule to computer programs.
I With AD the variables v in the chain rule are all of the variables assigned in
the computer program
I Thus, AD applies the chain rule for every single line in the program.
I The computer program is considered as sequence of explicit functions Vi ,
where i = 1, . . . , n.
I Assume that all of the loops in the program are unrolled, and therefore no
variables are overwritten and each variable only depends on earlier variables
in the sequence.
I This assumption is not restrictive, as programs iterate the chain rule together
with the program variables, converging to the correct total derivatives.
Algorithmic Differentiation 2
I Typically, the design variables are among the first v’s, and the quantities of
interest are the last quantities.
Algorithmic Differentiation 3
v1 x
v2
v3
v4 r1 r
.
.
.
r2
y1 y
y2
f
vn
Algorithmic Differentiation 4
I The chain rule is
X ∂Vi dvk i−1
dvi
= δij + ,
dvj ∂vk dvj
k=j
where the V represent explicit functions, each defined by a single line in the
computer program.
I The partial derivatives, ∂Vi /∂vk can be automatically differentiated
symbolically by applying another chain rule within the function defined by the
respective line in the program.
I The chain rule can be solved in two ways.
Forward mode: choose one vj and keep j fixed. Then we work our way
forward in the index i = 1, 2, . . . , n until we get the desired
total derivative.
Reverse mode: fix vi (the quantity we want to differentiate) and work our
way backward in the index j = n, n − 1, . . . , 1 all of the way to
the independent variables.
Algorithmic Differentiation 5
I The chain rule in matrix form,
(I − DV ) Dv = I ⇒
1 0 ··· 1 0 ···
− ∂V 2
1 0 ··· dv 2
1 0 ···
∂V∂v1 dv 1
− ∂v 3 − ∂V3
1 0 · · · dv31
dv dv3
1 0 · · · =
1 ∂v2 ∂v2
.. .. .. .. .. .. .. ..
. . . . . . . .
− ∂Vn
∂v1
− ∂Vn
∂v2
··· − ∂v∂Vn
n−1
1 dvn
dv1
dvn
dv2
··· dvn
dvn−1
1
1 0 ···
0 1 0 ···
0 0
· · · .
1 0
. . .. .. ..
.. .. . . .
0 0 0 0 1
Algorithmic Differentiation 6
I The terms that we ultimately want to compute are the total derivatives of
the quantities of interest with respect to the design variables, corresponding
to a nf × nx block in the Dv matrix in the lower left:
df1 df1 dv(n−nf ) dv(n−nf )
dx1 ··· ···
dxnx
dv
1 dvnx
df . .. .. .. .. ..
= .. . =
. . . . ,
dx
dfnf dfnf dvn dvn
··· ···
dx1 dxnx dv1 dvnx
which is an nf × nx matrix.
I The forward mode is equivalent to solving the linear system for one column of
Dv .
I Since (I − DV ) is a lower triangular matrix, this solution can be
accomplished by forward substitution.
I In the process, we end up computing the derivative of the chosen quantity
with respect to all of the other variables.
Algorithmic Differentiation 7
I The cost of this procedure is similar to the cost of the procedure that
computes the v’s.
dv
−v2 −2v1 v2 1 0 0 0 0 0 1
2 cos v
v2 2v2 sin v1
2 sin v
v2
0
dv4
1 ∂v2
dv4 00
0
− 1 − 1 1 0 0
= 0 .
v3 v3 2 dv1 ∂v2 0
v3
0 dv1
cos v1 sin v1 dv5 dv5 0
−
v3
0
v32 0 1 0
dv6 ∂v2
dv6 0 0
0 0
0 0 0 −1 0 1 0 dv1 ∂v2
−v5 cos v1 0 0 0 − sin v1 0 1 dv7 dv7
dv1 ∂v2
1 −2v1 v2 − 0 0 0 dv6 dv7 0 0
v3
0 2 sin v
v2 1 sin v1 dv3 ∂v3 00
= 0
0
0 1 0 0 dv6 dv7 0
2 2
v3 v3 dv4 ∂v4 0
0 0 0 1 0 −1 0
dv
dv5
6 dv7
∂v5 1 0
0 0 0 0 1 0 − sin v1 0 1
0 0 0 0 0 1 0 dv7
1
0 0 0 0 0 0 1 dv6
0 1
Available AD Tools 1
The tools for the various programming languages include:
I Fortran
I ADIFOR: Source transformation; forward and reverse modes; Fortran 77;
non-commercial.
I AD01: Operator overloading; forward and reverse modes; Fortran 90;
commercial.
I OPFAD/OPRAD: Operator overloading; forward and reverse modes; Fortran
90; non-commercial.
I TAMC: Source transformation; forward and reverse modes; Fortran 77;
non-commercial.
I TAF: Source transformation; forward and reverse modes; Fortran 90;
commercial.
I Tapenade: Source transformation; Fortran 90; non-commercial. Developed at
INRIA Sophia-Antipolis. Formerly Odyssée.
I C/C++: Various established tools for automatic differentiation. These
include include ADIC, an implementation mirroring ADIFOR, and ADOL-C, a
free package that uses operator overloading and can operate in the forward or
reverse modes and compute higher order derivatives.
Available AD Tools 2
I Other languages: Tools also exist for other languages, such as Matlab and
Python.
Automatic Complex-Step
∆x1 = 1 h1 = 10−20
∆x2 = 0 h2 = 0
f = x1 x2 f = (x1 + ih1 )(x2 + ih2 )
∆f = x1 ∆x2 + x2 ∆x1 f = x1 x2 − h1 h2 + i(x1 h2 + x2 h1 )
df /dx1 = ∆f df /dx1 = Im f /h
Complex-step method computes one extra term. Other functions are similar:
I Superfluous calculations are made.
I For h ≤ x × 10−20 they vanish, but still affect speed.
Analytic Methods 1
I Analytic methods are the most accurate and efficient methods.
I Much more involved, since they require detailed knowledge of the
computational model and a long implementation time.
I Applicable when f depends implicitly on x:
f = F (x, y(x)).
I The implicit relationship between the state variables y and the independent
variables is defined by the solution of a set of residual equations,
r = R(x, y(x)) = 0.
Analytic Methods 2
Continuous Discrete
Sensitivity Sensitivity
Equations Equations 1
Continuous
Governing
Equations
Discrete Discrete
Governing Sensitivity
Equations Equations 2
Traditional Derivation 1
I Using the chain rule we can write,
df ∂F ∂ F dy
= + ,
dx ∂x ∂y dx
where the result is an nf × nx matrix.
I The partial derivatives represent the variation of f = F (x) with respect to
changes in x for a fixed y
I The total derivative df / dx takes into account the change in y that is
required to keep the residual equations equal to zero.
I This distinction depends on the context, i.e., what is considered a total or
partial derivative depends on the level that is being considered in the nested
system of components.
Traditional Derivation 2
I Since the governing equations must always be satisfied, the total derivative of
the residuals (210) with respect to the design variables must be zero. Thus,
using the chain rule
dr ∂ R ∂ R dy
= + = 0.
dx ∂x ∂y dx
I The computation of the total derivative matrix dy/ dx is much more
expensive than any of the partial derivatives, since it requires the solution of
the residual equations.
I The partial derivatives can be computed by differentiating the function F
with respect to x while keeping y constant, and can be computed using
symbolic differentiation, finite differences, complex step, or AD.
I The linearized residual equations provide the means for computing the total
Jacobian matrix dy/ dx, by rewriting them as,
∂ R dy ∂R
=− .
∂y dx ∂x
Traditional Derivation 3
I Substituting this result into the total derivative equation, we obtain
dy
− dx
}| { z
−1
df ∂F ∂F ∂R ∂R
= − .
dx ∂x ∂y ∂y ∂x
| {z }
ψ
I The inverse of the square Jacobian matrix ∂R/∂y is not necessarily explicitly
calculated.
I There are two ways of computing the total derivative matrix dy/ dx:
Direct method: Factorize the Jacobian nx times with the columns of ∂R/∂x
in the right hand side.
Adjoint method: Factorize the Jacobian nf times with the columns of
∂F /∂y in the right hand side.
= –
1 = –
df @F @F @R @R
=
dx @x @y @y @x
= + = +
df @F @ F dy
= +
dx @x @y dx
@ R dy @ R – = – =
=
@y dx @x
= + = +
df @F df @ R
= +
dx @x dr @x
T T T
@R df @F – = – =
=
@y dr @y
Rk = Kki ui − Fk = 0,
where Kki is the stiffness matrix, ui is the vector of displacement (the state)
and Fk is the vector of applied force (not to be confused with the function of
interest from the previous section!).
I We want the derivatives of the stresses, which are related to the
displacements by the equation,
σm = Smi ui .
T ∂σm
Kki ψk = .
∂ui
Then we would substitute the adjoint vector into the equation,
dσm ∂σm ∂Kki
= + ψkT − ui .
dAj ∂Aj ∂Aj
v1 = x, v2 = r, v3 = y, v4 = f .
r1 r
r2
y1 y
y2
v = [v1 , . . . , vnx , v(nx +1) , . . . , v(nx +ny ) , v(nx +ny +1) , . . . , v(nx +2ny ) , v(n−nf ) , . . . , tn ]T .
| {z } | {z } | {z } | {z }
x r y f
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 221 / 427
Computing Derivatives Analytic Methods
∆x
∆r
∆y
∆f
v1 = x
∂R
v2 = r = x
∂x
−1
∂R
v3 = y = (−r)
∂y
∂F ∂F
v4 = f = x+ y
∂x ∂y
I Now, all variables are functions of only previous variables, so we can apply
the forward and reverse chain rule equations to the linearized system
dr @R df @F
=
J.R.R.A. Martins Multidisciplinary Design Optimization = August 2012 225 / 427
Computing Derivatives Analytic Methods
Adjoint Method 1
I The linear system involving the Jacobian matrix ∂R/∂y can be solved with
∂f /∂y as the right-hand side.
I This results in the following adjoint equations,
T T
∂R ∂F
ψ=− ,
∂y ∂y
Adjoint Method 2
I Thus, the cost of computing the total derivative matrix using the adjoint
method is independent of the number of design variables, nx , and instead
proportional to the number of quantities of interest, f .
I The partial derivatives shown in these equations need to be computed using
some other method. They can be differentiated symbolically, computed by
finite differences, the complex-step method or even AD. The use of AD for
these partials has been shown to be particularly effective in the development
of analytic methods for PDE solvers.
I After evaluating the system at [x1 , x2 ] = [1, 1] and [y1 , y2 ] = [ sin3 1 , sin3 1 ], we
can find df1 / dx1 using the computed values for df1 / dr1 and df1 / dr2 :
Constrained Optimization
1. Introduction
3. Gradient-Based Optimization
4. Computing Derivatives
5. Constrained Optimization
5.1 Introduction
5.2 Equality Constraints
5.3 Inequality Constraints
5.4 Constraint Qualification
5.5 Penalty Methods
5.6 Sequential Quadratic Programming
6. Gradient-Free Optimization
Constrained Optimization
I Engineering design optimization problems are rarely unconstrained.
I The constraints that appear in these problems are typically nonlinear.
I Thus, we are interested in general nonlinearly constrained optimization theory
and methods.
Recall the statement of a general optimization problem,
minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m
minimize f (x)
with respect to x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂
Lagrange Multipliers 1
I Joseph Louis Lagrange is credited with developing a more general method to
solve this problem.
I At a stationary point, the total differential of the objective function has to be
equal to zero,
∂f ∂f ∂f
df = dx1 + dx2 + · · · + dxn = ∇f T dx = 0.
∂x1 ∂x2 ∂xn
I Unlike unconstrained optimization, the infinitesimal vector
T
dx = [ dx1 , dx2 , . . . , dxn ] is not arbitrary
I The perturbation x + dx must be feasible: ĉj (x + dx) = 0.
I Therefore, the above equation does not imply that ∇f = 0.
Lagrange Multipliers 2
I For a feasible point, the total differential of each of the constraints
(ĉ1 , . . . ĉm̂ ) must also be zero:
∂ĉj ∂ĉj
dĉj = dx1 + · · · + dxn = ∇ĉTj dx = 0, j = 1, . . . , m̂
∂x1 ∂xn
I To interpret the above equation, recall that the gradient of a function is
orthogonal to its contours.
I Thus, since the displacement dx satisfies ĉj (x + dx) = 0 (the equation for a
contour), it follow that dx is orthogonal to the gradient ∇ĉj .
I Lagrange suggested that one could multiply each constraint variation by a
scalar λ̂j and subtract it from the objective function variation,
Xm̂ Xn Xm̂
∂f ∂ĉj
df − λ̂j dĉj = 0 ⇒ − λ̂j dxi = 0.
j=1 i=1
∂x i j=1
∂x i
Lagrange Multipliers 3
I Notice what has happened: the components of the infinitesimal vector dx
have become independent and arbitrary, because we have accounted for the
constraints.
I Thus, for this equation to be satisfied, we need a vector λ̂ such that the
expression inside the parenthesis vanishes, i.e.,
m̂
X ∂ĉj
∂f
− λ̂j = 0, (i = 1, 2, . . . , n)
∂xi j=1 ∂xi
I We call this function the Lagrangian of the constrained problem, and the
weights the Lagrange multipliers. A stationary point of the Lagrangian with
respect to both x and λ̂ will satisfy
X ∂ĉj m̂
∂L ∂f
= − λ̂j = 0, (i = 1, . . . , n)
∂xi ∂xi j=1 ∂xi
∂L
= ĉj = 0, (j = 1, . . . , m̂).
∂ λ̂j
minimize f (x) = x1 + x2
weight respect to x1 , x2
subject to ĉ1 (x) = x21 + x22 − 2 = 0
-1
-2
-2 -1 0 1 2
I At the solution the constraint normal ∇ĉ1 (x∗ ) is parallel to ∇f (x∗ ), i.e.,
there is a scalar λ̂∗1 such that
ĉ1 (x + d) = 0 ⇒
ĉ1 (x + d) = ĉ1 (x) +∇ĉT1 (x)d + O(dT d).
| {z }
=0
and noting that ∇x L(x, λ̂1 ) = ∇f (x) − λ̂1 ∇ĉ1 (x), we can state the
necessary optimality condition as follows: At the solution x∗ there is a scalar
λ̂∗1 such that ∇x L(x∗ , λ̂∗1 ) = 0.
I Thus we can search for solutions of the equality-constrained problem by
searching for a stationary point of the Lagrangian function. The scalar λ̂1 is
the Lagrange multiplier for the constraint ĉ1 (x) = 0.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 247 / 427
Constrained Optimization Inequality Constraints
minimize f (x)
w.r.t x ∈ Rn
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m
I The optimality (KKT) conditions for this problem can also be obtained for
this case by modifying the Lagrangian to be
L(x, λ̂, λ, s) = f (x) − λ̂T ĉ(x) − λT c(x) − s2 ,
m̂
X ∂ĉj X ∂ck m
∂L ∂f
∇x L = 0 ⇒ = − λ̂j − λk = 0, i = 1, . . . , n
∂xi ∂xi j=1 ∂xi ∂xi
k=1
∂L
∇λ̂ L = 0 ⇒ = ĉj = 0, j = 1, . . . , m̂
∂ λ̂j
∂L
∇λ L = 0 ⇒ = ck − s2k = 0 k = 1, . . . , m
∂λk
∂L
∇s L = 0 ⇒ = λk sk = 0, k = 1, . . . , m
∂sk
λk ≥ 0, k = 1, . . . , m.
minimize f (x) = x1 + x2
s.t. c1 (x) = 2 − x21 − x22 ≥ 0
I The feasible region is now the circle and its interior. Note that ∇c1 (x) now
points towards the center of the circle.
I Graphically, we can see that the solution is still (−1, −1)T and therefore
λ∗1 = 1/2.
-1
-2
-2 -1 0 1 2
∇f T (x)d < 0 .
I The first condition, however is slightly different, since the constraint is not
necessarily zero, i.e.
c1 (x + d) ≥ 0
I Performing a Taylor series expansion we have,
I The optimality conditions for these two cases can again be summarized by
using the Lagrangian function, that is,
-1
-2
-2 -1 0 1 2
1.5
0.5
0
0 0.5 1 1.5 2
minimize f (x) = x1 + x2
s.t. c1 (x) = 2 − x21 − x22 ≥ 0, c2 (x) = x2 ≥ 0.
The feasible
√ region is now a half disk. Graphically, we can see that the solution is
now (− 2, 0)T and that both constraints are active at this point.
1.5
0.5
0
-2 -1 0 1 2
1.5
0.5
0
-2 -1 0 1 2
c1 (x + d) ≥ 0 ⇒ 1 + ∇c1 (x)T d ≥ 0,
c2 (x + d) ≥ 0 ⇒ ∇c2 (x)T d ≥ 0,
f (x + d) − f (x) < 0 ⇒ 1 + ∇f (x)T d < 0.
I We only need to worry about the last two conditions, since the first is always
satisfied for a small enough step.
I By noting that
1 0
∇f (x∗ ) = , ∇c2 (x∗ ) = ,
1 1
we can see that the vector d = − 12 , 14 , for example satisfies the two
conditions.
1.5
0.5
0
-2 -1 0 1 2
Constraint Qualification 1
I The KKT conditions are derived using certain assumptions and depending on
the problem, these assumptions might not hold.
I A point x satisfying a set of constraints is a regular point if the gradient
vectors of the active constraints, ∇cj (x) are linearly independent.
I To illustrate this, suppose we replaced the ĉ1 (x) in the previous example by
the equivalent condition
2
ĉ1 (x) = x21 + x22 − 2 = 0.
I Then we have
4(x21 + x22 − 2)x1
∇ĉ1 (x) = ,
4(x21 + x22 − 2)x2
so ∇ĉ1 (x) = 0 for all feasible points and ∇f (x) = λ̂1 ∇ĉ1 (x) cannot be
satisfied. In other words, there is no (finite) Lagrange multiplier that makes
the objective gradient parallel to the constraint gradient, so we cannot solve
the optimality conditions.
Constraint Qualification 2
I This does not imply there is no solution; on the contrary, the solution
remains unchanged for the earlier example.
I Instead, what it means is that most algorithms will fail, because they assume
the constraints are linearly independent.
minimize f (x)
subject to ĉ(x) = 0
φ(x) = 0 if x is feasible
φ(x) > 0 otherwise,
minimize π (x, ρk )
w.r.t. x
minimize π(x, ρk )
w.r.t. x
5: xk+1 ← x
6: ρk+1 ← τ ρk . Increase the penalty parameter
7: k ←k+1
8: until xk converges to the desired tolerance
The increase in the penalty parameter for each iteration can range from modest
(ρk+1 = 1.4ρk ), to ambitious (ρk+1 = 10ρk ), depending on the problem.
I The penalty is equal to the sum of the square of all the constraints and is
therefore greater than zero when any constraint is violated and is zero when
the point is feasible.
I We can modify this method to handle inequality constraints by defining the
penalty for these constraints as
m
X 2
φ(x, ρ) = ρ (max [0, −ci (x)]) .
i=1
I Penalty functions suffer from problems of ill conditioning. The solution of the
modified problem approaches the true solution as limρ→+∞ x∗ (ρ) = x∗
minimize f (x)
subject to c(x) ≥ 0
I The solution of the modified problem for both functions approach the real
solution as limµ→0 x∗ (µ) = x∗ .
I Again, the Hessian matrix becomes increasingly ill conditioned as µ
approaches zero.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 271 / 427
Constrained Optimization Penalty Methods
minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
where the Hessian of the Lagrangian is denoted by W (x, λ̂) = ∇2xx L(x, λ̂).
I The Newton step from the current point is given by
xk+1 xk p
= + k .
λ̂k+1 λ̂k pλ̂
Wk p + gk − ATk λ̂k = 0
Ak p + ĉk = 0
I By writing this in matrix form, we see that pk and λ̂k can be identified as the
solution of the Newton equations we derived previously.
Wk −ATk pk −gk
=
Ak 0 λ̂k+1 −ĉk .
Quasi-Newton Approximations 1
I Any SQP method relies on a choice of Wk (an approximation of the Hessian
of the Lagrangian) in the quadratic model.
I When Wk is exact, then the SQP becomes the Newton method applied to
the optimality conditions.
I One way to approximate the Hessian of the Lagrangian would be to use a
quasi-Newton approximation, such as the BFGS update formula. We could
define,
and then compute the new approximation Bk+1 using the same formula used
in the unconstrained case.
I If ∇2xx L is positive definite at the sequence of points xk , the method will
converge rapidly, just as in the unconstrained case. If, however, ∇2xx L is not
positive definite, then using the BFGS update may not work well.
Quasi-Newton Approximations 2
I To ensure that the update is always well-defined the damped BFGS updating
for SQP was devised. Using this scheme, we set
rk = θk yk + (1 − θk )Bk sk ,
Bk sk sTk Bk rk rT
Bk+1 = Bk − T
+ T k,
sk Bk sk sk rk
Quasi-Newton Approximations 3
I When θk = 1 we have an unmodified BFGS update.
I The modified method thus produces an interpolation between the current Bk
and the one corresponding to BFGS.
I The choice of θk ensures that the new approximation stays close enough to
the current approximation to guarantee positive definiteness.
Other Modifications 1
I In addition to using a different quasi-Newton update, SQP algorithms also
need modifications to the line search criteria in order to ensure that the
method converges from remote starting points.
I It is common to use a merit function, φ to control the size of the steps in the
line search. The following is one of the possibilities for such a function:
1
φ(xk ; µ) = f (x) + ||ĉ||1
µ
I The penalty parameter µ is positive and the L1 norm of the equality
constraints is
m̂
X
||ĉ||1 = |ĉj |.
j=1
Other Modifications 2
I To determine the sequence of penalty parameters, the following strategy is
often used (
µk−1 if µ−1
k−1 ≥ γ + δ
µk = −1
(γ + 2δ) otherwise,
where γ is set to max(λk+1 ) and δ is a small tolerance that should be larger
that the expected relative precision of your function evaluations.
SQP Algorithm
Inequality Constraints 1
I The SQP method can be extended to handle inequality constraints.
I Consider general nonlinear optimization problem
minimize f (x)
subject to ĉj (x) = 0, j = 1, . . . , m̂
ck (x) ≥ 0, k = 1, . . . , m
I To define the subproblem we now linearize both the inequality and equality
constraints and obtain,
1 T
minimize p Wk p + gkT p
2
subject to ∇ĉj (x)T p + ĉj (x) = 0, j = 1, . . . , m̂
∇ck (x)T p + ck (x) ≥ 0, k = 1, . . . , m
I One of the most common type of strategy to solve this problem, the
active-set method, is to consider only the active constraints at a given
iteration and treat those as equality constraints.
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 284 / 427
Constrained Optimization Sequential Quadratic Programming
Inequality Constraints 2
I This is a significantly more difficult problem because we do not know a priori
which inequality constraints are active at the solution. If we did, we could just
solve the equality constrained problem considering only the active constraints.
I The most commonly used active-set methods are feasible-point methods.
These start with a feasible solution a never let the new point be infeasible.
Gradient-Free Optimization
1. Introduction
3. Gradient-Based Optimization
4. Computing Derivatives
5. Constrained Optimization
6. Gradient-Free Optimization
6.1 Introduction
6.2 Nelder–Mead Simplex
6.3 DIvided RECTangles (DIRECT)
6.4 Genetic Algorithms
6.5 Particle Swarm Optimization
Gradient-Free Optimization 1
Using optimization in the solution of practical applications we often encounter one
or more of the following challenges:
I non-differentiable functions and/or constraints
I disconnected and/or non-convex feasible space
I discrete feasible space
I mixed variables (discrete, continuous, permutation)
I large dimensionality
I multiple local minima (multi-modal)
I multiple objectives
Mi
xed
(
Int
eger
-Cont
inuous
)
Gradient-Free Optimization 2
Gradient-based methods are:
I Efficient in finding local minima for high-dimensional, nonlinearly-constrained,
convex problems
I Sensitive to noisy and discontinuous functions
I Limited to continuous design variables.
Consider, for example, the Griewank function:
n
P Q
n
x2i xi
f (x) = 4000 − cos √
i
+1
i=1 i=1
−600 ≤ xi ≤ 600
Gradient-Free Optimization 3
How we could find the best solution for this example?
I Multiple point restarts of gradient (local) based optimizer
I Systematically search the design space
I Use gradient-free optimizers
Some comments on gradient-free methods:
I Many mimic mechanisms observed in nature — biomimicry — or use other
heuristics.
I They are not necessarily guaranteed to find the true global optimal solutions
— unlike gradient-based methods in a convex search space . . .
I . . . but they are able to find many good solutions — the mathematician’s
answer vs. the engineer’s answer.
I Their key strength is the ability to solve some problems that are difficult to
solve using gradient-based methods.
I Many of them are designed as global optimizers and thus are able to find
multiple local optima while searching for the global optimum.
Gradient-Free Optimization 4
A wide variety of gradient-free methods have been developed. We are going to
look at some of the most commonly used algorithms:
I Nelder–Mead Simplex (Nonlinear Simplex)
I Divided Rectangles Method
I Genetic Algorithms
I Particle Swarm Optimization
Nelder–Mead Simplex 1
I The simplex method of Nelder and Mead performs a search in n-dimensional
space using heuristic ideas.
I It is also known as the nonlinear simplex
I Not to be confused with the linear simplex, with which it has nothing in
common.
I Strengths: requires no derivatives to be computed and that it does not
require the objective function to be smooth.
I The weakness: not very efficient, particularly for problems with more than
about 10 design variables; above this number of variables convergence
becomes increasingly difficult.
Nelder–Mead Simplex 2
The Nelder–Mead algorithm starts with a simplex (n + 1 sets of design variables
x) and then modifies the simplex at each iteration using four simple operations.
The sequence of operations to be performed is chosen based on the relative values
of the objective function at each of the points.
Nelder–Meade Algorithm 1
I The first step of the simplex algorithm is to find the n + 1 points of the
simplex given an initial guess x0 .
I This can be easily done by simply adding a step to each component of x0 to
generate n new points.
I However, generating a simplex with equal length edges is preferable . . .
I Suppose the length of all sides is required to be c and that the initial guess,
x0 is the (n + 1)th point.
I The remaining points of the simplex, i = 1, . . . , n can be computed by
adding a vector to x0 whose components are all b except for the ith
component which is set to a, where
c √
b= √ n+1−1
n 2
c
a=b+ √ .
2
Nelder–Meade Algorithm 2
0.9
0.8 0.9
0.8
0.7
0.7
0.6
0.6
0.5 0.5
x3
x2
0.4
0.4
0.3
0.3 0.2
0.1
0.2
0
0 0.8
0.1 0.2
0.4 0.6
0.6 0.4
0 0.8 0.2
0 0.2 0.4 0.6 0.8 1 1 0
x1 x2
x1
Nelder–Meade Algorithm 3
The Nelder–Mead algorithm starts by computing the average of the n points that
exclude exclude the worst,
n+1
1 X
xa = xi .
n
i=1,i6=w
Nelder–Meade Algorithm 4
!"#$%
*+,-.
&'() /012345 6/7189613/7
KLMNOPQRO ABCADE FGBHIJFHAGB :;<=>?=>@
I Reflection
xr = xa + α (xa − xw )
Nelder–Meade Algorithm 5
I Expansion
xe = xr + γ (xr − xa ) ,
where the expansion parameter γ is usually set to 1.
I Inside contraction
xc = xa − β (xa − xw ) ,
where the contraction factor is usually set to β = 0.5.
I Outside contraction
xo = xa + β (xa − xw ) .
I Shrinking
xi = xb + ρ (xi − xb ) ,
where the scaling parameter is usually set to ρ = 0.5.
Each of these operations generates a new point and the sequence of operations
performed in one iteration depends on the value of the objective at the new point
relative to the other key points.
Nelder–Meade Algorithm 6
Initialize n-simplex,
evaluate n+1 points
Rank vertices:
best, lousy and
worst
Reflect
Yes
Perform inside Is reflected point worse Is expanded point better
Yes
contraction than worst point? than best point?
No
No
Keep reflected point
No Yes No
Shrink
Yes
Is new point worse than
Keep new point Shrink Keep reflected point
reflected point? No
Keep new point
Nelder–Meade Algorithm
Input: Initial guess, x0
Output: Optimum, x∗
k←0
Create a simplex with edge length c
repeat
Identify the highest (xw : worst), second highest (xl , lousy) and lowest (xb :
best) value points with function values fw , fl , and fb , respectively
Evaluate xa , the average of the point in simplex excluding xw
Perform reflection to obtain xr , evaluate fr
if fr < fb then
Perform expansion to obtain xe , evaluate fe .
if fe < fb then
xw ← xe , fw ← fe (accept expansion)
else
xw ← xr , fw ← fr (accept reflection)
end if
else if fr ≤ fl then
xw ← xr , fw ← fr (accept reflected point)
else
if fr > fw then
Perform an inside contraction and evaluate fc
if fc < fw then
xw ← xc (accept contraction)
else
Shrink the simplex
end if
else
Perform an outside contraction and evaluate fc
if fc ≤ fr then
xw ← xc (accept contraction)
else
Shrink the simplex
end if
end if
end if
k ←k+1
until (fw − fb ) < (ε1 + ε2 |fb |)
DIRECT Algorithm
Genetic Algorithms
I Genetic algorithms for optimization are inspired by the process of natural
evolution of organisms.
I First developed by John Holland in the mid 1960’s. Holland was motivated
by a desire to better understand the evolution of life by simulating it in a
computer and the use of this process in optimization.
I Genetic algorithms are based on three essential components:
I Survival of the fittest — Selection
I Reproduction processes where genetic traits are propagated — Crossover
I Variation — Mutation
I We use the term “genetic algorithms” generically to refer to optimization
approaches that use the three components above.
I Depending on the approach they have different names, for example: genetic
algorithms, evolutionary computation, genetic programming, evolutionary
programming, evolutionary strategies.
minimize f (x)
subject to xl ≤ x ≤ xu
where xl and xu are the vectors of lower and upper bounds on x, respectively.
In the context of genetic algorithms we will call each design variable vector x a
population member. The value of the objective function, f (x) is termed the
fitness.
Genetic algorithms are radically different from the gradient based methods we
have covered so far. Instead of looking at one point at a time and stepping to a
new point for each iteration, a whole population of solutions is iterated towards
the optimum at the same time. Using a population lets us explore multiple
“buckets” (local minima) simultaneously, increasing the likelihood of finding the
global optimum.
Single-Objective Optimization 1
The general procedure of a genetic algorithm can be described as follows:
1. Initialize a population: Each member of the population represents a design
point, x and has a value of the objective (fitness), and information about its
constraint violations associated with it.
2. Determine mating pool: Each population member is paired for reproduction
by using one of the following methods:
I Random selection
I Based on fitness: make the better members to reproduce more often than the
others.
3. Generate offspring: To generate offspring we need a scheme for the crossover
operation. There are various schemes that one can use. When the design
variables are continuous, for example, one offspring can be found by
interpolating between the two parents and the other one can be extrapolated
in the direction of the fitter parent.
4. Mutation: Add some randomness in the offspring’s variables to maintain
diversity.
Single-Objective Optimization 2
5. Compute Offspring’s Fitness
Evaluate the value of the objective function and constraint violations for each
offspring.
6. Tournament
Again, there are different schemes that can be used in this step. One method
involves replacing the worst parent from each “family” with the best offspring.
7. Identify the Best Member
8. Return to step 2 unless converged or computational budget is exceeded.
Multi-Objective Optimization 1
I What if we want to investigate the trade-off between two (or more)
conflicting objectives?
I Examples . . .
I In this situation there is no one “best design” . . .
I . . . but there is a set of designs that are the best possible for that
combination of the two objectives.
I For these optimal solutions, the only way to improve one objective is to
worsen the other.
I Genetic algorithms can handle this problem with little modification: We
already evaluate a whole population, so we can use this to our advantage.
I Alternatively, we could use gradient-based optimization with one of two
strategies:
I Use a composite weighted function,
f = αf1 + (1 − α)f2
Multi-Objective Optimization 2
I Solve the problem
minimize f1
subject to f2 = fc
Multi-Objective Optimization 3
I Comparing members A and B, we can see that A has a higher (worse) f1 than
B, but has a lower (better) f2 . Hence we cannot determine whether A is
better than B or vice versa.
I On the other hand, B is clearly a fitter member than C since both of B’s
objectives are lower. We say that B dominates C.
I Comparing A and C, once again we are unable to say that one is better than
the other.
I In summary:
I A is non-dominated by either B or C
I B is non-dominated by either A or C
I C is dominated by B but not by A
I The rank of a member is the number of members that dominate it plus one.
In this case the ranks of the three members are:
rank(A) = 1
rank(B) = 1
rank(C) = 2
Multi-Objective Optimization 4
I In multi-objective optimization the rank is crucial in determining which
population member are the fittest.
I A solution of rank one is said to be Pareto optimal and the set of rank one
points for a given generation is called the Pareto set.
I As the number of generations increases, and the fitness of the population
improves, the size of the Pareto set grows.
I In the case above, the Pareto set includes A and B. The graphical
representation of a Pareto set is called a Pareto front.
I The procedure of a two-objective genetic algorithm is similar to the
single-objective one, with the following modifications:
I Instead of making decisions based on the objective function, we make
decisions based on rank (the lower the better)
I Instead of keeping track of the best member of population, we keep track of
all members with rank one, which should converge to the Pareto set
I One of the problems with this method is that there is no mechanism
“pushing” the Pareto front to a better one.
x = xl + r(xu − xl )
C = 0.1fh − 1.1fl
to each function value. Thus the new highest value will be 1.1(fh − fl ) and
the new lowest value 0.1(fh − fl ). The values are then normalized as follows,
fi + C
fi0 =
D
where
D = max(1, fh + C).
f10 + . . . + fj−1
0
≤ rS ≤ f10 + . . . + fj0
This ensures that the probability of a member being selected for reproduction
is proportional to its scaled fitness value.
Mutation
I Mutation is a random operation performed to change the genetic information.
I Mutation is needed because even though reproduction and crossover
effectively recombine existing information, occasionally some useful genetic
information might be lost.
I The mutation operation protects against such irrecoverable loss.
I It also introduces additional diversity into the population.
I When using bit representation, every bit is assigned a small permutation
probability, say p = 0.005 ∼ 0.1. This is done by generating a random
number 0 ≤ r ≤ 1 for each bit, which is changed if r < p.
Before Mutation After Mutation
11111 11010
I The mutation of the real representation can be done in a variety of way. A
simple way involves generating a small probability that each design variable
changes by a random amount (within certain bounds). Another more
sophisticated alternative consists in using a probability density function.
ST5 Antenna 1
I The antenna for the ST5 satellite system presented a challenging design
problem, requiring both a wide beam width for a circularly-polarized wave
and a wide bandwidth.
I Two teams were assigned the same design problem: one used a traditional
method, and the other used GAs.
I The GA team found an antenna configuration (ST5-3-10) that was slightly
more difficult to manufacture, but it:
I Used less power
I Removed two steps in design and fabrication
ST5 Antenna 2
I Had more uniform coverage and wider range of operational elevation angle
relative to the ground changes
I Took 3 person-months to design and fabricate the first prototype as compared
to 5 person-months for the conventionally designed antenna.
xki vki
wvki
xki vki
Inertia
pki
Cognitive Learning
wvki
xki vki
Inertia
pki
Inertia
pki
Inertia
PSO Algorithm
1. Initialize a set of particles positions xio and velocities voi randomly distributed
throughout the design space bounded by specified limits
2. Evaluate the objective function values f xik using the design space positions
xik
3. Update the best particle position pik at current iteration (k) and best particle
position in the complete history pgk
4. Update the position of each particle using its previous position and updated
velocity vector.
5. Repeat steps 2–4 until the stopping criteria is met.
PSO Characteristics
Compared to other global optimization approaches:
I Simple algorithm, extremely easy to implement.
I Still a population based algorithm, however it works well with few particles
(10 to 40 are usual) and there is not such thing as “generations”
I Unlike evolutionary approaches, design variables are directly updated, there
are no chromosomes, survival of the fittest, selection or crossover operations.
I Global and local search behavior can be directly “adjusted” as desired using
the cognitive c1 and social c2 parameters.
I Convergence “balance” is achieved through the inertial weight factor w
Analysis of PSO 1
I If we replace the velocity update equation into the position update the
following expression is obtained:
!
i i i pik − xik pgk − xik
xk+1 = xk + wvk + c1 r1 + c2 r2 ∆t
∆t ∆t
Analysis of PSO 2
I Re-arranging the position and velocity term in the above equation we have:
where the above is true only when Vki = 0 and xik = pik = pgk (equilibrium
point).
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 342 / 427
Gradient-Free Optimization Particle Swarm Optimization
Analysis of PSO 3
I The eigenvalues of the dynamic system are:
λ2 − (w − c1 r1 − c2 r2 + 1) λ + w = 0
11000
Structure Weight [lbs]
10000
9000
8000
7000
6000
5000
4000
0 50 100 150 200
Iteration
Constraint Handling 1
The basic PSO algorithm is an unconstrained optimizer, to include constraints we
can use:
I Penalty methods
where:
−λj
θj xik i
= max gj xk ,
2rp,i
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 351 / 427
Gradient-Free Optimization Particle Swarm Optimization
Constraint Handling 2
I Multipliers and penalty factors that lead to the optimum are unknown and
problem dependent.
I A sequence of unconstrained minimizations of the augmented Lagrangian
function are required to obtain a solution.
I Multiplier update
λij v+1 = λij v + 2 rp,j |v θj xik
I Penalty factor update (penalizes infeasible movements):
2 rp,j |v if gj xiv > gj xiv−1 ∧ gj xiv > εg
rp,j |v+1 = 1
rp,j v if gj xiv ≤ εg
2
rp,j |v otherwise
Xn Yn
x2i xi
lf (x) = − cos √ + 1
i=1
4000 i i=1
−600 ≤ xi ≤ 600
3. Gradient-Based Optimization
4. Computing Derivatives
5. Constrained Optimization
6. Gradient-Free Optimization
Introduction 1
I In the last few decades, numerical models that predict the performance of
engineering systems have been developed, and many of these models are now
mature areas of research. For example . . .
I Once engineers can predict the effect that changes in the design have on the
performance of a system, the next logical question is what changes in the
design produced optimal performance. The application of the numerical
optimization techniques described in the preceding chapters address this
question.
I Single-discipline optimization is in some cases quite mature, but the design
and optimization of systems that involve more than one discipline is still in its
infancy.
I When systems are composed of multiple systems, additional issues arise in
both the analysis and design optimization.
I MDO researchers think industry will not adopt MDO more widely because
they do not realize their utility.
Introduction 2
I Industry think that researchers are not presenting anything new, since
industry has already been doing multidisciplinary design.
I There is some truth to each of these perspectives . . .
I Real-world aerospace design problem may involve thousands of variables and
hundreds of analyses and engineers, and it is often difficult to apply the
numerical optimization techniques and solve the mathematically correct
optimization problems.
I The kinds of problems in industry are often of much larger scale, involve
much uncertainty, and include human decisions in the loop, making them
difficult to solve with traditional numerical optimization techniques.
I On the other hand, a better understanding of MDO by engineers in industry
is now contributing a more widespread use in practical design.
Why MDO?
I Parametric trade studies are subject to the “curse of dimensionality”.
I Iterated procedures for which convergence is not guaranteed.
I Sequential optimization that does not lead to the true optimum of the system
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 358 / 427
Multidisciplinary Design Optimization Introduction
Introduction 3
Objectives of MDO:
I Avoid difficulties associated with sequential design or partial optimization.
I Provide more efficient and robust convergence than by simple iteration.
I Aid in the management of the design process.
Difficulties of MDO:
I Communication and translation
I Time
I Scheduling and planning
I Implementation
Personnel hierarchy
Design process
MDO Architectures
I MDO focuses on the development of strategies that use numerical analyses
and optimization techniques to enable the automation of the design process
of a multidisciplinary system.
I The big challenge: make such a strategy scalable and practical.
I An MDO architecture is a particular strategy for organizing the analysis
software, optimization software, and optimization subproblem statements to
achieve an optimal design.
I Other terms are used: “method”, “methodology”, “problem formulation”,
“strategy”, “procedure” and “algorithm”.
Symbol Definition
x Vector of design variables
yt Vector of coupling variable targets (inputs to a discipline analysis)
y Vector of coupling variable responses (outputs from a discipline analysis)
ȳ Vector of state variables (variables used inside only one discipline analysis
f Objective function
c Vector of design constraints
cc Vector of consistency constraints
R Governing equations of a discipline analysis in residual form
N Number of disciplines
n() Length of given variable vector
m() Length of given constraint vector
()0 Functions or variables that are shared by more than one discipline
()i Functions or variables that apply only to discipline i
()∗ Functions or variables at their optimal value
˜
() Approximation of a given function or vector of functions
ˆ
() Duplicates of certain variable sets distributed to other disciplines
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 362 / 427
Multidisciplinary Design Optimization Introduction
cci = yit − yi
15
10
x (ft)
5
1
0.5
0
30
20
10
0 15
10
−10
5
−20 0
y (ft) −30
x (ft)
R1 = 0 ⇒ AΓ − v(u, α) = 0
R2 = 0 ⇒ Ku − F (Γ) = 0
R3 = 0 ⇒ L(Γ) − W = 0
I The angle of attack is considered a state variable here, and helps satisfy
L = W.
I The design variables are the the wing sweep (Λ), structural thicknesses (t)
and twist distribution (γ).
x0 = Λ
t
x= ,
γ
I Sweep is a shared variable because changing the sweep has a direct effect on
both the aerodynamic influence matrix and the stiffness matrix.
Multidisciplinary Analysis 1
I To find the coupled state of a multidisciplinary system we need to perform a
multidisciplinary analysis — MDA.
I This is often done by repeating each disciplinary analysis until yit = yir for all
is.
Multidisciplinary Analysis 2
I The design structure matrix (DSM) was originally developed to visualize the
interconnections between the various components of a system.
A B C D E F G H I J K L M N O A L H O D M E G N C B K I F J
Optimization A A Optimization A A
Aerodynamics B B Mission L L
Atmosphere C C Performance H H
Economics D D System O O
Emissions E E Economics D D
Loads F F Reliability M M
Noise G G Emissions E E
Performance H H Noise G G
Sizing I I Propulsion N N
Weight J J Atmosphere C C
Structures K K Aerodynamics B B
Mission L L Structures K K
Reliability M M Sizing I I
Propulsion N N Loads F F
System O O Weight J J
yt x0 , x1 x0 , x2 x0 , x3
0,4→1:
(no data) 1 : y2t , y3t 2 : y3t
MDA
y1 1:
4 : y1 2 : y1 3 : y1
Analysis 1
y2 2:
4 : y2 3 : y2
Analysis 2
y3 3:
4 : y3
Analysis 3
Gradient-Based Optimization
x(0)
0,2→1:
x∗ Optimization
1:x 1:x 1:x
1:
2:f
Objective
1:
2:c
Constraints
1:
2 : df / dx, dc/ dx
Gradients
minimize D (α, γi )
w.r.t. α, γi
s.t. L (α, γi ) = W
2. Once the aerodynamic optimization has converged, the twist distribution and
the forces are fixed
3. Then we optimize the structure by minimizing weight subject to stress
constraints at the maneuver condition, i.e.,
minimize W (ti )
w.r.t. ti
s.t. σj (ti ) ≤ σyield
0
⇤
8
, t⇤ Iterator
1,3
7!1
1
Optimization
2,4
3!2
2
3
L/D Aerodynamics F
4
7
t Optimization t
5
6!5
5
6
u W, y Structures
0
7
⇤
, t⇤ Optimization 5 : ,t 2: 3:t
1
6!1
6
6 : R, y Functions
5
1
5
MDA 2:u
2
4!2
2
10
Twist (degrees)
5
−5
Jigtwist
−10 Deflected
0 5 10 15 20
Spanwise Distance (m)
0.06
Thickness (m)
0.05
0.04
0.03
0.02
0 5 10 15 20
Spanwise Distance (m)
4
x 10
5
Elliptical
4
Lift (N)
1
0 5 10 15 20
Spanwise Distance (m)
0.25
J.R.R.A. Martins Multidisciplinary Design Optimization August 2012 378 / 427
Multidisciplinary Design Optimization Monolithic Architectures
Monolithic Architectures
I Monolithic architectures solve the MDO problem by casting it as single
optimization problem.
I Distributed architectures, on the other hand, decompose the overall problem
into smaller ones.
I Monolithic architectures include:
I Multidisciplinary Feasible — MDF
I Individual Discipline Feasible — IDF
I Simultaneous Analysis and Design — SAND
I All-At-Once — AAO
0, 7→1:
x∗ Optimization
2 : x0 , x1 3 : x0 , x2 4 : x0 , x3 6:x
1, 5→2:
2 : y2t , y3t 3 : y3t
MDA
2:
y1∗ 5 : y1 3 : y1 4 : y1 6 : y1
Analysis 1
3:
y2∗ 5 : y2 4 : y2 6 : y2
Analysis 2
4:
y3∗ 5 : y3 6 : y3
Analysis 3
6:
7 : f, c
Functions
minimize −R
w.r.t. Λ, γ, t
s.t. σyield − σi (u) ≥ 0
AΓ − v(u, α) = 0
K(t, Λ)u − F (Γ) = 0
L(Γ) − W (t) = 0
I Advantages:
I Optimizer typically converges the multidisciplinary feasibility better than
fixed-point MDA iterations
I Disadvantages:
I Problem is potentially much larger than MDF, depending on the number of
coupling variables
I Gradient computation can be costly
x(0) , y t,(0)
0,3→1:
x∗ 1 : x0 , xi , yj6t =i 2 : x, y t
Optimization
1:
yi∗ 2 : yi
Analysis i
2:
3 : f, c, cc
Functions
minimize −R
w.r.t. Λ, γ, t, Γt , αt , ut
s.t. σyield − σi ≥ 0
Γt − Γ = 0
αt − α = 0
ut − u = 0
minimize f0 (x, y)
with respect to x, y, ȳ
subject to c0 (x, y) ≥ 0
ci (x0 , xi , yi ) ≥ 0 for i = 1, . . . , N
Ri (x0 , xi , y, ȳi ) = 0 for i = 1, . . . , N.
I Advantages:
I If implemented well, can be the most efficient architecture
I Disadvantages:
I Intermediate results do not even satisfy the governing equations
I Difficult or impossible to implement for “black-box” components
0,2→1:
x∗ , y ∗ 1 : x, y 1 : x0 , xi , y, ȳi
Optimization
1:
2 : f, c
Functions
1:
2 : Ri
Residual i
minimize −R
w.r.t. Λ, γ, t, Γ, α, u
s.t. σyield − σi (u) ≥ 0
AΓ = v(u, α)
K(t)u = f (Γ)
L(Γ) − W (t) = 0
0, 2→1:
x∗ , y ∗ 1 : x, y, y t 1 : x0 , xi , yi , yj6t =i , ȳi
Optimization
1:
2 : f, c, cc
Functions
1:
2 : Ri
Residual i
Monolithic
Remove cc , y t
AAO SAND
Remove Remove
R, y, ȳ R, y, ȳ
Remove cc , y t
IDF MDF
Distributed Architectures
I Monolithic MDO architectures solve a single optimization problem
I Distributed MDO architectures decompose the original problem into multiple
optimization problems
I Some problems have a special structure and can be efficiently decomposed,
but that is usually not the case
I In reality, the primary motivation for decomposing the MDO problem comes
from the structure of the engineering design environment
I Typical industrial practice involves breaking up the design of a large system
and distributing aspects of that design to specific engineering groups.
I These groups may be geographically distributed and may only communicate
infrequently.
I In addition, these groups typically like to retain control of their own design
procedures and make use of in-house expertise
Monolithic
AAO SAND
IDF MDF
Distributed IDF
0,25→1:
(no data) Convergence
Check
1,6→2:
2 : yt 5 : x0 , xi 3 : x0 , xi
Initial DOE
13,18→14:
t 17 : x0 , xi 15 : x0 , xi
Discipline 14 : y
DOE
2,4→3,14,16→15: 3, 15 : yj6t =i
Exact MDA
19,24→20
x∗ 24 : x 1:x System 23 : x 7:x 21 : x
Optimization
11,23:
24 : f, c 12 : f, c
All Functions
20,22→21:
Metamodel 21 : ỹj6t =i
MDA
7,12→8:
13 : x 11 : x 9 : x0 , xj6=i 9 : x0 , xi
Optimization i
8,10→9: 9 : yj6t =i
9 : yt
Local MDA i
5,9,17,21:
11 : ỹj6=i
yi∗ 1 : ỹi 13 : ỹj6=i 22 : ỹi 10 : ỹj6=i Analysis i
23 : ỹi
Metamodel
3,9,15:
13 : yi 3, 15 : yi 11 : yi 10 : yi 5, 17 : yi
Analysis i
CSSO Algorithm
Input: Initial design variables x
Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate main CSSO iteration
repeat
1: Initiate a design of experiments (DOE) to generate design points
for Each DOE point do
2: Initiate an MDA that uses exact disciplinary information
repeat
3: Evaluate discipline analyses
4: Update coupling variables y
until 4 → 3: MDA has converged
5: Update the disciplinary surrogate models with the latest design
end for 6 → 2
7: Initiate independent disciplinary optimizations (in parallel)
for Each discipline i do
repeat
8: Initiate an MDA with exact coupling variables for discipline i and
approximate coupling variables for the other disciplines
repeat
9: Evaluate discipline i outputs yi , and surrogate models for the
other disciplines, ỹj6=i
until 10 → 9: MDA has converged
11: Compute objective f0 and constraint functions c using current
data
until 12 → 8: Disciplinary optimization i has converged
end for
13: Initiate a DOE that uses the subproblem solutions as sample points
for Each subproblem solution i do
14: Initiate an MDA that uses exact disciplinary information
repeat
15: Evaluate discipline analyses.
until 16 → 15 MDA has converged
17: Update the disciplinary surrogate models with the newest design
end for 18 → 14
19: Initiate system-level optimization
repeat
20: Initiate an MDA that uses only surrogate model information
repeat
21: Evaluate disciplinary surrogate models
until 22 → 21: MDA has converged
23: Compute objective f0 , and constraint function values c
until 24 → 20: System level problem has converged
until 25 → 1: CSSO has converged
where x̂0i are duplicates of the global design variables passed to (and manipulated
by) discipline i and x̂i are duplicates of the local design variables passed to the
system subproblem.
The discipline i subproblem in both CO1 and CO2 is
minimize Ji x̂0i , xi , yi x̂0i , xi , yj6t =i
with respect to x̂0i , xi
subject to ci x̂0i , xi , yi x̂0i , xi , yj6t =i ≥ 0.
0, 2→1:
x∗0 System 1 : x0 , x̂1···N , y t 1.1 : yj6t =i 1.2 : x0 , x̂i , y t
Optimization
1:
2 : f0 , c0 System
Functions
1.0, 1.3→1.1:
x∗i 1.1 : x̂0i , xi 1.2 : x̂0i , xi
Optimization i
1.1:
yi∗ 1.2 : yi
Analysis i
1.2:
2 : Ji∗ 1.3 : fi , ci , Ji Discipline i
Functions
CO Algorithm 1
minimize −R
w.r.t. Λt , Γt , αt , ut , W t
s.t. J1∗ ≤ 10−6
J2∗ ≤ 10−6
Aerodynamics subproblem:
2 X 2 2
Λ Γi α 2 W
minimize J1 = 1− t + 1− t + 1− t + 1− t
Λ Γi α W
w.r.t. Λ, γ, α
s.t. L−W =0
Note the extra set of constraints in both system and discipline subproblems
denoting the design variables bounds.
0,11→1:
(no data) Convergence
Check
8,10:
x∗0 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
Optimization
4,7:
x∗i 11 : xi 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i
6,9:
10 : f0 , c0 7 : f0 , c0 System
Functions
6,9:
10 : fi , ci 7 : fi , ci Discipline i
Functions
9:
Shared
10 : df / dx0 , dc/ dx0
Variable
Derivatives
6:
Discipline i
7 : df0,i / dxi , dc0,i / dxi
Variable
Derivatives
2,5:
yi∗ 3 : yi 6, 9 : yi 6, 9 : yi 9 : yi 6 : yi
Analysis i
BLISS Algorithm
Input: Initial design variables x
Output: Optimal variables x∗ , objective function f ∗ , and constraint values c∗
0: Initiate system optimization
repeat
1: Initiate MDA
repeat
2: Evaluate discipline analyses
3: Update coupling variables
until 3 → 2: MDA has converged
4: Initiate parallel discipline optimizations
for Each discipline i do
5: Evaluate discipline analysis
6: Compute objective and constraint function values and derivatives with
respect to local design variables
7: Compute the optimal solutions for the disciplinary subproblem
end for
8: Initiate system optimization
9: Compute objective and constraint function values and derivatives with
respect to shared design variables using post-optimality analysis
10: Compute optimal solution to system subproblem
until 11 → 1: System optimization has converged
0,8→1:
(no data) 6:w 3 : wi
w update
5,7→6:
x∗0 System 6 : x0 , y t 3 : x0 , y t 2 : yj6t =i
Optimization
6:
System and
7 : f0 , Φ0···N
Penalty
Functions
1,4→2:
x∗i 6 : x̂0i , xi 3 : x̂0i , xi 2 : x̂0i , xi
Optimization i
3:
Discipline i
4 : fi , ci , Φ0 , Φi
and Penalty
Functions
2:
yi∗ 6 : yi 3 : yi
Analysis i
ATC Algorithm
0,10→1:
x∗0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
Optimization
9:
10 : f0,1,2 , c0,1,2 Discipline 0, 1,
and 2
Functions
1,8→2:
2 : y2t , y3t 3 : y3t
MDA
2:
y1∗ 9 : y1 8 : y1 3 : y1 6 : y1 5 : y1
Analysis 1
3:
y2∗ 9 : y2 8 : y2 6 : y2 5 : y2
Analysis 2
4,7→5:
x∗3 9 : x3 6 : x3 5 : x3
Optimization 3
6:
Discipline 0
7 : f0 , c0 , f3 , c3
and 3
Functions
5:
y3∗ 9 : y3 8 : y3 6 : y3
Analysis 3
ASO Algorithm
1
1
1
0..* 1
RS
Optimization Optimizer
R = [R1 , R2 ]T y = [y1 , y2 ]T
r1 r
r2
y1 y
y2
v = [v1 , . . . , vnx , . . . , v(nx +ny1 ) , . . . , v(nx +ny1 +ny2 ) , . . . , v(nx +2ny1 +ny2 ) , . . . , v(nx
| {z } | {z } | {z } | {z } |
x r1 r2 y1
Ri = Yi − yi ,
where the yi vector contains the intermediate variables of the ith discipline,
and Yi is the vector of functions that explicitly define these intermediate
variables.
∆r2
∆y1
∆y1
∆y2
∆y2
∆f
∆f
(a) (b)
Residual Functional
∆x
∆r1
∆y1
∆y2
∆f
(c)
Hybrid
2 32 3 2 32 T T T
32 3 2 3
@R1 @R2 @F df
6 I 0 0 07 6 I 7 6I 76I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @R 76 7 6 76 T76 7 6 7
7 6 7
6
@R1 @R1 T T
6 1 7 6 dy 7 6076 @R1 @R2 @F 76 df
6 07 6 1
6 760 76 7 607
6 @x @y1 @y2 7 6 dx 7 7 = 6 76 @y @y @y 7 6 dr1 7 6 7 7
6 76 6 76 1 1 1 T7 6 7=
6 @R2 @R2 @R2 7 6 dy2 77 6076 @R1
T
@R2
T
@F 7 6 df 7 6 7
6 07 6 760 76 7 6 0 7
6 @x @y1 @y2 7 6 dx 7 6
7
6 76 7 6 dr 7 6 7
6
6 74 6 @y2 @y2 @y2 76 27 4 7
4 @F @F @F 5 df 5 4 54 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx
(c) Coupled direct — residual approach (d) Coupled adjoint — residual approach
2 32 3 2 32 T T T
32 3 2 3
@Y1 @Y2 @F df
6 I 0 0 07 6 I 7 I 6 I
6 7 6 Design@x 7 6 7 6 07
6 J.R.R.A. Martins 7 Multidisciplinary Optimization @x @x 7 6 dx
August 7
2012 419 / 427
@x @y1 dx Design Optimization
@y2 Multidisciplinary Computing Coupled Derivatives
(c) Coupled direct — residual approach (d) Coupled adjoint — residual approach
Analytic Methods for Computing Coupled Derivatives 6
2 32 32 32 T T T
32 3 2 3
@Y1 @Y2 @F df
6 I 0 0 0 7 6 I 7 6I 7 6I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @Y 76 6 76 T76 7 6 7
dy1 7 7 6 df 7 6 7
@Y1 T
6 1
07 6 7 607 6 @Y2 @F
6 I 76 76 0 I 76 7 607
6 @x @y2 7 6 dx 7 6
7
76 @y1 @y 7 6 dy1 7 6 7
6 76 =6
6 76 1 T7 6 7= 6 7
6 @Y2 @Y2 7 6 dy2 7
7 607 6 6 @Y1
T
@F 7 6 df 7 6 7
6 I 07 6 76 7 6 07
6 @x @y1 7 6 dx 7 6 7 0
7 6 76
I 7 6 dy 7 6 7
6 74 6 @y2 @y2 76 27 4 7 6
4 @F @F @F 5 df 5 4 5 4 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx
(e) Coupled direct — functional approach (f) Coupled adjoint — functional approach
2 32 32 32 T T T
32 3 2 3
@R1 @Y2 @F df
6 I 0 0 07 6 I 7 6I 76I 76 7 07
6 76 7 6 76 @x @x @x 7 6 dx 7 6
6 @R 76 6 76 T76 7 6 7
dy1 7 7 6 df 7 6 7
@R1 @R1 T T
6 1
07 6 7 6076 @R1 @Y2 @F
6 76 76 0 76 7 607
6 @x @y1 @y2 7 6 dx 7 6
7
76 @y @y1 @y 7 6 dr1 7 6 7
6 76 =6
6 76 1 1 T7 6 7= 6 7
6 @Y2 @Y2 7 6 dy2 7
7 6076 6 @R1
T
@F 7 6 df 7 6 7
6 I 07 6 76 7 6 07
6 @x @y1 7 6 dx 7 6 7 0
7 6 76
I 7 6 dy 7 6 7
6 74 6 @y2 @y2 76 27 4 7 6
4 @F @F @F 5 df 5 4 54 54 5 5
I 0 0 0 0 I I I
@x @y1 @y2 dx
Numerical Example 1
In most cases, the explicit computation of state variables involves solving the
nonlinear system corresponding to the discipline; however, in this example, this is
simplified because the residuals are linear in the state variables and each discipline
has only one state variable. Thus, the explicit forms are
2y2 sin x1
Y1 (x1 , x2 , y2 ) = − +
x1 x1
y1
Y2 (x1 , x2 , y1 ) = .
x22
Numerical Example 2
Coupled — Residual (Direct)
∂R ∂ R1 dy1 dy1 ∂ R1 ∂ R1
1
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
∂ R2 ∂ R2 dy2 dy2 = ∂ R2 ∂ R2
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
dy dy
1 1
−x1 −2 dx1 dx2 y1 − cos x1 0
2 dy =
1 −x2 2 dy2 0 2x2 y2
dx1 dx2
Numerical Example 3
Coupled — Residual (Adjoint)
∂R ∂ R2 df1 df2 ∂ F1 ∂ F2
1
− −
∂y1 ∂y1 dr1 dr1 ∂y1 ∂y1
∂ R1 ∂ R2 df1 df2 = ∂ F1 ∂ F2
− −
∂y2 ∂y2 dr2 dr2 ∂y2 ∂y2
df df
1 2
−x1 1 dr1 dr1 1 0
2 df =
−2 −x2 1 df2 0 sin x1
dr2 dr2
Numerical Example 4
Coupled — Functional (Direct)
∂ Y1 dy1 dy1 ∂ Y1 ∂ Y1
1 −
∂y2 dx1 dx2 ∂x1 ∂x2
dy2 dy2 = ∂ Y2 ∂ Y2
∂ Y2
− 1
∂y1 dx1 dx2 ∂x1 ∂x2
dy dy
1 1
2y2
+ cosx1x1 − sinx2x1
" # " #
1 x21 dx dx x2
0
1 2 = 1 1
− x12 1 dy2 dy2 0 − 2y1
2 x32
dx1 dx2
Numerical Example 5
Coupled — Functional (Adjoint)
∂ Y2 df1 df2 ∂ F1 ∂ F2
1 −
∂y1 dy1 dy1 ∂y1 ∂y1
df1 df2 = ∂ F1
∂ Y1 ∂ F2
− 1
∂y2 dy2 dy2 ∂y2 ∂y2
df df
" # 1 2
1 − x12 dy dy
1 0
2
2 df11 df12 =
x1
1 0 sin x1
dy2 dy2
Numerical Example 6
Coupled — Hybrid (Direct)
∂R ∂ R1 dy1 dy1 ∂ R1 ∂ R1
1
− −
∂y1 ∂y2 dx1 dx2 ∂x1 ∂x2
∂ Y2 dy2 dy2 = ∂ Y2 ∂ Y2
− 1
∂y1 dx1 dx2 ∂x1 ∂x2
# dy1 dy1
" " #
−x1 −2 dx dx y1 − cos x1 0
1 2
− x12 1 dy2 dy2 = 0 − 2y
x3
1
2 2
dx1 dx2
Numerical Example 7
Coupled — Hybrid (Adjoint)
∂R ∂ Y2 df1 df2 ∂ F1 ∂ F2
1
− −
∂y1 ∂y1 dr1 dr1 ∂y1 ∂y1
∂ R1 df1 df2 = ∂ F1 ∂ F2
− 1
∂y2 dy2 dy2 ∂y2 ∂y2
# df1 df2
"
−x1 − x12 dr dr
1 0
2 df11 df12 =
−2 1 0 sin x1
dy2 dy2