Professional Documents
Culture Documents
Lagrange Multipliers Without Permanent Scarring: Dan Klein
Lagrange Multipliers Without Permanent Scarring: Dan Klein
Lagrange Multipliers Without Permanent Scarring: Dan Klein
Dan Klein
1 Introduction
This tutorial assumes that you want to know what Lagrange multipliers are, but are more interested in getting the intuitions and central ideas. It contains nothing which would qualify as a formal proof, but the key ideas need to read or reconstruct the relevant formal results are provided. If you dont understand Lagrange multipliers, thats ne. If you dont understand vector calculus at all, in particular gradients of functions and surface normal vectors, the majority of the tutorial is likely to be somewhat unpleasant. Understanding about vector spaces, spanned subspaces, and linear combinations is a bonus (a few sections will be somewhat mysterious if these concepts are unclear). Lagrange multipliers are a mathematical tool for constrained optimization of differentiable functions. In the that we want to basic, unconstrained version, we have some (differentiable) function maximize (or minimize). We can do this by rst nd extreme points of , which are points where the gradient is zero, or, equivlantly, each of the partial derivatives is zero. If were lucky, points like this that we nd will turn out to be (local) maxima, but they can also be minima or saddle points. We can tell the different cases apart by a variety of means, including checking properties of the second derivatives or simple inspecting the function values. Hopefully this is all familiar from calculus, though maybe its more concretely clear when dealing with functions of just one variable. All kinds of practical problems can crop up in unconstrained optimization, which we wont worry about here. One is that and its derivative can be expensive to compute, causing people to worry about how many evaluations are needed to nd a maximum. A second problem is that there can be (innitely) many local maxima which are not global maxima, causing people to despair. Were going to ignore these issues, which are as big or bigger problems for the constrained case. In constrained optimization, we have the same function to maximize as before. However, we also have some restrictions on which points in we are interested in. The points which satisfy our constraints are refered to as the feasible region. A simple constraint on the feasible region is to add boundaries, such as insisting that each be positive. Boundaries complicate matters because extreme points on the boundaries will not, in general, meet the zero-derivative criterion, and so must be searched for in other ways. You probably had to deal with boundaries in calculus class. Boundaries correspond to inequality constraints, which we will say relatively little about in this tutorial. Lagrange multipliers can help deal with both equality constraints and inequality constraints. For the majority of the tutorial, we will be concerned only with equality constraints, which restrict the feasible region to points lying on some surface inside . Each constraint will be given by a function , and we will only be interested in points where .1
G @ A C 8 65 HFED97(4
A @ 8 65 B97(4
1 If
you want a
" !
.
%%)('
&
& 2301'
$ %
2.5
1.5
0.5
0.5
1.5
2 2
1.5
0.5
0.5
1.5
4 2 0 2 4 6 8 10 2 1 0 0 1 2 2 1 1 2
2 Trial by Example
Lets do some example maximizations. First, well have an example of not using Lagrange multipliers.
2.1 A No-Brainer
Lets say you want to know the maximum value of subject to the constraint (see gure 1). Here we can just substitute our value for (1) into , and get our maximum value of . It isnt the most challenging example, but well come back to it once the Lagrange multipliers show up. However, it highlights a basic way that we might go about dealing with constraints: substitution.
2.2 Substitution
Let . This is the downward cupping paraboloid shown in gure 5. The unconstrained maximum is clearly at , while the unconstrained minimum is not even dened (you can nd points with as low as you like). Now lets say we constrain and to lie on the unit circle. To do this, we add the constraint . Then, we maximize (or minimize) by rst solving for one of the variables explicitly:
S S P dcI 0 S eR aP 2 b0
XI WP
R
X P wgS
P `I
P VI
P QI
R %
while satisfying
R %
0 Y
R S vP X R r R
2 p0
X 0W (i P I
R
P hI
X 0 g f'
2 U0
S TP
(1) (2)
1.5
1.5
0.5
0.5
0.5
0.5
1.5
1.5
2 2
1.5
0.5
0.5
1.5
2 2
1.5
0.5
0.5
1.5
(4)
(5) (6)
Then, were back to a one-dimensional unconstrained problem, which has a maximum at , where and . This shouldnt be too surprising; were stuck on a circle which trades for linearly, while costs twice as much from . Finding the constrained minimum here is slightly more complex, and highlights one weakness of this approach; the one-dimensional problem is still actually somewhat constrained in that must be in . The minimum value occurs at both these boundary points, where and .
curves which touch but do not cross are tangent, but feel free to verify it by checking derivatives!
RS
S X
2 0 U T 2 0
0 aB
S P S h
X R X 2 30
&
R %
2 0 tB
XI (P
2 0 3"
0 2 B
R X I
X P wgS R XwsS VI P P P VI r R
X 0 f'
XI yP &
P xI
0 R
R r %
R %
2 30
0 H
&
a0
0.1
1.5
0 0.7 0.5
0.02
0.06
0.1
1.04
1.02
0.98
0.96
0.94
0.92
1.5 1.5
0.5
0.5
1.5
0.6
0.8
0.7
0.6
Figure 4: Level curves of the paraboloid, intersecting the constraint circle. This intuition is very important; the entire enterprise of Lagrange multipliers (which are coming soon, really!) rests on it. So heres another, equivalent, way of looking at the tangent requirement, which generalizes better. Consider again the zooms in gure 4. Now think about the normal vectors of the contour and constraint curves. The two curves being tangent at a point is equivalent to the normal vectors being parallel at that point. The contour is a level curve, and so the gradient of , , is normal to it. But that means that at an extreme point , the gradient of will be perpendicular to as well. This should also make sense the gradient is the direction of steepest ascent. At a solution , we must be on , and, while it is ne for to have a non-zero gradient, the direction of steepest ascent had better be perpendicular to . Otherwise, we can project onto , get a non-zero direction along , and nudge along that direction, increasing but staying on . If the direction of steepest increase and decrease take you off perpendicularly off of , then, even if you are not at an unconstrained maximum of , there is no local move you can make to increase which does not take you out of the feasible region . Formally, we can write our claim that the normal vectors are parallel at an extreme point as:
So, our method for nding extreme points3 which satisfy the constraints is to look for point where the following equations hold true:
The partial derivatives with respect to recover the parallel-normals equations, while the partial derivative with respect to recovers the constraint . The is our rst Lagrange multiplier. Lets re-solve the circle-paraboloid problem from above using this method. It was so easy to solve with substition that the Lagrange multiplier method isnt any easier (if fact its harder), but at least it illustrates the method. The Lagrangian is:
and we want
3 We
can sort out after we nd them which are minima, maxima, or neither.
&
&
2 0 3'
&
& P
&
& #
&
&
0 #
#
0 #
0 %f'
&
2 0 31'
0 f' #
0 0
&
&
&
&
(7)
(8) (9)
(10)
(11)
(12) (13)
(14) (15)
From the rst two equations, we must have either or . If , then , , and . If , then , , and . These are the minimum and maximum, respectively. Lets say we instead want the constraint that and sum to 1 ( ). Then, we have the situation in gure ??(right). Before we do anything numeric, convince yourself from the picture that the maximum is going to occur in the (+,+) quadrant, at a point where the line is tangent to a level curve of . Also convince yourself that the minimum will not be dened; that values get arbitrarily low in both directions along the line away from the maximum. Formally, we have
and we want
We can see from the rst two equations that , which, with, since they sum to one, means , . At those values, and . So what do we have so far? Given a function and a constraint, we can write the Lagrangian, differentiate, and solve for zero. Actually solving that system of equations can be hard, but note that the Lagrangian is a function of +1 variables ( plus ) and so we do have the right number of equations to hope for unique, existing solutions: from the partial derivatives, plus one from the partial derivative.
ih (I
&
0 he
2 t0
2 0 3"
2 0 3"
S TP
#
R r P R gP f
I P
ih f kgP 0 " R I h 0
0
0
0 d
& 2 0 on $
e d
R d
which gives
0 2 ge a0 R
2 30 S hP
0 "
S P vEX
2 0 3'
2 0 UseD r I P &
S vP R 2 t0 R I P R 2 U0 R D
I P 0 B #
e P
r R f gP
#
I P S hP X
2 0 tB 0 H
0 %f' 0 %f'
0 %f' d
0 f' #
0 0
2 0 Use
d R d d d d
d d
(16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27)
ih kf
$ l $
0 B
a0 R
I P
0 " l &
ih (jS
0 R
&
0 B
directions that is free to vary. However, given the constraints, cannot make any local movement along vectors which have any component perpendicular to any constraint. Therefore, our condition should again be that , while not necessarily zero, is entirely contained in the subspace spanned by the normals. We can express this by the equation
be a linear combination of the normals, with weights . Which asserts that It turns out that tossing all the constraints into a single Lagrangian accomplishes this:
It should be clear that differentiating with respect to and setting equal to zero recovers the th constraint, , while differentiating with respect to the recovers the assertion that the gradient of have no components which arent spanned by the constraints normals. As an example of multiple constraints, consider gure ??. Imagine that is the distance from the origin. Thus, the level surfaces of are concentric spheres with the gradient pointing straight out of the spheres. Lets say we want the minimum of subject to the constraints that and , shown as planes in the gure. Again imagine the spheres as expanding from the center, until it makes contact with the planes. The unconstrained minimum is, of course, at the origin, where is zero. The sphere grows, and increases. When the spheres radius reaches one, the sphere touches both planes individually. At the points of contact, the gradient of is perpendicular to the touching plane. Those points would be solutions if that plane were the only constraint. When the sphere reaches a radius of , it is touching both planes along their line of intersection. Note that the gradient is not zero at that point, nor is it perpendicular to either surface. However, it is parallel to an (equal) combination of the two planes normal vectors, or, equivalently, it lies inside the plane spanned by those vectors (the plane , [not shown due to my lacking matlab skills]). A good way to think about the effect of adding constraints is as follows. Before there are any constraints, there are dimensions for to vary along when maximizing, and we want to nd points where all dimensions have zero gradient. Every time we add a constraint, we restrict one dimension, so we have less freedom in maximizing. However, that constraint also removes a dimension along which the gradient must be zero. So, in the nice case, we should be able to add as many or few constraints (up to ) as we wish, and everything should work out.4
w @ ws6
G @ "s6
the not-nice cases, all sorts of things can go wrong. Constraints may be unsatisable (e.g. can prevent the Lagrange multipliers from existing [more].
4 In
and
, or subtler situations
S hP
& $ #
S hP
and
S hP
S hP
$
&
# % $
p p 0 r q $ t
0 #
I v
0
%f'
2 0 t"
#
2 0 q $
&
# l
(28)
(29)
3 The Lagrangian
The Lagrangian is a function of variables (remember that , plus one for each of the ). Differentiating gives the corresponding equations, each set to zero, to solve. The equations from differentiating with respect to each recovers our gradient conditions. The equations from differentiating with respect the recover the constraints . So the numbers give us some condence that we have the right number of equations to hope for point solutions. Its helpful to have an idea of what the Lagrangian actually means. There are two intuitions, described below.
Now remember that if your x happens to satisfy the constraints, , regardless of what is. However, if does not satisfy the constraints, some . But then, can be ddled to make as small as desired, and will be . So will be the maximum value of subject to the constraints.
This is part of the full Kuhn-Tucker theorem (cite), which we arent going to prove rigorously. However, the intuition behind why its true is important. Before we examine why this reversal should work, lets see what it accomplishes if its true. We originally had a constrained optimization problem. We would very much like for it to become an unconstrained optimization problem. Once we x the values of the multipliers, becomes a function of alone. We might be able to maximize that function (its unconstrained!) relatively easily. If so, we would get a solution for each , call it . But then we can do an unconstrained minimization of over the space of . We would then have our solution. It might not be clear why thats any different that xing and nding a minimizing value for each . Its different in two ways. First, unlike , would not be continuous. (Remember that its negative innity almost everywhere and jumps to for which satisfy the constraints.) Second, it is often the case that we can nd a closed-form solution to while we have nothing useful to say about . This is also a general instance of switching to a dual problem when a primal problem is unpleasant in some way. [cites]
3.3 Duality
Lets say were convinced that it would be a good thing if
f'
(32)
| y
'
f'
$ 0 'm`
z Ur $ &
%f' 7
$ $
%f
%f
z {r &
nz
~
& 2 0 '$
0 7 (
% ' ' %
$ %
eP
0 '3}%f
(30)
(31)
10 5 0 5 10 15 20 25 30 35 4 2 0 0 2 4 4 2 2 4
Now, well argue why this is true, examples rst, intuition second, formal proof elsewhere. Recall the nobrainer problem: maximize subject to . Lets form the Lagrangian.
. At that value, the This surface is plotted in gure ??. The straight dark line is the value of at constraint is satised and so, as promised, has no effect on and . At each , be , except for where . The curving dark line is the value of at for all . The minimum value along this line is at , where , which is the maximum (and only) value of among the point(s) satisfying the constraint.
i f (h gP
S 0 i % U0" S 0 S 0 c'V Hz 0 t
0 w
&
i f h gP
S TP
S TP
2 t0 P
0 n
P VI S
2 30
R %
R S TP
0 i
P VI
ih ih S (I
% S 0
0 q0
%f S 3l
R %
P VI
0 1 S
(33)
0 "
& 2 0 a} $
eP S
D2 r
#
ih f kgP
'
lambda = 0 2 2
lambda = 1.6667 2
lambda = 1.3333 2
lambda = 1
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2 2
2 2
2 2
2 2
2 2
increase the lagrangian by nudging toward the origin. Similarly, if we nudge down to , then the gradient of is over-cancelled and we can increase the Lagrangian by nudging away from the origin.
D2
S P vEX $
P i f g(h gP
a 0
XI (P
P VI $
2 t0
R %
$
&
&
1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1 0.8 0.6 0.4 0.2
5 Extensions
5.1 Inequality Constraints
The Lagrange multiplier method also covers the case of inequality constraints. Constraints of this form are written . The key observation about inequality constraints work is that, at any given , a has either or , which are qualitatively very different. The two possibilities are shown in gure ??. If then is said to be active at , otherwise it is inactive. If is active at , then is a lot like an equality constraint; it allows to be maximum if the gradient of , , is either zero or pointing towards negative values of (which violate the constraint). However, if the gradient is pointing towards positive values of , then there is no reason that we cannot move in that direction. Recall that we used to write
for a (single) equality constraint. The interpretation was that, if is a solution, direction of the normal to , . For inequality constraints, we write
but, if x is a maximum, then if is non-zero, it not only has to be parallel , but it must actually point in the opposite sense along that direction (i.e., out of the feasible side and towards the forbidden side). We can actually enforce this very simply, by restricting the multiplier to be negative (or zero). Positive mutlipliers mean that the direction of increasing is in the same direction as increasing but points in that situation certainly arent solutions, as we want to increase and we are allowed to increase . If is inactive at ( ), then we want to be even stricter about what values of are acceptable from a solution. In fact, in this case, must be zero at . (Intuitively, if is inactive, then nothing should change at if we drop ). [better explanation] In summary, for inequality constraints, we add them to the Lagrangian just as if they were equality constraints, except that we require that and that, if is not zero, then is. The situation that one or the other can be non-zero, but not both, is referred to as complementary slackness. This situation can be compactly written as . Bundling it all up, complete with multiple constraints, we get the general Lagrangian:
2 0 3'
The Kuhn-Tucker theorem (or our intuitive arguments) tell us that if a point the constraints and , then:
is a maximum of
subject to (37)
'
' #
&
P & $ P (t $ % t 3ff $ 0
# (t
#
s P
'$
r ' &
# $
&
# t # 0
# t # 0
' #
2 tW
#
'
0 ff #
&
# '
2 '
&
2 0 U}
(34)
(36)
The second condition takes care of the restriction on active inequalities. The third condition is a somewhat cryptic way of insisting that for each , either is zero or is zero. Now is probably a good time to point out that there is more to the Kuhn-Tucker theorem than the above statement. The above conditions are called the rst-order conditions. All (local) maxima will satisfy them. The theorem also gives second order conditions on the second derivative (Hessian) matrices which distinguish local maxima from other situations which can trigger false alarms with the rst-order conditions. However, in many situations, one knows in advance that the solution will be a maximum (such as in the maximum entropy example). Caveat about globals?
6 Conclusion
This tutorial only introduces the basic concepts of the Langrange multiplier methods. If you are interested, there are many detailed texts on the subject [cites]. The goal of this tutorial was to supply some intuition behind the central ideas so that other, more comprehensive and formal sources become more accessible. Feedback requested!
$
2 2
0' t u $ u
(38) (39)