Professional Documents
Culture Documents
Karush-Kuhn-Tucker Conditions: Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725
Karush-Kuhn-Tucker Conditions: Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725
1
Remember duality
Given a minimization problem
min f (x)
x∈Rn
subject to hi (x) ≤ 0, i = 1, . . . m
`j (x) = 0, j = 1, . . . r
2
The subsequent dual problem is:
max g(u, v)
u∈Rm , v∈Rr
subject to u ≥ 0
Important properties:
• Dual problem is always convex, i.e., g is always concave (even
if primal problem is not convex)
• The primal and dual optimal values, f ? and g ? , always satisfy
weak duality: f ? ≥ g ?
• Slater’s condition: for convex primal, if there is an x such that
3
Duality gap
Given primal feasible x and dual feasible u, v, the quantity
f (x) − g(u, v)
4
Dual norms
Let kxk be a norm, e.g.,
• `p norm: kxkp = ( n p 1/p , for p ≥ 1
P
i=1 |xi | )
Pr
• Nuclear norm: kXknuc = i=1 σi (X)
kxk∗ = max z T x
kzk≤1
Dual norm of dual norm: it turns out that kxk∗∗ = kxk ...
connections to duality (including this one) in coming lectures
5
Outline
Today:
• KKT conditions
• Examples
• Constrained and Lagrange forms
• Uniqueness with 1-norm penalties
6
Karush-Kuhn-Tucker conditions
Given general problem
min f (x)
x∈Rn
subject to hi (x) ≤ 0, i = 1, . . . m
`j (x) = 0, j = 1, . . . r
7
Necessity
f (x? ) = g(u? , v ? )
m
X r
X
= minn f (x) + u?i hi (x) + vj? `j (x)
x∈R
i=1 j=1
m
X r
X
≤ f (x? ) + u?i hi (x? ) + vj? `j (x? )
i=1 j=1
?
≤ f (x )
8
Two things to learn from this:
• The point x? minimizes L(x, u? , v ? ) over x ∈ Rn . Hence the
subdifferential of L(x, u? , v ? ) must contain 0 at x = x? —this
is exactly the stationarity condition
Pm ? ?
• We must have i=1 ui hi (x ) = 0, and since each term here
? ?
is ≤ 0, this implies ui hi (x ) = 0 for every i—this is exactly
complementary slackness
9
Sufficiency
If there exists x? , u? , v ? that satisfy the KKT conditions, then
m
X r
X
? ? ?
g(u , v ) = f (x ) + u?i hi (x? ) + vj? `j (x? )
i=1 j=1
?
= f (x )
where the first equality holds from stationarity, and the second
holds from complementary slackness
10
Putting it together
In summary, KKT conditions:
• always sufficient
• necessary under strong duality
Putting it together:
11
What’s in a name?
Older folks will know these as the KT (Kuhn-Tucker) conditions:
• First appeared in publication by Kuhn and Tucker in 1951
• Later people found out that Karush had the conditions in his
unpublished master’s thesis of 1939
12
Quadratic with equality constraints
Consider for Q 0,
1 T
minn x Qx + cT x
x∈R 2
subject to Ax = 0
Q AT
x −c
=
A 0 u 0
13
Water-filling
Example from B & V page 245: consider problem
n
X
min − log(αi + xi )
x∈Rn
i=1
subject to x ≥ 0, 1T x = 1
−1/(αi + xi ) − ui + v = 0, i = 1, . . . n
ui · xi = 0, i = 1, . . . n, x ≥ 0, 1T x = 1, u≥0
Eliminate u:
1/(αi + xi ) ≤ v, i = 1, . . . n
xi (v − 1/(αi + xi )) = 0, i = 1, . . . n, x ≥ 0, 1T x = 1
14
Can argue directly stationarity and complementary slackness imply
(
1/v − α if v ≤ 1/α
xi = = max{0, 1/v − α}, i = 1, . . . n
0 if v > 1/α
15
Lasso
Let’s return the lasso problem: given response y ∈ Rn , predictors
A ∈ Rn×p (columns A1 , . . . Ap ), solve
1
minp ky − Axk2 + λkxk1
x∈R 2
KKT conditions:
AT (y − Ax) = λs
where s ∈ ∂kxk1 , i.e.,
{1}
if xi > 0
si ∈ {−1} if xi < 0
[−1, 1] if xi = 0
17
KKT conditions:
√
λ p(i) −1 T
x(i) = AT(i) A(i) + I A(i) r−(i) ,
kx(i) k2
X
where r−(i) = y − A(j) x(j)
j6=i
18
Constrained and Lagrange forms
Often in statistics and machine learning we’ll switch back and forth
between constrained form, where t ∈ R is a tuning parameter,
min f (x) subject to h(x) ≤ t (C)
x∈Rn
and claim these are equivalent. Is this true (assuming convex f, h)?
Conclusion:
[ [
{solutions in (L)} ⊆ {solutions in (C)}
λ≥0 t
[ [
{solutions in (L)} ⊇ {solutions in (C)}
λ≥0 t such that (C)
is strictly feasible
{x : g(x) ≤ t} =
6 ∅, {x : g(x) = t} = ∅ ⇒ t=0
22
In other words, we’ve proved that rank(AS ) < |S| implies si Ai is
in the affine span of sj Aj , j ∈ S \ {i} (subspace of dimension < n)
● A3
A2 ●
● A1
● A4
● ●
23
Therefore, if entries of A are drawn from continuous probability
distribution, any solution must satisfy rank(AS ) = |S|
24
Back to duality
One of the most important uses of duality is that, under strong
duality, we can characterize primal solutions from dual solutions
Recall that under strong duality, the KKT conditions are necessary
for optimality. Given dual solutions u? , v ? , any primal solution x?
satisfies the stationarity condition
m
X r
X
?
0 ∈ ∂f (x ) + u?i ∂hi (x? ) + vi? ∂`j (x? )
i=1 j=1
25
References
26