Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Lecture: Convex Cardinality Optimization

Junyu Zhang

Based on Stephen Boyd’s lecture notes

October 28, 2019

1 / 26
Outline

Convex cardinality optimization

Case studies

Hardness of the problem

`1 -regularized methods

More problems cases

2 / 26
Convex Cardinality Optimization

The cardinality of x ∈ Rn , written card(x), is the number of


nonzero elements in x.

Cardinality is not continuous.

Cardinality is non-convex and non-concave. (Think of 1-d case


card(x), x ∈ R)

Quasi-concave on Rn+ since

card(λx + (1 − λ)y ) ≥ min{card(x), card(y )}

Useful concept: support (sparsity pattern)

supp(x) := {i : xi 6= 0}

3 / 26
More about cardinality

It is often called the “`0 -norm”, written as kxk0 . However,


cardinality is not a norm.
Triangle inequality: kx + y k0 ≤ kxk0 + ky k0 . (Yes)
Positive definite: kxk0 = 0 ⇐⇒ x = 0. (Yes)
Homogeneity: ka · xk0 = |a| · kxk0 , ∀a ∈ R. (No)

Define d(x, y ) := kx − y k0 , d(·, ·) is a metric (distance).


Triangle inequality: d(x, y ) ≤ d(x, z) + d(y , z). (Yes)
Symmetry: d(x, y ) = d(y , x). (Yes)
Identity of indiscernibles: d(x, y ) = 0 ⇐⇒ x = y . (Yes)

If x ∈ {0, 1}n : d(·, ·) is called Hamming distance. It has many


application in data minning, e.g., recommondation systems.

4 / 26
Convex-cardinality problems

A convex-cardinality problem is a problem which will be convex


without the presence of card.

minimizing cardinality as objective function:

minimize card(x)
s.t. x ∈C

having cardinality as constraint:

minimize f (x)
s.t. x ∈C
card(x) ≤ k

5 / 26
Sparse portfolio selection
The problem set up:
n assets, each with expected return ri , 1 ≤ i ≤ n.
The covariance matrix of the the returns Σ.
Each assets invest wi .

The portfolio selection problem:

minimize w T Σw
s.t. r T w = µ, 1T w = 1.

Solution w ∗ can be dense, many entries can be small but nonzero:

minimize w T Σw
s.t. r T w = µ, 1T w = 1,
card(w ) ≤ k.
6 / 26
Sparse portfolio selection

which is often formulated as

minimize w T Σw + γ · card(w )
s.t. r T w = µ, 1T w = 1.

In practice, solve

minimize w T Σw + λkw k1
s.t. r T w = µ, 1T w = 1.

7 / 26
High dimensional statistics

Fitting linear models:


y = x T β ∗ + ,  ∼ N(0, σ 2 ).
Collected data points {(xi , yi )}ni=1 .
The linear regression:
n
X
β̂n = arg min (xiT β − yi )2
β i=1

Classical senarios, abundant (nondegenerate) data:

lim β̂n = β ∗
n→∞

8 / 26
High dimensional statistics

Modern situation: x ∈ Rd , with d  n.


Linear regression can be arbitrarily bad.
No theoretical guarantee.

Sepcial structure: sparsity: card(β ∗ ) ≤ O(n/ log d)  d


Choose proper k, solve
n
X
β̂n = arg min (xiT β − yi )2 s.t. card(β) ≤ k.
β i=1

In practice, solve
n
X
β̂n = arg min (xiT β − yi )2 + λn kβk1 .
β i=1

With proper λn , good bound can be derived for kβ̂n − β ∗ k2 .

9 / 26
Sparse coding / Dictionary learning
Unsupervised methods for learning sets of over-complete bases to
represent data efficiently.
Belief: natural images are generated by linear combination of bases.

picture from Andrew Ng.


10 / 26
Sparse coding / Dictionary learning

The set of over-complete basis: Φ = [φ1 , φ2 , · · · , φm ].

Assume each data points xi ∈ Rd can be expressed as


m
(j)
X
xi = ai · φj + νi = Φai + νi
j=1

νi ∈ Rd is an additive noise; card ai is small.




Sparse coding: a := [a1 , ..., an ],


n
X
kxi − Φai k2 + λkai k1

minimize
a, Φ
i=1

Fix a, convex in Φ; Fix Φ, convex in a.

11 / 26
Exact solution to convex-cardinality problems

Consider the case:

minimize
n
f (x) + λ · card(x)
x∈R
s.t. x ∈C

If supp(x) is given, solvable convex problem.

Exact solution: enumerate all possible supp(x) and solve the


problems. In the worst case: solve 2n problems.

The problem is in general (NP-)hard.

12 / 26
Binary linear programming

Binary LP problems:

minimize cT x
x
s.t. Ax ≤ b, x ∈ {0, 1}n .

Includes hard problems traveling salesman, Hamiltonian cycle, etc.


Observation (“=” achieved if and only if x ∈ {0, 1}n ):
n
X
card(x) + card(1 − x) = card(xi ) + card(1 − xi ) ≥ n.
i=1

reduction:
minimize cT x
x
s.t. Ax ≤ b,
card(x) + card(1 − x) = n.

13 / 26
`1 -norm heuristic

Step 1. Replace card(x) with λkxk1 . Solve the problem to get x̂ ∗

Tune λ to get desired sparsity. The larger λ is, the sparser the
solution is.
P
Or sophisticated versions: i wi |xi |.

Step 2. Polishing: fix the sparsity pattern of supp(x̂ ∗ ) and resolve


the problem without regularization to get x ∗

Step 3. In statistics and machine learning, a cross validation is


needed to ensure the selection of λ is reasonable.

14 / 26
The bias in the solution

Solve minimize f (x) s.t. card(x) ≤ k by


x

x̂λ = minimize f (x) + λkxk1


x

KKT condition: 0 ∈ ∇f (x̂λ ) + λ∂(kx̂λ k1 ), element-wise:



∇i f (x̂λ ) = −λ,
 if (x̂λ )i > 0
∇i f (x̂λ ) = λ, if (x̂λ )i < 0

∇i f (x̂λ ) ∈ [−λ, λ], if (x̂λ )i = 0

Even if we find the correct sparsity pattern. There is a nonzero bias.


Namely, define B = supp(x̂λ ),

k∇B f (x̂λ )k2 = |B| · λ2 6= 0

15 / 26
Solving the `1 regularized problem
Suppose f is convex and smooth, consider

minimize f (x) + λ · kxk1 , s.t. x ∈C


x

ISTA / FISTA method. (Next page)


Subgradient method

xk+1 = xk − ηk · (∇f (xk ) + λvk ), vk ∈ ∂(kxk k1 )

ADMM formulation, for simple f such as kAx − bk2

minimize f (x) + λ · ky k1 , s.t. x = y, x ∈ C


x

Min-max formulation. (kxk1 = maxy {y T x : ky k∞ ≤ 1})

minimize maximize f (x) + λy T x


x∈C ky k∞ ≤1

...
16 / 26
Iterative shrinkage-thresholding algorithm (ISTA)

Define x̃k+1 = xk − η · ∇f (xk ). ISTA:


1
xk+1 = arg min kx − x̃k+1 k2 + λkxk1
x 2η
Equivalent formulation:
1
xk+1 = arg min f (xk ) + h∇f (xk ), x − xk i + λkxk1 + kx − xk k2
x 2η
Separable, each entry solves arg minz (z − a)2 + b|z|, with b > 0.
Explicit update:

(x̃k+1 )i − ηλ, if
 (x̃k+1 )i ∈ (ηλ, ∞)
(xk+1 )i = 0, if (x̃k+1 )i ∈ [−ηλ, ηλ]

(x̃k+1 )i + ηλ, if (x̃k+1 )i ∈ (−∞, −ηλ)

Apply Nesterov’s acceleration, fast ISTA (FISTA).


17 / 26
Interpretation as convex relaxation

For R large enough, consider

minimize f (x) + γ · card(x)


x
s.t. x ∈ C, kxk∞ ≤ R

Equivalent mixed integer programming:

minimize f (x) + γ · 1T z
x,z
s.t. |xi | ≤ Rzi , for i = 1, ..., n
x ∈ C, z ∈ {0, 1}n

18 / 26
Interpretation as convex relaxation

Relax z ∈ {0, 1}n to z ∈ [0, 1]n

minimize f (x) + γ · 1T z
x
s.t. |xi | ≤ Rzi , for i = 1, ..., n
x ∈ C, z ∈ [0, 1]n

Fix x, minimize over z gives zi = |xi |/R:


γ
minimize f (x) + · kxk1
x R
s.t. x ∈C

optimal value lower bounds the original problem.

19 / 26
Sparse design

Find design vector x with smallest card(x) such that some set of
specifications are satisfied.

Zero entries of x simplify design (components that are not needed).

Problem formulation

minimize card(x)
s.t. x ∈C

Examples
antenna array beamforming (zero coefficients correspond to unneeded
antenna elements)
truss design (zero coefficients correspond to bars that are not needed)
...

20 / 26
Truss design

Figure: Such truss contains a huge number of steal bars, potentially many of
them are not necessary.

21 / 26
Sparse modeling / regressor selection

Fitting data b as a linear combinition of k regressors, chosen from n


regressors.

minimize
n
kAx − bk2
x∈R
s.t. card(x) ≤ k

Reformulations:
minimize
n
kAx − bk2 + λ · card(x)
x∈R

minimize
n
kAx − bk22 + λ · card(x)
x∈R

minimize
n
card(x) s.t. kAx − bk2 ≤ 
x∈R

Lasso, feature selection, etc.

Generalized linear model: logistic regression, poisson regression, etc.


22 / 26
Sparse signal reconstruction (compressed sensing)

For an unknown signal x ∈ Rd


noisy measurement y = Ax + v , v ∼ N(0, σ 2 I ). A ∈ Rn×d is a known
fat matrix, n  d.
prior knowledge x is sparse, card(x) ≤ k.

Maximum likelihood estimator (MLE):

minimize
n
kAx − y k22
x∈R
s.t. card(x) ≤ k

Exact recovery expected under certain conditions (for `1 heuristic).

23 / 26
Estimation with outliers

Data points {(xi , yi )}ni=1 : yi = xiT β ∗ + i + wi∗ .

i ∼ N(0, σ 2 ) are i.i.d. noise.

w ∗ = [w1∗ , ..., wn∗ ]T is sparse noise: card(w ∗ ) ≤ k.

supp(w ∗ ) corresponds to the set of outliers.

The MLE for β:

minimize ky − X β − w k22
β∈Rd ,w ∈Rn
s.t. card(w ) ≤ k

24 / 26
Robust principle component analysis∗

Data matrix M ∈ Rn×n . M has an approximate low-rank


decomposition: M = UV T + W , U, V ∈ Rn×k , k  n and the noise
matrix W is sparse.

Classical PCA problem:

minimize kM − Lk
L∈Rn×n
s.t. rank(L) ≤ k

Robust PCA problem:

minimize kM − L − W k
L,W ∈Rn×n
s.t. rank(L) ≤ k, card(W ) ≤ k 0 .

25 / 26
Robust principle component analysis∗

Robust PCA problem:

minimize kM − L − W k
L,W ∈Rn×n
s.t. rank(L) ≤ k, card(W ) ≤ k 0 .

Matrix L is low-rank ⇐⇒ card(σ(L)) is small, σ(L) is the vector of


n singular values of L.

Nuclear norm kLk∗ := kσ(L)k1 .

Practical formulation of robust PCA:

minimize kLk∗ + λkW k1


L,W ∈Rn×n
s.t. L + W = M.

26 / 26

You might also like