Convex Cardinality Optimization

Lecture: Convex Cardinality Optimization
Junyu Zhang
Based on Stephen Boyd’s lecture notes
October 28, 2019
1 / 26
Outline
Convex cardinality optimization
Case studies
Hardness of the problem
`1 -regularized methods
More problems cases
2 / 26
Convex Cardinality Optimization
The cardinality of x ∈ Rn , written card(x), is the number of

nonzero elements in x.
Cardinality is not continuous.
Cardinality is non-convex and non-concave. (Think of 1-d case

card(x), x ∈ R)
Quasi-concave on Rn+ since
card(λx + (1 − λ)y ) ≥ min{card(x), card(y )}
Useful concept: support (sparsity pattern)
supp(x) := {i : xi 6= 0}
3 / 26
More about cardinality
It is often called the “`0 -norm”, written as kxk0 . However,

cardinality is not a norm.
Triangle inequality: kx + y k0 ≤ kxk0 + ky k0 . (Yes)
Positive definite: kxk0 = 0 ⇐⇒ x = 0. (Yes)
Homogeneity: ka · xk0 = |a| · kxk0 , ∀a ∈ R. (No)
Define d(x, y ) := kx − y k0 , d(·, ·) is a metric (distance).

Triangle inequality: d(x, y ) ≤ d(x, z) + d(y , z). (Yes)
Symmetry: d(x, y ) = d(y , x). (Yes)
Identity of indiscernibles: d(x, y ) = 0 ⇐⇒ x = y . (Yes)
If x ∈ {0, 1}n : d(·, ·) is called Hamming distance. It has many

application in data minning, e.g., recommondation systems.
4 / 26
Convex-cardinality problems
A convex-cardinality problem is a problem which will be convex

without the presence of card.
minimizing cardinality as objective function:
minimize card(x)
s.t. x ∈C
having cardinality as constraint:
minimize f (x)
s.t. x ∈C
card(x) ≤ k
5 / 26
Sparse portfolio selection
The problem set up:
n assets, each with expected return ri , 1 ≤ i ≤ n.
The covariance matrix of the the returns Σ.
Each assets invest wi .
The portfolio selection problem:
minimize w T Σw
s.t. r T w = µ, 1T w = 1.
Solution w ∗ can be dense, many entries can be small but nonzero:
minimize w T Σw
s.t. r T w = µ, 1T w = 1,
card(w ) ≤ k.
6 / 26
Sparse portfolio selection
which is often formulated as
minimize w T Σw + γ · card(w )
s.t. r T w = µ, 1T w = 1.
In practice, solve
minimize w T Σw + λkw k1
s.t. r T w = µ, 1T w = 1.
7 / 26
High dimensional statistics
Fitting linear models:

y = x T β ∗ + , ∼ N(0, σ 2 ).
Collected data points {(xi , yi )}ni=1 .
The linear regression:
n
X
β̂n = arg min (xiT β − yi )2
β i=1
Classical senarios, abundant (nondegenerate) data:
lim β̂n = β ∗
n→∞
8 / 26
High dimensional statistics
Modern situation: x ∈ Rd , with d n.

Linear regression can be arbitrarily bad.
No theoretical guarantee.
Sepcial structure: sparsity: card(β ∗ ) ≤ O(n/ log d) d

Choose proper k, solve
n
X
β̂n = arg min (xiT β − yi )2 s.t. card(β) ≤ k.
β i=1
In practice, solve
n
X
β̂n = arg min (xiT β − yi )2 + λn kβk1 .
β i=1
With proper λn , good bound can be derived for kβ̂n − β ∗ k2 .
9 / 26
Sparse coding / Dictionary learning
Unsupervised methods for learning sets of over-complete bases to
represent data efficiently.
Belief: natural images are generated by linear combination of bases.
picture from Andrew Ng.

10 / 26
Sparse coding / Dictionary learning
The set of over-complete basis: Φ = [φ1 , φ2 , · · · , φm ].
Assume each data points xi ∈ Rd can be expressed as

m
(j)
X
xi = ai · φj + νi = Φai + νi
j=1
νi ∈ Rd is an additive noise; card ai is small.

Sparse coding: a := [a1 , ..., an ],

n
X
kxi − Φai k2 + λkai k1

minimize
a, Φ
i=1
Fix a, convex in Φ; Fix Φ, convex in a.
11 / 26
Exact solution to convex-cardinality problems
Consider the case:
minimize
n
f (x) + λ · card(x)
x∈R
s.t. x ∈C
If supp(x) is given, solvable convex problem.
Exact solution: enumerate all possible supp(x) and solve the

problems. In the worst case: solve 2n problems.
The problem is in general (NP-)hard.
12 / 26
Binary linear programming
Binary LP problems:
minimize cT x
x
s.t. Ax ≤ b, x ∈ {0, 1}n .
Includes hard problems traveling salesman, Hamiltonian cycle, etc.

Observation (“=” achieved if and only if x ∈ {0, 1}n ):
n
X
card(x) + card(1 − x) = card(xi ) + card(1 − xi ) ≥ n.
i=1
reduction:
minimize cT x
x
s.t. Ax ≤ b,
card(x) + card(1 − x) = n.
13 / 26
`1 -norm heuristic
Step 1. Replace card(x) with λkxk1 . Solve the problem to get x̂ ∗
Tune λ to get desired sparsity. The larger λ is, the sparser the
solution is.
P
Or sophisticated versions: i wi |xi |.
Step 2. Polishing: fix the sparsity pattern of supp(x̂ ∗ ) and resolve

the problem without regularization to get x ∗
Step 3. In statistics and machine learning, a cross validation is

needed to ensure the selection of λ is reasonable.
14 / 26
The bias in the solution
Solve minimize f (x) s.t. card(x) ≤ k by

x
x̂λ = minimize f (x) + λkxk1

x
KKT condition: 0 ∈ ∇f (x̂λ ) + λ∂(kx̂λ k1 ), element-wise:


∇i f (x̂λ ) = −λ,
 if (x̂λ )i > 0
∇i f (x̂λ ) = λ, if (x̂λ )i < 0

∇i f (x̂λ ) ∈ [−λ, λ], if (x̂λ )i = 0

Even if we find the correct sparsity pattern. There is a nonzero bias.

Namely, define B = supp(x̂λ ),
k∇B f (x̂λ )k2 = |B| · λ2 6= 0
15 / 26
Solving the `1 regularized problem
Suppose f is convex and smooth, consider
minimize f (x) + λ · kxk1 , s.t. x ∈C

x
ISTA / FISTA method. (Next page)

Subgradient method
xk+1 = xk − ηk · (∇f (xk ) + λvk ), vk ∈ ∂(kxk k1 )
ADMM formulation, for simple f such as kAx − bk2
minimize f (x) + λ · ky k1 , s.t. x = y, x ∈ C

x
Min-max formulation. (kxk1 = maxy {y T x : ky k∞ ≤ 1})
minimize maximize f (x) + λy T x

x∈C ky k∞ ≤1
...
16 / 26
Iterative shrinkage-thresholding algorithm (ISTA)
Define x̃k+1 = xk − η · ∇f (xk ). ISTA:

1
xk+1 = arg min kx − x̃k+1 k2 + λkxk1
x 2η
Equivalent formulation:
1
xk+1 = arg min f (xk ) + h∇f (xk ), x − xk i + λkxk1 + kx − xk k2
x 2η
Separable, each entry solves arg minz (z − a)2 + b|z|, with b > 0.
Explicit update:

(x̃k+1 )i − ηλ, if
 (x̃k+1 )i ∈ (ηλ, ∞)
(xk+1 )i = 0, if (x̃k+1 )i ∈ [−ηλ, ηλ]

(x̃k+1 )i + ηλ, if (x̃k+1 )i ∈ (−∞, −ηλ)

Apply Nesterov’s acceleration, fast ISTA (FISTA).

17 / 26
Interpretation as convex relaxation
For R large enough, consider
minimize f (x) + γ · card(x)

x
s.t. x ∈ C, kxk∞ ≤ R
Equivalent mixed integer programming:
minimize f (x) + γ · 1T z
x,z
s.t. |xi | ≤ Rzi , for i = 1, ..., n
x ∈ C, z ∈ {0, 1}n
18 / 26
Interpretation as convex relaxation
Relax z ∈ {0, 1}n to z ∈ [0, 1]n
minimize f (x) + γ · 1T z
x
s.t. |xi | ≤ Rzi , for i = 1, ..., n
x ∈ C, z ∈ [0, 1]n
Fix x, minimize over z gives zi = |xi |/R:

γ
minimize f (x) + · kxk1
x R
s.t. x ∈C
optimal value lower bounds the original problem.
19 / 26
Sparse design
Find design vector x with smallest card(x) such that some set of
specifications are satisfied.
Zero entries of x simplify design (components that are not needed).
Problem formulation
minimize card(x)
s.t. x ∈C
Examples
antenna array beamforming (zero coefficients correspond to unneeded
antenna elements)
truss design (zero coefficients correspond to bars that are not needed)
...
20 / 26
Truss design
Figure: Such truss contains a huge number of steal bars, potentially many of
them are not necessary.
21 / 26
Sparse modeling / regressor selection
Fitting data b as a linear combinition of k regressors, chosen from n

regressors.
minimize
n
kAx − bk2
x∈R
s.t. card(x) ≤ k
Reformulations:
minimize
n
kAx − bk2 + λ · card(x)
x∈R
minimize
n
kAx − bk22 + λ · card(x)
x∈R
minimize
n
card(x) s.t. kAx − bk2 ≤
x∈R
Lasso, feature selection, etc.
Generalized linear model: logistic regression, poisson regression, etc.

22 / 26
Sparse signal reconstruction (compressed sensing)
For an unknown signal x ∈ Rd

noisy measurement y = Ax + v , v ∼ N(0, σ 2 I ). A ∈ Rn×d is a known
fat matrix, n d.
prior knowledge x is sparse, card(x) ≤ k.
Maximum likelihood estimator (MLE):
minimize
n
kAx − y k22
x∈R
s.t. card(x) ≤ k
Exact recovery expected under certain conditions (for `1 heuristic).
23 / 26
Estimation with outliers
Data points {(xi , yi )}ni=1 : yi = xiT β ∗ + i + wi∗ .
i ∼ N(0, σ 2 ) are i.i.d. noise.
w ∗ = [w1∗ , ..., wn∗ ]T is sparse noise: card(w ∗ ) ≤ k.
supp(w ∗ ) corresponds to the set of outliers.
The MLE for β:
minimize ky − X β − w k22
β∈Rd ,w ∈Rn
s.t. card(w ) ≤ k
24 / 26
Robust principle component analysis∗
Data matrix M ∈ Rn×n . M has an approximate low-rank

decomposition: M = UV T + W , U, V ∈ Rn×k , k n and the noise
matrix W is sparse.
Classical PCA problem:
minimize kM − Lk
L∈Rn×n
s.t. rank(L) ≤ k
Robust PCA problem:
minimize kM − L − W k
L,W ∈Rn×n
s.t. rank(L) ≤ k, card(W ) ≤ k 0 .
25 / 26
Robust principle component analysis∗
Robust PCA problem:
minimize kM − L − W k
L,W ∈Rn×n
s.t. rank(L) ≤ k, card(W ) ≤ k 0 .
Matrix L is low-rank ⇐⇒ card(σ(L)) is small, σ(L) is the vector of

n singular values of L.
Nuclear norm kLk∗ := kσ(L)k1 .
Practical formulation of robust PCA:
minimize kLk∗ + λkW k1

L,W ∈Rn×n
s.t. L + W = M.
26 / 26

Convex Cardinality Optimization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convex Cardinality Optimization

Uploaded by

Copyright:

Available Formats

Lecture: Convex Cardinality Optimization

Based on Stephen Boyd’s lecture notes

October 28, 2019

Convex cardinality optimization

Hardness of the problem

More problems cases

The cardinality of x ∈ Rn , written card(x), is the number of

Cardinality is not continuous.

Cardinality is non-convex and non-concave. (Think of 1-d case

Quasi-concave on Rn+ since

card(λx + (1 − λ)y ) ≥ min{card(x), card(y )}

Useful concept: support (sparsity pattern)

It is often called the “`0 -norm”, written as kxk0 . However,

Define d(x, y ) := kx − y k0 , d(·, ·) is a metric (distance).

If x ∈ {0, 1}n : d(·, ·) is called Hamming distance. It has many

A convex-cardinality problem is a problem which will be convex

minimizing cardinality as objective function:

having cardinality as constraint:

The portfolio selection problem:

Solution w ∗ can be dense, many entries can be small but nonzero:

which is often formulated as

Fitting linear models:

Classical senarios, abundant (nondegenerate) data:

Modern situation: x ∈ Rd , with d  n.

Sepcial structure: sparsity: card(β ∗ ) ≤ O(n/ log d)  d

With proper λn , good bound can be derived for kβ̂n − β ∗ k2 .

picture from Andrew Ng.

The set of over-complete basis: Φ = [φ1 , φ2 , · · · , φm ].

Assume each data points xi ∈ Rd can be expressed as

νi ∈ Rd is an additive noise; card ai is small.

Sparse coding: a := [a1 , ..., an ],

Fix a, convex in Φ; Fix Φ, convex in a.

Consider the case:

If supp(x) is given, solvable convex problem.

Exact solution: enumerate all possible supp(x) and solve the

The problem is in general (NP-)hard.

Includes hard problems traveling salesman, Hamiltonian cycle, etc.

Step 1. Replace card(x) with λkxk1 . Solve the problem to get x̂ ∗

Step 2. Polishing: fix the sparsity pattern of supp(x̂ ∗ ) and resolve

Step 3. In statistics and machine learning, a cross validation is

Solve minimize f (x) s.t. card(x) ≤ k by

x̂λ = minimize f (x) + λkxk1

KKT condition: 0 ∈ ∇f (x̂λ ) + λ∂(kx̂λ k1 ), element-wise:

Even if we find the correct sparsity pattern. There is a nonzero bias.

k∇B f (x̂λ )k2 = |B| · λ2 6= 0

minimize f (x) + λ · kxk1 , s.t. x ∈C

ISTA / FISTA method. (Next page)

xk+1 = xk − ηk · (∇f (xk ) + λvk ), vk ∈ ∂(kxk k1 )

ADMM formulation, for simple f such as kAx − bk2

minimize f (x) + λ · ky k1 , s.t. x = y, x ∈ C

Min-max formulation. (kxk1 = maxy {y T x : ky k∞ ≤ 1})

minimize maximize f (x) + λy T x

Define x̃k+1 = xk − η · ∇f (xk ). ISTA:

Apply Nesterov’s acceleration, fast ISTA (FISTA).

For R large enough, consider

minimize f (x) + γ · card(x)

Equivalent mixed integer programming:

Relax z ∈ {0, 1}n to z ∈ [0, 1]n

Fix x, minimize over z gives zi = |xi |/R:

optimal value lower bounds the original problem.

Zero entries of x simplify design (components that are not needed).

Fitting data b as a linear combinition of k regressors, chosen from n

Modern situation: x ∈ Rd , with d n.

Sepcial structure: sparsity: card(β ∗ ) ≤ O(n/ log d) d

Data points {(xi , yi )}ni=1 : yi = xiT β ∗ + i + wi∗ .

i ∼ N(0, σ 2 ) are i.i.d. noise.