Advanced Machine Learning: (COMP 5328)

Advanced Machine Learning
(COMP 5328)
Sparse Coding and Regularisation
Tongliang Liu
Quiz results
COMP5328 Tongliang Liu 2

Review
3
Dictionary learning
What is a dictionary in machine learning?
d
Let x 2 R , D2R d⇥k
⇤ 2
↵ = arg min kx D↵k .
↵2Rk
p
Note that kxk = x> x is the ell 2 norm.
d
Given x1 , . . . , xn 2 R
Xn
⇤ ⇤ ⇤ 1 2
{D , ↵1 , . . . , ↵ n } = arg min kxi D↵i k .
D2Rd⇥k ,↵1 ,...,↵n 2Rk n i=1

Dictionary learning
Note that
Xn
1 1 2
kxi D↵i k = kX DRkF2 ,
n i=1 n
d⇥n
where X = [x1 , x2 , . . . , xn ] 2 R ,
k⇥n
R = [↵1 , ↵2 , . . . , ↵n ] 2 R .
v
q u d n
uX X
kXkF = trace(X > X) = t 2
Xi,j is the Frobenius norm of X.
i=1 j=1

Dictionary learning
Note that
2
arg min kX DRkF ,
D2D,R2R
where D and R are some speci c domains for D and R.

fi
Optimisation
Objective:
2
min kX DRkF
D2D,R2R
The objective is convex with respect to either R or D but
not to both.
Fix R , solve for D

2
min kX DRkF
D2D
Fix D , solve for R

2
min kX DRkF
R2R
Engan, Kjersti, Sven Ole Aase, and J. Hakon Husoy. "Method of optimal directions for frame design." Acoustics, Speech, and Signal Processing,
1999. Proceedings., 1999 IEEE International Conference on.Vol. 5. IEEE, 1999.

Dictionary learning
application
Dimension(
Reduction
≈
X

Sparse Coding
Why sparse?
Many signals are sparse in some transform domain.
Image credit: https://datacommandnet.blogspot.com/p/periodic-analog-signals.html

Why sparse?
Image credit: http://www.princeton.edu/~yc5/ele538b_sparsity/lectures/sparse_representation.pdf

Why sparse?

Why sparse?

Why sparse?
Musical instrument spectrum Speech spectrogram

Why sparse?
Natural image
Image credit: Zaouali, Bouzidi, and Zagrouba: Review of multiscale geometric decompositions in a remote sensing context

Dictionary learning
Note that
2
arg min kX DRkF ,
D2D,R2R
where D and R are some speci c domains for D and R .

fi
Sparse coding
Note that
2
arg min kX DRkF
D2D,R2R
X # "
Sparsity
Overcompleteness
`p norm
0 11/p
Xk
`p norm: k↵kp = @ |↵j | pA
, where ↵ 2 R .
k
j=1
k
X
p p
In other words, k↵kp = |↵j | .
j=1

K-means
K-means clustering:
D:1
2
D:3
min kX DRkF D:2
D2D,R2R
Special requirement: each column of R is an one-hot

vector, i.e., kRi k0 = 1 & kRi k1 = 1.

K-SVD
K-SVD: min kX DRk2

D2D,R2R
Special requirement: each column of R is a sparse vector,

0
i.e., kRi k0  k << k.

Sparse coding applications
Image Compression: Results for 820 bytes per image
Image credit: Compression of facial images using the K-SVD algorithm, J.Vis. Comun. Image Represent.

Data pre-processing for K-SVD
Facial feature points.
Image credit: CLow Bit-Rate Compression of Facial Images, IEEE Transactions on Image Processing

(Top) Input images and their (canonical) aligned (bottom)

versions.
Image credit: CLow Bit-Rate Compression of Facial Images, IEEE Transactions on Image Processing

Uniform slicing
Image credit: Compression of facial images using the K-SVD algorithm, J.Vis. Comun. Image Represent.

Inpainting:
70%$Missing$Samples DCT$(RMSE=0.04) Haar$ (RMSE=0.045) K>SVD$(RMSE=0.03)
90%$Missing$Samples DCT$(RMSE=0.085_ Haar$ (RMSE=0.07) K>SVD$(RMSE=0.06)
Image credit: http://www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt

Inpainting:
20%
50%
80%
Image credit: http://www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt

Inpainting for text removal:
Image credit: Sparse representation for color image restoration. IEEE Transactions on Image Processing.

Inpainting for text removal:
Image credit: Sparse representation for color image restoration. IEEE Transactions on Image Processing.

Measure of Sparsity
The objective function:
min kX DRk2F + (R)

D2D,R2R
Data tting Sparse regularisation
Question: how can we design the regularisation to

make R to be sparse?

fi
Measure of Sparsity: `0 norm
k
X
k↵kpp = |↵j |p .
j=1
As →0 , we get a count of the non-zeros in the vector.

So we can employ k↵k0 to measure sparsity.
𝑝
Measure of Sparsity: `0 norm
As →0 , we get a count of the non-zeros in the vector.

So we can employ k↵k0 to measure sparsity.
However, the `0 minimisation is not easy. How to do?

𝑝
Measure of Sparsity `1 norm
2D example (compared with `2 -norm):
1 1 *
min ' − ) *
* ++,. .. ) /≤ 1
min ' − ) * ++,. .. ) * ≤0
$ 2 $ 2

Measure of Sparsity `1 norm
2D example (compared with `2 -norm):
1 1 *
min ' − ) *
* ++,. .. ) /≤ 1
min ' − ) * ++,. .. ) * ≤0
$ 2 $ 2
↵2 ↵2
µ ↵1 ↵1
µ µ µ

Sparse coding learning
algorithms
The `0 norm based approaches:
+
• min & − () * $$,. .. ∀$0, ) 2 < 4.
$%
/
• min & ' $$s. *. + − -& . ≤ 1.
$%
Greedy algorithms:
•OMP (Y. Pati, et al. 1993; J. Tropp 2004).
•Subspace pursuit (SP) (W. Dai and O. Milenkovic 2009),
CosaMP (D. Needell and J. Tropp 2009).
•IHT (T. Blumensath and M. Davies 2009).
Sparse coding learning
algorithms
The `1 norm based approaches:
/
• min % & ''s.*. + − -% . ≤ 1.
$
*
• min % − '( ) +, ( -.
$
Bayesian approach:
• Relevance vector machine (RVM) (M. Tipping 2001 ).
• Bayesian compressed sensing (BCS) (S. Ji, et al. 2008).

Regularisation and
algorithmic stability
No-Free-Lunch Theorem
• Sparse algorithms are not stable!
• A learning algorithm is said to be stable if slight

perturbations in the training data result in
small changes in the output of the algorithm,
and these changes vanish as the data set grows
bigger and bigger.
Xu, H., Caramanis, C., & Mannor, S. (2012). Sparse algorithms are not stable: A no-free-lunch theorem. IEEE transactions on pattern analysis
and machine intelligence, 34(1), 187-193.

Algorithmic Stability
We have two different training samples:
S = {(X1 , Y1 ), . . . , (Xi 1 , Yi 1 ), (Xi , Yi ), (Xi+1 , Yi+1 ), . . . , (Xn , Yn )}
0 0
S i = {(X1 , Y1 ), . . . , (Xi 1 , Yi 1 ), (X i , Y i ), (Xi+1 , Yi+1 ), . . . , (Xn , Yn )}
They are different because of only one training example.
An algorithm is uniformly stable if for any example (X, Y )

|`(X, Y, hS ) `(X, Y, hS i )|  ✏(n).
Note that will vanish as n goes to in nity.

fi
Generalisation error
⇤
R(hS ) min R(h) = R(hS ) R(h )
h2H
⇤ ⇤ ⇤
= R(hS ) RS (hS ) + RS (hS ) RS (h ) + RS (h ) R(h )
 R(hS ) RS (hS ) + RS (h⇤ ) R(h⇤ )
⇤ ⇤
 |R(hS ) RS (hS )| + |R(h ) RS (h )|
 sup |R(h) RS (h)| + sup |R(h) RS (h)|
h2H h2H
= 2 sup |R(h) RS (h)|.
h2H
R(hS ) RS (hS )  sup |R(h) RS (h)|.

h2H

Algorithmic Stability
A good stable algorithm will have a good generalisation ability:
E[R(hS ) RS (hS )]
" n
#
1 X
= ES EX,Y [`(X, Y, hS )] `(Xi , Yi , hS )
n i=1
" " n
# n
#
1 X 1 X
= ES ES 0 `(Xi0 , Yi0 , hS ) `(Xi , Yi , hS )
n i=1
n i=1
" n
#
1 X
= ES,S 0 (`(Xi0 , Yi0 , hS ) `(Xi , Yi , hS ))
n i=1
" n
# ✏0 (n) would be small if the learning
1 X
= ES,S 0 (`(Xi0 , Yi0 , hS ) `(Xi0 , Yi0 , hS i )) algorithm is stable (because hS and hS i are similar).
n i=1 This implies that stable algorithms will have small
expected generalisation errors.
 ✏0 (n)
where S 0 = {(X10 , Y10 ), . . . , (Xn0 , Yn0 )}.

Regularisation and Stability
`2 norm regularisation will make learning algorithms stable if
the employed surrogate loss function is convex.
Xn
1 2
hS = arg min `(Xi , Yi , h) + khk2
h2H n
i=1
If the convex surrogate loss function is L-Lipschitz continuous
w.r.t. h , kXk2  B, we have

Proof (optional)
A surrogate loss function is L-Lipschitz continuous w.r.t. h if
for any in the domain and any
0 0
|`(X, Y, h) `(X, Y, h )|  L|h(X) h (X)|.

Proof (optional)
A function is -strongly convex:
) *
! " ≥ ! $ + &! $ ," − $ + $−" , ∀$, ",
*
⇔
./ ≼ & *! $ , ∀$
!"
. *
! $ + &! $ ," − $ + $ −" ! $ + &! $ ," − $
2
($, ! $ )
COMP5328 - Tongliang Liu 43

𝜇
Proof (optional)
Note that the function `(X, Y, h) + khk is always 2
strongly convex w.r.t. h .

Let n
1 X
RS, (h) = `(Xi , Yi , h) + khk2 .
n i=1
We have
RS, (hS i ) RS, (hS ) + hrRS, (hS ), hS i hS i + khS i hS k2
2
2
= RS, (hS ) + kh Si hS k
2
⌦ ↵
RS i , (hS ) RS i , (hS i ) + rRS i , (hS i ), hS hS i + khS h S i k2
2
= RS i , (hS i ) + khS h S i k2 .
2
Proof (optional)
We have
khS i hS k2  RS i , (hS ) RS, (hS ) + RS, (hS i ) RS i , (hS i )

1 0 0 0 0 1
 |`(Xi , Yi , hS ) `(Xi , Yi , hS i )| + |`(Xi , Yi , hS ) `(Xi , Yi , hS i )|
n n
L 0 0 L
 |hS (Xi ) hS i (Xi )| + |hS (Xi ) hS i (Xi )|
n n
aumme h(X) = hh, Xi L L
0
= | hhS hS i , Xi i | + | hhS hS i , Xi i |
n n
L 0 L
 khS hS i kkXi k + khS hS i kkXi k
n n
2LB
 khS hS i k,
n
where the fth inequality holds because of Cauchy-

Schwartz inequality: ha, bi  kakkbk.
fi
Proof (optional)
Thus,
We then have
|`(X, Y, hS ) `(X, Y, hS i )|  L|hS (X) hS i (X)|

 LkhS hS i kkXk
2L2 B 2
 .
n

Advanced Machine Learning: (COMP 5328)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Machine Learning: (COMP 5328)

Uploaded by

Copyright:

Available Formats

Advanced Machine Learning

COMP5328 Tongliang Liu 2

COMP5328 Tongliang Liu 4

COMP5328 Tongliang Liu 5

where D and R are some speci c domains for D and R.

COMP5328 Tongliang Liu 6

Fix R , solve for D

Fix D , solve for R

COMP5328 Tongliang Liu 7

COMP5328 Tongliang Liu 8

Image credit: https://datacommandnet.blogspot.com/p/periodic-analog-signals.html

COMP5328 Tongliang Liu 10

Image credit: http://www.princeton.edu/~yc5/ele538b_sparsity/lectures/sparse_representation.pdf

COMP5328 Tongliang Liu 11

Image credit: http://www.princeton.edu/~yc5/ele538b_sparsity/lectures/sparse_representation.pdf

COMP5328 Tongliang Liu 12

Image credit: http://www.princeton.edu/~yc5/ele538b_sparsity/lectures/sparse_representation.pdf

COMP5328 Tongliang Liu 13

Musical instrument spectrum Speech spectrogram

COMP5328 Tongliang Liu 15

where D and R are some speci c domains for D and R .

COMP5328 Tongliang Liu 16

COMP5328 Tongliang Liu 18

Special requirement: each column of R is an one-hot

COMP5328 Tongliang Liu 19

K-SVD: min kX DRk2

Special requirement: each column of R is a sparse vector,

COMP5328 Tongliang Liu 20

COMP5328 Tongliang Liu 21

Facial feature points.

COMP5328 Tongliang Liu 22

(Top) Input images and their (canonical) aligned (bottom)

COMP5328 Tongliang Liu 23

COMP5328 Tongliang Liu 24

90%$Missing$Samples DCT$(RMSE=0.085_ Haar$ (RMSE=0.07) K>SVD$(RMSE=0.06)

Image credit: http://www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt

COMP5328 Tongliang Liu 25

Image credit: http://www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt

COMP5328 Tongliang Liu 26

COMP5328 Tongliang Liu 27

COMP5328 Tongliang Liu 28

min kX DRk2F + (R)

Data tting Sparse regularisation

Question: how can we design the regularisation to

COMP5328 Tongliang Liu 29

As →0 , we get a count of the non-zeros in the vector.

As →0 , we get a count of the non-zeros in the vector.

However, the `0 minimisation is not easy. How to do?

COMP5328 Tongliang Liu 31

COMP5328 Tongliang Liu 32

COMP5328 Tongliang Liu 33

COMP5328 Tongliang Liu 35

• Sparse algorithms are not stable!

• A learning algorithm is said to be stable if slight

COMP5328 Tongliang Liu 37

They are different because of only one training example.

An algorithm is uniformly stable if for any example (X, Y )

COMP5328 Tongliang Liu 38

R(hS ) RS (hS )  sup |R(h) RS (h)|.

COMP5328 Tongliang Liu 39

where S 0 = {(X10 , Y10 ), . . . , (Xn0 , Yn0 )}.

COMP5328 Tongliang Liu 41