Professional Documents
Culture Documents
Machine Learning The Theoretical Minimum
Machine Learning The Theoretical Minimum
Lorenzo Rosasco,
- Universita di Genova
- Istituto Italiano di Tecnologia
Machine Learning
Intelligent Systems
Data Science
Intro
PART I
Local methods
Bias-Variance and Cross Validation
PART II
PART III
PART IV
Morning
Afternoon
PART I
Local methods
Bias-Variance and Cross Validation
CHAPTER 1
Machine Learning deals with systems that are trained from data rather than
grammed. Here we describe the data model considered in statistical learning theor
R
Given the training set, the goal is to learn a corresponding input-output rel.
to antask
output
spacetoY postulate
. We consider
several possible
situations:
Y R,s
this
we have
the existence
of a model
for the regression
data. The model
Y =possible
{ 1, 1} uncertainty
and multi-category
(multiclass)
classification
Y = {1, 2, . . . , T }. The s
the
in the task
and in the
data.
+1
0
x11
B
Xn = @ ...
x1n
xp1
y1
B
C
Yn = @ ... A
yn
Local Methods
?
2.1. Nearest Neighbor
S = {(x1 , y1 ), . . . , (x
Nearest
n , yn )}.
Neighbor
Given an input x
, let
i0 = arg min k
x
i=1,...,n
xi k2
Every new input point is assigned the same output as its nearest input in the tra
comments. First, while in
the
above
definition
we
simply
considered
the
Euclid
How does it work?
15
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
FIGURE 2.2. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and
Plot
can be promptly generalized to consider other measure of similarity among inputs. For example if
input are binary strings, i.e. X = {0, 1}D , one could consider the Hamming distance
D
1 X
dH (x, x
) =
1[xj 6=xj ]
D j=1
K-Nearest Neighbors
Consider
xi k2 )ni=1
dx = (k
x
be the array of the first K entries of Ix . The K-nearest neighbor estimator (KNN) is defined as,
X
f(
x) =
y i0 ,
1
i0 2Kx
15
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
FIGURE 2.2. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and
Plot
Thisinput
induce
a Parzen
analogous
to KNN,
here t
asedquestion
on the principle
have
the similar/same
output.
The
of how tothat
bestnearby
choose
Kchoice
willpoints
be
theshould
subject
of window
a future
discussion.
by the radius
r. have
More the
generally
it is interesting
to have a deca
principle that nearby input points
should
similar/same
output.
Remarks:
are
further
away.
For example considering
2.1. Nearest
Neighbor
Parzen Windows
2.1. Nearest Neighbor
i=1,...,n
P
f (
x) =
,
Gaussian k(x , x) = e
.
n
0
2
k(
xx, ixki )
i
=
arg
min
k
x
i=1
ne the nearest neighbor (NN)i=1,...,n
estimator as
or
p
0
k
:
X
X
!
[0,
1]
is
a
suitable
function,
which
can
be
seen
as
a
similarity
measure
f (
x ) = y i0 .
t neighbor (NN) estimator as
Exponential
k(x0 , x)Windows
= e kx x k/ 2
Parzen
input points. The function k defines a window around each point and is sometimes
w
input point
is assigned
the
same
output
as methods
itsconsidering
nearest
input
theoftraining
Wecont
ad
(
In
all y
the
above
there
is ain
parameter
r or set.that
a Parzen
window.
A classification
is
obtained
the
sign
f
x
).
0
f(
x
) rule
=
.
i
s. First,
while
the above
we
considered
the example
Euclidean
has
on the
ny
examples
of in
k depend
on definition
theneighbor
distance
kxsimply
x0 k,prediction.
x, x0 2 X. For
wenorm,
can the m
nt
is Generalization
assigned
the same
output
as its
nearestofinput
in the
training
set. For
Weexample
add few
omptly
generalized
toII:
consider
measure
similarity
among
inputs.
otherother
metric/similarities
er
D0
binary
strings,
X = {0, 1}
one
consider
the
Hamming
distance
le
in the
abovei.e.definition
we
simply
the
Euclidean
norm,
the metho
k(x ,, x)
=could
1kx considered
.
4-2
x0 kr
eralized
to consider window
other measure
oftosimilarity
among
inputs.K For
example if th
D
X
hoice induce a Parzen
analogous
KNN,
here
the
parameter
is
replaced
1
D
gs,
i.e. r.
X More
= {0,generally
1} , one
consider
the 1Hamming
distance
j 6=x
dH (x,
x
) =to have
j ]
radius
it could
is interesting
a[xdecaying
weight
for point which
D
j=1
ther away. For example considering X
D
1
is the j-th component
of0 ax
dH (x,
string
) = x 2 X.
1[x0 j 6=xj ]
, x) = (1D kx x k)+ 1kx x0 kr ,
he complexity of the k(x
algorithm
for predicting
any new point is O(nD) recall that the com
j=1
ltiplying
is O(D).
Finally,
we note is
that
can be
(a)
= a,two
if a D-dimensional
> 0 and (a) =vectors
0, otherwise.
Another
possibility
to NN
consider
fastfairly sen
+
+ X.
omponent
of aThere
string is
x 2one
parameter controlling fit/stability
ng functions such as a Gaussian
ty of the algorithm for predicting any new point is O(nD) recall that the complex
5.1
fK (x))2 .
We next characterize the corresponding minimization problem. For the sake of simpl
consider a regression model
yi = f (xi ) + i ,
= 0, E
2
i
i = 1, . . . , n
Moreover, we consider the least square loss function to measure errors, so that the
mance of the KNN algorithm is given by the expected loss
To get an insight on how to choose K, we analyze theoretically how this choice influen
expected loss. In fact, in the following we simplify the analysis considering the perfo
of KNN "(K) at a given point x.
"(K)
{z }
}
5.1
Tuning
Variance
Variance
Decompos
|and |Bias
{z
2
2
ect of machine learning.
fK (x)) .
|
{z
} how this choice influ
oimplicity
get
an
insight
on
how
to
choose
K,
we
analyze
theoretically
we
consider
a
regression
model
For the sake of simplicity we consider a regression model
ES Ex,y (y
fK (x)) "(K)
= Ex ES Ey|x (y
"(K)
"(K)
on
how
to
choose
K,
we
analyze
theoretically
how
this
choice
influences
the
o chooseloss.
K, we
how
this
choicethe
influences
the
xpected
Inanalyze
fact, intheoretically
the following
we
simplify
analysis
considering
the
per
2 analysis
2
fact,
in
the
following
we
simplify
the
considering
the
performance
y
=
f
(x
)
+
,
E
=
0,
E
=
i
=
1,
.
.
.
2, n
2 this
get
insight
choose
K,
we
analyze
theoretically
how
the
i an
we
i on
ihowytopoint
I
the
following
the
analysis
considering
the
performance
i
f To
KNN
"(K)
at
asimplify
given
x.
E I = 0, E i =
i = choice
1, . . . ,influences
n
i = f (xi ) + i ,
given
point
x.In fact, in the following we simplify the analysis considering the performance
expected
loss.
point
x.
First,the
note
that
nsider
least
square
loss x.
function2 to measure errors, so that2the perfort ofand
KNN
"(K)
at
a
given
point
K (x))
denote by S2the corresponding
training
set.(fMoreover,
we consider
the least squ
"(K)
=
+
E
E
(x)
f
,
S
y|x
2
NN algorithm
is
given
byS Ethe
expected
loss
First, "(K)
note
that
=
+
E
(f
(x)
f
(x))
,
2
2
K
y|x
function
to
measure
errors,
so
that
the
performance
of 2the KNN algorithm is given
"(K) =
+ ES Ey|x (f (x)
f
K2(x)) ,
2
"(K)
= + ES Ey|x
(f (x)term.
f2K (x))
,
here
can
be
seen
as
an
irreducible
error
Second,
to
study
the
latter
2
expected
loss
seen asESan
irreducible
error= term.
Second,
to
study. the latter term we
E
(y
f
(x))
E
E
E
(y
f
(x))
x,y
K
x
S
K
y|x
2
an
irreducible
error
term.
Second,
to study
the
lattertoterm
we
2
ntroduce
the
expected
KNN
algorithm,
where
can
be
seen
as
an
irreducible
error
term.
Second,
study
the2 latter
term we
|
{z
}
X
X
Etheoretically
fK` ).
(x)1 = how this
fchoice
(x` ). influences the
K1(x)
y|x
Ey|x fK,
=analyze
f
(x
on how to choose
we
Kf`2K
Ey|x
f (x) = analyze
(xtheoretically
` ).
Ey|x
(x) = on how to
fchoose
(x
K
To get
anfK
insight
how this choice influen
` ). K K, we
x
K
`2K
n fact, in the following
simplifyx the analysis
K we
`2Kx considering the performance
expected loss. In fact,`2K
inx the following we simplify the analysis considering the perfo
given
point x.
WeaWe
have
have
of KNN "(K) at a given point x.
hat
First,
2
2 2
2
2
2 note that
2
2
2
2
E
E
(f
(x)
f
(x))
=
(f
(x)
E
f
(x))
+
E
E
(E
f
(x)
f
(x))
E
(f
(x)
f
(x))
=
(f
(x)
E
E
f
(x))
+
E
E
(E
f
(x)
f
) 2 fE
(x))
=
(f
(x)
E
E
f
(x))
+
E
E
(E
f
(x)
f
(x))
K
S
K
S
K
K
y|x
y|x
y|x
y|x
K
S
K
S
K
K(
y|x
y|x
y|x
K S S y|x
S
K
S
K
K
y|x , y|x
"(K) = + ES y|x
Ey|x
(f
(x)
f
(x))
2
2
K
2
2
| +|E=
} (x)
|{zfK}(x))
(f
fK|(x)){z, }
} y|x{z|+
{z
{z }
x)) = (f (x) | ES Ey|x fK{z
(x))"(K)
(EEy|x
fKy|x(x)
SE
SE
Bias
V ariance
{z
} term.
|
Bias
V
ariancethe }latter
Bias{z
V ariance
e seen |as an
irreducible
error
Second,
to
study
term
we
2
V ariance
where we have
can Bias
be seen as an irreducible error
term. Second, to study the latter t
Finally,
pected
KNN
algorithm,
inally,
we
have
2
X2
introduce the expected KNN algorithm,
1
2
2
X
"(K) =1 2X
+
(f
(x)
f
(x
))
+
`
1
X
2
2
2
2 K+
1 2+
X
"(K)
=
+
(f
+
f
K
1(x)
(x
`))
"(K)
=
(f
(x)
+
f
(x
))
+
X
`
2
2 f (x` ). `2K
x
1
E
f
(x)
=
K
K
K
y|x
(K) = + (f (x)
fK
+
K `2K f (x` ).
K
(x
` ))
`2K
E
x y|x fK (x) =
x
K `2K
K
`2Kx
K
x
`2K
x
5-1
We have
5-1
5-1
(f (x)
ave
2
X
1
"(K) = 2 + (f (x) +
f (x` ))2 +
K `2K
K
Lecture
5x
V ariance
fK (x))2
}
Spring
5-1
The
Bias-Variance
Tradeo.
In theYES!
KNN algorithm theCan
parameter
K controls
Is there
an optimal
value?
we compute
it?the a
plexity.
(f (x)
ave
Not quite
fK (x))2 = (f (x)
|
V ariance
2
X
1
"(K) = 2 + (f (x) +
f (x` ))2 +
K `2K
K
x
5-1
fK (x))2
}
Validation
Validation
Validation
Validation
Validation
Hold-Out
Validation
Validation
Validation
Validation
Validation
End of PART I
Local methods
Bias-Variance and Cross Validation
tell me
the length
of the edge
vs volume
in high
dimensions
of a cube containing 1% of the
volume of a cube with edge 1
Curse of dimensionality!
PART II
Regularization I: Linear Least Squares
Regularization II: Kernel Least Squares
7.1
wT xi ))2 + wT w,
0.
f (x) = wT x = 0
A motivation for considering the above scheme is to view the empirical e
n
X
1
(yi
n i=1
Tikhonov 62
dxdyp(x, y)(y
?,
x ))
Phillips 62
Hoerl et al. 62
wT x))2 .
7.1
wT xi ))2 + wT w,
0.
Tikhonov 62
dxdyp(x, y)(y
?,
x ))
Phillips 62
Hoerl et al. 62
wT x))2 .
n
X
1
min
(yi
w2RD n
i=1
w xi )) + w w,
0.
tion for considering the above scheme is to view the empirical error
Computations?
Statistics?
n
X
1
(yi wT xi ))2 ,
n i=1
dxdyp(x, y)(y
w x)) .
this seemingly simple example will be the basis for much more com
n
X
1
T
2
T
min
(yi w xi )) + w w,
0.
7.2
w2RD n Computations
i=1
ML-II
Lecture
7
In thisthe
caseabove
it is convenient
toto
introduce
theempirical
n times Derror
matrix Xn
tion for considering
scheme
is
view
the
Lecture
7
Spring
2014
Computations?
input points, and the n by 1 vector Yn where the entries are th
n
With this notation
X
1 the gradient
n
T
2with respect1 to w of the e
irect computation shows that
X
1
(y
w
x
))
,
T
2
2
i
i
Notation
(y
w
x
))
=
kY
X
wk
.
ows
that theare
gradient
withnrespect to nw of the
empirical
risk
i
i
n and
n
regularizer
respectively
n
i=1
i=1
tively
2 T
y for the expected error
7-1 2w.
X
(Y
X
w),
and,
n
n
n
2 T
Z
n
Setting
gradients
Xn (Yn Xn w), and, 2w.
T
2
n
dxdyp(x, y)(y w x)) .
n, setting the gradient to zero, we have that the solution of regulari
nt
zero,
wesystem
have that the solution of regularized least squares
es to
the
linear
wT w is to
a regularizer
and help preventing
overfitting.
T
T
zero
(X
X
+
nI)w
=
X
Y
.
n
n
T
2
n
n
T
erm w (X
w T=Xkwk
is
called
regularizer
and
controls
the
stability
of
the
+
nI)w
=
X
Y
.
n
n
n n
eral
comments
arethe
in error
order.term
First,
methods can
be used(7.1)
to solve
meter
balances
andseveral
the regularizer.
Algorithm
is a
order.
First, several
methods
be the
used
to solveof
the
abovesince
linearthe We
nov
also
called can
penalized
empirical
risk
minimization.
ems,regularization,
Choleski
decomposition
being
method
choice,
m
T
position
being
the
method
of
choice,
since
the
matrix
X
X
+
I
n
n
hosen
the
space
of
possible
solution,
called
the
hypotheses
space,
to
be
th
ymmetric and positive definite. The complexity of the method
is
e
2
OK,
but
what
is
this
doing?
e
definite.
The
complexity
of
the
method
is
essentially
O(nd
)
nctions,
trainingthat
and isO(d) for testing. The parameter controls the invertibi
7.3
Interlude:
Linear Systems
Interlude:7.3
Linear
Systems
problem
Consider the problem
Consider
the
problem
7.3 Interlude: Linear Systems
M aM
=a
b, =
b,
M a = b,
D
where
M
is
a
D
by
D
matrix
and
a,
b
vectors
in
R
. We are interested in determing a
D
onsider
where
M the
is aproblem
D by D satisfying
matrix and
a,
b
vectors
in
R
.
We
are
interested
in determing a
the above equation given M, b. If D
M is invertible, the solution to the problem is
b,
D bytheDabove
matrix
a,b.b IfM
vectors
in R
. Weto are
interested
atisfying
equation and
given M,
Ma =
is invertible,
the solution
the problem
is
in d
a = M b.
D
here
M
is
a
D
by
D
matrix
and
a,
b
vectors
in
R
.
We
are interested in determing a
above equation given M, ab.= M
If M
1
b. is invertible, the solution to the p
1
=
diag(1/
.
,
1/
),= I, 1 + ), . . . , 1/( D + )
M
= 11diag(
,
.
.
.
V
V
1
D
= V 1 = V , D = diag(1/(
symmetric
and
positive
definite,
then
considering
the
eigendecompo
and
then
The
ratio
1
D/ 1
(M + I) =1 V = V1 ,T =1 diag(1/( 1 + ), . . . , 1/( D + )
1 M = V T V , = diag(1/ 1 , . . . , 1/ D ),
T
M
=
V
V
,
=
diag(
,
.
.
.
,
),
V
V
=
I,
1
D
7.4
Dealing
with
an
Oset
The ratio
/
is
called
the
condition
number
of
M
.
and D 1
1
T
(M + I)
=
V
=
V
,
= diag(1/(
+relatively
), . . . , 1/(
D + )
When considering linear
models,
especially1in
low dimensional
spaces, it is inter-
esting to consider an oset, that is wT x + b. We shall ask the question of how to estimate b
is called
thedata.
condition
M
A simple
1from
1 number
Tidea is toofsimply
1 . augment the dimension of the input space, considering
7.4
with an Oset
he ratioDealing
D/ 1
Lecture
7Regularized
Least
Squares
e respectively
n
X
Lorenzo Rosasco
Scribe: Lorenzo
1
T
2
T
2min
(yi w xi )) + w w,
0.
TD
Xn (Ynn i=1 Xn w), and, 2w.
w2R
n
classfor
weconsidering
introduce athe
class
of learning
algorithms
based
Tikhonov
regula
tion
above
scheme
is
to
view
the
empirical
error
e gradient to zero, we have that the solution
of regularized least squ
Statistics?
alized empirical risk minimization and regularization. In particular, we
n
ystem
X
1
hm defined by
the square
CHAPTER
11 loss.
2
T
TT
x
))
,
(Xn Xn + nI)w(y=i Xw
Y
.
n ni
n i=1
s are in order. First, several methods can be used to solve the above li
Regularized
Least
Squares
Variable
Selection
ydecomposition
for the
expected
error
another
story
that
shall
told another
being the method
of be
choice,
since thetime
matrix XnT Xn +
Z(Stein
56, James and
Stein
61)
derpositive
definite.
The
complexity
of
the
method
is
essentially
O(
the following algorithm
T
2
dxdyp(x,
w x))
O(d)
for predictions
testing. The
controls
the . invertibility
of the ma
beyond
it is parameter
important
to y)(y
obtain
interpretable
results. Intern
X
by detecting which factors
have determined
our
1
T
2
T prediction. We look at
min
(yhelp
wpreventing
xi )) + woverfitting.
w,
0.
i
ve
selection.
wTofwvariable
is a regularizer
and
w2RD n
i=1
T
2
erm w w = kwk is called regularizer and controls the stability of the
D
v
lude:
Linear
Systems
X
meter
balances the
error
term
and the
regularizer.
Algorithm
(7.1)
is
a
on for considering
the
above
scheme
is
to
view
the
empirical
error
X
j 2
T
j j
(wminimization.
)
fw (x) = w xalso
= called
w x .penalized empirical risk
nov regularization,
We
n
i=1 1 X
blem
j=1
hosen the space of possible solution,
called
the
space,
to
be
th
T
2 hypotheses
(y
w
x
))
,
i
i
j
M
=
b,
n ai=1
nctions,
onents xthat
of Shrinkage
anis input of specific
measurements:
pixel values
in the case
- Stein
EffectAdmissible
Estimator
15
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
FIGURE 2.2. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and
Plot
v
x 7! x
=(
T
1 (x), . . . ,
f (x) = w x
=
p
X
j=1
p
(x))
2
R
p
j (x)w
Dictionaries
7!
T X
n + nY )w = X
T Yn
(X
n
n
inear Systems
M a = b,
b.
Complexity Vademecum
v v 7! O(p)
0
M v 7! O(np)
M M T 7! O(n2 p)
T
(M M )
7! O(n )
7!
T X
n + nY )w = X
T Yn
(X
n
n
inear Systems
Representer Theorem for RLS. The result follows noting that the following equality
1. Representer Theorem for RLS. The result follows noting that the following equality
have,
we have,
(Xn Xn + nI)
T
T
T
X
=
X
(X
X
1n T
n T n n T+ nI)
Xn = Xn (Xn Xn + nI)
1,
n
X
n
X
T
T
1
T
T n Xn T
1n =
ww
==
XnX(X
+
nI)
Y
x
i Tcci .i .
(X
X
+
nI)
Y
=
x
n
n
n {z
i
|n |
}}
{z
c c
i=1
i=1
TT
6.3)
follows
from
considering
the
SVD
of
X
,
that
is
X
=
U
V
hav
n n , that is X
nn = U V .. Indeed
n (6.3) follows from considering the SVD of X
Indeed we
we hav
hat
o that
T T
1 1TT
22
11
TT
(X(X
X
+
nI)
X
=
V
(
+
)
U
X
+
nI)
X
=
V
(
+
)
U
n nn n
nn
T
T T
11
22
11 TT
(X
X
+
nI)
=
V
(
+
)
XnTX(X
X
+
nI)
=
V
(
+
)
UU
. . 3 ) + O(pn2 )
O(n
n n nn n
2. Representer
Theorem
Implications.Using
UsingEquation
Equation(7.2)
(7.2)ititpossible
possible to
to show
show how th
Representer
Theorem
Implications.
ficients
computed
considering
differentloss
lossfunctions.
functions.In
Inparticular,
particular, for
for the
the square
square
ents
cancan
computed
considering
different
fefficients
coefficients
satisfies
following
linearsystem
system
satisfies
thethe
following
linear
n + nI)c = Yn .
(K(K
n + nI)c = Yn .
Kn is the n by n matrix with entries (Kn )i,j = TxTi xj . The matrix Kn is called the kernel ma
s the n by n matrix with entries (Kn )i,j = xi xj . The matrix Kn is called the kernel ma
etric and positive semi-definite.
6.3. Kernels
6.3. Kernels
of the main advantages of using the representer theorem is that the solution of the p
the
main
advantages
ofthrough
using the
representer
is that
the solution
of the
T 0
on the
input
points only
inner
products xtheorem
x . Kernel
methods
can be seen
as re
the input points only through inner products0 xT x0 . Kernel methods can be seen as r
Representer Theorem for RLS. The result follows noting that the following equality
(Xn Xn + nI)
T
T
T
X
=
X
(X
X
1n T
n T n n T+ nI)
Xn = Xn (Xn Xn + nI)
1,
One of the main advantages of using the representer theorem is that the
have,
we have,
n
T 0
X
n
ends on the input pointsTonly through
inner
products
x
x . Kernel method
XT
T
1
T n Xn T
ww
==
XnX(X
++ nI)
Y1 Y
xxi0 Tcci . .
n ==
(X
X
nI)
n
n
inner product with
a
more
K(x,
x
).
n general
n {zfunction
i iIn this case, the repre
|
}
|
{z
}
P
w (x)
=w x=
n
i=1
xTi xci ,
c c
becomes
i=1
i=1
TT
6.3)
follows
from
considering
the
SVD
of
X
,
that
is
X
=
U
V
hav
n n , thatnis X
nn = U V .. Indeed
n (6.3) follows from considering the SVD of X
Indeed we
we hav
X
hat
o
that
n
n , x)c .
)
f
(x)
=
K(x
X
X
i
T T
1 1TT
22
11 i TT
(X(X
X
+
nI)
X
=
V
(
+
)
U
X
+
nI)
X
=
V
(
+
)
U
T
T
n nn n
nn
Kernels
w=
xi ci ) f (x) = x w
i=1=
x xi c i
| {z }
j=1
T
T
2
T j=1
T
11
2the
11 TT
we can promptly
kernel
versions
of
Regularization
Networks
X(X
(X
X
+
nI)
=
V
(
+
)
U
.
MLNotes
2014/5/29
14:33
page
13
#17
Xnderive
X
+
nI)
=
V
(
+
)
U
.
K(x,x
)
i
n n nn n
ctions.
2. Representer
Theorem
Implications.Using
UsingEquation
Equation(7.2)
(7.2)ititpossible
possible to
to show
show how th
Representer
Theorem
Implications.
The
function
K
is
often
called
a
kernel
and
to
be
admissible
it
should
beh
1
ficients
computed
considering
different
loss
functions.
Inparticular,
particular,
for the
the square
square
ents
cancan
computed
considering
different
loss
functions.
In
for
(K
+
nI)
c
=
Y
,
(K
)
=
K(x
,
x
)
n
n
n
i,j
i
j
re
preciselysatisfies
it should
be:
1) symmetric,
and 2) positive definite, that is the
fefficients
coefficients
the
following
linear
system
satisfies the following linear system
positive semi-definite for any (K
set of
n
input
points.
While
the
symmetry
pro
+ nI)c
nI)c==YYn. .
n
(K
+
n
n
ck, positive semi definiteness is trickier. TPopular
examples of positive defin
Kn is the n by n matrix with entries0 (Kn )i,jT =0 Txi xj . The matrix Kn is called the kernel ma
s the nthe
by nlinear
matrix
with entries
(K
)i,jx =xx,i xj . The matrix Kn is called the kernel ma
n=
kernel
K(x,
x
)
6.3. KERNELS
etric and positive semi-definite.
ic and
positive
semi-definite.
the polynomial kernel K(x, x0 ) = (xT x0 + 1)d ,
6.3. Kernels
kx x0 k2
2 2
the Gaussian kernel K(x,6.3.
x0 ) Kernels
=e
,
of the main advantages of using the representer theorem is that the solution of the p
the
main
advantages
ofthrough
using
representer
is degree
that
the and
solution
of the
T 0
ere
last
two
kernels
have the
ainner
tuning
parameter,
the
w
on the
the
input
points
only
products
xtheorem
x . Kernel
methods
canGaussian
be seen
as re
the input points only through inner products0 xT x0 . Kernel methods can be seen as r
15
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
FIGURE 2.2. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and
Plot
n
i=1
xTi ci , becomes
f(x) =
n
X
K(xi , x)ci .
i=1
Integral Equations
Sampling Theory/Inverse Problems
Loss functions- SVM, Logistic
Multi - task, labels, outputs, classes
End of PART II
Regularization I: Linear Least Squares
Regularization II: Kernel Least Squares
PART III
a) Variable Selection: OMP
b) Dimensionality Reduction: PCA
...
;
...
x11
B ..
Xn = @ .
x1n
...
..
.
...
..
.
...
..
.
...
...
...
xp1
y1
.. C
.. C Y = B
@ . A
. A; n
yn
xpn
fw (x) = w x =
D
X
x w
j=1
j
j 0}|
kwkkwk
=
|{j
|
w
0 0 = |{j | w6=
6= 0}|
each training set one would solve the learning problem and eventually end selecting
which the corresponding training set achieve the best performance.
Greedy approaches/Matching
The approach described has an exponential
complexity and becomesPursuit
unfeasible
tively small D. If we consider the square loss, it can be shown that the corresponding p
written as
(11.2)
n
Yn 1 X
min
w2RD
i=1
where kwk0 = |{j | wj 6= 0}| is called the `0 norm and counts the number of non z
in w. In the following we focus on the least squares loss and consider different ap
approximate solution to the above problem, namely greedy methods and convex relax
2
11.2. Greedy Methods: (Orthogonal)
Matching Pursuit
r
Greedy approaches are often1considered to find approximate solution to problem
initialize the residual, the coefficient vector, and the index set,
find the variable most correlated with the residual,
update the index set to include the index of such variable,
update/compute coefficient vector,
update residual.
The simplest such procedure is called forward stage-wise regression in statistics and m
(MP) in signal processing. To describe the procedure we need some notation. Let Xn b
= arg max aj ,
aj =
j=1,...,D
kX j k2
T
j 2
(r
X
)
i
1
k = arg max aj , aj =(riT 1 Xj j )22 ,
,j=1,...,D
,
w
=
0,
I
=
;.
kX
k
0 =
0
0
kr=
argYnmax
a
,
a
=
j
j
11. VARIABLE SELECTIONj 2 ,
j=1,...,D
kX k
11. VARIABLE SELECTION
min
kri
j
X vk ,
29kri 1
j j 2
X v k = kri
Matching Pursuit
2
1k
aj
v2R
X
is then iterated for i =
1. The
correlated with t
j 1,
2 ...,T
j j 2variable most
2
= arg
min
kr
vk The
, kr
X most
v k =correlated
kri 1 k
awith
i. ,1T X 1.
i 1
j
d
for
i
=
1,
.
.
variable
the
2
j
2
j
j
2
2
v2R
k
= arg min kr We select
X vk ,thekr
X such
v k =that
kr the
k projection
a
erpretations.
variable,
on the
i 1
i 1
i 1
T
j 2
(r
X
) the
n two
interpretations.
We select the
variable,
such
that
the
projection
on
the
i select
1
mn
is larger,
or,
equivalently,
we
variable
such
that
the
k = arg max(rTaj ,X j a)2j =
, *
j
2
interpretations.
We
the variable,
such the
thatvariable
the projection
onthe
the
j=1,...,D
ding
column
is larger,
or,
we
select
such that
i select
1 equivalently,
kX
k
swo
the
the
output
vector
in
a
least
squares
sense.
max aj , aj =
,
j
2
MLNotes
we
2014/5/29
14:33
pagesuch
30 that
#34 the
g
column
is
larger,
or,
equivalently,
select
the
variable
st
explains
the
the
output
vector
in
a
least
squares
sense.
j=1,...,D
kX
k
= Ii 1 [ {k}, and the coefficients vector is given by
ated
as Ithe
Ii 1output
[ {k}, and
thein
coefficients
vectorsense.
is given by
xplains
vector
a least squares
i = the
j
Xw = w
jk the
2 vcoefficients
j j 2 is given by
2
+
w
,
w
=
e
ed as
=
I
[
{k},
and
vector
i =Iarg
i
1
k
k
k
k
i
i 1=
min
krwi i 11 +X
vk
,wk kkr=i v1k ek X v k = kri 1 k
aj
w
w
,
2
i
k
k
jv2R2
j j 2
2
D
rnical
X
k v=
k different
aj from
basis
R 1kr
with
zero.
Finally,
DX
wi vk
=inw, ibasis
+
w1kR, k-th
wkvcomponent
k k-th
=
ekr
1 canonical
iin
kcomponent
k i 1different
fnithe
with
from
zero.
Finally,
two interpretations. We select the variable, such that the projection
on t
D
30
11. VARIABLE SELECTION
he
canonical
basis
in
R
with
k-th
component
different
from zero.
Finally,
ding
column
is
larger,
or,
equivalently,
we
select
the
variable
such
that t
tations.r We
select
the
variable,
k
k such that the projection on the
ri the
Xw
ri 1 .vector
Xw . in a least squares sense.
i =
1ri =
st
explains
the
output
The or,
following
procedure is we
then select
iterated the
for i variable
= 1, . . . , T such
1. Thethat
variable
most correl
s larger,
equivalently,
the
k
ocedure,
called
Orthogonal
Matching
Pursuit,
is also
considered.
The
r
=
r
Xw
.Pursuit,
residual
is
given
by
dated
as
I
=
I
[
{k},
and
the
coefficients
vector
isoften
given
by
called
Orthogonal
Matching
is
also
often
considered.
The
i 1
i
i i1
the
output
vector
in
a
least
squares
sense.
end
T
j 2
(r
X
) replaced
analogous
to
that
of
MP
but
the
coefficient
computation
(11.3)
is replacedby
by
i
1
s
to
that
of
MP
but
the
coefficient
computation
(11.3)
is
edure,
called
Orthogonal
Matching
Pursuit,
is
also
often
considered.
The
k
=
arg
max
a
,
a
=
,
j
j
w
=
w
+
w
,
w
k
=
v
e
[
{k},
and
the
coefficients
vector
is
given
by
j
2
i
i
1
k
k
k
k
1
j=1,...,D
kX k
2
alogous to
that
of MP
but
the
coefficient
computation
(11.3)
is
replaced
by
w
=
arg
min
kY
X
M
wk
,
2
i
n
n
I
i
D I wk ,
wheremin
we note
that
w
=
arg
kY
DXn M
i
n
w2R
f
the
canonical
basis
in
R
k-th component different from zero. Fina
i
= wi 1 + w2R
wk , D wk k = Tvk ekjwith
2
r
X
wi = arg min
kY
X
M
wk
,
Ii
j
j
2j
j j 2
2
j i n1
j n
v2R
* vIDw)
= =j w
=ifarg
min
kr
X Ivk
, =
kr0
X v k = kr
aj
MI is such that
j2
I
and
w)
otherwise.
Moreover,
i 1 (M
i 1
i 1k
w2R
D j (M
2
j
j
v2R
lhbasis
in Rw)with
k-thifkXcomponent
different
zero. Finally,
k (M
that (M
=w
j k2 I and
w) =from
0 otherwise.
Moreover,
n
X
1
(yi
min
w2RD n
i=1
j=1
j
kwk
=
|{j
|
w
=
6
0}|
0
Problem is now convex and can be solved using convex optimization,
calledthe
proximal
methods
e `in0 particular
norm and so
counts
number
of non zero components in w. In the
n the least squares loss and consider dierent approaches to find ap
the above problem, namely greedy methods and convex relaxation.
Xn
Yn
p
p
PART III b)
a) Variable Selection: OMP
b) Dimensionality Reduction: PCA
k D,
suitable criterion.
& Reconstruction
k D,
suitable criterion.
Consider first k = 1
& Reconstruction
D,
dimension according to some suitable criterion. Recall that, if w
the (orthogonal) projection of a point x on w is given by (wT x)
suitable
finding
thecriterion.
direction p which allows the best possible average re
set,Consider
that is the
firstsolution
k = 1 of the problem
n
X
1
min
kxi
w2SD 1 n
i=1
he most popular
dimensionality
reduction
procedure.
It
T
w
w=1
D
{w
R (unsupervised)
| kwk = 1} is the sphere
in D
kxi
ven2 an
sample
S dimensions.
= (x1 , . . . The
, xn )norm
derive
Computations?
Statistics?
much
we
lose
by
projecting
x
along
the
direction
and the
by a linear map M . PCA can be derived
from w,
several
pros
is called the first principal component of the data. A direct
c/analytical
derivation.
(wT xi )wk2 = kxi k (wT xi )2 , so that problem (19.1) is equiva
onsidering the case where k = 1. We are interested into
n
X
1
min
kxi
w2SD 1 n
i=1
(w xi )wk ,
n
n
X
X
1
1
T
2
T
2
min min kxi kx
(wi x(w
)wk
,
xi )wk ,
i
D
1
1 n
nDi=1
w2S w2S
i=1
1 2 RD | kwk
Statistics?
{w
1} =
is 1}
theissphere
in D in
dimensions.
The norm
kxi
= {w 2 RD | =
kwk
the sphere
D dimensions.
The norm
much
we lose
x along
the direction
w, and
how much
we by
loseprojecting
by projecting
x along
the direction
w, the
and s
is called
the first
principal
component
of theof data.
A direct
19.1)
is called
the first
principal
component
the data.
A di
T
T
2T
T 2
2
2
(wi x(w
= kxi=k kx(w
soi ) that
problem
(19.1)(19.1)
is equiva
t kx
, so that
problem
is eq
i )wkxi )wk
i) , x
i k x(w
n
n
X
X
1 T 2T 2
1
(w
xi ) .
max max
(w
x
)
.
i
1 nD 1 n
w2SD w2S
i=1
i=1
r observation
is useful
fordierent
two dierent
reasons
thatwe
thediscuss
we discus
ervation
is useful
for two
reasons
that the
in t
PCA
Maximum
Variance
CA
andand
Maximum
Variance
1
(19.1)
1 2 RD | kwk
D = 1} is the sphere in D dimensions.
Statistics?
{w
The
norm
kx
=
{w
2
R
|
kwk
=
1}
is
the
sphere
in
D
dimensions.
The
norm
T
2i
wk = 1} is the sphere in D dimensions. The norm kxi (w xi )wk
much
we
lose
by
projecting
x
along
the
direction
w,
and
the
s
how
much
we
lose
by
projecting
x
along
the
direction
w,
and
se by projecting x along the direction w, and the solution p to
is
called
the
first
principal
component
of
the
data.
A
direct
19.1)
is
called
the
first
principal
component
of
the
data.
A di
he first
principal
component
of
the
data.
A
direct
computation
T
T
2T
T 2
2
2
2i=
(wi kxxi(w
=T xkx
k
(w
so
problem
(19.1)
is
equiva
kx
,
so
that
problem
(19.1)
is
eq
ik)wkx
i) , x
i )wk
i k x(w
i ) that
kt 2kx
=
(w
)
,
so
that
problem
(19.1)
is
equivalent
to
i
1
=) max
w2SD 1 n
n
n
X
X
1 T 2T 2
1
T max
2
(w
x
)
.
max
(w
x
)
.
i
i
(w
x
)
.
1i nD 1 n
w2SD w2S
n
X
i=1
i=1
(19.2)
i=1
rseful
observation
is useful
fordierent
two dierent
reasons
that
the
we
discus
ervation
useful
for two
that the
in t
foristwo
dierent
reasons
thatreasons
the we
discuss
inwe
thediscuss
following.
PCA
Maximum
Variance
CA
andand
Maximum
Variance
Maximum
Variance
1
(19.1)
1 2 RD | kwk
D = 1} is the sphere in D dimensions.
Statistics?
{w
The
norm
kx
=
{w
2
R
|
kwk
=
1}
is
the
sphere
in
D
dimensions.
The
norm
T
2i
wk = 1} is the sphere in D dimensions. The norm kxi (w xi )wk
much
we
lose
by
projecting
x
along
the
direction
w,
and
the
s
how
much
we
lose
by
projecting
x
along
the
direction
w,
and
se by projecting x along the direction w, and the solution p to
is
called
the
first
principal
component
of
the
data.
A
direct
19.1)
is
called
the
first
principal
component
of
the
data.
A di
he first
principal
component
of
the
data.
A
direct
computation
T
T
2T
T 2
2
2
2i=
(wi kxxi(w
=T xkx
k
(w
x
so
problem
(19.1)
is
equiva
kx
(w
,
so
that
problem
(19.1)
is
eq
ik)wkx
i) , x
i )wk
ik
i ) that
Lecture
19
Spring
2
kt 2kx
=
(w
)
,
so
that
problem
(19.1)
is
equivalent
to
i
n
n
X
n
X
1 T 2T 2
1
X
1
T max
2
(w
x
)
.
T
2
max
(w
x
)
.
i
i
max
(w
x
)
.
(19.2)
the=)
term (w
x)
has
the
variance
of
x
in
the
direction
w.
If
i
D
1
D
1
n
w2S
n
w2S
w2SD 1 n
i=1 i=1
i=1
keep this interpretation we should replace problem (19.2) with
o
rseful
observation
is useful
fordierent
two dierent
reasons
that
the
we
discus
ervation
useful
forn two
reasons
that the
we
discuss
in t
foristwo
dierent
reasons
that
the
we
discuss
in
the
following.
1X T
2
max
(w
(x
x
))
,
(1
i
=)
w2SD 1 n
i=1
PCA
Maximum
Variance
CA
andand
Maximum
Variance
Maximum
Variance
c
x. In the term
n
X
1
min
kxi
w2SD 1 n
i=1
(w xi )wk ,
{w 2 RD |Computations?
kwk = 1} is the sphere in D dimensions. The norm kxi
much we lose by projecting x along the direction w, and the s
is called the first principal component of the data. A direct
T
2
T
2
(w xi )wk = kxi k (w xi ) , so that problem (19.1) is equiva
n
X
1
T
2
max
(w xi ) .
w2SD 1 n
i=1
max
(w
(x
x
))
,
2
i
n
kxi (w xiw2S
)wkD ,1 n
(19.1)
DIndeed,
1 n
using the symmetry
of the inner product we have
n i=1
X
T
1
T
2
min
kx
(w
x
)wk
,
c
i
i
responds to theX
original
on
x. In the
n
n the centered data xX
n= x
D 1problem
X
n
w2S
T
2
1
1
1
i=1
T
2
T
T
T
T
is
the
sphere
in
D
dimensions.
The
norm
kx
(w
x
)wk
i
i
19.1) it is easy to see
that
this
corresponds
to
considering
(w xi ) =
w xi w xi =
w xi xi w =
n
n
n
rojecting
x
along
the
direction
w,
and
the
solution
p
to
D Computations?
i=1
i=1in D dimensions.i=1
n
X
{w
2
R
|
kwk
=
1}
is
the
sphere
The
norm
kx
i
1
principal component
of
the
data.
TA direct computation
2
min
kx
((w
(x
b))w
+
b)k
.
i
i
x
along
the
direction
w, and the s
Dprojecting
1 n
T problem
2 w,b2S
so that
(19.2)
can
be
written
as
kmuch
(wwe
xi )lose
, soby
that
problem
(19.1)
is
equivalent
to
i=1
w1 max
Cn
is called the first principal component
of eigenvector
the data. Aofdirect
T
n
T b))wX
T
2
n 2b is an affine transformation
(x(w
+
(rather
than
an
orthogonal
pr
X
xi1)wk = Tkxi k2 (w xi ) , so that
problem
(19.1)
is
equiva
i
1
T
T
max
(w xi ) . ,
(19.2)
max
w
C
w,
C
=
x
x
.
n
n
i
i
D 1
w2SD 1 n
n
w2S
n
i=1
i=1
1X T 2
PCA and Associated
Eigenproblem
max
(w xi ) .
P
D 1 n
1
w2S
two need
dierent
that theFirst,
we discuss
in thenotation
following.Cn =
We
two reasons
observations.
in matrix
i=1
n
further manipulation allows to write problem (19.2) as an eigenvalue
see that Cn is symmetric and positive semi-definite. If the dat
sing the symmetry
of the
product
we have
ervation
is useful for
twoinner
dierent
reasons
that the we discuss in t
is the so called covariance matrix. Clearly the objective functio
i=1
imum
Variance
n
n
X
X
1
1
n
n
X
X
1
T
2
T
T
T
T
T 1
T
(w xi ) =
w xi w xi =
w xi xiwwT =
w
(
x
x
i i )w
C
w
n
1
n
n
n
n
= xi=1
=
0,
problem
(19.2)
has
the
following
interpretation:
i=1
i=1
i=1
i
CA
and Maximum VariancewT w
n
n
n
n
Xsuitable criterion.
X
X
X
1
1
T
2
T
T
T
T
T 1
T
(wWhat
xi ) about
=
w2?xi w xi =
w xi xi w = w (
xi xi
k
=
n
n
n
1
i=1
i=1
i=1
& Reconstruction
w2 second eigenvector of Cn
n
X
he most popular dimensionality
reduction
1
T
T procedure. It
max w Cn w, Cn =
xi xi .
ven an (unsupervised)
sample S n=i=1(x1 , . . . , xn ) derive
w2SD 1
w ? w1
M :X=R !R ,
D
k D,
suitable criterion.
things I wont tell you about
A Random
& Reconstruction
Maps: Johnson-Linderstrauss Lemma
procedure. I
ven an (unsupervised) sample S = (x1 , . . . , xn ) derive
by a linear map M . PCA can be derived from several pro
ic/analytical derivation.
onsidering the case where k = 1. We are interested into
ension according to some suitable criterion. Recall that
(orthogonal) projection of a point x on w is given by (
The End
PART IV
Afternoon