Machine Learning The Theoretical Minimum

Machine Learning: a Basic Toolkit
Lorenzo Rosasco,
- Universita di Genova
- Istituto Italiano di Tecnologia
August 2015 - BMM Summer School
Machine Learning
Intelligent Systems
Data Science
Intro
ML Desert Island Compilation

An introduction to essential Machine Learning:
Concepts
Algorithms
PART I
Local methods
Bias-Variance and Cross Validation
PART II
Regularization I: Linear Least Squares

Regularization II: Kernel Least Squares
PART III
Variable Selection: OMP

Dimensionality Reduction: PCA
PART IV
Matlab practical session
Morning
Afternoon
PART I
Local methods
GOAL: Investigate the trade-off between stability and fitting starting

from simple machine learning approaches
CHAPTER 1
Machine Learning deals with systems that are trained from data rather than
grammed. Here we describe the data model considered in statistical learning theor
Statistical Learning Theory

1.1. Data
The goal of supervised learning is to find an underlying input-output relation

Machine Learning deals with systems that are trained from data rather th
(xnew ) y, in statistical learning th
grammed. Here we describe the data modelfconsidered
given data.
1.1. Data
The data, called training set, is a set of n input-output
pairs,
= {(x
, y1 ), . . . , (xninput-output
, yn )}.
The goal of supervised learning is to Sfind
an1underlying
relation
Each pair is called an example. We consider the

f (xapproach
new ) y,to machine learning ba
learning from examples paradigm.
given data.
Given the training set, the goal is to learn a corresponding input-output relatio
The data, called training set, is a set of n input-output pairs,
this task we have to postulate the existence of a model for the data. The model sho
S=
the possible uncertainty in the task and in
the{(x
data.
1 , y1 ), . . . , (xn , yn )}.
Each pair is called an example. We1.2.

consider
the
approach
to
machine
learning
Probabilistic Data Model
learning from examples paradigm.
D
The
inputs
belong
to
an
input
space
X,
we
assume
throughout
that
X
R
Given the training set, the goal is to learn a corresponding input-output rel.
to antask
output
spacetoY postulate
. We consider
several possible
situations:
Y R,s
this
we have
the existence
of a model
for the regression
data. The model
Y =possible
{ 1, 1} uncertainty
and multi-category
(multiclass)
classification
Y = {1, 2, . . . , T }. The s
the
in the task
and in the
data.
+1
0
x11
B
Xn = @ ...
x1n
xp1
... ... ...

..
..
..
.. C
.
.
.
. A
. . . . . . . . . xpn
y1
B
C
Yn = @ ... A
yn
Local Methods
We describe a simple yet efficient class of algorithms, the so called memory

rithms, based on the principle that nearby input points should have the similar/s
?
2.1. Nearest Neighbor
Local Methods: Nearby points have similar labels

Consider a training set
S = {(x1 , y1 ), . . . , (x
Nearest
n , yn )}.
Neighbor
Given an input x
, let
i0 = arg min k
x
i=1,...,n
xi k2
and define the nearest neighbor (NN) estimator as

f(
x ) = y i0 .
Every new input point is assigned the same output as its nearest input in the tra
comments. First, while in
the
above
definition
we
simply
considered
the
Euclid
How does it work?
2.3 Least Squares and Nearest Neighbors
15
15-Nearest Neighbor Classifier

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
FIGURE 2.2. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) and
Plot
can be promptly generalized to consider other measure of similarity among inputs. For example if
input are binary strings, i.e. X = {0, 1}D , one could consider the Hamming distance
D
1 X
dH (x, x
) =
1[xj 6=xj ]
D j=1
where xj is the j-th component of a string x 2 X.

Second, the complexity of the algorithm for predicting any new point is O(nD) recall that the comp
ity of multiplying two D-dimensional vectors is O(D). Finally, we note that NN can be fairly sens
to noise.
2.2. K-Nearest Neighbor
K-Nearest Neighbors
Consider
xi k2 )ni=1
dx = (k
x
the array of distances of a new point x

to the input points in the training set. Let
sx
be the above array sorted in increasing order and
Ix
the corresponding vector of indices, and
Kx = {Ix1 , . . . , IxK }
be the array of the first K entries of Ix . The K-nearest neighbor estimator (KNN) is defined as,
X
f(
x) =
y i0 ,
1
i0 2Kx
15

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
Plot
Thisinput
induce
a Parzen
analogous
to KNN,
here t
asedquestion
on the principle
have
the similar/same
output.
The
of how tothat
bestnearby
choose
Kchoice
willpoints
be
theshould
subject
of window
a future
discussion.
by the radius
r. have
More the
generally
it is interesting
to have a deca
principle that nearby input points
should
similar/same
output.
Remarks:
are
further
away.
For example considering
2.1. Nearest
Neighbor
Parzen Windows
2.1. Nearest Neighbor
k(x0 , x) = (1 kx x0 k)+ 1kx x0 kr ,

ider a training set
N each
of the K neighbors
equal
weights
in. .determining
output of a new point.
S = points
{(x
. , (xn , count
yn )}.themore
ng
set
1 , y1 ),
Generalization
I: has
closer
should
where (a)+ = a, if a > 0 and (a)+ = 0, otherwise. Another p
einput
general
approach
is
to
consider
estimators
ofy the
form,
x
, let
S = {(x1 , y1decaying
),
. . . , (x
,
)}.
n n
functions
such as:
Pn
2
i0 = arg
min
k
x
x
k
i
yi k(
x, xi )
i=1
0
kx x0 k2 /2 2
i=1,...,n
P
f (
x) =
,
Gaussian k(x , x) = e
.
n
0
2
k(
xx, ixki )
i
=
arg
min
k
x
i=1
ne the nearest neighbor (NN)i=1,...,n
estimator as
or
p
0
k
:
X
X
!
[0,
1]
is
a
suitable
function,
which
can
be
seen
as
a
similarity
measure
f (
x ) = y i0 .
t neighbor (NN) estimator as
Exponential
k(x0 , x)Windows
= e kx x k/ 2
Parzen
input points. The function k defines a window around each point and is sometimes
w
input point
is assigned
the
same
output
as methods
itsconsidering
nearest
input
theoftraining
Wecont
ad
(
In
all y
the
above
there
is ain
parameter
r or set.that
a Parzen
window.
A classification
is
obtained
the
sign
f
x
).
0
f(
x
) rule
=
.
i
s. First,
while
the above
we
considered
the example
Euclidean
has
on the
ny
examples
of in
k depend
on definition
theneighbor
distance
kxsimply
x0 k,prediction.
x, x0 2 X. For
wenorm,
can the m
nt
is Generalization
assigned
the same
output
as its
nearestofinput
in the
training
set. For
Weexample
add few
omptly
generalized
toII:
consider
measure
similarity
among
inputs.
otherother
metric/similarities
er
D0
binary
strings,
X = {0, 1}
one
consider
the
Hamming
distance
le
in the
abovei.e.definition
we
simply
the
Euclidean
norm,
the metho
k(x ,, x)
=could
1kx considered
.
4-2
x0 kr
eralized
to consider window
other measure
oftosimilarity
among
inputs.K For
example if th
D
X
hoice induce a Parzen
analogous
KNN,
here
the
parameter
is
replaced
1
D
gs,
i.e. r.
X More
= {0,generally
1} , one
consider
the 1Hamming
distance
j 6=x
dH (x,
x
) =to have
j ]
radius
it could
is interesting
a[xdecaying
weight
for point which
D
j=1
ther away. For example considering X
D
1
is the j-th component
of0 ax
dH (x,
string
) = x 2 X.
1[x0 j 6=xj ]
, x) = (1D kx x k)+ 1kx x0 kr ,
he complexity of the k(x
algorithm
for predicting
any new point is O(nD) recall that the com
j=1
ltiplying
is O(D).
Finally,
we note is
that
can be
(a)
= a,two
if a D-dimensional
> 0 and (a) =vectors
0, otherwise.
Another
possibility
to NN
consider
fastfairly sen
+
+ X.
omponent
of aThere
string is
x 2one
parameter controlling fit/stability
ng functions such as a Gaussian
ty of the algorithm for predicting any new point is O(nD) recall that the complex
How do we choose it?
Is there an optimal value?
Can we compute it?
Here we ask the question of how to choose K: is there an optima choice of K? C

computed in practice? Towards answering these questions we investigate theoretic
question of how K aects the performance of the KNN algorithm.
Is there an optimal value?
5.1
Tuning and Bias Variance Variance Decompos
Ideally we would like to choose K that minimizes the expected error

ES Ex,y (y
fK (x))2 .
We next characterize the corresponding minimization problem. For the sake of simpl
consider a regression model
yi = f (xi ) + i ,
= 0, E
2
i
i = 1, . . . , n
Moreover, we consider the least square loss function to measure errors, so that the
mance of the KNN algorithm is given by the expected loss
Next: Characterize corresponding

minimization problem to uncover
ES Ex,y (y fK (x))2 = Ex ES Ey|x (y fK (x))2 .
|
{z
}
one of the most
"(K)
fundamental aspect of machine learning.
To get an insight on how to choose K, we analyze theoretically how this choice influen
expected loss. In fact, in the following we simplify the analysis considering the perfo
of KNN "(K) at a given point x.
"(K)
{z }
}
5.1
Tuning
Variance
Variance
Decompos
|and |Bias
{z
2
2
ect of machine learning.
fK (x)) .
|
{z
} how this choice influ
oimplicity
get
an
insight
on
how
to
choose
K,
we
analyze
theoretically
we
consider
a
regression
model
For the sake of simplicity we consider a regression model
ES Ex,y (y
fK (x)) "(K)
= Ex ES Ey|x (y
"(K)
"(K)
on
how
to
choose
K,
we
analyze
theoretically
how
this
choice
influences
the
o chooseloss.
K, we
how
this
choicethe
influences
the
xpected
Inanalyze
fact, intheoretically
the following
we
simplify
analysis
considering
the
per
2 analysis
2
fact,
in
the
following
we
simplify
the
considering
the
performance
y
=
f
(x
)
+
,
E
=
0,
E
=
i
=
1,
.
.
.
2, n
2 this
get
insight
choose
K,
we
analyze
theoretically
how
the
i an
we
i on
ihowytopoint
I
the
following
the
analysis
considering
the
performance
i
f To
KNN
"(K)
at
asimplify
given
x.
E I = 0, E i =
i = choice
1, . . . ,influences
n
i = f (xi ) + i ,
given
point
x.In fact, in the following we simplify the analysis considering the performance
expected
loss.
point
x.
First,the
note
that
nsider
least
square
loss x.
function2 to measure errors, so that2the perfort ofand
KNN
"(K)
at
a
given
point
K (x))
denote by S2the corresponding
training
set.(fMoreover,
we consider
the least squ
"(K)
=
+
E
E
(x)
f
,
S
y|x
2
NN algorithm
is
given
byS Ethe
expected
loss
First, "(K)
note
that
=
+
E
(f
(x)
f
(x))
,
2
2
K
y|x
function
to
measure
errors,
so
that
the
performance
of 2the KNN algorithm is given
"(K) =
+ ES Ey|x (f (x)
f
K2(x)) ,
2
"(K)
= + ES Ey|x
(f (x)term.
f2K (x))
,
here
can
be
seen
as
an
irreducible
error
Second,
to
study
the
latter
2
expected
loss
seen asESan
irreducible
error= term.
Second,
to
study. the latter term we
E
(y
f
(x))
E
E
E
(y
f
(x))
x,y
K
x
S
K
y|x
2
an
irreducible
error
term.
Second,
to study
the
lattertoterm
we
2
ntroduce
the
expected
KNN
algorithm,
where
can
be
seen
as
an
irreducible
error
term.
Second,
study
the2 latter
term we
|
{z
}
ES Ex,y (y fK (x)) = Ex ES Ey|x (y fK (x))

.
ected KNN algorithm,
NN
algorithm,
"(K)
|
{z
}
introduce
the expected KNN algorithm,
X
1
X
"(K)
1
X
X
Etheoretically
fK` ).
(x)1 = how this
fchoice
(x` ). influences the
K1(x)
y|x
Ey|x fK,
=analyze
f
(x
on how to choose
we
Kf`2K
Ey|x
f (x) = analyze
(xtheoretically
` ).
Ey|x
(x) = on how to
fchoose
(x
K
To get
anfK
insight
how this choice influen
` ). K K, we
x
K
`2K
n fact, in the following
simplifyx the analysis
K we
`2Kx considering the performance
expected loss. In fact,`2K
inx the following we simplify the analysis considering the perfo
given
point x.
WeaWe
have
have
of KNN "(K) at a given point x.
hat
First,
2
2 2
2
2
2 note that
2
2
2
2
E
E
(f
(x)
f
(x))
=
(f
(x)
E
f
(x))
+
E
E
(E
f
(x)
f
(x))
E
(f
(x)
f
(x))
=
(f
(x)
E
E
f
(x))
+
E
E
(E
f
(x)
f
) 2 fE
(x))
=
(f
(x)
E
E
f
(x))
+
E
E
(E
f
(x)
f
(x))
K
S
K
S
K
K
y|x
y|x
y|x
y|x
K
S
K
S
K
K(
y|x
y|x
y|x
K S S y|x
S
K
S
K
K
y|x , y|x
"(K) = + ES y|x
Ey|x
(f
(x)
f
(x))
2
2
K
2
2
| +|E=
} (x)
|{zfK}(x))
(f
fK|(x)){z, }
} y|x{z|+
{z
{z }
x)) = (f (x) | ES Ey|x fK{z
(x))"(K)
(EEy|x
fKy|x(x)
SE
SE
Bias
V ariance
{z
} term.
|
Bias
V
ariancethe }latter
Bias{z
V ariance
e seen |as an
irreducible
error
Second,
to
study
term
we
2
V ariance
where we have
can Bias
be seen as an irreducible error
term. Second, to study the latter t
Finally,
pected
KNN
algorithm,
inally,
we
have
2
X2
introduce the expected KNN algorithm,
1
2
2
X
"(K) =1 2X
+
(f
(x)
f
(x
))
+
`
1
X
2
2
2
2 K+
1 2+
X
"(K)
=
+
(f
+
f
K
1(x)
(x
`))
"(K)
=
(f
(x)
+
f
(x
))
+
X
`
2
2 f (x` ). `2K
x
1
E
f
(x)
=
K
K
K
y|x
(K) = + (f (x)
fK
+
K `2K f (x` ).
K
(x
` ))
`2K
E
x y|x fK (x) =
x
K `2K
K
`2Kx
K
x
`2K
x
5-1
We have
5-1
5-1
(f (x)
ave
fK (x))2 = (f (x) ES Ey|x fK (x))2 + ES Ey|x (Ey|x fK (x)

Bias Variance
Trade-Off
|
{z
} |
{z
Bias
2
X
1
"(K) = 2 + (f (x) +
f (x` ))2 +
K `2K
K
Lecture
5x
V ariance
fK (x))2
}
Spring
5-1
The
Bias-Variance
Tradeo.
In theYES!
KNN algorithm theCan
parameter
K controls
Is there
an optimal
value?
we compute
it?the a
plexity.
(f (x)
ave
Not quite
fK (x))2 = (f (x)
|
ES Ey|x fK (x))2 + ES Ey|x (Ey|x fK (x)

{z
} |
{z
Bias
V ariance
2
X
1
"(K) = 2 + (f (x) +
f (x` ))2 +
K `2K
K
x
5-1
enter Cross Validation

Split data: train on some, tune on some other
fK (x))2
}
Cross Validation Flavors
Validation
Validation
Validation
Validation
Validation
Hold-Out
Cross Validation Flavors
Validation
Validation
Validation
Validation
Validation
V-Fold, (V=n is Leave-One-Out)
End of PART I
Local methods
Stability - Overfitting - Bias/Variance - Cross-Validation
End of the Story?
High Dimensions and Neighborhood
tell me
the length
of the edge
vs volume
in high
dimensions
of a cube containing 1% of the
volume of a cube with edge 1
Cubes and Dth-roots
Curse of dimensionality!
PART II
GOAL: Introduce the basic (global) regularization methods with

parametric and non parametric models
Going Global + Impose Smoothness
Of all the principles which can be proposed for that purpose, I

introduce
a class
of learning
think thereInisthis
noneclass
more we
general,
more exact,
and more
easy of algorithms based Tikho
a.k.a. that
penalized
empirical
and regularization. In part
application,
of which
we maderisk
useminimization
in the preceding
the algorithm
definedof by
the square
loss.
researches,
and which consists
rendering
the sum
of squares
of the errors a minimum.
(Legendre 1805)
7.1
Regularized Least Squares
We consider the following algorithm

n
X
1
min
(yi
w2RD n
i=1
wT xi ))2 + wT w,
0.
f (x) = wT x = 0
A motivation for considering the above scheme is to view the empirical e
n
X
1
(yi
n i=1
as a proxy for the expected error

v
Z
Tikhonov 62
dxdyp(x, y)(y
?,
x ))
Phillips 62
Hoerl et al. 62
wT x))2 .
Of all the principles which can be proposed for that purpose, I

we introduce
a class
of learning
think thereIn
is this
none class
more general,
more exact,
and more
easy of algorithms based Tikh
a.k.a.that
penalized
and regularization. In par
application,
of whichempirical
we made risk
use minimization
in the preceding
the and
algorithm
defined
by the square
researches,
which consists
of rendering
the sumloss.
of squares
of the errors a minimum.
(Legendre 1805)
7.1
Regularized Least Squares
We consider the following algorithm

n
X
1
min
(yi
T
w2RD n
f (x) = w x
i=1
wT xi ))2 + wT w,
0.
A motivation for considering the above scheme is to view the empirical

n
X
1
(yi
n i=1
as a proxy for the expected error

v
Z
Tikhonov 62
dxdyp(x, y)(y
?,
x ))
Phillips 62
Hoerl et al. 62
wT x))2 .
n
X
1
min
(yi
w2RD n
i=1
w xi )) + w w,
0.
tion for considering the above scheme is to view the empirical error
Computations?
Statistics?
n
X
1
(yi wT xi ))2 ,
n i=1
y for the expected error

Z
dxdyp(x, y)(y
w x)) .
wT w is a regularizer and help preventing overfitting.

erm wT w = kwk2 is called regularizer and controls the stability of the
meter balances the error term and the regularizer. Algorithm (7.1) is a
nov regularization, also called penalized empirical risk minimization. We
hosen the space of possible solution, called the hypotheses space, to be th
nctions, that is
this seemingly simple example will be the basis for much more com
n
X
1
T
2
T
min
(yi w xi )) + w w,
0.
7.2
w2RD n Computations
i=1
ML-II
Lecture
7
In thisthe
caseabove
it is convenient
toto
introduce
theempirical
n times Derror
matrix Xn
tion for considering
scheme
is
view
the
Lecture
7
Spring
2014
Computations?
input points, and the n by 1 vector Yn where the entries are th
n
With this notation
X
1 the gradient
n
T
2with respect1 to w of the e
irect computation shows that
X
1
(y
w
x
))
,
T
2
2
i
i
Notation
(y
w
x
))
=
kY
X
wk
.
ows
that theare
gradient
withnrespect to nw of the
empirical
risk
i
i
n and
n
regularizer
respectively
n
i=1
i=1
tively
2 T
y for the expected error
7-1 2w.
X
(Y
X
w),
and,
n
n
n
2 T
Z
n
Setting
gradients
Xn (Yn Xn w), and, 2w.
T
2
n
dxdyp(x, y)(y w x)) .
n, setting the gradient to zero, we have that the solution of regulari
nt
zero,
wesystem
have that the solution of regularized least squares
es to
the
linear
wT w is to
a regularizer
and help preventing
overfitting.
T
T
zero
(X
X
+
nI)w
=
X
Y
.
n
n
T
2
n
n
T
erm w (X
w T=Xkwk
is
called
regularizer
and
controls
the
stability
of
the
+
nI)w
=
X
Y
.
n
n
n n
eral
comments
arethe
in error
order.term
First,
methods can
be used(7.1)
to solve
meter
balances
andseveral
the regularizer.
Algorithm
is a
order.
First, several
methods
be the
used
to solveof
the
abovesince
linearthe We
nov
also
called can
penalized
empirical
risk
minimization.
ems,regularization,
Choleski
decomposition
being
method
choice,
m
T
position
being
the
method
of
choice,
since
the
matrix
X
X
+
I
n
n
hosen
the
space
of
possible
solution,
called
the
hypotheses
space,
to
be
th
ymmetric and positive definite. The complexity of the method
is
e
2
OK,
but
what
is
this
doing?
e
definite.
The
complexity
of
the
method
is
essentially
O(nd
)
nctions,
trainingthat
and isO(d) for testing. The parameter controls the invertibi
controls the invertibility of the matrix

erlude: Linear Systems
or training and O(d) for testing. The parameter

XnT Xn + nI).
7.3
Interlude:
Linear Systems
Interlude:7.3
Linear
Systems
problem
Consider the problem
Consider
the
problem
7.3 Interlude: Linear Systems
M aM
=a
b, =
b,
M a = b,
D
where
M
is
a
D
by
D
matrix
and
a,
b
vectors
in
R
. We are interested in determing a
D
onsider
where
M the
is aproblem
D by D satisfying
matrix and
a,
b
vectors
in
R
.
We
are
interested
in determing a
the above equation given M, b. If D
M is invertible, the solution to the problem is
b,
D bytheDabove
matrix
a,b.b IfM
vectors
in R
. Weto are
interested
atisfying
equation and
given M,
Ma =
is invertible,
the solution
the problem
is
in d
a = M b.
D
here
M
is
a
D
by
D
matrix
and
a,
b
vectors
in
R
.
We
are interested in determing a
above equation given M, ab.= M
If M
1
b. is invertible, the solution to the p
1
atisfying the above equation

b. If M
invertible,
the solution to the problem is
If given
M is a M,
diagonal
M =isdiag(
1 , . . . , D ) where i 2 (0, 1) for all i = 1, . . . , D, then
If M is a diagonal M = diag( 1 1 , . . . , D ) where
i = 1, . . . , D, then
1
i 2 (0, 1) for all
1
1
M = diag(1/
. . b.
, 1/ Db.
), (M + I) = diag(1/( 1 + ), . . . , 1/( D + )
1, . M
aa==
M
M 1 = diag(1/ 1 , . . . , 1/ D ), (M + I) 1 = diag(1/( 1 + ), . . . , 1/( D + )
If diag(
M is symmetric
and
positive idefinite,
thenfor
considering
If M is a diagonal M =
) where
2 (0, 1)
all i = 1,the
. . .eigendecomposition
, D, then
1, . . . , D
a diagonal
M
=
diag(
,
.
.
.
,
)
where
2
(0,
1)
for
all
i
=
1,
.
.
1
D
i
1
T
T
If M is 1symmetric and positive definite,Mthen
eigendecomposition
= Vconsidering
V
, =the
diag(
1
1 , . . . , D ), V V
M = diag(1/ 1 , . . . , 1/ D ), (M + I) = diag(1/( 1 +
), . . . , 1/( D=+I, )
then
M 1=
V V T , = diag( 1 , . . . , D1), V V T = I,
=
diag(1/ 1 , . .and
. , 1/
(M
=1the
diag(1/(
1+
1 T
D ),definite,
1 + ), . . . , 1/(
M then
= Vconsidering
I)
V ,
= diag(1/
1 , . . . , 1/ D ),
If M is symmetric
positive
eigendecomposition
then
and
11
T T
T
T, , . .),
M = V V
1(M
V, +
, I)
=
diag(1/
.
,
1/
),= I, 1 + ), . . . , 1/( D + )
M
= 11diag(
,
.
.
.
V
V
1
D
= V 1 = V , D = diag(1/(
symmetric
and
positive
definite,
then
considering
the
eigendecompo
and
then
The
ratio
1
D/ 1
isT called the condition number of M .
(M + I) =1 V = V1 ,T =1 diag(1/( 1 + ), . . . , 1/( D + )
1 M = V T V , = diag(1/ 1 , . . . , 1/ D ),
T
M
=
V
V
,
=
diag(
,
.
.
.
,
),
V
V
=
I,
1
D
7.4
Dealing
with
an
Oset
The ratio
/
is
called
the
condition
number
of
M
.
and D 1
1
T
(M + I)
=
V
=
V
,
= diag(1/(
+relatively
), . . . , 1/(
D + )
When considering linear
models,
especially1in
low dimensional
spaces, it is inter-
esting to consider an oset, that is wT x + b. We shall ask the question of how to estimate b
is called
thedata.
condition
M
A simple
1from
1 number
Tidea is toofsimply
1 . augment the dimension of the input space, considering
7.4
with an Oset
he ratioDealing
D/ 1
Lecture
7Regularized
Least
Squares
e respectively
n
X
Lorenzo Rosasco
Scribe: Lorenzo
1
T
2
T
2min
(yi w xi )) + w w,
0.
TD
Xn (Ynn i=1 Xn w), and, 2w.
w2R
n
classfor
weconsidering
introduce athe
class
of learning
algorithms
based
Tikhonov
regula
tion
above
scheme
is
to
view
the
empirical
error
e gradient to zero, we have that the solution
of regularized least squ
Statistics?
alized empirical risk minimization and regularization. In particular, we
n
ystem
X
1
hm defined by
the square
CHAPTER
11 loss.
2
T
TT
x
))
,
(Xn Xn + nI)w(y=i Xw
Y
.
n ni
n i=1
s are in order. First, several methods can be used to solve the above li
Regularized
Least
Squares
Variable
Selection
ydecomposition
for the
expected
error
another
story
that
shall
told another
being the method
of be
choice,
since thetime
matrix XnT Xn +
Z(Stein
56, James and
Stein
61)
derpositive
definite.
The
complexity
of
the
method
is
essentially
O(
the following algorithm
T
2
dxdyp(x,
w x))
O(d)
for predictions
testing. The
controls
the . invertibility
of the ma
beyond
it is parameter
important
to y)(y
obtain
interpretable
results. Intern
X
by detecting which factors
have determined
our
1
T
2
T prediction. We look at
min
(yhelp
wpreventing
xi )) + woverfitting.
w,
0.
i
ve
selection.
wTofwvariable
is a regularizer
and
w2RD n
i=1
T
2
erm w w = kwk is called regularizer and controls the stability of the
D
v
lude:
Linear
Systems
X
meter
balances the
error
term
and the
regularizer.
Algorithm
(7.1)
is
a
on for considering
the
above
scheme
is
to
view
the
empirical
error
X
j 2
T
j j
(wminimization.
)
fw (x) = w xalso
= called
w x .penalized empirical risk
nov regularization,
We
n
i=1 1 X
blem
j=1
hosen the space of possible solution,
called
the
space,
to
be
th
T
2 hypotheses
(y
w
x
))
,
i
i
j
M
=
b,
n ai=1
nctions,
onents xthat
of Shrinkage
anis input of specific
measurements:
pixel values
in the case
- Stein
EffectAdmissible
Estimator
15

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
Plot
Why a linear decision rule?
v
x 7! x
=(
T
1 (x), . . . ,
f (x) = w x
=
p
X
j=1
p
(x))
2
R
p
j (x)w
Dictionaries
o zero, we have that the solution of regularized least squares

(XnT Xn + nI)w = XnT Yn .
7!
T X
n + nY )w = X
T Yn
(X
n
n
r. First, several methods can be used to solve the above linear

tion being the method of choice, since the matrix XnT Xn + I
efinite. The complexity of the method is essentially O(nd2 )
What About Computational Complexity?
ting. The parameter controls the invertibility of the matrix
inear Systems
M a = b,
ix and a, b vectors in RD . We are interested in determing a

given M, b. If M is invertible, the solution to the problem is
a=M
b.
Complexity Vademecum
M n by p matrix and v, v 0 p dimensional vectors

T 0
v v 7! O(p)
0
M v 7! O(np)
M M T 7! O(n2 p)
T
(M M )
7! O(n )
o zero, we have that the solution of regularized least squares

(XnT Xn + nI)w = XnT Yn .
7!
T X
n + nY )w = X
T Yn
(X
n
n
r. First, several methods can be used to solve the above linear

T
tion
being
the
method
of
choice,
since
the
matrix
X
+ I map
l representation of the potential effect of considering
n feature
n Xa
2
rial
representation
of
the
potential
effect
of
considering
a
feature
efinite.
The
complexity
of
the
method
is
essentially
O(nd
) map
nsional example.
What
About
Computational
Complexity?
mensional
ting. The example.
parameter controls the invertibility of the matrix
ts in the training set andO(p

c=
3 (c1 , . . .2, cn ) a set of coefficients. Th
)
+
O(p
n)
puts
in representer
the training theorem.
set and c =
, . . . discuss
, cn ) a set
of result
coefficients.
called
We(c1first
this
in the Th
co
so called representer theorem. We first discuss this result in the co
inear Systems
What if p is much larger than n?

em for RLS. The result follows noting that the following equality
orem for RLS. The result follows noting that the following equality
T
1 T
T
T
1
M
a
=
b,
Xn XTn + nI) X1 n T= Xn T(Xn Xn T+ nI) 1,
(Xn Xn + nI) Xn = Xn (Xn Xn + nI) ,
ix and a, b vectors in RD . We are interested in determing a
n
X
n
given M,Tb. If M Tis invertible,
X
1 the solution
T to the problem is
w=
X
(X
+
nI)
Y
x
T n Xn T
1n =
Tci .
n
i
w = X|n (Xn Xn {z
+ nI) Y}n =
xi c i .
1
|
} i=1
a = M b.c {z
i=1
c
Representer Theorem for RLS. The result follows noting that the following equality
1. Representer Theorem for RLS. The result follows noting that the following equality
(XnT XTn + nI)
have,
we have,
(Xn Xn + nI)
T
T
T
X
=
X
(X
X
1n T
n T n n T+ nI)
Xn = Xn (Xn Xn + nI)
1,
n
X
n
X
T
T
1
T
T n Xn T
1n =
ww
==
XnX(X
+
nI)
Y
x
i Tcci .i .
(X
X
+
nI)
Y
=
x
n
n
n {z
i
|n |
}}
{z
c c
i=1
i=1
TT
6.3)
follows
from
considering
the
SVD
of
X
,
that
is
X
=
U
V
hav
n n , that is X
nn = U V .. Indeed
n (6.3) follows from considering the SVD of X
Indeed we
we hav
hat
o that
T T
1 1TT
22
11
TT
(X(X
X
+
nI)
X
=
V
(
+
)
U
X
+
nI)
X
=
V
(
+
)
U
n nn n
nn
Computational Complexity: O(p3 ) + O(p2 n)
T
T T
11
22
11 TT
(X
X
+
nI)
=
V
(
+
)
XnTX(X
X
+
nI)
=
V
(
+
)
UU
. . 3 ) + O(pn2 )
O(n
n n nn n
2. Representer
Theorem
Implications.Using
UsingEquation
Equation(7.2)
(7.2)ititpossible
possible to
to show
show how th
Representer
Theorem
Implications.
ficients
computed
considering
differentloss
lossfunctions.
functions.In
Inparticular,
particular, for
for the
the square
square
ents
cancan
computed
considering
different
fefficients
coefficients
satisfies
following
linearsystem
system
satisfies
thethe
following
linear
n + nI)c = Yn .
(K(K
n + nI)c = Yn .
Kn is the n by n matrix with entries (Kn )i,j = TxTi xj . The matrix Kn is called the kernel ma
s the n by n matrix with entries (Kn )i,j = xi xj . The matrix Kn is called the kernel ma
etric and positive semi-definite.
ic and positive semi-definite.
6.3. Kernels
6.3. Kernels
of the main advantages of using the representer theorem is that the solution of the p
the
main
advantages
ofthrough
using the
representer
is that
the solution
of the
T 0
on the
input
points only
inner
products xtheorem
x . Kernel
methods
can be seen
as re
the input points only through inner products0 xT x0 . Kernel methods can be seen as r
Representer Theorem for RLS. The result follows noting that the following equality
1. Representer Theorem for RLS. The result follows

noting
that
the
following
equality
6.3. Kernels
(XnT XTn + nI)
(Xn Xn + nI)
T
T
T
X
=
X
(X
X
1n T
n T n n T+ nI)
Xn = Xn (Xn Xn + nI)
1,
One of the main advantages of using the representer theorem is that the
have,
we have,
n
T 0
X
n
ends on the input pointsTonly through
inner
products
x
x . Kernel method
XT
T
1
T n Xn T
ww
==
XnX(X
++ nI)
Y1 Y
xxi0 Tcci . .
n ==
(X
X
nI)
n
n
inner product with
a
more
K(x,
x
).
n general
n {zfunction
i iIn this case, the repre
|
}
|
{z
}
P
w (x)
=w x=
n
i=1
xTi xci ,
c c
becomes
i=1
i=1
TT
6.3)
follows
from
considering
the
SVD
of
X
,
that
is
X
=
U
V
hav
n n , thatnis X
nn = U V .. Indeed
n (6.3) follows from considering the SVD of X
Indeed we
we hav
X
hat
o
that
n
n , x)c .
)
f
(x)
=
K(x
X
X
i
T T
1 1TT
22
11 i TT
(X(X
X
+
nI)
X
=
V
(
+
)
U
X
+
nI)
X
=
V
(
+
)
U
T
T
n nn n
nn
Kernels
w=
xi ci ) f (x) = x w
i=1=
x xi c i
| {z }
j=1
T
T
2
T j=1
T
11
2the
11 TT
we can promptly
kernel
versions
of
Regularization
Networks
X(X
(X
X
+
nI)
=
V
(
+
)
U
.
MLNotes
2014/5/29
14:33
page
13
#17
Xnderive
X
+
nI)
=
V
(
+
)
U
.
K(x,x
)
i
n n nn n
ctions.
2. Representer
Theorem
Implications.Using
UsingEquation
Equation(7.2)
(7.2)ititpossible
possible to
to show
show how th
Representer
Theorem
Implications.
The
function
K
is
often
called
a
kernel
and
to
be
admissible
it
should
beh
1
ficients
computed
considering
different
loss
functions.
Inparticular,
particular,
for the
the square
square
ents
cancan
computed
considering
different
loss
functions.
In
for
(K
+
nI)
c
=
Y
,
(K
)
=
K(x
,
x
)
n
n
n
i,j
i
j
re
preciselysatisfies
it should
be:
1) symmetric,
and 2) positive definite, that is the
fefficients
coefficients
the
following
linear
system
satisfies the following linear system
positive semi-definite for any (K
set of
n
input
points.
While
the
symmetry
pro
+ nI)c
nI)c==YYn. .
n
(K
+
n
n
ck, positive semi definiteness is trickier. TPopular
examples of positive defin
Kn is the n by n matrix with entries0 (Kn )i,jT =0 Txi xj . The matrix Kn is called the kernel ma
s the nthe
by nlinear
matrix
with entries
(K
)i,jx =xx,i xj . The matrix Kn is called the kernel ma
n=
kernel
K(x,
x
)
6.3. KERNELS
etric and positive semi-definite.
ic and
positive
semi-definite.
the polynomial kernel K(x, x0 ) = (xT x0 + 1)d ,
6.3. Kernels
kx x0 k2
2 2
the Gaussian kernel K(x,6.3.
x0 ) Kernels
=e
,
of the main advantages of using the representer theorem is that the solution of the p
the
main
advantages
ofthrough
using
representer
is degree
that
the and
solution
of the
T 0
ere
last
two
kernels
have the
ainner
tuning
parameter,
the
w
on the
the
input
points
only
products
xtheorem
x . Kernel
methods
canGaussian
be seen
as re
the input points only through inner products0 xT x0 . Kernel methods can be seen as r
15

.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.....................................................................
oo
o
o
oo oo
o
o o
oo o o
o o o
o
o oo o o
o oo
o
oo
o
o
o
o
o oo
oo o o o
o
o
o
o
o
o
o o oo oo
o
o
o
oo
o o o oo
o
o
o o o
o
o
ooooo
o
o
o
o o ooo
ooo o o ooo o
o
o
o
o
ooo oo oo
o
oo
o
o
o
o
o oo o o
o
o
o
o
oo
o oo oooo
o oo
o
o oo oo o o
o
oooo o o oo
o
o
o oo
o
o o oo
o
o
o o o o o
o
oo oo o
o
ooo
o o o oo o
o
oo
o
o
o
o
o
oo
Plot
n
i=1
xTi ci , becomes
f(x) =
n
X
K(xi , x)ci .
i=1
omptly derive kernel versions of the Regularization Networks induced b
on K is often called a kernel and to be admissible it should behave like an

y it should be: 1) symmetric, and 2) positive definite, that is the kernel ma
I wont
tellWhile
you about
mi-definite for any setthings
of n input
points.
the symmetry property is ty
e semi definiteness is trickier. Popular examples of positive definite kernel
near
kernel K(x, x0 ) =
xT x0 , Hilbert Spaces
Reproducing
Kernel
0
T 0
d
olynomial
kernel
K(x,
x
)
=
(x
x
+
1)
,
Gaussian Processes
Integral Equations
Sampling Theory/Inverse Problems
Loss functions- SVM, Logistic
Multi - task, labels, outputs, classes
End of PART II
Regularized Least Squares - Dictionaries - Kernels
PART III
a) Variable Selection: OMP
b) Dimensionality Reduction: PCA
GOAL: To introduce methods that allow to learn interpretable models

from data
n patients p gene expression measurements
...
;
...
x11
B ..
Xn = @ .
x1n
...
..
.
...
..
.
...
..
.
...
...
...
xp1
y1
.. C
.. C Y = B
@ . A
. A; n
yn
xpn
fw (x) = w x =
D
X
x w
j=1
Which variables are important for prediction?

or
Torture the data until they confess
Sparsity: only some of the coefficients are non zero
the second, and so on and so forth. Next, we could pass to consider tr

of variables, then triplet etc. For each training
set
one
would
solve
th
Brute Force Approach
nd eventually end selecting the variables for which the corresponding tr
best performance.
check
all individual
variables, then
all couple,
proach
described
has an exponential
complexity
and triplets..
becomes unfeasib
ly small D. If we consider the square loss, it can be shown that the corr
uld be written as
n
X
1
(yi
min
w2RD n
i=1
fw (xi ))2 + kwk0 ,
j
j 0}|
kwkkwk
=
|{j
|
w
0 0 = |{j | w6=
6= 0}|
e `0 norm and counts the number of non zero components in w. In the

n the least squares loss and consider dierent approaches to find ap
the above problem, namely greedy methods and convex relaxation.
each training set one would solve the learning problem and eventually end selecting
which the corresponding training set achieve the best performance.
Greedy approaches/Matching
The approach described has an exponential
complexity and becomesPursuit
unfeasible
tively small D. If we consider the square loss, it can be shown that the corresponding p
written as
(11.2)
n
Yn 1 X
min
w2RD
i=1
`(yi , fw (xi )) + kwk0 ,
where kwk0 = |{j | wj 6= 0}| is called the `0 norm and counts the number of non z
in w. In the following we focus on the least squares loss and consider different ap
approximate solution to the above problem, namely greedy methods and convex relax
2
11.2. Greedy Methods: (Orthogonal)
Matching Pursuit
r
Greedy approaches are often1considered to find approximate solution to problem
of approaches to variable selection generally encompasses the following steps:

(1)
(2)
(3)
(4)
(5)
initialize the residual, the coefficient vector, and the index set,
find the variable most correlated with the residual,
update the index set to include the index of such variable,
update/compute coefficient vector,
update residual.
The simplest such procedure is called forward stage-wise regression in statistics and m
(MP) in signal processing. To describe the procedure we need some notation. Let Xn b
= arg max aj ,
aj =
j=1,...,D
kX j k2
T
j 2
(r
X
)
i
1
k = arg max aj , aj =(riT 1 Xj j )22 ,
,j=1,...,D
,
w
=
0,
I
=
;.
kX
k
0 =
0
0
kr=
argYnmax
a
,
a
=
j
j
11. VARIABLE SELECTIONj 2 ,
j=1,...,D
kX k
11. VARIABLE SELECTION
min
kri
j
X vk ,
29kri 1
j j 2
X v k = kri
Matching Pursuit
2
1k
(Mallat Zhang 93)
aj
v2R
X
is then iterated for i =
1. The
correlated with t
j 1,
2 ...,T
j j 2variable most
2
= arg
min
kr
vk The
, kr
X most
v k =correlated
kri 1 k
awith
i. ,1T X 1.
i 1
j
d
for
i
=
1,
.
.
variable
the
2
j
2
j
j
2
2
v2R
k
= arg min kr We select
X vk ,thekr
X such
v k =that
kr the
k projection
a
erpretations.
variable,
on the
i 1
i 1
i 1
T
j 2
(r
X
) the
n two
interpretations.
We select the
variable,
such
that
the
projection
on
the
i select
1
mn
is larger,
or,
equivalently,
we
variable
such
that
the
k = arg max(rTaj ,X j a)2j =
, *
j
2
interpretations.
We
the variable,
such the
thatvariable
the projection
onthe
the
j=1,...,D
ding
column
is larger,
or,
we
select
such that
i select
1 equivalently,
kX
k
swo
the
the
output
vector
in
a
least
squares
sense.
max aj , aj =
,
j
2
MLNotes
we
2014/5/29
14:33
pagesuch
30 that
#34 the
g
column
is
larger,
or,
equivalently,
select
the
variable
st
explains
the
the
output
vector
in
a
least
squares
sense.
j=1,...,D
kX
k
= Ii 1 [ {k}, and the coefficients vector is given by
ated
as Ithe
Ii 1output
[ {k}, and
thein
coefficients
vectorsense.
is given by
xplains
vector
a least squares
i = the
j
Xw = w
jk the
2 vcoefficients
j j 2 is given by
2
+
w
,
w
=
e
ed as
=
I
[
{k},
and
vector
i =Iarg
i
1
k
k
k
k
i
i 1=
min
krwi i 11 +X
vk
,wk kkr=i v1k ek X v k = kri 1 k
aj
w
w
,
2
i
k
k
jv2R2
j j 2
2
D
rnical
X
k v=
k different
aj from
basis
R 1kr
with
zero.
Finally,
DX
wi vk
=inw, ibasis
+
w1kR, k-th
wkvcomponent
k k-th
=
ekr
1 canonical
iin
kcomponent
k i 1different
fnithe
with
from
zero.
Finally,
two interpretations. We select the variable, such that the projection
on t
D
30
11. VARIABLE SELECTION
he
canonical
basis
in
R
with
k-th
component
different
from zero.
Finally,
ding
column
is
larger,
or,
equivalently,
we
select
the
variable
such
that t
tations.r We
select
the
variable,
k
k such that the projection on the
ri the
Xw
ri 1 .vector
Xw . in a least squares sense.
i =
1ri =
st
explains
the
output
The or,
following
procedure is we
then select
iterated the
for i variable
= 1, . . . , T such
1. Thethat
variable
most correl
s larger,
equivalently,
the
k
ocedure,
called
Orthogonal
Matching
Pursuit,
is also
considered.
The
r
=
r
Xw
.Pursuit,
residual
is
given
by
dated
as
I
=
I
[
{k},
and
the
coefficients
vector
isoften
given
by
called
Orthogonal
Matching
is
also
often
considered.
The
i 1
i
i i1
the
output
vector
in
a
least
squares
sense.
end
T
j 2
(r
X
) replaced
analogous
to
that
of
MP
but
the
coefficient
computation
(11.3)
is replacedby
by
i
1
s
to
that
of
MP
but
the
coefficient
computation
(11.3)
is
edure,
called
Orthogonal
Matching
Pursuit,
is
also
often
considered.
The
k
=
arg
max
a
,
a
=
,
j
j
w
=
w
+
w
,
w
k
=
v
e
[
{k},
and
the
coefficients
vector
is
given
by
j
2
i
i
1
k
k
k
k
1
j=1,...,D
kX k
2
alogous to
that
of MP
but
the
coefficient
computation
(11.3)
is
replaced
by
w
=
arg
min
kY
X
M
wk
,
2
i
n
n
I
i
D I wk ,
wheremin
we note
that
w
=
arg
kY
DXn M
i
n
w2R
f
the
canonical
basis
in
R
k-th component different from zero. Fina
i
= wi 1 + w2R
wk , D wk k = Tvk ekjwith
2
r
X
wi = arg min
kY
X
M
wk
,
Ii
j
j
2j
j j 2
2
j i n1
j n
v2R
* vIDw)
= =j w
=ifarg
min
kr
X Ivk
, =
kr0
X v k = kr
aj
MI is such that
j2
I
and
w)
otherwise.
Moreover,
i 1 (M
i 1
i 1k
w2R
D j (M
2
j
j
v2R
lhbasis
in Rw)with
k-thifkXcomponent
different
zero. Finally,
k (M
that (M
=w
j k2 I and
w) =from
0 otherwise.
Moreover,
the second, and so on and so forth. Next, we could pass to consider tr

of variables, then triplet etc. For each training set
one
would
solve
th
Basis Pursuit/Lasso
nd eventually end selecting the variables for which the corresponding tr
(Chen
Donoho
Saunders
~95,
Tibshirani
96)
best performance.
proach described has an exponential complexity and becomes unfeasib
ly small D. If we consider the square loss, it can be shown that
the corr
D
X
uld be written as
kwk =
|wj |
1
n
X
1
(yi
min
w2RD n
i=1
j=1
fw (xi ))2 + kwk0 ,
j
kwk
=
|{j
|
w
=
6
0}|
0
Problem is now convex and can be solved using convex optimization,
calledthe
proximal
methods
e `in0 particular
norm and so
counts
number
of non zero components in w. In the
n the least squares loss and consider dierent approaches to find ap
the above problem, namely greedy methods and convex relaxation.
Xn
Yn
p
p
things I wont tell you about
Solving underdetermined systems

Sampling theory
Compressed Sensing
Structured Sparsity
From vector to matrices- from sparsity to low rank
End of PART III a)

Interpretability - Sparsity - Greedy & Convex Relaxation Approaches
PART III b)
GOAL: To introduce methods that allow to reduce data dimensionality

in absence of labels, namely unsupervised learning
Dimensionality Reduction for Data Visualization
is useful for data visualization, or for investigating the

ta. This problemDimensionality
is often referred
to as dimensionality r
Reduction
blem of defining a map
M :X=R !R ,
D
k D,
suitable criterion.
& Reconstruction
he most popular dimensionality reduction procedure. It

ven an (unsupervised) sample S = (x1 , . . . , xn ) derive
by a linear map M . PCA can be derived from several pro
c/analytical derivation.
onsidering the case where k = 1. We are interested into

is often referred
Reduction
blem of defining a map
M :X=R !R ,
D
k D,
suitable criterion.
Consider first k = 1
& Reconstruction
he most popular dimensionality reduction procedure. It

c/analytical derivation.

given an (unsupervised) sample S = (x1 , . . . , xn ) derive a di
is often referred
Reduction
ed by a linear map M . PCA can be derived from several prospect
blem
of
defining
a
map
etric/analytical derivation.
y considering the case where

k
=
1.
We
are
interested
into
findin
D
k
M
:
X
=
R
!
R
,
k
D,
dimension according to some suitable criterion. Recall that, if w
the (orthogonal) projection of a point x on w is given by (wT x)
suitable
finding
thecriterion.
direction p which allows the best possible average re
set,Consider
that is the
firstsolution
k = 1 of the problem
& Reconstruction(wT xi)wk2,

PCA
n
X
1
min
kxi
w2SD 1 n
i=1
he most popular
dimensionality
reduction
procedure.
It
T
w
w=1
D
{w
R (unsupervised)
| kwk = 1} is the sphere
in D
kxi
ven2 an
sample
S dimensions.
= (x1 , . . . The
, xn )norm
derive
Computations?
Statistics?
much
we
lose
by
projecting
x
along
the
direction
and the
by a linear map M . PCA can be derived
from w,
several
pros
is called the first principal component of the data. A direct
c/analytical
derivation.
(wT xi )wk2 = kxi k (wT xi )2 , so that problem (19.1) is equiva
n
X
1
min
kxi
w2SD 1 n
i=1
(w xi )wk ,
Statistics? The norm kxi

{w 2 RD | kwk = 1} is the sphere in D dimensions.
much we lose by projecting x along the direction w, and the s
T
2
T
2
(w xi )wk = kxi k (w xi ) , so that problem (19.1) is equiva
n
X
1
T
2
max
(w xi ) .
w2SD 1 n
i=1
ervation is useful for two dierent reasons that the we discuss in t
CA and Maximum Variance

1
n
n
X
X
1
1
T
2
T
2
min min kxi kx
(wi x(w
)wk
,
xi )wk ,
i
D
1
1 n
nDi=1
w2S w2S
i=1
1 2 RD | kwk
Statistics?
{w
1} =
is 1}
theissphere
in D in
dimensions.
The norm
kxi
= {w 2 RD | =
kwk
the sphere
D dimensions.
The norm
much
we lose
x along
the direction
w, and
how much
we by
loseprojecting
by projecting
x along
the direction
w, the
and s
is called
the first
principal
component
of theof data.
A direct
19.1)
is called
the first
principal
component
the data.
A di
T
T
2T
T 2
2
2
(wi x(w
= kxi=k kx(w
soi ) that
problem
(19.1)(19.1)
is equiva
t kx
, so that
problem
is eq
i )wkxi )wk
i) , x
i k x(w
n
n
X
X
1 T 2T 2
1
(w
xi ) .
max max
(w
x
)
.
i
1 nD 1 n
w2SD w2S
i=1
i=1
r observation
is useful
fordierent
two dierent
reasons
thatwe
thediscuss
we discus
ervation
is useful
for two
reasons
that the
in t
PCA
Maximum
Variance
CA
andand
Maximum
Variance
1
the solution of the problem

n
n
X
X
1
n
1
T
2
X
T
2
1
min minT kxi 2 kx
(wi x(w
)wk
,
xi )wk ,
i
D
1
min
kx
(w
x
)wk
,
D 1 n
n
i
w2S i w2S
D
1
i=1 i=1
n i=1
w2S
(19.1)
1 2 RD | kwk
D = 1} is the sphere in D dimensions.
Statistics?
{w
The
norm
kx
=
{w
2
R
|
kwk
=
1}
is
the
sphere
in
D
dimensions.
The
norm
T
2i
wk = 1} is the sphere in D dimensions. The norm kxi (w xi )wk
much
we
lose
by
projecting
x
along
the
direction
w,
and
the
s
how
much
we
lose
by
projecting
x
along
the
direction
w,
and
se by projecting x along the direction w, and the solution p to
is
called
the
first
principal
component
of
the
data.
A
direct
19.1)
is
called
the
first
principal
component
of
the
data.
A di
he first
principal
component
of
the
data.
A
direct
computation
T
T
2T
T 2
2
2
2i=
(wi kxxi(w
=T xkx
k
(w
so
problem
(19.1)
is
equiva
kx
,
so
that
problem
(19.1)
is
eq
ik)wkx
i) , x
i )wk
i k x(w
i ) that
kt 2kx
=
(w
)
,
so
that
problem
(19.1)
is
equivalent
to
i
1
=) max
w2SD 1 n
n
n
X
X
1 T 2T 2
1
T max
2
(w
x
)
.
max
(w
x
)
.
i
i
(w
x
)
.
1i nD 1 n
w2SD w2S
n
X
i=1
i=1
(19.2)
i=1
rseful
observation
is useful
fordierent
two dierent
reasons
that
the
we
discus
ervation
useful
for two
that the
in t
foristwo
dierent
reasons
thatreasons
the we
discuss
inwe
thediscuss
following.
PCA
Maximum
Variance
CA
andand
Maximum
Variance
Maximum
Variance
1
the solution of the problem

n
n
X
X
1
n
1
T
2
X
T
2
1
min minT kxi 2 kx
(wi x(w
)wk
,
xi )wk ,
i
D
1
min
kx
(w
x
)wk
,
D 1 n
n
i
w2S i w2S
D
1
i=1 i=1
n i=1
w2S
(19.1)
1 2 RD | kwk
D = 1} is the sphere in D dimensions.
Statistics?
{w
The
norm
kx
=
{w
2
R
|
kwk
=
1}
is
the
sphere
in
D
dimensions.
The
norm
T
2i
wk = 1} is the sphere in D dimensions. The norm kxi (w xi )wk
much
we
lose
by
projecting
x
along
the
direction
w,
and
the
s
how
much
we
lose
by
projecting
x
along
the
direction
w,
and
se by projecting x along the direction w, and the solution p to
is
called
the
first
principal
component
of
the
data.
A
direct
19.1)
is
called
the
first
principal
component
of
the
data.
A di
he first
principal
component
of
the
data.
A
direct
computation
T
T
2T
T 2
2
2
2i=
(wi kxxi(w
=T xkx
k
(w
x
so
problem
(19.1)
is
equiva
kx
(w
,
so
that
problem
(19.1)
is
eq
ik)wkx
i) , x
i )wk
ik
i ) that
Lecture
19
Spring
2
kt 2kx
=
(w
)
,
so
that
problem
(19.1)
is
equivalent
to
i
n
n
X
n
X
1 T 2T 2
1
X
1
T max
2
(w
x
)
.
T
2
max
(w
x
)
.
i
i
max
(w
x
)
.
(19.2)
the=)
term (w
x)
has
the
variance
of
x
in
the
direction
w.
If
i
D
1
D
1
n
w2S
n
w2S
w2SD 1 n
i=1 i=1
i=1
keep this interpretation we should replace problem (19.2) with
o
rseful
observation
is useful
fordierent
two dierent
reasons
that
the
we
discus
ervation
useful
forn two
reasons
that the
we
discuss
in t
foristwo
dierent
reasons
that
the
we
discuss
in
the
following.
1X T
2
max
(w
(x
x
))
,
(1
i
=)
w2SD 1 n
i=1
PCA
Maximum
Variance
CA
andand
Maximum
Variance
Maximum
Variance
c
e original problem on the 1centered data x = x

1
x. In the term
n
X
1
min
kxi
w2SD 1 n
i=1
(w xi )wk ,
{w 2 RD |Computations?
kwk = 1} is the sphere in D dimensions. The norm kxi
much we lose by projecting x along the direction w, and the s
T
2
T
2
(w xi )wk = kxi k (w xi ) , so that problem (19.1) is equiva
n
X
1
T
2
max
(w xi ) .
w2SD 1 n
i=1
ervation is useful for two dierent reasons that the we discuss in t
CA and Maximum Variance

1
max
(w
(x
x
))
,
2
i
n
kxi (w xiw2S
)wkD ,1 n
(19.1)
DIndeed,
1 n
using the symmetry
of the inner product we have
n i=1
X
T
1
T
2
min
kx
(w
x
)wk
,
c
i
i
responds to theX
original
on
x. In the
n
n the centered data xX
n= x
D 1problem
X
n
w2S
T
2
1
1
1
i=1
T
2
T
T
T
T
is
the
sphere
in
D
dimensions.
The
norm
kx
(w
x
)wk
i
i
19.1) it is easy to see
that
this
corresponds
to
considering
(w xi ) =
w xi w xi =
w xi xi w =
n
n
n
rojecting
x
along
the
direction
w,
and
the
solution
p
to
D Computations?
i=1
i=1in D dimensions.i=1
n
X
{w
2
R
|
kwk
=
1}
is
the
sphere
The
norm
kx
i
1
principal component
of
the
data.
TA direct computation
2
min
kx
((w
(x
b))w
+
b)k
.
i
i
x
along
the
direction
w, and the s
Dprojecting
1 n
T problem
2 w,b2S
so that
(19.2)
can
be
written
as
kmuch
(wwe
xi )lose
, soby
that
problem
(19.1)
is
equivalent
to
i=1
w1 max
Cn
is called the first principal component
of eigenvector
the data. Aofdirect
T
n
T b))wX
T
2
n 2b is an affine transformation
(x(w
+
(rather
than
an
orthogonal
pr
X
xi1)wk = Tkxi k2 (w xi ) , so that
problem
(19.1)
is
equiva
i
1
T
T
max
(w xi ) . ,
(19.2)
max
w
C
w,
C
=
x
x
.
n
n
i
i
D 1
w2SD 1 n
n
w2S
n
i=1
i=1
1X T 2
PCA and Associated
Eigenproblem
max
(w xi ) .
P
D 1 n
1
w2S
two need
dierent
that theFirst,
we discuss
in thenotation
following.Cn =
We
two reasons
observations.
in matrix
i=1
n
further manipulation allows to write problem (19.2) as an eigenvalue
see that Cn is symmetric and positive semi-definite. If the dat
sing the symmetry
of the
product
we have
ervation
is useful for
twoinner
dierent
reasons
that the we discuss in t
is the so called covariance matrix. Clearly the objective functio
i=1
imum
Variance
n
n
X
X
1
1
n
n
X
X
1
T
2
T
T
T
T
T 1
T
(w xi ) =
w xi w xi =
w xi xiwwT =
w
(
x
x
i i )w
C
w
n
1
n
n
n
n
= xi=1
=
0,
problem
(19.2)
has
the
following
interpretation:
i=1
i=1
i=1
i
CA
and Maximum VariancewT w
n
which the data have (on average) maximum variance.
oblem (19.2) can be written

as
1

is often referred
Reduction
CA
Associated
Eigenproblem
blemand
of defining
a map
er manipulation allows toDwrite problem

(19.2) as an eigenv
k
M : X = R ! R , k D,
he symmetry of the inner product we have
n
n
n
Xsuitable criterion.
X
X
X
1
1
T
2
T
T
T
T
T 1
T
(wWhat
xi ) about
=
w2?xi w xi =
w xi xi w = w (
xi xi
k
=
n
n
n
1
i=1
i=1
i=1
& Reconstruction
w2 second eigenvector of Cn
m (19.2) can be written as
n
X
he most popular dimensionality
reduction
1
T
T procedure. It
max w Cn w, Cn =
xi xi .
ven an (unsupervised)
sample S n=i=1(x1 , . . . , xn ) derive
w2SD 1
w ? w1
by a linear map M . PCA can be derived from

several
pro
Pn
1
T
bservations.
in matrix notation Cn = n i=1 Xn Xn and
c/analyticalFirst,
derivation.
symmetric
and
positive
semi-definite.
If
the
data
are
centered
t
M :X=R !R ,
D
k D,
suitable criterion.
things I wont tell you about
A Random
& Reconstruction
Maps: Johnson-Linderstrauss Lemma
Non Linear Maps: Kernel PCA, Laplacian/

theDiffusion
most popular
dimensionality reduction
maps
procedure. I
ic/analytical derivation.
ension according to some suitable criterion. Recall that
(orthogonal) projection of a point x on w is given by (
End of PART III b)

Interpretability - Sparsity - Greedy & Convex Relaxation Approaches
The End
PART IV
Matlab practical session
Afternoon

Machine Learning The Theoretical Minimum

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning The Theoretical Minimum

Uploaded by

Copyright:

Available Formats

Machine Learning: a Basic Toolkit

August 2015 - BMM Summer School

ML Desert Island Compilation

Regularization I: Linear Least Squares

Variable Selection: OMP

Matlab practical session

GOAL: Investigate the trade-off between stability and fitting starting

Statistical Learning Theory

The goal of supervised learning is to find an underlying input-output relation

Each pair is called an example. We consider the

Each pair is called an example. We1.2.

... ... ...

We describe a simple yet efficient class of algorithms, the so called memory

Local Methods: Nearby points have similar labels

and define the nearest neighbor (NN) estimator as

2.3 Least Squares and Nearest Neighbors

15-Nearest Neighbor Classifier

where xj is the j-th component of a string x 2 X.

2.2. K-Nearest Neighbor

the array of distances of a new point x

2.3 Least Squares and Nearest Neighbors

15-Nearest Neighbor Classifier

k(x0 , x) = (1 kx x0 k)+ 1kx x0 kr ,

How do we choose it?

Is there an optimal value?

Can we compute it?

Here we ask the question of how to choose K: is there an optima choice of K? C

Is there an optimal value?

Tuning and Bias Variance Variance Decompos

Ideally we would like to choose K that minimizes the expected error

Next: Characterize corresponding

ES Ex,y (y fK (x)) = Ex ES Ey|x (y fK (x))

fK (x))2 = (f (x) ES Ey|x fK (x))2 + ES Ey|x (Ey|x fK (x)

ES Ey|x fK (x))2 + ES Ey|x (Ey|x fK (x)

enter Cross Validation

Cross Validation Flavors

Cross Validation Flavors

V-Fold, (V=n is Leave-One-Out)

Stability - Overfitting - Bias/Variance - Cross-Validation

End of the Story?

High Dimensions and Neighborhood

Cubes and Dth-roots

GOAL: Introduce the basic (global) regularization methods with

Going Global + Impose Smoothness

Of all the principles which can be proposed for that purpose, I

Regularized Least Squares

We consider the following algorithm

as a proxy for the expected error

Of all the principles which can be proposed for that purpose, I

Regularized Least Squares

We consider the following algorithm

A motivation for considering the above scheme is to view the empirical

as a proxy for the expected error

y for the expected error

wT w is a regularizer and help preventing overfitting.

controls the invertibility of the matrix

or training and O(d) for testing. The parameter

atisfying the above equation

isT called the condition number of M .

2.3 Least Squares and Nearest Neighbors

15-Nearest Neighbor Classifier

Why a linear decision rule?

o zero, we have that the solution of regularized least squares