Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Lecture 7: Unsupervised Learning

C19 Machine Learning Hilary 2013 A. Zisserman


Dimensionality reduction Principal Component Analysis
algorithm
applications
Isomap
Non-linear map
applications
clustering
dimensionality reduction
Goal: learn the data distribution p( x )
Dimensionality Reduction
Why reduce dimensionality?
1. Intrinsic dimension of data: often data is measured in high dimensions, but
its actual variation lives on a low dimensional surface (plus noise)
Example
x
1
x
3
x
2
data lives on low
dimensional surface
64 64 bitmap {0, 1}
4096
There is irrelevant noise (variation in stroke width)
and a much smaller dimension of variations in the digit
2. Feature extraction, rather than feature selection
new features are a linear combination of originals (not a subset)
3. Visualization
Dimensionality reduction usually involves determining a projection
where D >>d, and often d =2.
If the projection is linear, then it can be written as a matrix
Projection to lower dimensions
R
D
R
d
d D
=
Principal Component Analysis (PCA)
Determine a set of (orthogonal) axes
which best represent the data
u
1
u
2
R
D
R
d
a linear projection d D
=
Principal Component Analysis (PCA)
Determine a set of (orthogonal) axes
which best represent the data
Step 1: compute a vector to the data
centroid, c
Step 2: compute the principal axes, u
i
c
u
1
u
2
Principal Component Analysis (PCA)
c
Given a set of N data points x
i
R
D
1. Centre the data: compute the centroid
c =
1
N
N
X
i=1
x
i
and transform the data so that c becomes the new origin
x
i
x
i
c
Principal Component Analysis (PCA)
u
u
2a. Compute the rst principal axis: determine the direction that
best explains (or approximates) the data. Find a direction (unit
vector) u such that
X
i

x
i
(u
>
x
i
)u

2
is minimized. Or equivalently such that
X
i

u
>
x
i

2
is maximized. This is the direction of maximum variation.
Introduce a Lagrange multiplier to enforce that kuk = 1, and nd
the stationary point of
L =
X
i

u
>
x
i

2
+(1 u
>
u)
w.r.t. u.
L =
X
i
u
>

x
i
x
i
>

u +(1 u
>
u)
= u
>
Su +(1 u
>
u)
where S is the D D symmetrix matrix S =
P
i
x
i
x
i
>
. Then
dL
du
= 2Su 2u = 0
and hence
Su = u
i.e. u is an eigen-vector of S. Thus the variation
X
i

u
>
x
i

2
= u
>
Su = u
>
u =
is maximised by the eigen-vector, u
1
, corresponding to the largest
eigen-value,
1
, of S. u
1
is the rst principal component.
2b. Now compute the next axis, which has the most variation and is
orthogonal to u
1
.
This must again be an eigen-vector of S since Su = u gives all the
stationary points of the variation and hence is given by u
2
, the eigen-
vector corresponding to the second largest eigen-value of S. Why?
u
2
is the second principal component.
Continuing in this manner it can be seen that the d principal compo-
nents of the data are the d eigen-vectors of S with largest eigen-value.
u
1
u
2
Example
Data: three points x
1
, x
2
, x
3
R
3
:
x
1
=

1
1
1

x
2
=

2
2
1

x
3
=

3
3
1

Centroid x = (2, 2, 1)
>
, and so the centred data is:
x
1
=

1
1
0

x
2
=

0
0
0

x
3
=

1
1
0

Write X = [x
1
, x
2
, x
3
], then
S =
X
i
x
i
x
i
>
= XX
>
=

1 0 1
1 0 1
0 0 0

1 1 0
0 0 0
1 1 0

2 2 0
2 2 0
0 0 0

x
1
x
2
x
3
x
y
Z=1
and its eigen-vector decomposition is:
S = [u
1
, u
2
, u
3
]


1
0 0
0
2
0
0 0
3

[u
1
, u
2
, u
3
]
>
=

2
1

2
0
1

2

1

2
0
0 0 1

2 0 0
0 0 0
0 0 0

2
1

2
0
1

2

1

2
0
0 0 1

Then y
i
= u
1
>
x
i
and y
i
= {

2, 0,

2} for
the points x
i
.
x
1
x
2
x
3
u
1
u
2
Given data {x
1
, x
2
, . . . , x
N
} R
D
1. Compute the centroid and centre the data
c =
1
N
N
X
i=1
x
i
and transform the data so that c becomes the new origin
x
i
x
i
c
2. Write the centred data as X = [x
1
, x
2
, . . . , x
N
] and compute the covariance matrix
S =
1
N
X
i
x
i
x
i
>
=
1
N
XX
>
3. Compute the eigen-decomposition of S
S = UDU
>
4. Then the principal components are the columns u
i
of U ordered by the magnitude
of the eigen-values
5. The dimensionality of the data is reduced to d by the projection
y = U
d
>
x
where U
d
are the rst d columns of U, and y is a d-vector.
The PCA Algorithm
Notes
The PCA is a linear transformation that rotates the data
so that it is maximally decorrelated
Often each coordinate is first transformed independently
to have unit variance. Why?
A limitation of PCA is the linearity it cant fitcurved
surfaces well. We will return to this problem later.
Example: Visualization
Suppose we are given high dimensional data and want to
get some idea of its distribution
e.g. the irisdataset:
three classes, 50 instances for each class, 4 attributes
x
1
=

0.2911
0.5909
0.5942
0.8400

x
2
=

0.2405
0.3636
0.5942
0.8400

. . . x
150
=

0.4937
0.3636
0.4783
0.4400

y
i
{1, 2, 3}
-1.5 -1 -0.5 0 0.5 1 1.5
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
First two principal component
data can be visualized
in this case data can (almost) be classified using 2 principal
components as new feature vectors
The eigen-vectors U provide an orthogonal basis for any x R
D
x =
D
X
j
(u
j
>
x)u
j
The PCA approximation with d principal components is
x =
d
X
j
(u
j
>
x)u
j
and so the error is
x x =
D
X
d+1
(u
j
>
x)u
j
Using u
j
>
u
k
=
jk
, the squared error is
kx xk
2
=

D
X
d+1
(u
j
>
x)u
j

2
=
D
X
d+1
u
j
>
(xx
>
)u
j
How much is lost by the PCA approximation?
Hence the mean squared error is
1
N
N
X
i
kx
i
x
i
k
2
=
1
N
D
X
d+1
u
j
>

N
X
i
x
i
x
i
>

u
j
=
D
X
d+1
u
j
>
Su
j
and since Su
j
= u
j
1
N
N
X
i
kx
i
x
i
k
2
=
D
X
d+1
u
j
>
Su
j
=
D
X
d+1

j
the (squared reconstruction) error is given by the sum of the
eigenvalues for the unused eigenvectors
the error is minimized by choosing the smallest eigen-values (as
expected)
Example: Compression
Natural application: can choose how muchof data to keep
Represent image by patches of size s x s pixels
Compute PCA for all patches (each patch is a s
2
-vector)
Project patch onto d principal components
Original image Splitting to patches
s =16, D =s
2
=256
compressed image d =20
0 50 100 150 200 250 300
0
0.5
1
1.5
2
2.5
3
3.5
Reconstruction error
output dimesion
M
s
E
r
r
0 50 100 150 200 250 300
0
0.5
1
1.5
2
2.5
3
3.5
Reconstruction error
output dimesion
M
s
E
r
r
ratio (compressed/original)=31.25%
d =40
ratio (compressed/original)=15.63%
d =20
Example: Graphics PCA for faces
3D PCA
3D faces
CyberScan faces
Thomas Vetter, Sami Romdhani, Volker Blanz
original
fitted
Isomap
Limitations of linear methods
dimensionality reduction
The images in each row
cannot be expressed as a
linear combination of the
others
The Swiss Roll Problem
Would like to unravel local structure
Preserve the intrinsic manifoldstructure
Need more than linear methods
The Swiss Roll Problem
Would like to unravel local structure
Preserve the intrinsic manifoldstructure
Need more than linear methods
Isomap
Starting point MDS linear method
Another formulation of PCA (called Multi-Dimensional Scaling) arranges
the low-dimensional points so as to minimize the discrepancy between
the pairwise distances in the original space and the pairwise distances
in the low-d space.
( )
2
|| || || ||

=
ij
j i j i
Cost y y x x
high-D
distance
low-d
distance
slide credit: Geoffrey Hinton
Isomap
Instead of measuring actual Euclidean
distances between points (in high
dimensional space) measure the
distances along the manifold and then
model these intrinsic distances.
The main problem is to find a robust
way of measuring distances along the
manifold.
If we can measure manifold
distances, the global optimisation is
easy: its just PCA.
2-D
1-D
If we measure distances
along the manifold,
d(1,6) >d(1,4)
1
4
6
slide credit: Geoffrey Hinton
How Isomap measures intrinsic distances
Connect each datapoint to its K
nearest neighbours in the high-
dimensional space.
Put the true Euclidean distance on
each of these links.
Then approximate the manifold
distance between any pair of points
as the shortest path in this graph.
A
B
slide credit: Geoffrey Hinton
Intrinsic distances by shortest paths between neighbours
( )
2
|| || || ||

=
ij
j i j i
Cost y y x x
high-D intrinsic
distance
low-d
distance
Example 1
2000 64x64
hand images
R
6464
R
2
Example 2
Unsupervised
embedding of the
digits 0-4 from
MNIST. Not all the
data is displayed
R
2828
R
2
Example 3
Unsupervised
embedding of 100
flower classes based on
their shape.
R
8000
R
2
Example 4
Unsupervised embedding of
188749 tracks (from 10289 artists,
67799 albums) only a sample
shown here.
Fast Embedding of Sparse Music
Similarity Graphs
J ohn Platt, Proc. NIPS 2004
Background reading
Bishop, chapter 12
Other dimensionality reduction methods:
Multi-dimensional scaling (MDS),
Locally Linear Embedding (LLE)
More on web page:
http://www.robots.ox.ac.uk/~az/lectures/ml

You might also like