Professional Documents
Culture Documents
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
Dimensionality Reduction: Pca, SVD, MDS, Ica, and Friends
kNN
Rochio
NB
Unsupervised feature selection
Differs from feature selection in two ways:
Instead of choosing subset of features,
Create new features (dimensions) defined as
functions over all features
Don’t consider class labels, just the data points
Unsupervised feature selection
Idea:
Given data points in d-dimensional space,
Project into lower dimensional space while preserving
as much information as possible
E.g., find best planar approximation to 3D data
E.g., find best planar approximation to 104D data
-2
2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5
Principal Components
5 1st principal vector
3
Gives best axis
to project 2
Minimum RMS 1
error 0
Principal
vectors are -1
orthogonal -2
-3
2nd principal vector
-4
-5
-5 -4 -3 -2 -1 0 1 2 3 4 5
How many components?
Check the distribution of eigen-values
Take enough many eigen-vectors to cover 80-90%
of the variance
Sensor networks
Given a 54x54
matrix of pairwise
link qualities
Do PCA
Project down to 2
principal
dimensions
PCA discovered the
map of the lab
Problems and limitations
What if very large dimensional data?
e.g., Images (d ≥ 104)
Problem:
Covariance matrix Σ is size (d2)
d=104 |Σ| = 108
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0 1
= u1 u2 x x
5 5 5 0 0 2
0 0 0 2 2
0 0 0 3 3 v1
0 0 0 1 1
v2
SVD - Interpretation
‘documents’, ‘terms’ and ‘concepts’:
U: document-to-concept similarity matrix
V: term-to-concept similarity matrix
: its diagonal elements: ‘strength’ of each
concept
Projection:
best axis to project on: (‘best’ = min sum of
squares of projection errors)
SVD - Example
A = U VT - example:
retrieval
inf. lung
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71
SVD - Example
A = U VT - example: doc-to-concept
retrieval CS-concept similarity matrix
inf. lung MD-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71
SVD - Example
A = U VT - example:
retrieval
inf. lung ‘strength’ of CS-concept
data brain
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71
SVD - Example
A = U VT - example:
retrieval
term-to-concept
inf. lung similarity matrix
data brain
1 1 1 0 0 0.18 0 CS-concept
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
MD 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71
SVD – Dimensionality reduction
Q: how exactly is dim. reduction done?
A: set the smallest singular values to zero:
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = x 0 5.29
x
0.90 0
0 0 0 2 2 0 0.53
0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0
0 0 0 1 1 0 0.27 0 0 0 0.71 0.71
SVD - Dimensionality reduction
1 1 1 0 0 0.18
2 2 2 0 0 0.36
1 1 1 0 0 0.18 9.64
5 5 5 0 0 ~ x x
0.90
0 0 0 2 2 0
0 0 0 3 3 0 0.58 0.58 0.58 0 0
0 0 0 1 1 0
SVD - Dimensionality reduction
1 1 1 0 0
1 1 1 0 0
2 2 2 0 0
2 2 2 0 0
1 1 1 0 0
1 1 1 0 0
5 5 5 0 0 ~ 5 5 5 0 0
0 0 0 0 0
0 0 0 2 2
0 0 0 0 0
0 0 0 3 3
0 0 0 0 0
0 0 0 1 1
LSI (latent semantic indexing)
Q1: How to do queries with LSI?
A: map query vectors into ‘concept space’ – how?
retrieval
inf. brain lung
data
1 1 1 0 0 0.18 0
2 2 2 0 0 0.36 0
CS 1 1 1 0 0 0.18 0 9.64 0
5 5 5 0 0 = 0.90 0
x 0 5.29 x
0 0 0 2 2 0 0.53
0 0 0 3 3
MD 0 0 0 1 1
0
0
0.80
0.27
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
LSI (latent semantic indexing)
Q: How to do queries with LSI?
A: map query vectors into ‘concept space’ – how?
retrieval
inf. brain lung term2 q
data
q= 1 0 0 0 0
v2
v1
A: inner product
(cosine similarity) term1
with each ‘concept’ vector vi
LSI (latent semantic indexing)
compactly, we have:
qconcept = q V
e.g.: CS-concept
retrieval 0.58 0
inf. brain lung
data 0.58 0
0.58 0
q= 1 0 0 0 0 0.58 0 =
0 0.71
0 0.71
term-to-concept
similarities
Multi-lingual IR
(English query, on Spanish text?)
Q: multi-lingual IR (english query, on spanish
text?)
Problem:
given many documents, translated to both
languages (eg., English and Spanish)
answer queries across languages
Little example
How would the document (‘information’, ‘retrieval’)
handled by LSI? A: SAME:
dconcept = d V
CS-concept
Eg: retrieval 0.58 0
inf. brain lung
data 0.58 0
0 1 1 0 0 0.58 0 = 1.16 0
d=
0 0.71
0 0.71
term-to-concept
similarities
Little example
Observation: document (‘information’, ‘retrieval’) will
be retrieved by query (‘data’), although it does
not contain ‘data’!!
CS-concept
retrieval
inf. brain lung
data
0 1 1 0 0
1.16 0
d=
0.58 0
q= 1 0 0 0 0
Multi-lingual IR
Solution: ~ LSI Concatenate
documents
informacion Do SVD on them
retrieval datos Now when a new
inf. brain lung
data document comes
1 1 1 0 0 1 1 1 0 0 project it into
2 2 2 0 0 1 2 2 0 0 concept space
CS 1 1 1 0 0 1 1 1 0 0
5 5 5 0 0
Measure similarity
5 5 4 0 0
0 0 0 2 2 0 0 0 2 2 in concept spalce
0 0 0 3 3
MD 0 0 0 1 1
0
0
0
0
0
0
2
1
3
1
Visualization of text
Given a set of documents how could we
visualize them over time?
Idea:
Perform PCA
Project documents down to 2 dimensions
See how the cluster centers change – observe the
words in the cluster over time
Example:
Our paper with Andreas and Carlos at ICML 2006
eigenvectors and
eigenvalues on
graphs
1 2 3
4
5
(Simplified) PageRank algorithm
From AT p = p
To 1 p1 p1
1 2 3
1 1 p2 p2
1/2 1/2 p3 = p3
1/2 p4 p4
4 1/2 p5 p5
5
(Simplified) PageRank algorithm
AT p = 1 * p
thus, p is the eigenvector that corresponds to
the highest eigenvalue (=1, since the matrix is column-
normalized)
formal definition of eigenvector/value: soon
PageRank: How do I calculate it fast?
If A is a (n x n) square matrix
, x) is an eigenvalue/eigenvector pair
of A if
Ax=x
x’ A x AT p = p
2 2 1 1 x’
1 = 1 3 0 x
1
2 3
1
Power Iteration - Intuition
By definition, eigenvectors remain parallel to
themselves (‘fixed points’, A x = x)
1 v1 A v1
0.52 2 1 0.52
3.62 * 0.85 = 1 3 0.85
Many PCA-like approaches
Multi-dimensional scaling (MDS):
Given a matrix of distances between features
We want a lower-dimensional representation that best
preserves the distances
Independent component analysis (ICA):
Find directions that are most statistically independent
Acknowledgements
Some of the material is borrowed from lectures
of Christos Faloutsos and Tom Mitchell