Professional Documents
Culture Documents
Dimensionality Reduction: Principal Component Analysis and SVD
Dimensionality Reduction: Principal Component Analysis and SVD
Dimensionality Reduction: Principal Component Analysis and SVD
Goal is to compress the original representation of the inputs For example, can be modeled
by a kernel or a deep neural net
Example: Approximate each input , as a linear combination of “basis”
vectors , each also
Note: These “basis” vectors need not 𝐾 is
necessarily be linearly independent. But for
some dim. red. techniques, e.g., classic
𝒙 𝑛 ≈ ∑ 𝑧 𝑛𝑘 𝒘 𝑘= 𝐖 𝒛 𝑛
principal component analysis (PCA), they 𝑘=1 is
are
We have represented each by a -dim vector (a new feat. rep)
To store such inputs , we need to keep and
Originally we required storage, now storage
If , this yields substantial storage saving, hence good compression
CS771: Intro to ML
3
Dimensionality Reduction
Each “basis” image is like a “template”
Dim-red for face images that captures the common properties of
face images in the dataset K=4 “basis” face images
A face
𝑧 𝑛1 𝒘1 𝑧𝑛 2 𝒘 2 𝑧𝑛 3 𝒘 3 𝑧𝑛 4 𝒘4
image
Essentially, each face image in the dataset now represented by just 4 real numbers
Different dim-red algos differ in terms of how the basis vectors are defined/learned
.. And in general, how the function in the mapping is defined
CS771: Intro to ML
4
Principal Component Analysis (PCA)
A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
Can be seen as
Learning directions (co-ordinate axes) that capture maximum variance in data
𝑒2
PCA is essentially doing a change of axes
Standard co-ordinate axis () 𝑤1 in which we are representing the data
𝑤2
Sm
New co-ordinate axis () Each input will still have 2 co-ordinates, in
a ll
var
ce the new co-ordinate system, equal to the
i
ia n
an
var
ce
To reduce dimension, can only keep the co-ordinates ge distances measured from the new origin
Lar
of those directions that have largest variances (e.g., in
𝑒1
this example, if we want to reduce to one-dim, we can
keep the co-ordinate of each point along and throw
away ). We won’t lose much information
𝑁 Subject to orthonormality
arg min ∑ ‖ 𝒙 𝑛 − 𝑾 𝒛 𝑛‖ = arg min‖ 𝑿 − 𝒁𝑾 ‖
2 2 constraints: for and
Learning𝑾projection
, 𝒁 𝑛=1 directions that result in
𝑾smallest
,𝒁 reconstruction error
CS771: Intro to ML
5
CS771: Intro to ML
6
Solving PCA by Finding Max. Variance Directions
Consider projecting an input along a direction
Projection/embedding of (red points below) will be (green pts below)
𝒙𝑛
is the cov matrix of the data:
Variance of the projections:
⊤
𝒘1
Can construct a Lagrangian for this problem PCA would keep the
top such directions of
largest variances
1-
Taking derivative w.r.t. and setting to zero gives Note: In general,
will have eigvecs
Therefore is an eigenvector of the cov matrix with eigenvalue
Claim: is the eigenvector of with largest eigenvalue . Note that
Thus variance will be max. if is the largest eigenvalue (and is the corresponding top
eigenvector; also known as the first Principal Component)
Other large variance directions can also be found likewise (with each being
orthogonal to all others) using the eigendecomposition of cov matrix (this is PCA)
CS771: Intro to ML
8
CS771: Intro to ML
9
Alternate Basis and Reconstruction
Representing a data point in the standard orthonormal basis
𝐷
𝒙 𝑛 = ∑ 𝑥𝑛𝑑 𝒆 𝑑 is a vector of all zeros except a single 1 at the
position. Also, for
Let’s represent the same data point 𝑑=1
in a new orthonormal basis
𝐾 𝐾 𝐾
𝒙 𝑛 = ∑ 𝑧 𝑛𝑑 𝒘 𝑑 ¿ ∑ (𝒙 ¿¿ 𝑛⊤ 𝒘 𝑑 ) 𝒘 𝑑 ¿ ¿ ∑ (𝒘 𝑑 𝒘 ¿¿ 𝑑 )𝒙 𝑛 ¿
⊤
𝒙𝑛 ≈ ^ Note that is the reconstruction error
on . Would like it to minimize w.r.t.
𝑑=1
Now is represented by dim. rep. and (verify)
𝑑 =1 𝑑=1
𝒛𝑛 ≈ 𝐖⊤
𝐾 𝒙𝑛 is the “projection matrix” of size
Also,
CS771: Intro to ML
10
Minimizing Reconstruction Error
We plan to use only directions so would like them to be such that the total
reconstruction error is minimized Constant; doesn’t
depend on the ’s
𝑁 𝑁
ℒ ( 𝒘 1 , 𝒘 2 , … , 𝒘 𝐾 )= ∑ ‖ 𝒙 𝑛 − ^
𝒙 𝑛‖ = ∑ ¿ ¿ ¿ ¿
2
(verify)
𝑛=1 𝑛 =1 Variance along
Each optimal can be found by solving
PCA is done by doing SVD on the covariance matrix (left and right singular vectors
are the same and become eigenvectors, singular values become eigenvalues)
CS771: Intro to ML
14
Dim-Red as Matrix Factorization
If we don’t care about the orthonormality constraints, then dim-red can also be
achieved by solving a matrix factorization problem on the data matrix
D K D
K W
N X ≈ N Z
Matrix containing
the low-dim rep of
𝑁 𝑋1 𝑁 𝑋2
Sometimes, we may want to perform a common dim-red for both sources to get a
common feature rep which captures properties of both sources (or fused their info)
This can be done by doing a joint dim-red of both sources. Many methods exists,
e.g.,
Canonical Correlational Analysis (CCA): looks at cross-covar rather than variances
Joint Matrix Factorization +
CS771: Intro to ML
16
Coming up next
Some methods for computing eigenvectors
Supervised dimensionality reduction
Nonlinear dimensionality reduction
Kernel PCA
Manifold Learning
CS771: Intro to ML