Dimensionality Reduction: Principal Component Analysis and SVD

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Dimensionality Reduction: Principal

Component Analysis and SVD

CS771: Introduction to Machine Learning


Piyush Rai
2
Dimensionality Reduction Can think of W as a linear mapping that
transforms low-dim to high-dim

 A broad class of techniques Some dim-red techniques assume a nonlinear


mapping function such that

 Goal is to compress the original representation of the inputs For example, can be modeled
by a kernel or a deep neural net
 Example: Approximate each input , as a linear combination of “basis”
vectors , each also
Note: These “basis” vectors need not 𝐾 is
necessarily be linearly independent. But for
some dim. red. techniques, e.g., classic
𝒙 𝑛 ≈ ∑ 𝑧 𝑛𝑘 𝒘 𝑘= 𝐖 𝒛 𝑛
principal component analysis (PCA), they 𝑘=1 is
are
 We have represented each by a -dim vector (a new feat. rep)
 To store such inputs , we need to keep and
 Originally we required storage, now storage
 If , this yields substantial storage saving, hence good compression
CS771: Intro to ML
3
Dimensionality Reduction
Each “basis” image is like a “template”
 Dim-red for face images that captures the common properties of
face images in the dataset K=4 “basis” face images

A face
𝑧 𝑛1 𝒘1 𝑧𝑛 2 𝒘 2 𝑧𝑛 3 𝒘 3 𝑧𝑛 4 𝒘4
image

 In this example, is a low-dim feature rep. for Like 4 new features

 Essentially, each face image in the dataset now represented by just 4 real numbers 
 Different dim-red algos differ in terms of how the basis vectors are defined/learned
 .. And in general, how the function in the mapping is defined
CS771: Intro to ML
4
Principal Component Analysis (PCA)
 A classic linear dim. reduction method (Pearson, 1901; Hotelling, 1930)
 Can be seen as
 Learning directions (co-ordinate axes) that capture maximum variance in data
𝑒2
PCA is essentially doing a change of axes
Standard co-ordinate axis () 𝑤1 in which we are representing the data
𝑤2

Sm
New co-ordinate axis () Each input will still have 2 co-ordinates, in

a ll
var
ce the new co-ordinate system, equal to the

i
ia n

an
var

ce
To reduce dimension, can only keep the co-ordinates ge distances measured from the new origin
Lar
of those directions that have largest variances (e.g., in
𝑒1
this example, if we want to reduce to one-dim, we can
keep the co-ordinate of each point along and throw
away ). We won’t lose much information

𝑁 Subject to orthonormality
arg min ∑ ‖ 𝒙 𝑛 − 𝑾 𝒛 𝑛‖ = arg min‖ 𝑿 − 𝒁𝑾 ‖
2 2 constraints: for and

 Learning𝑾projection
, 𝒁 𝑛=1 directions that result in
𝑾smallest
,𝒁 reconstruction error

CS771: Intro to ML
5

PCA: From the variance perspective

CS771: Intro to ML
6
Solving PCA by Finding Max. Variance Directions
 Consider projecting an input along a direction
 Projection/embedding of (red points below) will be (green pts below)

Mean of projections of all inputs:


𝒘1
𝒙𝑛

𝒙𝑛
is the cov matrix of the data:
Variance of the projections:

𝒘1

 Want such that variance is maximized


For already centered data, = and =
s.t.
Need this constraint otherwise the
objective’s max will be infinity CS771: Intro to ML
7
Max. Variance Direction Variance along the
direction Note: Total variance of
the data is equal to the
sum of eigenvalues of ,
 Our objective function was s.t. i.e.,

 Can construct a Lagrangian for this problem PCA would keep the
top such directions of
largest variances
1-
 Taking derivative w.r.t. and setting to zero gives Note: In general,
will have eigvecs
 Therefore is an eigenvector of the cov matrix with eigenvalue
 Claim: is the eigenvector of with largest eigenvalue . Note that

 Thus variance will be max. if is the largest eigenvalue (and is the corresponding top
eigenvector; also known as the first Principal Component)
 Other large variance directions can also be found likewise (with each being
orthogonal to all others) using the eigendecomposition of cov matrix (this is PCA)
CS771: Intro to ML
8

PCA: From the reconstruction perspective

CS771: Intro to ML
9
Alternate Basis and Reconstruction
 Representing a data point in the standard orthonormal basis
𝐷
𝒙 𝑛 = ∑ 𝑥𝑛𝑑 𝒆 𝑑 is a vector of all zeros except a single 1 at the
position. Also, for
 Let’s represent the same data point 𝑑=1
in a new orthonormal basis

𝐷 denotes the co-ordinates of in the


is the projection of along the direction
since (verify) 𝒙 𝑛 = ∑ 𝑧 𝑛𝑑 𝒘 𝑑 new basis
 Ignoring directions along which projection
𝑑=1 is small, we can approximate as

𝐾 𝐾 𝐾
𝒙 𝑛 = ∑ 𝑧 𝑛𝑑 𝒘 𝑑 ¿ ∑ (𝒙 ¿¿ 𝑛⊤ 𝒘 𝑑 ) 𝒘 𝑑 ¿ ¿ ∑ (𝒘 𝑑 𝒘 ¿¿ 𝑑 )𝒙 𝑛 ¿

𝒙𝑛 ≈ ^ Note that is the reconstruction error
on . Would like it to minimize w.r.t.
𝑑=1
 Now is represented by dim. rep. and (verify)
𝑑 =1 𝑑=1

𝒛𝑛 ≈ 𝐖⊤
𝐾 𝒙𝑛 is the “projection matrix” of size
Also,
CS771: Intro to ML
10
Minimizing Reconstruction Error
 We plan to use only directions so would like them to be such that the total
reconstruction error is minimized Constant; doesn’t
depend on the ’s
𝑁 𝑁
ℒ ( 𝒘 1 , 𝒘 2 , … , 𝒘 𝐾 )= ∑ ‖ 𝒙 𝑛 − ^
𝒙 𝑛‖ = ∑ ¿ ¿ ¿ ¿
2
(verify)
𝑛=1 𝑛 =1 Variance along
 Each optimal can be found by solving

 Thus minimizing the reconstruction error is equivalent to maximizing variance


 The directions can be found by solving the eigendecomposition of
 Note:
 Thus s.t. orthonormality on columns of is the same as solving the eigendec. of (recall
that Spectral Clustering also required solving this)
CS771: Intro to ML
11
Principal Component Analysis
 Center the data (subtract the mean from each data point)
 Compute the covariance matrix using the centered data matrix as
1 ⊤
𝐒= 𝐗 𝐗 (Assuming is arranged as )
𝑁
 Do an eigendecomposition of the covariance matrix (many methods exist)
 Take top leading eigvectors with eigvalues
 The -dimensional projection/embedding of each input is
Note: Can decide how many
eigvecs to use based on how
𝒛𝑛 ≈ 𝐖⊤
𝐾 𝒙𝑛 is the “projection matrix” of size
much variance we want to
campure (recall that each gives
the variance in the direction
(and their sum is the total
variance)
CS771: Intro to ML
12
Singular Value Decomposition (SVD)
 Any matrix of size can be represented as the following decomposition
min{𝑁 , 𝐷 }
𝐗=𝐔 𝚲 𝐕 = ⊤
∑ 𝜆 𝑘 𝒖𝑘 𝒗 ⊤
𝑘
𝑘=1

Diagonal matrix. If , last rows are all zeros; if ,


last columns are all zeros

 is matrix of left singular vectors, each


 is also orthonormal
 is matrix of right singular vectors, each
 is also orthonormal
 is with only diagonal entries - singular values
 Note: If is symmetric then it is known as eigenvalue decomposition ()
CS771: Intro to ML
13
Low-Rank Approximation via SVD
 If we just use the top singular values, we get a rank- SVD

 Above SVD approx. can be shown to minimize the reconstruction error


 Fact: SVD gives the best rank- approximation of a matrix

 PCA is done by doing SVD on the covariance matrix (left and right singular vectors
are the same and become eigenvectors, singular values become eigenvalues)
CS771: Intro to ML
14
Dim-Red as Matrix Factorization
 If we don’t care about the orthonormality constraints, then dim-red can also be
achieved by solving a matrix factorization problem on the data matrix
D K D
K W
N X ≈ N Z
Matrix containing
the low-dim rep of

{𝐙 , 𝐖 }=argmin𝐙 , 𝐖‖𝐗 − 𝐙𝐖‖


^ ^ 2
If , such a factorization gives a low-
rank approximation of the data
matrix X
 Can solve such problems using ALT-OPT
 Can impose various constraints on and (e.g., sparsity, non-negativity, etc) CS771: Intro to ML
15
Joint Dim-Red
 Often we have two or more data sources with 1-1 correspondence between inputs
N videos Audio of these N videos
𝐷1 𝐷2

𝑁 𝑋1 𝑁 𝑋2

 Sometimes, we may want to perform a common dim-red for both sources to get a
common feature rep which captures properties of both sources (or fused their info)
 This can be done by doing a joint dim-red of both sources. Many methods exists,
e.g.,
 Canonical Correlational Analysis (CCA): looks at cross-covar rather than variances
 Joint Matrix Factorization +
CS771: Intro to ML
16
Coming up next
 Some methods for computing eigenvectors
 Supervised dimensionality reduction
 Nonlinear dimensionality reduction
 Kernel PCA
 Manifold Learning

CS771: Intro to ML

You might also like