Sanjay Singh Principal Component Analysis

Principal Component Analysis
Sanjay Singh?†
?
Department of Information and Communication Technology
Manipal Institute of Technology, Manipal University
Manipal-576104, INDIA
sanjay.singh@manipal.edu
†
Centre for Artificial and Machine Intelligence (CAMI)
Manipal University, Manipal-576104, INDIA
October 23, 2018
Sanjay Singh Principal Component Analysis
Factor analysis model data x ∈ Rn as “approximately” lying

in some k-dimensional subspace, where k n
Here we imagined that each point x(i) was created by first
generating some z(i) lying in the k-dimensional affine space
{µ + Λz; z ∈ Rk } and then adding Ψ-covariance noise
Factor analysis is based on probabilistic model, and
parameter estimation uses EM algorithm
Principal Component Analysis (PCA), also tries to identify
the subspace in which the data approximately lies
PCA does this directly, and require only an eigen vector
calculation and does not need to use EM

PCA has been rediscovered many times in many fields, so it is
also known as
Karhunen-Loe’ve transform
Hotelling transformation
Method of empirical orthogonal functions, and
Singular value decomposition (SVD)
Consider a data set {x(i) ; i = 1 . . . , m} of m different types

of automobiles
Let x(i) ∈ Rn for each i(n m)
Suppose unknown to us, two different attributes, say xi and
xj give a car’s maximum speed measured in mph, and kph
respectively
These two attributes are almost linearly dependent with
only small differences introduced by rounding to the
nearest mph or kph
Thus, the data really lies on an n − 1 dimensional subspace
How can we automatically detect, and remove, this
redundancy?

We’ll develop PCA, but prior to running PCA, we first
pre-process the data to normalize its mean and variance
m
1 X (i)
1. Let µ = x
m
i=1
2. Replace each x(i) with x(i) − µ
2 1 X (i) 2
3. Let σj = (xj )
m
i
(i) (i)
4. Replace each xj with xj /σj
Steps (1 and 2) zero out the mean of the data (not required
for data known to have zero mean)
Steps (3 and 4) rescale each coordinate to have unit
variance, which ensures that different attributes are all
treated on the same “scale”
Steps (3 and 4) can be omitted if we have apriori
knowledge that the different attributes are all on the same
scale
Figure 1: A unnormalized data set Figure 2: A normalized data set

Having carried out the normalization, how do we compute
the “major axis of variation”, say ~u?
I.e., the direction on which the data approximately lies
We can pose this problem as finding the unit vector ~u so
that when the data is projected onto the direction
corresponding to ~u, the variance of the projected data is
maximized
We would like to choose a direction ~u so that if we were to
approximate the data as lying in the direction/subspace
corresponding to ~u, as much as possible of this variance is
retained
Figure 3: A normalized data set

Figure 4: A normalized data set
Suppose we pick ~u to
From Fig.4 we see that the
correspond to the direction
projected data still has
shown in Fig.4
large variance, and the
The circle denotes the points tend to be far from
projection of the original zero
data onto this line Sanjay Singh Principal Component Analysis
Figure 5: Same data with different direction
Here, the projections have a significantly smaller variance,

and are much closer to the origin
We would like to automatically select the direction ~u
corresponding to the first of the two figures: Fig4, and Fig5
Mathematics of Principal Components
Consider a n dimensional feature vector, i.e x ∈ Rn , we

want to summarize them by projecting into a k-dimensional
subspace (k < n)
Our summary will be the projection of original vectors on to
the k directions, the principal components(PC), which
spans the space
The simplest way to derive PC is by finding the projections
which maximize the variance (How do we know this
condition?)
Just hold on for some time!

We’ll start by looking for a 1D projection
I.e., we have n dimensional feature vectors, and we want to
project them on to a line through the origin
We can specify the line by a unit vector along it, ~u
T
Projection of a data point x(i) on to the line is x(i) ~u, which
is a scalar
xT~u is the distance of the projection from the origin, the
T
actual co-ordinate in n-dimensional space is (x(i) ~u)~u
The mean of the projections will be zero, because µx = 0:
m m
! !
1 X T 1 X
(x(i) ~u)~u = x(i) .~u ~u
m m
i=1 i=1
If we use our projected or image vectors instead of original

vectors, there will be some error
because the images do not coincide with the original
vector (When do they coincide?)
The difference is the error or residual of the projection
How big is the residual?
T T T
||x(i) − (x(i) ~u)~u||2 = ||x(i) ||2 − 2(x(i) ~u)(x(i) ~u) + ||~u||2
T
= ||x(i) ||2 − 2(x(i) ~u)2 + 1
Residual across all vectors is given by
m
(i)T 2
X
(i) 2
RSS(~u) = ||x || − 2(x ~u) + 1
i=1
m m m
(i)T 2
X X X
(i) 2
= ||x || − 2 (x ~u) + 1
i=1 i=1 i=1
m
! m
T
X X
(i) 2
= m+ ||x || −2 (x(i) ~u)2
i=1 i=1
First term in RSS doesn’t depend on ~u, so no need to
consider it for minimizing RSS
To minimize RSS, we need to maximize the second term,
m
T
X
i.e we want to maximize (x(i) ~u)2
i=1
Since m does not depend on ~u, we want to maximize
m
1 X (i)T 2
(x ~u) ,
m
i=1
T
which can be seen as the sample mean of (x(i) ~u)2
The mean of square is equal to the square of the mean
plus the variance:
m m
!2
1 X T 1 X T T
(x(i) ~u)2 = x(i) ~u +Var[x(i) ~u]
m m
i=1 i=1
| {z }
=0
Now, minimizing the residual sum of square (RSS) is

equivalent to maximizing the variance of the projections
In general, we want to project not on just one vector but on
to multiple principal components
If those components are orthogonal and have unit vectors
~u1 , . . . ,~uk , then the image of x(i) is its projection into the
space spanned by these vectors:
k
T
X
(x(i) ~uj )~uj
j=1

Maximizing Variance
Let X ∈ Rm×n
The variance is defined as:
m
1 X (i)T 2
σu2 = (x ~u)
m
i=1
1 X
= (X~u)T (X~u) [since X X = T
xi2 ]
m
i
1 T T
= ~u X X~u
m
T
TX X
= ~u ~u
m
= ~uT Σ~u
σu2 = ~uT Σ~u

We want to choose a unit vector ~u so as to maximize σu2
To do this we need to only look at unit vectors-we need
constrain the maximization
maximize ~uT Σ~u

~u
subject to ~uT~u = 1
Solving using Lagrange multiplier, we get Σ~u = λ~u

The desired vector ~u is an eigen vector of the covariance
matrix Σ
Maximizing vector will be associated with the largest
eigenvalue λ

In general, if we wish to project our data into a
k-dimensional subspace (k < n), we should choose
u1 , . . . , uk to be the top k eigenvectors of Σ
The ui ’s now form a new orthogonal basis for the data
To represent x(i) in this basis, we need only compute the
corresponding vector
 T (i) 
u1 x
uT x(i) 
 2 
y(i) =  .  ∈ Rk
 .. 
uTk x(i)
Since x(i) ∈ Rn , the vector y(i) gives a lower, k-dimensional

representation/approximation for x(i)
PCA is also referred to as a dimensionality reduction

algorithm
The vectors u1 , . . . , uk are called the first k principal
components of the data
Applications of PCA:
Compression
Apply PCA before applying supervised learning
many more

Sanjay Singh Principal Component Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sanjay Singh Principal Component Analysis

Uploaded by

Copyright:

Available Formats

Principal Component Analysis

October 23, 2018