Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Principal Component Analysis

Sanjay Singh?†

?
Department of Information and Communication Technology
Manipal Institute of Technology, Manipal University
Manipal-576104, INDIA
sanjay.singh@manipal.edu

Centre for Artificial and Machine Intelligence (CAMI)
Manipal University, Manipal-576104, INDIA

October 23, 2018

Sanjay Singh Principal Component Analysis

Factor analysis model data x ∈ Rn as “approximately” lying


in some k-dimensional subspace, where k  n
Here we imagined that each point x(i) was created by first
generating some z(i) lying in the k-dimensional affine space
{µ + Λz; z ∈ Rk } and then adding Ψ-covariance noise
Factor analysis is based on probabilistic model, and
parameter estimation uses EM algorithm
Principal Component Analysis (PCA), also tries to identify
the subspace in which the data approximately lies
PCA does this directly, and require only an eigen vector
calculation and does not need to use EM

Sanjay Singh Principal Component Analysis


PCA has been rediscovered many times in many fields, so it is
also known as
Karhunen-Loe’ve transform
Hotelling transformation
Method of empirical orthogonal functions, and
Singular value decomposition (SVD)

Sanjay Singh Principal Component Analysis

Consider a data set {x(i) ; i = 1 . . . , m} of m different types


of automobiles
Let x(i) ∈ Rn for each i(n  m)
Suppose unknown to us, two different attributes, say xi and
xj give a car’s maximum speed measured in mph, and kph
respectively
These two attributes are almost linearly dependent with
only small differences introduced by rounding to the
nearest mph or kph
Thus, the data really lies on an n − 1 dimensional subspace
How can we automatically detect, and remove, this
redundancy?

Sanjay Singh Principal Component Analysis


We’ll develop PCA, but prior to running PCA, we first
pre-process the data to normalize its mean and variance
m
1 X (i)
1. Let µ = x
m
i=1
2. Replace each x(i) with x(i) − µ
2 1 X (i) 2
3. Let σj = (xj )
m
i
(i) (i)
4. Replace each xj with xj /σj
Steps (1 and 2) zero out the mean of the data (not required
for data known to have zero mean)
Steps (3 and 4) rescale each coordinate to have unit
variance, which ensures that different attributes are all
treated on the same “scale”
Steps (3 and 4) can be omitted if we have apriori
knowledge that the different attributes are all on the same
scale
Sanjay Singh Principal Component Analysis

Figure 1: A unnormalized data set Figure 2: A normalized data set

Sanjay Singh Principal Component Analysis


Having carried out the normalization, how do we compute
the “major axis of variation”, say ~u?
I.e., the direction on which the data approximately lies
We can pose this problem as finding the unit vector ~u so
that when the data is projected onto the direction
corresponding to ~u, the variance of the projected data is
maximized
We would like to choose a direction ~u so that if we were to
approximate the data as lying in the direction/subspace
corresponding to ~u, as much as possible of this variance is
retained

Sanjay Singh Principal Component Analysis

Figure 3: A normalized data set


Figure 4: A normalized data set

Suppose we pick ~u to
From Fig.4 we see that the
correspond to the direction
projected data still has
shown in Fig.4
large variance, and the
The circle denotes the points tend to be far from
projection of the original zero
data onto this line Sanjay Singh Principal Component Analysis
Figure 5: Same data with different direction

Here, the projections have a significantly smaller variance,


and are much closer to the origin
We would like to automatically select the direction ~u
corresponding to the first of the two figures: Fig4, and Fig5
Sanjay Singh Principal Component Analysis

Mathematics of Principal Components

Consider a n dimensional feature vector, i.e x ∈ Rn , we


want to summarize them by projecting into a k-dimensional
subspace (k < n)
Our summary will be the projection of original vectors on to
the k directions, the principal components(PC), which
spans the space
The simplest way to derive PC is by finding the projections
which maximize the variance (How do we know this
condition?)
Just hold on for some time!

Sanjay Singh Principal Component Analysis


We’ll start by looking for a 1D projection
I.e., we have n dimensional feature vectors, and we want to
project them on to a line through the origin
We can specify the line by a unit vector along it, ~u
T
Projection of a data point x(i) on to the line is x(i) ~u, which
is a scalar
xT~u is the distance of the projection from the origin, the
T
actual co-ordinate in n-dimensional space is (x(i) ~u)~u
The mean of the projections will be zero, because µx = 0:
m m
! !
1 X T 1 X
(x(i) ~u)~u = x(i) .~u ~u
m m
i=1 i=1

Sanjay Singh Principal Component Analysis

If we use our projected or image vectors instead of original


vectors, there will be some error
because the images do not coincide with the original
vector (When do they coincide?)
The difference is the error or residual of the projection
How big is the residual?
T T T
||x(i) − (x(i) ~u)~u||2 = ||x(i) ||2 − 2(x(i) ~u)(x(i) ~u) + ||~u||2
T
= ||x(i) ||2 − 2(x(i) ~u)2 + 1
Residual across all vectors is given by
m  
(i)T 2
X
(i) 2
RSS(~u) = ||x || − 2(x ~u) + 1
i=1
m m m
(i)T 2
X X X
(i) 2
= ||x || − 2 (x ~u) + 1
i=1 i=1 i=1
m
! m
T
X X
(i) 2
= m+ ||x || −2 (x(i) ~u)2
i=1 i=1
Sanjay Singh Principal Component Analysis
First term in RSS doesn’t depend on ~u, so no need to
consider it for minimizing RSS
To minimize RSS, we need to maximize the second term,
m
T
X
i.e we want to maximize (x(i) ~u)2
i=1
Since m does not depend on ~u, we want to maximize
m
1 X (i)T 2
(x ~u) ,
m
i=1
T
which can be seen as the sample mean of (x(i) ~u)2
The mean of square is equal to the square of the mean
plus the variance:
m m
!2
1 X T 1 X T T
(x(i) ~u)2 = x(i) ~u +Var[x(i) ~u]
m m
i=1 i=1
| {z }
=0

Sanjay Singh Principal Component Analysis

Now, minimizing the residual sum of square (RSS) is


equivalent to maximizing the variance of the projections
In general, we want to project not on just one vector but on
to multiple principal components
If those components are orthogonal and have unit vectors
~u1 , . . . ,~uk , then the image of x(i) is its projection into the
space spanned by these vectors:
k
T
X
(x(i) ~uj )~uj
j=1

Sanjay Singh Principal Component Analysis


Maximizing Variance

Let X ∈ Rm×n
The variance is defined as:
m
1 X (i)T 2
σu2 = (x ~u)
m
i=1
1 X
= (X~u)T (X~u) [since X X = T
xi2 ]
m
i
1 T T
= ~u X X~u
m
T
TX X
= ~u ~u
m
= ~uT Σ~u

Sanjay Singh Principal Component Analysis

σu2 = ~uT Σ~u


We want to choose a unit vector ~u so as to maximize σu2
To do this we need to only look at unit vectors-we need
constrain the maximization

maximize ~uT Σ~u


~u
subject to ~uT~u = 1

Solving using Lagrange multiplier, we get Σ~u = λ~u


The desired vector ~u is an eigen vector of the covariance
matrix Σ
Maximizing vector will be associated with the largest
eigenvalue λ

Sanjay Singh Principal Component Analysis


In general, if we wish to project our data into a
k-dimensional subspace (k < n), we should choose
u1 , . . . , uk to be the top k eigenvectors of Σ
The ui ’s now form a new orthogonal basis for the data
To represent x(i) in this basis, we need only compute the
corresponding vector
 T (i) 
u1 x
uT x(i) 
 2 
y(i) =  .  ∈ Rk
 .. 
uTk x(i)

Since x(i) ∈ Rn , the vector y(i) gives a lower, k-dimensional


representation/approximation for x(i)

Sanjay Singh Principal Component Analysis

PCA is also referred to as a dimensionality reduction


algorithm
The vectors u1 , . . . , uk are called the first k principal
components of the data
Applications of PCA:
Compression
Apply PCA before applying supervised learning
many more

Sanjay Singh Principal Component Analysis

You might also like