14 PCA Max Variance and Min Error

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations.

| by Aadhithya Sankar | Towards Data Science

Sign In Get started

Published in Towards Data Science

Aadhithya Sankar Follow

Sep 29, 2021 · 9 min read · Listen

Principal Component Analysis Part 1: The


Different Formulations.
What is Principal Component Analysis? What are the Maximum
Variance and Minimum Error formulations of PCA? How do we reduce
dimensionality using PCA?

62 2

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 1/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 2/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Fig.1: Principal components of a multivariate gaussian centered at (1,3). Image Source: [3].

We all know that Principal Component Analysis is one of the standard


methods used in dimensionality reduction. There are multiple posts detailing
the code and implementation of PCA; in this post, however, we will look into
how PCA is formulated and how we arrive at the PCA algorithm. While it is
widely known that the Principal components are the eigenvectors
corresponding to the largest eigenvalues of the covariance matrix; this post
will explore why that is the case and how that solution is arrived at. The
content of this post will be based on the material provided in Chapter 12 of [1]
and [2].

Before we jump into PCA, it is important to know what eigenvalues and


eigenvectors are. Let A ∈ R^{n×n} be an n×n matrix. Then, a vector x ∈ C^n is
called the eigenvector of A if

E 1 Ei l d Ei t
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 3/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
Eq. 1: Eigenvalues and Eigenvectors
Sign In Get started

The λ ∈ C is called the Eigenvalue of A.

The Principal Component Analysis(PCA) problem can be formulated in two


ways: Maximum Variance Formulation and Minimum Error Formulation. In
the Maximum Variance Formulation, the goal is to find the orthogonal
projection of the data into a lower dimensional linear space such that the
variance of the projected data is maximised . In the Minimum Error
Formulation, PCA is defined as the linear projection that minimises the
average projection cost (mean squared error) between the data points and
their projection . We will see in the following sections that, both these
formulations lead to the same solution.

PCA: Maximum Variance Formulation


Given the set of observations {x_n}, n = 1, 2, …, N and x_n ∈ R^D, our goal
according to the maximum variance formulation is to find the orthogonal
projection of x_n onto a space with dimensions M<D.

Now let us consider the simplest case where M=1. We define a vector w1∈ R^D
as the direction of the lower dimensional space. Since we are only interested
in the direction of the space, we set w1 to be of unit length. i.e.,
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 4/26
10/5/22, 10:52 PM
t e d ect o oPrincipal
t eComponent
space,Analysis
we set Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
w to be o u t e gt . .e.,
Sign In Get started

Eq.2: w1 is a unit vector.

Then the data observations x_n can be projected onto this new space as

Eq.3: projection of x_n onto new space defined by w1.

If x̄ is the mean of the data observations in the original space, then the mean
of the samples in the projected space is given by

Eq. 3: projection of x̄ onto new space defined by w1.

Now, we can write the variance of the projected data as

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 5/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 6/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started


Eq. 4: Variance of x̂ .

where S is the covariance matrix of the observed data in the original high
dimensional space.

Eq. 6: Covariance of x.

Now, according to the maximum variance formulation definition, we need to


maximise the variance of x̂. This can be done by maximising Eq. 4. Eq. 4 has
a trivial solution when ||w1|| → ∞. To prevent this, we make use of the unit
norm constraint we set earlier in Eq.2. To address this, we introduce a
Lagrange multiplier λ1 and and formulate our optimisation objective as
follows:

Eq. 7: Maximum Variance Optimisation Problem.

Setting the derivative of Eq. 7 w.r.t w1 to 0, we get a stationary point when,

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 7/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Eq. 8: Setting derivative of Eq. 7 w.r.t w1 to zero.

This shows that at the stationary point, w1 must be an eigenvector of S and λ1


the eigenvalue. corresponding to the eigenvector w1. Left-multypling Eq.7
with w1^T, we ca see that the maximum variance is equal to the eigenvalue
λ1.

Eq. 9: The maximum variance in the lower dimensional space is equal to the eigenvalue corresponding to
eigenvector w1.

We can identify additional principal components by choosing directions that


https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 8/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

maximise variance while being orthogonal to the existing ones. For the Sign In Get started

general case of a lower dimensional space with M dimensions with M<D, the
principal components are the eigenvectors w1, w2, … wm corresponding to
the M largest eigenvalues λ1, λ2, …, λm.

PCA: Minimum Error Formulation


Let {x_n}, n = 1, 2, …, N and x_n ∈ R^D be the set of data observations, then
our goal according to the minimum error formulation of PCA is to find
transformation that minimises the reconstruction error:

Eq. 10: Minimum Error Objective

where x̃ is the reconstruction generated from a lower dimensional latent


variable. Here, we have a complete D dimensional orthonormal(orthogonal
and unit length) basis w_i, where i= 1, 2, … D. Then we have

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 9/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Eq. 11: Orthonormal Basis.

where δᵢⱼ is the kronecker delta. Since the basis is complete, we can represent
any vector as a linear combination of the basis vectors

Eq. 12

Since we have orthonormal basis, we have the solution for α_ni as the dot
product of the x_n and w_i. Then, we can write Eq. 12 as

Eq. 13 Proof for this can be found in [4]

With dimensionality reduction, our goal is to find a M dimensional


representation for our D dimensional data (with M<D) by projecting it onto
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 10/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

the lower dimensional space. We can represent this M dimensional space


using the first M basis vectors

Eq. 14

and the remaining (D-M) basis are shared by all data points(shared offsets). In
Eq. 14, z_ni depends on the individual data points while b_i are constants that
are shared by all the data points.

From Eq. 13 and Eq. 14 we can compute the difference between x_n and x̃_n
as

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 11/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Eq. 15: Difference between x_n and x̃_n.

Now, we can substitute Eq. 15 in Eq. 10 to get the objective function as:

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 12/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Eq. 16

Taking derivative w.r.t z_nj and setting to zero, we get

Eq. 17

Taking derivative w.r.t b_j and setting to zero, we get

Eq. 18

N b tit t E 17 dE 18 i E 16 d t
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 13/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
Now, we can substitute Eq. 17 and Eq. 18 in Eq. 16 and get
Sign In Get started

Eq. 18

Our goal is to minimise J(w), but we observe that a trivial solution exists to this
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 14/26
10/5/22, 10:52 PM
g J( ),
Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

problem when w = 0. To overcome this, we again make use of the property ofSign In Get started

orthonormal bases and set the normalisation constraint ||w||² = 1.

Now, we look at a simple example where D=2 and M=1. We have to choose a
direction w2 such that we can minimise the following objective

Eq. 19

As before, taking derivative w.r.t w2 and setting to 0, we get

Eq. 20

where w2 is the eigenvector corresponding to eigenvalue λ2. Substituting Eq.


20 in Eq. 19, we get J = λ2, i.e., J is minimised when we choose the eigenvector
with the smallest eigenvalue. In the general case, when we have M<D, the
Minimum Error for reconstruction J is obtained by choosing the eigenvectors
w_i corresponding to the (D-M) smallest eigenvalues obtained from the
covariance matrix given by
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 15/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Eq. 21

The minimum reconstruction error(distortion measure) is given by

Eq. 22

the sum of (D-M) eigenvalues. Therefore, our goal is to use the M largest
eigenvalues so that the distortion measure J, now constituting of the (D-M)
smallest eigenvalues is minimised. Thus we can conclude that minimising the
reconstruction error maximises the variance of the projection.

Dimensionality Reduction with PCA


Now that we’ve seen that the M principal components are the M eigenvectors
corresponding to the M largest eigenvalues of the covariance matrix, we move
on to see how PCA is applied to reduce dimensionality.

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 16/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Dimensionality reduction using PCA consists of 4 steps: Sign In Get started

1. Center the data

The first step is to compute and subtract the mean from the data points so that
the data is centred around 0 and therefore has zero mean.

Eq. 23: Centered Data X̂

2. Compute Covariance Matrix

Eq. 24: Covariance Matrix

3. Compute Eigen Values and vectors using Eigen Value Decomposition

The goal of PCA is the transformation of the coordinate system such that the
covariance between the new axes are 0.

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 17/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Fig. 2: Goal of PCA is to find space W such that the covariance between new axes is 0.

So we perform Eigen Value Decomposition of the Covariance matrix S

Eq. 25: EVD of the covariance matrix.

Here, Γ∈ℝ^(DxD) is the matrix of eigenvectors and Λ∈ℝ^(DxD) is a diagonal


matrix containing the eigenvalues.

4. Dimensionality Reduction

Now that we have the eigenvectors Γ , to reduce the dimensionality, we can


truncate Γ by keeping only the columns(eigenvectors) corresponding to the
largest M eigenvalues. We call the truncated Γ matrix as Γ’. Then the
representation in the reduced space is obtained by

Eq. 26: Dimensionality Reduction

Code: Dimensionality Reduction using PCA


https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 18/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started


1 def pca_dim_reduction(X, M):
2 assert X.ndims <= M, "M has to be less than dimensionality of X!"
3
4 # * 1. Center Data
5 X_mean = np.mean(X)
6 X_centered = X - X_mean
7
8 # * 2. Compute Covariance Matrix
9 X_cov = np.cov(X_centered.T)
10
11 # * 3.1. Get Eigen Values and Vectors
12 eigen_values, eigen_vectors = np.linalg.eig(X_cov)
13 # * 3.2. Get the top M eigen values indices
14 top_M_idx = np.argsort(eigen_values)[::-1][:M]
15
16 # * 4.1 Get top M Eigen Vectors
17 top_M_eigen_vectors = eigen_vectors[:,top_M_idx]
18
19 # * 4.2 return projections in reduced space
20 return np.dot(X_mean, top_M_eigen_vectors)

pca_dim_reduction.py
hosted with ❤ by GitHub view raw

Code: Dimensionality Reduction with PCA.

Performance of PCA with EVD


PCA using Eigen Value Decomposition(EVD) is very expensive with a
complexity of O(D³) where D is the dimensionality of the input data. EVD
computes all the eigenvalue and eigenvector pairs, where as usually we need
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 19/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

only the eigenvectors corresponding to the M largest eigenvalues. Therefore Sign In Get started

in practice, much efficient iterative approaches such as Power Iteration


method is used to compute the eigenvectors.

PCA and Data Standardisation[2]

Fig.3 PCA can get mislead by unstandardised data. (a) Principal Component is skewed because PCA is misled
by the unstandardised data. (b) PCA when the scales are standardised. Image generated from code at [5].

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 20/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

The principal directions from PCA are the ones along which the variance is
the most. So, PCA can be mislead by directions along which the variance
appears high just because of the measurement scale. We can see this in Fig.
3(a) where the Principal Component is not aligned properly because it is
mislead by the unstandardised scale. Fig. 3(b) shows the correct principal
component when the scales are standardised. Therefore, care needs to be
taken to standardise the data to avoid such issues.

Conclusion
In this rather long post, we dove deep and looked at the two formulations of
PCA: the Maximum Variance and Minimum Error Formulation. We saw that
both the formulations had the same solution/algorithm — select as the new
basis, the eigenvectors corresponding to the M largest eigenvalues of the
Covariance matrix of the data. We saw how PCA can be use for dimensionality
reduction and how it can be implemented in python. Finally we briefly looked
into the importance of standardising the data and how it affects the algorithm.
This brings to the end of this post, which is merely Part 1 of this series on
PCA. The next parts will cover Probabilistic PCA, Singular Value
Decomposition, Auto encoders and the relationship between Autoencoders,
PCA and SVD.
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 21/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
PCA and SVD.
Sign In Get started

Follow Aadhithya Sankar to get notified when the next parts are made
available!

If you find any mistakes, please leave a comment, I will fix them!🙏🏽 ✌🏽

References
[1] Bishop, Christopher M. Pattern Recognition and Machine Learning. New
York :Springer, 2006.

[2] Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press,


2012.

[3]https://commons.wikimedia.org/wiki/File:GaussianScatterPCA.svg#/media/
File:GaussianScatterPCA.svg

[4] https://www.math.ucdavis.edu/~linear/old/notes21.pdf

[5] Murphy, K., Soliman, M., Duran-Martin, G., Kara, A., Liang Ang, M.,
Reddy S & Patel D (2021) PyProbML library for Probabilistic Machine
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 22/26
10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
Reddy, S., & Patel, D. (2021). PyProbML library for Probabilistic Machine
Sign In Get started
Learning [Computer software].

Resources
Here are some resources that can help understand the topic better

1. PCA: Maximum Variance Formulation (University of Amsterdam)

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 23/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

2. PCA Minimum Error Formulation (University of Amasterdam)

More Works by the Author


If you liked this post, you might also enjoy the following posts:

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 24/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Real-time Artwork Generation using Deep Learning


Sign In Get started
Adaptive Instance Normalisation(AdaIN) for Style Transfer between
any arbitrary content-style image pair.
towardsdatascience.com

A Primer on Atrous Convolutions and Depth-wise Separable


Convolutions
What are atrous/dilated and depth-wise separable convolutions?
How are the different from standard convolutions? What…
towardsdatascience.com

Demystified: Wasserstein GANs (WGAN)


What is the Wasserstein distance? What is the intuition behind using
Wasserstein distance to train GANs? How is it…
towardsdatascience.com

Thanks to Abdullah Farouk and Ben Huberman

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 25/26


10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science

Sign In Get started

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review Get this newsletter
our Privacy Policy for more information about our privacy practices.

About Help Terms Privacy

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 26/26

You might also like