14 PCA Max Variance and Min Error

10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations.
| by Aadhithya Sankar | Towards Data Science
Sign In Get started
Published in Towards Data Science
Aadhithya Sankar Follow
Sep 29, 2021 · 9 min read · Listen
Principal Component Analysis Part 1: The

Different Formulations.
What is Principal Component Analysis? What are the Maximum
Variance and Minimum Error formulations of PCA? How do we reduce
dimensionality using PCA?
62 2
https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 1/26

10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
Sign In Get started

Sign In Get started
Fig.1: Principal components of a multivariate gaussian centered at (1,3). Image Source: [3].
We all know that Principal Component Analysis is one of the standard

methods used in dimensionality reduction. There are multiple posts detailing
the code and implementation of PCA; in this post, however, we will look into
how PCA is formulated and how we arrive at the PCA algorithm. While it is
widely known that the Principal components are the eigenvectors
corresponding to the largest eigenvalues of the covariance matrix; this post
will explore why that is the case and how that solution is arrived at. The
content of this post will be based on the material provided in Chapter 12 of [1]
and [2].
Before we jump into PCA, it is important to know what eigenvalues and

eigenvectors are. Let A ∈ R^{n×n} be an n×n matrix. Then, a vector x ∈ C^n is
called the eigenvector of A if
E 1 Ei l d Ei t
Eq. 1: Eigenvalues and Eigenvectors
Sign In Get started
The λ ∈ C is called the Eigenvalue of A.
The Principal Component Analysis(PCA) problem can be formulated in two

ways: Maximum Variance Formulation and Minimum Error Formulation. In
the Maximum Variance Formulation, the goal is to find the orthogonal
projection of the data into a lower dimensional linear space such that the
variance of the projected data is maximised . In the Minimum Error
Formulation, PCA is defined as the linear projection that minimises the
average projection cost (mean squared error) between the data points and
their projection . We will see in the following sections that, both these
formulations lead to the same solution.
PCA: Maximum Variance Formulation

Given the set of observations {x_n}, n = 1, 2, …, N and x_n ∈ R^D, our goal
according to the maximum variance formulation is to find the orthogonal
projection of x_n onto a space with dimensions M<D.
Now let us consider the simplest case where M=1. We define a vector w1∈ R^D
as the direction of the lower dimensional space. Since we are only interested
in the direction of the space, we set w1 to be of unit length. i.e.,
10/5/22, 10:52 PM
t e d ect o oPrincipal
t eComponent
space,Analysis
we set Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
w to be o u t e gt . .e.,
Sign In Get started
Eq.2: w1 is a unit vector.
Then the data observations x_n can be projected onto this new space as
Eq.3: projection of x_n onto new space defined by w1.
If x̄ is the mean of the data observations in the original space, then the mean
of the samples in the projected space is given by
Eq. 3: projection of x̄ onto new space defined by w1.
Now, we can write the variance of the projected data as

Sign In Get started

Sign In Get started

Eq. 4: Variance of x̂ .
where S is the covariance matrix of the observed data in the original high
dimensional space.
Eq. 6: Covariance of x.
Now, according to the maximum variance formulation definition, we need to

maximise the variance of x̂. This can be done by maximising Eq. 4. Eq. 4 has
a trivial solution when ||w1|| → ∞. To prevent this, we make use of the unit
norm constraint we set earlier in Eq.2. To address this, we introduce a
Lagrange multiplier λ1 and and formulate our optimisation objective as
follows:
Eq. 7: Maximum Variance Optimisation Problem.
Setting the derivative of Eq. 7 w.r.t w1 to 0, we get a stationary point when,

Sign In Get started
Eq. 8: Setting derivative of Eq. 7 w.r.t w1 to zero.
This shows that at the stationary point, w1 must be an eigenvector of S and λ1

the eigenvalue. corresponding to the eigenvector w1. Left-multypling Eq.7
with w1^T, we ca see that the maximum variance is equal to the eigenvalue
λ1.
Eq. 9: The maximum variance in the lower dimensional space is equal to the eigenvalue corresponding to
eigenvector w1.
We can identify additional principal components by choosing directions that

maximise variance while being orthogonal to the existing ones. For the Sign In Get started
general case of a lower dimensional space with M dimensions with M<D, the
principal components are the eigenvectors w1, w2, … wm corresponding to
the M largest eigenvalues λ1, λ2, …, λm.
PCA: Minimum Error Formulation

Let {x_n}, n = 1, 2, …, N and x_n ∈ R^D be the set of data observations, then
our goal according to the minimum error formulation of PCA is to find
transformation that minimises the reconstruction error:
Eq. 10: Minimum Error Objective
where x̃ is the reconstruction generated from a lower dimensional latent

variable. Here, we have a complete D dimensional orthonormal(orthogonal
and unit length) basis w_i, where i= 1, 2, … D. Then we have

Sign In Get started
Eq. 11: Orthonormal Basis.
where δᵢⱼ is the kronecker delta. Since the basis is complete, we can represent
any vector as a linear combination of the basis vectors
Eq. 12
Since we have orthonormal basis, we have the solution for α_ni as the dot
product of the x_n and w_i. Then, we can write Eq. 12 as
Eq. 13 Proof for this can be found in [4]
With dimensionality reduction, our goal is to find a M dimensional

representation for our D dimensional data (with M<D) by projecting it onto
Sign In Get started
the lower dimensional space. We can represent this M dimensional space

using the first M basis vectors
Eq. 14
and the remaining (D-M) basis are shared by all data points(shared offsets). In
Eq. 14, z_ni depends on the individual data points while b_i are constants that
are shared by all the data points.
From Eq. 13 and Eq. 14 we can compute the difference between x_n and x̃_n
as

Sign In Get started
Eq. 15: Difference between x_n and x̃_n.
Now, we can substitute Eq. 15 in Eq. 10 to get the objective function as:

Sign In Get started
Eq. 16
Taking derivative w.r.t z_nj and setting to zero, we get
Eq. 17
Taking derivative w.r.t b_j and setting to zero, we get
Eq. 18
N b tit t E 17 dE 18 i E 16 d t
Now, we can substitute Eq. 17 and Eq. 18 in Eq. 16 and get
Sign In Get started
Eq. 18
Our goal is to minimise J(w), but we observe that a trivial solution exists to this
10/5/22, 10:52 PM
g J( ),
Principal Component Analysis Part 1: The Different Formulations. | by Aadhithya Sankar | Towards Data Science
problem when w = 0. To overcome this, we again make use of the property ofSign In Get started
orthonormal bases and set the normalisation constraint ||w||² = 1.
Now, we look at a simple example where D=2 and M=1. We have to choose a
direction w2 such that we can minimise the following objective
Eq. 19
As before, taking derivative w.r.t w2 and setting to 0, we get
Eq. 20
where w2 is the eigenvector corresponding to eigenvalue λ2. Substituting Eq.

20 in Eq. 19, we get J = λ2, i.e., J is minimised when we choose the eigenvector
with the smallest eigenvalue. In the general case, when we have M<D, the
Minimum Error for reconstruction J is obtained by choosing the eigenvectors
w_i corresponding to the (D-M) smallest eigenvalues obtained from the
covariance matrix given by
Sign In Get started
Eq. 21
The minimum reconstruction error(distortion measure) is given by
Eq. 22
the sum of (D-M) eigenvalues. Therefore, our goal is to use the M largest
eigenvalues so that the distortion measure J, now constituting of the (D-M)
smallest eigenvalues is minimised. Thus we can conclude that minimising the
reconstruction error maximises the variance of the projection.
Dimensionality Reduction with PCA

Now that we’ve seen that the M principal components are the M eigenvectors
corresponding to the M largest eigenvalues of the covariance matrix, we move
on to see how PCA is applied to reduce dimensionality.

Dimensionality reduction using PCA consists of 4 steps: Sign In Get started
1. Center the data
The first step is to compute and subtract the mean from the data points so that
the data is centred around 0 and therefore has zero mean.
Eq. 23: Centered Data X̂
2. Compute Covariance Matrix
Eq. 24: Covariance Matrix
3. Compute Eigen Values and vectors using Eigen Value Decomposition
The goal of PCA is the transformation of the coordinate system such that the
covariance between the new axes are 0.

Sign In Get started
Fig. 2: Goal of PCA is to find space W such that the covariance between new axes is 0.
So we perform Eigen Value Decomposition of the Covariance matrix S
Eq. 25: EVD of the covariance matrix.
Here, Γ∈ℝ^(DxD) is the matrix of eigenvectors and Λ∈ℝ^(DxD) is a diagonal

matrix containing the eigenvalues.
4. Dimensionality Reduction
Now that we have the eigenvectors Γ , to reduce the dimensionality, we can

truncate Γ by keeping only the columns(eigenvectors) corresponding to the
largest M eigenvalues. We call the truncated Γ matrix as Γ’. Then the
representation in the reduced space is obtained by
Eq. 26: Dimensionality Reduction
Code: Dimensionality Reduction using PCA

Sign In Get started

1 def pca_dim_reduction(X, M):
2 assert X.ndims <= M, "M has to be less than dimensionality of X!"
3
4 # * 1. Center Data
5 X_mean = np.mean(X)
6 X_centered = X - X_mean
7
8 # * 2. Compute Covariance Matrix
9 X_cov = np.cov(X_centered.T)
10
11 # * 3.1. Get Eigen Values and Vectors
12 eigen_values, eigen_vectors = np.linalg.eig(X_cov)
13 # * 3.2. Get the top M eigen values indices
14 top_M_idx = np.argsort(eigen_values)[::-1][:M]
15
16 # * 4.1 Get top M Eigen Vectors
17 top_M_eigen_vectors = eigen_vectors[:,top_M_idx]
18
19 # * 4.2 return projections in reduced space
20 return np.dot(X_mean, top_M_eigen_vectors)
pca_dim_reduction.py
hosted with ❤ by GitHub view raw
Code: Dimensionality Reduction with PCA.
Performance of PCA with EVD

PCA using Eigen Value Decomposition(EVD) is very expensive with a
complexity of O(D³) where D is the dimensionality of the input data. EVD
computes all the eigenvalue and eigenvector pairs, where as usually we need
only the eigenvectors corresponding to the M largest eigenvalues. Therefore Sign In Get started
in practice, much efficient iterative approaches such as Power Iteration

method is used to compute the eigenvectors.
PCA and Data Standardisation[2]
Fig.3 PCA can get mislead by unstandardised data. (a) Principal Component is skewed because PCA is misled
by the unstandardised data. (b) PCA when the scales are standardised. Image generated from code at [5].

Sign In Get started
The principal directions from PCA are the ones along which the variance is
the most. So, PCA can be mislead by directions along which the variance
appears high just because of the measurement scale. We can see this in Fig.
3(a) where the Principal Component is not aligned properly because it is
mislead by the unstandardised scale. Fig. 3(b) shows the correct principal
component when the scales are standardised. Therefore, care needs to be
taken to standardise the data to avoid such issues.
Conclusion
In this rather long post, we dove deep and looked at the two formulations of
PCA: the Maximum Variance and Minimum Error Formulation. We saw that
both the formulations had the same solution/algorithm — select as the new
basis, the eigenvectors corresponding to the M largest eigenvalues of the
Covariance matrix of the data. We saw how PCA can be use for dimensionality
reduction and how it can be implemented in python. Finally we briefly looked
into the importance of standardising the data and how it affects the algorithm.
This brings to the end of this post, which is merely Part 1 of this series on
PCA. The next parts will cover Probabilistic PCA, Singular Value
Decomposition, Auto encoders and the relationship between Autoencoders,
PCA and SVD.
PCA and SVD.
Sign In Get started
Follow Aadhithya Sankar to get notified when the next parts are made
available!
If you find any mistakes, please leave a comment, I will fix them!🙏🏽 ✌🏽
References
[1] Bishop, Christopher M. Pattern Recognition and Machine Learning. New
York :Springer, 2006.
[2] Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press,

2012.
[3]https://commons.wikimedia.org/wiki/File:GaussianScatterPCA.svg#/media/
File:GaussianScatterPCA.svg
[4] https://www.math.ucdavis.edu/~linear/old/notes21.pdf
[5] Murphy, K., Soliman, M., Duran-Martin, G., Kara, A., Liang Ang, M.,
Reddy S & Patel D (2021) PyProbML library for Probabilistic Machine
Reddy, S., & Patel, D. (2021). PyProbML library for Probabilistic Machine
Sign In Get started
Learning [Computer software].
Resources
Here are some resources that can help understand the topic better
1. PCA: Maximum Variance Formulation (University of Amsterdam)

Sign In Get started
2. PCA Minimum Error Formulation (University of Amasterdam)
More Works by the Author

If you liked this post, you might also enjoy the following posts:

Real-time Artwork Generation using Deep Learning

Sign In Get started
Adaptive Instance Normalisation(AdaIN) for Style Transfer between
any arbitrary content-style image pair.
towardsdatascience.com
A Primer on Atrous Convolutions and Depth-wise Separable

Convolutions
What are atrous/dilated and depth-wise separable convolutions?
How are the different from standard convolutions? What…
Demystified: Wasserstein GANs (WGAN)

What is the Wasserstein distance? What is the intuition behind using
Wasserstein distance to train GANs? How is it…
Thanks to Abdullah Farouk and Ben Huberman

Sign In Get started
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review Get this newsletter
our Privacy Policy for more information about our privacy practices.
About Help Terms Privacy

14 PCA Max Variance and Min Error

Uploaded by

Copyright:

Available Formats

You might also like

14 PCA Max Variance and Min Error

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

14 PCA Max Variance and Min Error

Uploaded by

Copyright:

Available Formats

10/5/22, 10:52 PM Principal Component Analysis Part 1: The Different Formulations.

| by Aadhithya Sankar | Towards Data Science

Sign In Get started

Published in Towards Data Science

Aadhithya Sankar Follow

Sep 29, 2021 · 9 min read · Listen

Principal Component Analysis Part 1: The

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 1/26

Sign In Get started

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 2/26

Sign In Get started

We all know that Principal Component Analysis is one of the standard

Before we jump into PCA, it is important to know what eigenvalues and

The λ ∈ C is called the Eigenvalue of A.

The Principal Component Analysis(PCA) problem can be formulated in two

PCA: Maximum Variance Formulation

Eq.2: w1 is a unit vector.

Eq.3: projection of x_n onto new space defined by w1.

Eq. 3: projection of x̄ onto new space defined by w1.

Now, we can write the variance of the projected data as

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 5/26

Sign In Get started

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 6/26

Sign In Get started

Now, according to the maximum variance formulation definition, we need to

Eq. 7: Maximum Variance Optimisation Problem.

Setting the derivative of Eq. 7 w.r.t w1 to 0, we get a stationary point when,

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 7/26

Sign In Get started

Eq. 8: Setting derivative of Eq. 7 w.r.t w1 to zero.

This shows that at the stationary point, w1 must be an eigenvector of S and λ1

We can identify additional principal components by choosing directions that

PCA: Minimum Error Formulation

Eq. 10: Minimum Error Objective

where x̃ is the reconstruction generated from a lower dimensional latent

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 9/26

Sign In Get started

Eq. 11: Orthonormal Basis.

Eq. 13 Proof for this can be found in [4]

With dimensionality reduction, our goal is to find a M dimensional

Sign In Get started

the lower dimensional space. We can represent this M dimensional space

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 11/26

Sign In Get started

Eq. 15: Difference between x_n and x̃_n.

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 12/26

Sign In Get started

Taking derivative w.r.t z_nj and setting to zero, we get

Taking derivative w.r.t b_j and setting to zero, we get

orthonormal bases and set the normalisation constraint ||w||² = 1.

As before, taking derivative w.r.t w2 and setting to 0, we get

where w2 is the eigenvector corresponding to eigenvalue λ2. Substituting Eq.

Sign In Get started

The minimum reconstruction error(distortion measure) is given by

Dimensionality Reduction with PCA

https://towardsdatascience.com/principal-component-analysis-part-1-the-different-formulations-6508f63a5553#:~:text=In the Minimum Error Formulation,lead to the same solution. 16/26

Dimensionality reduction using PCA consists of 4 steps: Sign In Get started