9 ML

DIMENSIONALITY REDUCTION
Module-III
MODULE-III
Dimensionality Reduction: Introduction, Subset Selection, Principal Components Analysis,

Factor Analysis, Multidimensional Scaling, Linear Discriminant Analysis, Isomap, Locally
Linear Embedding,
Decision Trees: Introduction, Univariate Trees, Pruning, Rule Extraction from Trees,
Learning Rules from Data, Multivariate Trees,
Introduction to Linear Discrimination, Generalizing the Linear Model.
Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 2

REFERENCES
Ref-3 Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition,

2010 [Ch-6,9,10]

INTRODUCTION
 Complexity of any classifier or regressor depends on the number of inputs.
 Data contain information

• are taken as inputs and
• fed to the system for decision making.
 Ideally, the classifier (or regressor) should be able to use necessary features ,
discarding the irrelevant.

NEED FOR DIMENSIONALITY REDUCTION
 Curse of dimensionality: the size of training set increases with dimension.
 Computation time and storage increases

 Data can be easily visualized
 Help to eliminate irrelevant data or reduce noise
 Ideally, the classifier (or regressor) should be able to use necessary features ,
discarding the irrelevant.

PROJECTING INTO ANY ONE ATTRIBUTE

Why Reducing Dimensionality?
 complexity depends on the No. of input dimensions, d and size of the data sample, N
• for reduced memory and computation, reduce the dimensionality of the problem.
• Decreasing d also decreases the complexity of the inference algorithm during
testing.
 Unnecessary inputs need not be extracted .
 Simpler models are more robust on small datasets.

• Simpler models have less variance, that is, they vary less depending on the
particulars of a sample, including noise, outliers, etc.

Why Reducing Dimensionality? …Contd
 With fewer features, a better idea about the process is obtained and this allows
knowledge extraction.
 When data can be represented in a few dimensions without loss of information,

it can be plotted and analyzed visually for structure and outliers.

Two Methods for Reducing Dimensionality
 Feature Selection
• find k of the d dimensions that give us the most information
• discard the other (d − k) dimensions.
 Feature Extraction
• find a new set of k dimensions that are combinations of the original d
dimensions.

Subset Selection
 Finding the best subset of the set of features.

 The best subset contains the least number of dimensions that most contribute to
accuracy.
 Using a suitable error function, eliminate irrelevant dimensions
 This can be used in both regression and classification problems.
 There are possible subsets of d variables,
 But not possible to test all of them unless d is small and we employ heuristics to
get a reasonable (but not optimal) solution in reasonable (polynomial) time.

Subset Selection
 Two approaches
• Forward selection
• Backward selection

Forward Selection
 Start with no variables and add them one by one

 At each step adding the one that decreases the
error the most
 Until any further addition does not decrease the
error
https://quantifyinghealth.com/stepwise-selection/

Backward Selection
 Start with all variables and remove them one by

one,
 At each step removing the one that decreases the
error the most
 Until any further removal increases the error
significantly.
https://quantifyinghealth.com/stepwise-selection/

Subset Selection
In either case,
• Error checking should be done on a validation set distinct from the

training set because we want to test the generalization accuracy.
• With more features, generally we have lower training error, but not necessarily
lower validation error.

Principal Component Analysis
• projection methods
• find a mapping from the inputs in the original d-dimensional space to a new (k <
d)-dimensional space,
• with minimum loss of information
• unsupervised method -it does not use the output information;

• the criterion to be maximized is the variance

• Principal component analysis, or PCA, is a dimensionality-reduction method

• that is often used to reduce the dimensionality of large data sets,
• by transforming a large set of variables into a smaller one
• that still contains most of the information in the large set.

• The Principal Components are a straight line

that captures most of the variance of the data.
• They have a direction and magnitude.
• Principal components are orthogonal
projections (perpendicular) of data onto lower-
dimensional space.
https://www.simplilearn.com/tutorials/machin
e-learning-tutorial/principal-component-analysis

PCA Step-1 Standardisation of data
• to standardize the range of continuous initial variables so that each one of them
contributes equally to the analysis.
• if there are large differences between the ranges of initial variables, larger ranges
will dominate over those with small ranges
• (for example, a variable that ranges between 0 and 100 will dominate over a variable
that ranges between 0 and 1),
• which will lead to biased results.
• So, transforming the data to comparable scales can prevent this problem.

PCA Step-1 Standardisation of data
• Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.
.

PCA Step-2 Covariance Matrix Computation
Aim :
to understand how the variables of the input data set are varying from
the mean with respect to each other, or
in other words, to see if there is any relationship between them.

Because sometimes, variables are highly correlated in such a way that
they contain redundant information.
So, in order to identify these correlations, we compute the covariance
matrix.

• The covariance matrix is a p × p symmetric matrix (where p is the number of
dimensions) that has as entries the covariances associated with all possible pairs of
the initial variables.
• For example, for a 3-dimensional data set with 3 variables x, y, and z, the
covariance matrix is a 3×3 matrix of this from:

• the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)),

• in the main diagonal (Top left to bottom right) have the variances of each
initial variable.
• And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of
the covariance matrix are symmetric with respect to the main diagonal,
which means that the upper and the lower triangular portions are equal.

•If entries of covariance matrix are

•positive then: the two variables increase or decrease together (correlated)
•negative then: one increases when the other decreases (Inversely correlated)

PCA Step-3 Compute Eigen Values and Eigen vectors of
Covariance Matrix to find Principal Components
• Eigenvectors and eigenvalues are the linear algebra concepts that are to be
computed from the covariance matrix in order to determine the principal
components of the data.
• Principal components are new variables that are constructed as linear

combinations or mixtures of the initial variables.
• These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial
variables is squeezed or compressed into the first components.

A 10-dimensional data gives 10

principal components,
but PCA tries to put maximum possible
information in the first component,
then maximum remaining information
in the second and so on,

Principal components represent the directions of data that give a maximal amount of
variance,
OR
the lines that capture most information of the data.
The relationship between variance and information , is that,
the larger the variance carried by a line, the larger the dispersion of the data points
along it, and
the larger the dispersion along a line, the more information it has.
Simply, principal components as new axes that provide the best angle to see and
evaluate the data, so that the differences between the observations are better visible.
Constructing Principal Components
• No of principal components is the no of variables in the data
• principal components are constructed in such a manner that

the first principal component accounts for the largest possible
variance in the data set

let’s assume the scatter plot of our data set is
given
First principal component is approximately the

line that matches the purple marks because it
goes through the origin and it’s the line in which
the projection of the points (red dots) is the most
spread out.
Or mathematically it’s the line that maximizes the

variance (the average of the squared distances
from the projected points (red dots) to the origin).
• The second principal component is calculated

in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the
first principal component and that it accounts
for the next highest variance.
• This continues until a total of p principal

components have been calculated, equal to
the original number of variables.

• let’s go back to eigenvectors and eigenvalues.

• they always come in pairs, so that every
eigenvector has an eigenvalue.
• And their number is equal to the number of
dimensions of the data.
• For example, for a 3-dimensional data set,
there are 3 variables, therefore there are 3
eigenvectors with 3 corresponding
eigenvalues.

• the eigenvectors of the Covariance matrix are actually the directions of the axes
where there is the most variance(most information) and that we call Principal
Components.
• And eigenvalues are simply the coefficients attached to eigenvectors, which give
the amount of variance carried in each Principal Component.
By ranking eigenvectors in order of their eigenvalues, highest to lowest, principal

components in order of significance is obtained

Example
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that the
eigenvectors and eigenvalues of the covariance matrix are as follows:
If we rank the eigenvalues in descending order, we get λ1>λ2, which means that the
eigenvector that corresponds to the first principal component (PC1) is v1 and the one that
corresponds to the second component (PC2) isv2.

Example
• After having the principal components, to compute the percentage of

variance (information) accounted for by each component, divide the
eigenvalue of each component by the sum of eigenvalues.
• Apply this on the example above, we find that PC1 and PC2 carry
respectively 96% and 4% of the variance of the data.

Step-4 Feature Vector
• Decide to keep all these components or discard those of lesser significance (of
low eigenvalues), and
• form with the remaining ones a matrix of vectors that we call Feature vector.
• So, the feature vector is simply a matrix that has as columns the eigenvectors of
the components that we decide to keep.
• This makes it the first step towards dimensionality reduction, because if we

choose to keep only p eigenvectors (components) out of n, the final data set will
have only p dimensions.

Step-4 Feature Vector Example
Continuing with the example from Or discard the eigenvector v2,

the previous step, we can either which is the one of lesser
form a feature vector with both of significance, and form a feature
the eigenvectors v1 and v2: vector with v1 only:

Step-4 Feature Vector Example
• it’s up to you to choose whether to keep all the components or discard

the ones of lesser significance, depending on what you are looking for.
• Because if you just want to describe your data in terms of new

variables (principal components) that are uncorrelated without seeking
to reduce dimensionality, leaving out lesser significant components is
not needed.

Step-5 RECAST DATA ALONG PRINCIPAL COMPONENTS
• the aim is to use the feature vector formed using the eigenvectors of the
covariance matrix,
• to reorient the data from the original axes to the ones represented by the
principal components (hence the name Principal Components Analysis).
• This can be done by multiplying the transpose of the original data set by the
transpose of the feature vector.

Step-5 RECAST DATA ALONG PRINCIPAL COMPONENTS
• the aim is to use the feature vector formed using the eigenvectors of the
covariance matrix,
• to reorient the data from the original axes to the ones represented by the
principal components (hence the name Principal Components Analysis).
• This can be done by multiplying the transpose of the original data set by the
transpose of the feature vector.

THANK YOU

9 ML

Uploaded by

Copyright:

Available Formats

You might also like

9 ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

9 ML

Uploaded by

Copyright:

Available Formats

DIMENSIONALITY REDUCTION

Dimensionality Reduction: Introduction, Subset Selection, Principal Components Analysis,

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 2

Ref-3 Ethem Alpaydin, “Introduction to Machine Learning”, Second Edition,

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 3

 Data contain information

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 4

 Computation time and storage increases

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 5

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 6

 Unnecessary inputs need not be extracted .

 Simpler models are more robust on small datasets.

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 7

 When data can be represented in a few dimensions without loss of information,

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 8

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 9

 Finding the best subset of the set of features.

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 10

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 11

 Start with no variables and add them one by one

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 12

 Start with all variables and remove them one by

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 13

• Error checking should be done on a validation set distinct from the

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 14

• unsupervised method -it does not use the output information;

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 15

• Principal component analysis, or PCA, is a dimensionality-reduction method

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 16

• The Principal Components are a straight line

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 17

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 18

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 19

in other words, to see if there is any relationship between them.

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 20

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 21

• the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)),

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 22

•If entries of covariance matrix are

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 23

• Principal components are new variables that are constructed as linear

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 24

A 10-dimensional data gives 10

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 25

• No of principal components is the no of variables in the data

• principal components are constructed in such a manner that

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 27

First principal component is approximately the

Or mathematically it’s the line that maximizes the

• The second principal component is calculated

• This continues until a total of p principal

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 29

• let’s go back to eigenvectors and eigenvalues.

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 30

By ranking eigenvectors in order of their eigenvalues, highest to lowest, principal

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 31

Module-III Introduction to Dimensionality Reduction-PCA 11:21 AM 32

• After having the principal components, to compute the percentage of