Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Unit II

Dimensionality Reduction
we should not need feature selection or extraction as a separate process; the classifier (or regressor)
should be able to use whichever features are necessary, discarding the irrelevant.

there are several reasons why we are interested in reducing dimensionality as a separate preprocessing
step:

• In most learning algorithms, the complexity depends on the number of input dimensions, d, as well as
on the size of the data sample, N, and for reduced memory and computation, we are interested in
reducing the dimensionality of the problem. Decreasing d also decreases the complexity of the
inference algorithm during testing.

• When an input is decided to be unnecessary, we save the cost of ex- tracting it.

• Simpler models are more robust on small datasets. Simpler models have less variance, that is, they
vary less depending on the particulars of a sample, including noise, outliers, and so forth.

• When data can be explained with fewer features, we get a better idea about the process that underlies
the data and this allows knowledge extraction. These fewer features may be interpreted as hidden or
latent factors that in combination generate the observed features.

• When data can be represented in a few dimensions without loss of information, it can be plotted and
analyzed visually for structure and outliers.
There are two main methods for reducing dimensionality: feature se- lection and
feature extraction.

feature selection

feature extraction

1. In feature selection, we are interested in finding k of the d dimensions that give us


the most information, and we discard the other (d − k) dimensions. We discuss subset
selection as a feature selection method.

2. In feature extraction, we are interested in finding a new set of k di- mensions that
are combinations of the original d dimensions. These methods may be supervised or
unsupervised depending on whether or not they use the output information.

The best known and most widely used feature extraction methods are principal
component analysis and linear discriminant analysis, which are both linear projection
methods, unsupervised and supervised respectively.
Subset Selection

The best subset contains the least number of dimensions that most contribute to
accuracy.

There are two approaches: In forward selection, we start with no vari- ables and add
them one by one, at each step adding the one that de- creases the error the most, until
any further addition does not decrease the error (or decreases it only slightly).

In backward selection, we start with all variables and remove them one by one, at
each step removing the one that decreases the error the most (or increases it only
slightly), until any further removal increases the error significantly. In either case,
checking the error should be done on a validation set distinct from the training set
because we want to test the generalization accuracy. With more features, generally we
have lower training error, but not necessarily lower validation error.
Let us denote by F, a feature set of input dimensions, xi,i = 1,...,d. E(F) denotes the error incurred on the validation
sample when only the inputs in F are used. Depending on the application, the error is either the mean square error or
misclassification error.

In sequential forward selection, we start with no features: F = ∅. At each step, for all possible xi, we train our model
on the training set and calculate E(F ∪ xi ) on the validation set. Then, we choose that input xj that causes the least
error

We stop if adding any feature does not decrease E.

This algorithm is also known as the wrapper approach, where the pro- cess of feature extraction is thought to “wrap”
around the learner it uses as a subroutine (Kohavi and John 2007).
Subset selection is supervised in that outputs are used by the
regressor or classifier to calculate the error, but it can be used
with any regression or classification method.
Principal Component Analysis-

Principal Component Analysis is a well-known dimension reduction technique.

It transforms the variables into a new set of variables called as principal components.
These principal components are linear combination of original variables and are
orthogonal.
The first principal component accounts for most of the possible variation of original
data.
The second principal component does its best to capture the variance in the data.

There can be only two principal components for a two-dimensional data set.
PCA Algorithm-

The steps involved in PCA Algorithm are as follows-

Step-01: Get data.

Step-02: Compute the mean vector (µ).

Step-03: Subtract mean from the given data.

Step-04: Calculate the covariance matrix.

Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.

Step-06: Choosing components and forming a feature vector.

Step-07: Deriving the new data set.


STEP 1: STANDARDIZATION

The aim of this step is to standardize the range of the continuous initial variables so that each one of them

contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite

sensitive regarding the variances of the initial variables. That is, if there are large differences between the ranges

of initial variables, those variables with larger ranges will dominate over those with small ranges (For example, a

variable that ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will

lead to biased results. So, transforming the data to comparable scales can prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value

of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION

The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each

other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in

such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance

matrix.

The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as entries the covariances

associated with all possible pairs of the initial variables. For example, for a 3-dimensional data set with 3 variables x, y, and z,

the covariance matrix is a 3×3 matrix of this from:

Covariance Matrix for 3-Dimensional Data

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right)

we actually have the variances of each initial variable. And since the covariance is commutative (Cov(a,b)=Cov(b,a)), the

entries of the covariance matrix are symmetric with respect to the main diagonal, which means that the upper and the lower

triangular portions are equal.


STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE COVARIANCE

MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the covariance matrix in order to

determine the principal components of the data. Before getting to the explanation of these concepts, let’s first understand what

do we mean by principal components.

Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These

combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the

information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional

data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then

maximum remaining information in the second and so on, until having something like shown in the scree plot below.
How PCA Constructs the Principal Components

As there are as many principal components as there are variables in the data, principal

components are constructed in such a manner that the first principal component accounts for

the largest possible variance in the data set.

Without further ado, it is eigenvectors and eigenvalues who are behind all the magic explained

above, because the eigenvectors of the Covariance matrix are actually the directions of the axes

where there is the most variance(most information) and that we call Principal Components. And

eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of

variance carried in each Principal Component.

By ranking your eigenvectors in order of their eigenvalues, highest to lowest, you get the

principal components in order of significance.


STEP 4: FEATURE VECTOR

As we saw in the previous step, computing the eigenvectors and ordering them by

their eigenvalues in descending order, allow us to find the principal components in

order of significance. In this step, what we do is, to choose whether to keep all

these components or discard those of lesser significance (of low eigenvalues), and

form with the remaining ones a matrix of vectors that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the eigenvectors of

the components that we decide to keep. This makes it the first step towards

dimensionality reduction, because if we choose to keep only p eigenvectors

(components) out of n, the final data set will have only p dimensions.
Example:

Continuing with the example from the previous step, we can either form a feature
vector with both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and form a
feature vector with v1 only:

Discarding the eigenvector v2 will reduce dimensionality by 1, and will


consequently cause a loss of information in the final data set. But given that v2 was
carrying only 4% of the information, the loss will be therefore not important and we
will still have 96% of the information that is carried by v1.
LAST STEP: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES

In the previous steps, apart from standardization, you do not make any changes on the data, you

just select the principal components and form the feature vector, but the input data set remains

always in terms of the original axes (i.e, in terms of the initial variables).

In this step, which is the last one, the aim is to use the feature vector formed using the

eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones

represented by the principal components (hence the name Principal Components Analysis). This

can be done by multiplying the transpose of the original data set by the transpose of the feature

vector.
PRACTICE PROBLEMS BASED ON PRINCIPAL COMPONENT ANALYSIS-

Problem-01:

Consider the two dimensional patterns (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8).

Compute the principal component using PCA Algorithm.

Step-01:

Get data.

The given feature vectors are-

x1 = (2, 1)

x2 = (3, 5)

x3 = (4, 3)

x4 = (5, 6)

x5 = (6, 7)

x6 = (7, 8)
Step-02:

Calculate the mean vector (µ).

Mean vector (µ)

= ((2 + 3 + 4 + 5 + 6 + 7) / 6, (1 + 5 + 3 + 6 + 7 + 8) / 6)

= (4.5, 5)

Thus,

Step-03:

Subtract mean vector (µ) from the given feature vectors.

x1 – µ = (2 – 4.5, 1 – 5) = (-2.5, -4)

x2 – µ = (3 – 4.5, 5 – 5) = (-1.5, 0)

x3 – µ = (4 – 4.5, 3 – 5) = (-0.5, -2)

x4 – µ = (5 – 4.5, 6 – 5) = (0.5, 1)

x5 – µ = (6 – 4.5, 7 – 5) = (1.5, 2)

x6 – µ = (7 – 4.5, 8 – 5) = (2.5, 3)

Feature vectors (xi) after subtracting mean vector (µ) are-


Step-04:

Calculate the covariance matrix.

Covariance matrix is given by-


Now, Covariance matrix

= (m1 + m2 + m3 + m4 + m5 + m6) / 6

On adding the above matrices and dividing by 6, we get-

Step-05:

Calculate the eigen values and eigen vectors of the covariance matrix.

λ is an eigen value for a matrix M if it is a solution of the characteristic equation |M – λI| = 0.

So, we have-
From here,

(2.92 – λ)(5.67 – λ) – (3.67 x 3.67) = 0

16.56 – 2.92λ – 5.67λ + λ2 – 13.47 = 0

λ2 – 8.59λ + 3.09 = 0

Solving this quadratic equation, we get λ = 8.22, 0.38

Thus, two eigen values are λ1 = 8.22 and λ2 = 0.38.

Clearly, the second eigen value is very small compared to the first eigen value.

So, the second eigen vector can be left out.

Eigen vector corresponding to the greatest eigen value is the principal component for the given data set.

So. we find the eigen vector corresponding to eigen value λ1.

We use the following equation to find the eigen vector-

MX = λX

where-

M = Covariance Matrix

X = Eigen vector

λ = Eigen value

Substituting the values in the above equation, we get-


Solving these, we get-

2.92X1 + 3.67X2 = 8.22X1

3.67X1 + 5.67X2 = 8.22X2

On simplification, we get-

5.3X1 = 3.67X2 ………(1)

3.67X1 = 2.55X2 ………(2)

From (1) and (2), X1 = 0.69X2

From (2), the eigen vector is-

Thus, principal component for the given data set is-


Lastly, we project the data points onto the new subspace as-

https://www.gatevidyalay.com/tag/principal-component-analysis-questions-and-answers/
• Instead of using 5 variables, It can be reduced to 2
factors such as Quantitative ability and Verbal ability
What is factor analysis PPT?

Factor analysis is a correlational method used to find and describe the underlying factors driving data values for a large set of
variables. 6. SIMPLE PATH DIAGRAM FOR A FACTOR ANALYSIS MODEL •F1 and F2 are two common factors.

What is the basic purpose of factor analysis?

Factor analysis is a statistical data reduction and analysis technique that strives to explain correlations among multiple outcomes as the
result of one or more underlying explanations, or factors. The technique involves data reduction, as it attempts to represent a set of
variables by a smaller number.

What are the two types of factor analysis?

There are two types of factor analyses, exploratory and confirmatory. Exploratory factor analysis (EFA) is method to explore the
underlying structure of a set of observed variables, and is a crucial step in the scale development process.

How do you calculate factor analysis?

First go to Analyze – Dimension Reduction – Factor. Move all the observed variables over the Variables: box to be analyze. Under
Extraction – Method, pick Principal components and make sure to Analyze the Correlation matrix. We also request the Unrotated
factor solution and the Scree plot.
Multidimensional scaling (MDS)

Multidimensional scaling (MDS) is the method for placing these points in a low—
for example, two-dimensional—space such that the Euclidean distance between
them there is as close as possible to dij , the given distances in the original space.
Thus it requires a projection from some unknown dimensional space to, for
example, two dimensions.

In the archetypical example of multidimensional scaling, we take the road travel


distances between cities, and after applying MDS, we get an approximation to the
map. The map is distorted such that in parts of the country with geographical
obstacles like mountains and lakes where the road travel distance deviates much
from the direct bird-flight path (Euclidean distance), the map is stretched out to
accommodate longer distances
MDS can be used for dimensionality reduction by calculating
pairwise Euclidean distances in the d-dimensional x space
and giving this as input to MDS, which then projects it to a
lower-dimensional space so as to preserve these distances.
Linear Discriminant Analysis

Linear discriminant analysis (LDA) is a supervised method for


dimension- ality reduction for classification problems. We start with the
case where there are two classes, then generalize to K > 2 classes.

Given samples from two classes C1 and C2, we want to find the direc-
tion, as defined by a vector w, such that when the data are projected
onto w, the examples from the two classes are as well separated as
possible. As we saw before,
is the projection of x onto w and thus is a dimensionality reduction from
d to 1.

You might also like