Data Mining Project DSBA PCA Report Final

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Business Report

DSBA Data Mining Project – Part


1 Principal Component Analysis
Table of Contents
List of Figures................................................................................................................................................ 3
List of Tables................................................................................................................................................. 3
Problem Statement.......................................................................................................................................4
Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. The
inferences drawn from this should be properly documented. …………………………………………..5
We choose to treat outliers for this case. Do you think that treating outliers for this case is necessary?.8
Scale the variables and write the inference for using the type of scaling function for this case study.....9
Comment on the comparison between covariance and the correlation matrix after scaling…………..9
Check the dataset for outliers before and after scaling. Draw your inferences from this exercise.........10
Build the covariance matrix, eigenvalues and eigenvector..................................................................... 11
Write the explicit form of the first PC (in terms of EigenVectors)...........................................................13
Discuss the cumulative values of the eigenvalues. How does it help you to decide on the optimum
number of principal components? What do the eigenvectors indicate? Perform PCA and export the
data of the Principal Component scores into a data frame.................................................................... 13
Mention the business implication of using the Principal Component Analysis for this case study.........17
List of Figures
Figure 1: Skewness values of features 7
Figure 2: Histplot of features 7
Figure 3: Heatmap of features 8
Figure 4: Scaled data using Z-score 9
Figure 5: Covariance Matrix 9
Figure 6: Correlation Matrix 10
Figure 7: Boxplot 10
Figure 8: Scree Plot 14
Figure 9: Extracting Components 15
Figure 10: Correlation among PCs 16
Figure 11: Influence of PCs 16

List of Tables
Table 1: Dataset head 5
Table 2: Dataset info 5
Table 3: Dataset Summary 6
Table 4: Covariance Matrix 11
Problem Statement

The ‘Hair Salon.csv’ View in a new window dataset contains


various variables used for the context of Market
Segmentation. This particular case study is based on various
parameters of a salon chain of hair products. You are
expected to do Principal Component Analysis for this case
study according to the instructions given in the rubric.
Kindly refer to the PCA_Data_Dictionary.jpg View in a new
window file for the Data Dictionary of the Dataset.
Note: This particular dataset contains the target variable
satisfaction as well. Please drop this variable before doing
Principal Component Analysis
Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. The inferences drawn from this should be properly documented.

We will start analyzing the data by going thru the basic steps like:

1. Check head
2. Check info
3. Check summary
4. Check nulls
5. Check duplicates

Let us start by reading the data and extracting basic information:

Table 1: headfirst 5 rows of the dataset

Checking Info about the data:

Table 2: Dataset info


There are 100 rows and 13 columns in the dataset where the 12 float data type and 1 integer data type.

Checking summary:

Table 3: Dataset Summary

We will explore more in the EDA section.

Checking Nulls

There are no missing (null) values in the dataset.

Checking Duplicates

There are no duplicate values in the dataset.


Figure 1: Skewness values of features

● Ecom,SalesFImage are +ve skewed i.e. more values are to the left of the mean.
● Advertising, WartyClaim, Satisfaction are very slightly +ve skewed.
● ProdQual, TechSup, CompRes, ComPricing, OrdBilling, DelSpeed are -ve skewed.
● Remaining are very slightly -ve skewed.

Figure 2: hist plots of features


Figure 3: heatmap of features

1. There is a strong correlation observed between a few fields. 'Techsup' is highly correlated to
'WartyClaim'.

2. Also, 'CompRes' shows high correlation with 'DelSpeed' and 'OrdBilling'.

We choose to treat outliers for this case. Do you think that treating outliers for this case
is necessary?
In this project, we have chosen to treat outliers in the PCA analysis for the "PCA - Primary census
abstract Dataset" which consists of 12 numeric columns. The decision to treat outliers is based on
several reasons:

1. Outliers can increase the error variance and reduce the power of statistical tests. If the
outliers are non-randomly distributed, they can also violate the assumption of normality.

2. Most machine learning algorithms, including PCA, may not perform well in the presence of
outliers. Outliers can significantly impact the results and distort the principal components.

Therefore, treating outliers in this case is necessary to obtain meaningful and accurate results from
the PCA analysis.
Scale the variables and write the inference for using the type of scaling function for this case
study.
The main objective of scaling data is to normalize data within a particular range. It is a step of data
preprocessing which is applied to independent variables or features of data. Another importance of scaling is
it helps in speeding up the calculations in an algorithm.

Before scaling we removed the outliers as shown in the above section.

Figure 4: Scaled data using Z-score

Z-score is a variation of scaling that represents the number of standard deviations away from the mean. You
would use z-score to ensure your feature distributions have mean = 0 and std = 1. It's useful when there are a
few outliers, but not so extreme that you need clipping.

Comment on the comparison between covariance and correlation matrix after scaling.
Figure 5: Covariance matrix.
Figure 6: Correlation matrix.

Correlation is a scaled version of covariance; note that the two parameters have always been the same
sign(positive, negative or 0). When the sign is +ve, the variables are said to be +vely correlated. If the sign is
-ve, the variables are said to be -vely correlated, and when it is 0, the variables are said to be uncorrelated.
In simple sense correlation measures both strength and direction of the linear relationship between two
variables.

Check the dataset for outliers before and after scaling. Draw your inferences from this
exercise.

While univariate analysis we have checked the outliers. So, now after scaling we will check
the outliers again.
Figure 7: boxplot
Bartlett's Test of Sphericity

Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the population. If the
null hypothesis cannot be rejected, then PCA is not advisable.
𝑯𝟎 : All variables in the data are uncorrelated
𝑯𝟏: At least one pair of variables in the data are correlated
Inference: Since p-value: 0.00, we reject the null hypothesis.

KMO Test

The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to examine how
appropriate PCA is. Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is
expected. On the other hand, MSA > 0.7 is expected to provide a considerable reduction in the
dimension and extraction of meaningful components.

MSA = 0.6615

It's a mediocre value so reduction in data dimension is expected.

Build the covariance matrix, eigenvalues and eigenvector.


Eigenvalues and Eigenmatrix are mainly used to capture key information that is stored in a large matrix.

1. It provides a summary of a large matrix.

2. Performing computation on large matrix is slow and require more memory and CPU

Step 1- Create the covariance Matrix

Covariance Matrix

Table 4: Covariance Matrix


Step 2- Get eigenvalues and eigenvector
[[-1.61322427e-01, -1.38992261e-01, -1.27131534e-01,
-4.25502348e-01, -1.77257763e-01, -3.56524003e-01,
-2.10387616e-01, 1.37603805e-01, -1.76706227e-01,
-3.91237634e-01, -4.24958585e-01, -4.13318690e-01],
[-3.06272359e-01, 4.54921339e-01, -2.35263231e-01,
8.86065191e-03, 3.55907323e-01, -2.89852601e-01,
4.64926872e-01, 4.15466718e-01, -1.97843283e-01,
2.05739475e-02, 6.26377391e-02, 2.95573690e-02],
[ 7.95045575e-02, -2.29883744e-01, -6.21730460e-01,
1.91750596e-01, -9.22380787e-02, 1.12809185e-01,
-2.36626212e-01, 4.49919990e-02, -6.11385841e-01,
1.42820217e-01, 2.07727869e-01, 3.04020598e-02],
[ 6.16476615e-01, 1.83792626e-01, -1.66476236e-01,
-2.79905722e-01, 2.14732458e-01, 9.85304039e-02,
2.12995164e-01, -2.36864713e-01, -1.75501531e-01,
-3.03399090e-01, -2.93932094e-01, 3.37012361e-01],
[-2.56708792e-01, -1.95989018e-01, -4.32018329e-02,
-3.10014556e-02, 7.63273860e-01, 1.96214033e-02,
-1.38679963e-01, -4.84289239e-01, -2.28877328e-02,
-4.96697647e-02, 5.53883726e-02, -2.23746335e-01],
[ 3.49665681e-01, -4.72109013e-01, 1.18961241e-01,
2.27476118e-02, 4.10458402e-01, -1.94306175e-01,
-1.70268215e-01, 6.00687375e-01, 1.37026464e-01,
7.61903184e-02, -2.66931588e-02, 1.37245917e-01],
[ 1.59566085e-01, 4.58420598e-02, -1.85228111e-03,
-5.65413580e-03, -5.50372696e-02, -6.24254550e-01,
-2.15976650e-02, -3.18607905e-01, -4.42538118e-02,
6.47750462e-01, -2.33344595e-01, 4.13848752e-02],
[-3.28835920e-01, -5.09595822e-01, 5.57062806e-02,
1.36570710e-01, -1.42161086e-01, -2.70924156e-01,
3.52510892e-01, -1.80376001e-01, -9.00155002e-02,
-2.79292694e-01, -2.22438761e-02, 5.22889275e-01],
[-1.68511089e-01, -1.98053391e-01, -5.56287356e-01,
-4.35955358e-01, -4.16389528e-02, 2.17278275e-01,
1.58074558e-01, 3.15183851e-02, 5.12637449e-01,
2.76449344e-01, -7.81797655e-02, 1.12285686e-01],
[ 2.26630723e-01, 4.24260261e-02, -4.16015115e-01,
5.64127007e-01, -3.51473948e-02, -2.76430381e-01,
4.97639173e-02, -9.66456687e-02, 4.51055742e-01,
-3.26877107e-01, -6.69243234e-03, -2.36050949e-01],
[ 1.97914841e-01, -2.03297721e-03, 5.88820878e-04,
-4.18506907e-01, -8.35995268e-02, -3.44414628e-01,
1.08184495e-02, -1.01667061e-01, 6.24508414e-02,
-1.49741626e-01, 7.88056217e-01, -4.75225625e-02],
[-2.30478370e-01, 3.50691037e-01, -1.12144294e-01,
-1.21358431e-02, 5.50616703e-02, -1.51536732e-01,
-6.61578652e-01, 1.56701195e-02, 1.59780088e-01,
-1.50028819e-01, -5.04169815e-03, 5.46974515e-01]]

Write the explicit form of the first PC(in terms of eigenvectors).

Explicit form of first PC

Eigen Vectors

[-1.61322427e-01, -1.38992261e-01, -1.27131534e-01,


-4.25502348e-01, -1.77257763e-01, -3.56524003e-01,
-2.10387616e-01, 1.37603805e-01, -1.76706227e-01,
-3.91237634e-01, -4.24958585e-01, -4.13318690e-01]

Since the number of variables is large and the value of MSA is 0.8, it is expected that a few components
will be enough to explain 90% of variation in the data.

Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? Perform
PCA and export the data of the Principal Component scores into a data frame.

To decide how many eigenvalues/eigenvectors to keep, we should clearly define the


objective first for doing PCA in the first place. Are we doing it for reducing storage
requirements, to reduce dimensionality for a classification algorithm, or for some other
reason.

If we don’t have any strict constraints, then we should plot the cumulative sum of eigenvalues. If we
divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction
of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of
when you hit the point of diminishing return.
Figure 8: Scree Plot

From Above plot and cumulative explained variance, 5 PCs are chosen

Scree plot: A scree plot helps the analyst visualize the relative importance of the factors, a sharp drop in the
plot signals the subsequent factors are ignorable.
To find PCA components we use the PCA command from sklearn.
Figure9: Extracting components

The cumulative % gives the percentage of variance accounted for by the n components. For
example, the cumulative percentage for the second component is the sum of the
percentage of variance for the first and second components. It helps in deciding the
number of components by selecting the components which explained the high variance

In the above array we see that the first feature explains 33.3% of the variance within our
data set while the first two explain 55% and so on. If we employ 7 features we capture
~92.5% of the variance within the dataset.
Figure10: Correlation Among PCs

Figure11: Influence of PCs


Mention the business implication of using the Principal Component Analysis for this case
study.
1. PCA is a statistical technique and uses orthogonal transformation to convert a set of observations of
possibly correlated variables into a set of values of linearly uncorrelated variables. PCA also is a tool to
reduce multidimensional data to lower dimensions while retaining most of the information. Principal
component Analysis(PCA) is a well established mathematical technique for reducing the
dimensionality of data, while keeping as much variation as possible.
2. PCA can only be done on continuous variables.
3. There are 13 variables in the dataset, by applying PCA we will reduce those to just 5 components
which will capture 84.5% variance in the dataset.

You might also like