Professional Documents
Culture Documents
Data Mining Project DSBA PCA Report Final
Data Mining Project DSBA PCA Report Final
Data Mining Project DSBA PCA Report Final
List of Tables
Table 1: Dataset head 5
Table 2: Dataset info 5
Table 3: Dataset Summary 6
Table 4: Covariance Matrix 11
Problem Statement
We will start analyzing the data by going thru the basic steps like:
1. Check head
2. Check info
3. Check summary
4. Check nulls
5. Check duplicates
Checking summary:
Checking Nulls
Checking Duplicates
● Ecom,SalesFImage are +ve skewed i.e. more values are to the left of the mean.
● Advertising, WartyClaim, Satisfaction are very slightly +ve skewed.
● ProdQual, TechSup, CompRes, ComPricing, OrdBilling, DelSpeed are -ve skewed.
● Remaining are very slightly -ve skewed.
1. There is a strong correlation observed between a few fields. 'Techsup' is highly correlated to
'WartyClaim'.
We choose to treat outliers for this case. Do you think that treating outliers for this case
is necessary?
In this project, we have chosen to treat outliers in the PCA analysis for the "PCA - Primary census
abstract Dataset" which consists of 12 numeric columns. The decision to treat outliers is based on
several reasons:
1. Outliers can increase the error variance and reduce the power of statistical tests. If the
outliers are non-randomly distributed, they can also violate the assumption of normality.
2. Most machine learning algorithms, including PCA, may not perform well in the presence of
outliers. Outliers can significantly impact the results and distort the principal components.
Therefore, treating outliers in this case is necessary to obtain meaningful and accurate results from
the PCA analysis.
Scale the variables and write the inference for using the type of scaling function for this case
study.
The main objective of scaling data is to normalize data within a particular range. It is a step of data
preprocessing which is applied to independent variables or features of data. Another importance of scaling is
it helps in speeding up the calculations in an algorithm.
Z-score is a variation of scaling that represents the number of standard deviations away from the mean. You
would use z-score to ensure your feature distributions have mean = 0 and std = 1. It's useful when there are a
few outliers, but not so extreme that you need clipping.
Comment on the comparison between covariance and correlation matrix after scaling.
Figure 5: Covariance matrix.
Figure 6: Correlation matrix.
Correlation is a scaled version of covariance; note that the two parameters have always been the same
sign(positive, negative or 0). When the sign is +ve, the variables are said to be +vely correlated. If the sign is
-ve, the variables are said to be -vely correlated, and when it is 0, the variables are said to be uncorrelated.
In simple sense correlation measures both strength and direction of the linear relationship between two
variables.
Check the dataset for outliers before and after scaling. Draw your inferences from this
exercise.
While univariate analysis we have checked the outliers. So, now after scaling we will check
the outliers again.
Figure 7: boxplot
Bartlett's Test of Sphericity
Bartlett's test of sphericity tests the hypothesis that the variables are uncorrelated in the population. If the
null hypothesis cannot be rejected, then PCA is not advisable.
𝑯𝟎 : All variables in the data are uncorrelated
𝑯𝟏: At least one pair of variables in the data are correlated
Inference: Since p-value: 0.00, we reject the null hypothesis.
KMO Test
The Kaiser-Meyer-Olkin (KMO) - measure of sampling adequacy (MSA) is an index used to examine how
appropriate PCA is. Generally, if MSA is less than 0.5, PCA is not recommended, since no reduction is
expected. On the other hand, MSA > 0.7 is expected to provide a considerable reduction in the
dimension and extraction of meaningful components.
MSA = 0.6615
2. Performing computation on large matrix is slow and require more memory and CPU
Covariance Matrix
Eigen Vectors
Since the number of variables is large and the value of MSA is 0.8, it is expected that a few components
will be enough to explain 90% of variation in the data.
Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? Perform
PCA and export the data of the Principal Component scores into a data frame.
If we don’t have any strict constraints, then we should plot the cumulative sum of eigenvalues. If we
divide each value by the total sum of eigenvalues prior to plotting, then your plot will show the fraction
of total variance retained vs. number of eigenvalues. The plot will then provide a good indication of
when you hit the point of diminishing return.
Figure 8: Scree Plot
From Above plot and cumulative explained variance, 5 PCs are chosen
Scree plot: A scree plot helps the analyst visualize the relative importance of the factors, a sharp drop in the
plot signals the subsequent factors are ignorable.
To find PCA components we use the PCA command from sklearn.
Figure9: Extracting components
The cumulative % gives the percentage of variance accounted for by the n components. For
example, the cumulative percentage for the second component is the sum of the
percentage of variance for the first and second components. It helps in deciding the
number of components by selecting the components which explained the high variance
In the above array we see that the first feature explains 33.3% of the variance within our
data set while the first two explain 55% and so on. If we employ 7 features we capture
~92.5% of the variance within the dataset.
Figure10: Correlation Among PCs