AnovaEDA Adv Stats

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 31

Project: Advanced Statistics

ANOVA , EDA AND PCA

Student’s Name –SONALI JOSHI | PGP DSBA Online May_D 2021 | August 2021

PAGE| 1
Project: Advanced Statistics

Contents
1 ANOVA test
1.1 STATE THE NULL AND THE ALTERNATE HYPOTHESIS FOR CONDUCTING ONE-WAY ANOVA FOR BOTH
EDUCATION AND OCCUPATION INDIVIDUALLY.................................................................................................................................. 4
1.2 PERFORM ONE-WAY ANOVA FOR EDUCATION WITH RESPECT TO THE VARIABLE ‘SALARY’. STATE
WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON THE ANOVA RESULTS

5
1.3 PERFORM ONE-WAY ANOVA FOR VARIABLE OCCUPATION WITH RESPECT TO THE VARIABLE ‘SALARY’.
STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON THE ANOVA RESULTS
5
1.4 IF THE NULL HYPOTHESIS IS REJECTED IN EITHER (1.2) OR IN (1.3), FIND OUT WHICH CLASS MEANS ARE
SIGNIFICANTLY DIFFERENT. INTERPRET THE RESULT....................................................................................................................... 6

1.5 WHAT IS THE INTERACTION BETWEEN THE TWO TREATMENTS? ANALYZE THE EFFECTS OF ONE VARIABLE
ON THE OTHER (EDUCATION AND OCCUPATION) WITH THE HELP OF AN INTERACTION PLOT
7
1.6 PERFORM A TWO-WAY ANOVA BASED ON THE EDUCATION AND OCCUPATION (ALONG WITH THEIR
INTERACTION EDUCATION*OCCUPATION) WITH THE VARIABLE ‘SALARY’. STATE THE NULL AND ALTERNATIVE

HYPOTHESES AND STATE YOUR RESULTS. HOW WILL YOU INTERPRET THIS RESULT? ANSWER.........................................9
1.7 EXPLAIN THE BUSINESS IMPLICATIONS OF PERFORMING ANOVA FOR THIS PARTICULAR CASE STUDY..........10

2 EDA AND PCA................................................................................................11


2.1 PERFORM EXPLORATORY DATA ANALYSIS [BOTH UNIVARIATE AND MULTIVARIATE ANALYSIS TO BE
PERFORMED]

11
2.2 IS SCALING NECESSARY FOR PCA IN THIS CASE? GIVE JUSTIFICATION AND PERFORM SCALING........................18
2.3 COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE CORRELATION MATRICES FROM THIS
DATA.[ON SCALED DATA]..........................................................................................................................................20

2.4 CHECK THE DATASET FOR OUTLIERS BEFORE AND AFTER SCALING. WHAT INSIGHT DO YOU DERIVE HERE?.21
2.5 PERFORM PCA AND EXPORT THE DATA OF THE PRINCIPAL COMPONENT (EIGENVECTORS) INTO A DATA
FRAME WITH THE ORIGINAL FEATURES

23
2.6 WRITE DOWN THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF THE EIGENVECTORS. USE VALUES WITH
TWO PLACES OF DECIMALS ONLY). WRITE THE LINEAR EQUATION OF PC IN TERMS OF EIGENVECTORS AND
CORRESPONDING FEATURES

27
2.7 CONSIDER THE CUMULATIVE VALUES OF THE EIGENVALUES. HOW DOES IT HELP YOU TO DECIDE ON THE
OPTIMUM NUMBER OF PRINCIPAL COMPONENTS? WHAT DO THE EIGENVECTORS INDICATE?...................................28
2.8 EXPLAIN THE BUSINESS IMPLICATION OF USING THE PRINCIPAL COMPONENT ANALYSIS FOR THIS CASE
STUDY. HOW MAY PCS HELP IN THE FURTHER ANALYSIS? [HINT: WRITE INTERPRETATIONS OF THE PRINCIPAL
COMPONENTS OBTAINED].........................................................................................................................................29

PAGE | 2
List of Figures
Figure 1.5.1 Interaction between Education and Occupation 7

Figure 2.1.1 Histplot to check Distribution and Density of each Variables 14


Figure 2.1.2 Boxplot for checking presence of outliers in each feature or Univariate Analysis of all
Variables 15
Figure 2.1.3 Pair plot to see relationship of all Variables among each other

16 Figure 2.1.4 Heat Map to check collinearity of Original Data 17

Figure 2.2.1 Histogram of Variables before performing Scaling s 18


Figure 2.2.2 Histogram of Variables after performing Scaling 18
Figure 2.3.1 Heatmap of corelation between Variables after performing Scaling 21

Figure 2.5.1 Scree plot showing distribution of PCs 23


Figure 2.5.2 Square Graph to check features spread distribution 23
Figure 2.5.3 Loading of each selected Principle Component 23

F igure 2.7.1 Comparison between Cumulative and Individual explained Variance 237

List of Tables
Table 1.2.1 ANOVA for Education 5
Table 1.3.1 ANOVA for Occupation 5
Table 1.4.1 Turkey HSD test for Education levels 6
Table 1.6.1 Two way ANOVA 8
Table 1.6.2 Two Way ANOVA 8
Table 2.1.1 Statistical Description of Original Dataset 11
Table 2.1.2 Quartile Range for each Variables / Attributes of Dataset 12
Table 2.2.1 Statistical Description of Scaled Dataset 18
Table 2.3.2 Least Co-relation strength among Variables after performing Scaling 21
Table 2.3.3 Highest Co-relation strength among Variables after performing Scaling 21
Table 2.4.1 Scaled Data summary 22
Table 2.5.1 Eigenvectors (PC scores) of Original Data (vertical values) 24
Table 2.5.2 PCA 25
Table 2.5.4 Heat Map Co-relation Metrics between components and features 26

PAGE| 3
Project - New Advanced Statistics

1.0 ANOVA test

Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To understand the
dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s
educational qualification and occupation are noted. Educational qualification is at three levels,
High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and
clerical, Sales, Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may not
always hold if the sample size is small.]
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both Education
and Occupation individually.
2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
3. Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
4. If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
Problem 1B:
1. What is the interaction between two treatments? Analyze the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot.[hint: use the ‘pointplot’
function from the ‘seaborn’ function]
2. Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses
and state your results. How will you interpret this result?
3. Explain the business implications of performing ANOVA for this particular case study.

1.1 STATE THE NULL AND THE ALTERNATE HYPOTHESIS FOR CONDUCTING ONE-WAY
ANOVA FOR BOTH EDUCATION AND OCCUPATION INDIVIDUALLY.
Answer:
The Null and Alternate Hypothesis for the One Way ANOVA for Education are:
H0: The mean Salary variable for each educational level is equal
Ha1: For at least one of the means of Salary for level of Education is different

The Null and Alternate Hypothesis for the One Way ANOVA for Occupation are:
H0: The mean Salary variable for each Occupation type is equal
Ha2: For at least one of the means of Salary for type of Occupation is different

Where Alpha = 0.05


 If the p-value is < 0.05, then we reject the null hypothesis.
 If the p-value is >= 0.05, then we fail to reject the null hypothesis.

PAGE| 4
1.2 PERFORM ONE-WAY ANOVA FOR EDUCATION WITH RESPECT TO THE VARIABLE
‘SALARY’. STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTED
BASED ON THE ANOVA RESULTS.

Answer:
Table 1 : 1.2.1 ANOVA for Education

Since the p-value is less than Alpha we reject Null Hypothesis (H0) for Education.

1.3 PERFORM ONE-WAY ANOVA FOR VARIABLE OCCUPATION WITH RESPECT TO THE
VARIABLE ‘SALARY’. STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR
REJECTED BASED ON THE ANOVA RESULTS.

Answer:
Table 2 : 1.3.1 ANOVA for Occupation

Since the p-value is greater than Alpha we cannot reject Null Hypothesis (H0) for Occupation.

PAGE| 5
1.4 IF THE NULL HYPOTHESIS IS REJECTED IN EITHER (1.2) OR IN (1.3), FIND OUT WHICH
CLASS MEANS ARE SIGNIFICANTLY DIFFERENT. INTERPRET THE RESULT.

Answer:ANOVA tells us if our results or significant or not, but does not tell us where the results are
significant. But, the interpretability of statistical significance is crucial to figure out in order to guide
us. So a Tukey Test allows us to interpret the statistical significance of our ANOVA test and find out
which specific groups’ means (compared with each other) are different. So, after performing each
round of ANOVA, we should use a Tukey Test to find out where the statistical significance is
occurring in our data.

Tukey Test Inference:


The count of Salary mean actually differ for each pairs of Education level on an average
(reject=true).

Interpretation of Inference :
The mean count of Salary does differ across different Education levels, like it is certainly
higher in the Bachelors/Doctorate as compared to Bachelors/HS-grad , Doctorate/HS-grad.

PAGE| 6
1.5 WHAT IS THE INTERACTION BETWEEN THE TWO TREATMENTS? ANALYZE THE
EFFECTS OF ONE VARIABLE ON THE OTHER (EDUCATION AND OCCUPATION) WITH
THE HELP OF AN INTERACTION PLOT.

Answer:
Figure 1.5.1 Interaction between Education and Occupation

Observation:
 From above plot we can make out that the interaction between people with:
 Adm-Clerical job with Bachelors and Doctorates is fairly good.
 Sales job with Bachelors and Doctorates is good.
 Prof-Speciality job with HS-grad and Bachelors is a bit.
 All four occupations with educational level HS-grad and Doctorate is absolutely
NIL.
 Exec-Manegerial job role has no interactions with any other educational
background.

 From above plot we can figure out that people with educational level:
 Doctorates : are into higher salary brackets and mostly Prof-speciality roles or
Exec-managerial roles or in sales profile, very few are doing Adm-clerical jobs
 Bachlores: fall in mid income rangeand found mostly working as an Exec -
managers , Adm-clerks or into sales but very few are found in Prof- speciality
profile.
 HS-grads : are in low income brackets, mostly doing Prof-speciality or Adm -
clerical work and few are doing Sales but hardly any in Exec-managerial role.

PAGE| 7
1.6 PERFORM A TWO-WAY ANOVA BASED ON THE EDUCATION AND OCCUPATION (ALONG
WITH THEIR INTERACTION EDUCATION*OCCUPATION) WITH THE VARIABLE ‘SALARY’.
STATE THE NULL AND ALTERNATIVE HYPOTHESES AND STATE YOUR RESULTS. HOW
WILL YOU INTERPRET THIS RESULT?
ANSWER:
The Null and Alternate Hypothesis for the Two Way ANOVA for each Occupation type and
Education level are:
H0: The mean Salary variable for each Occupation type and Education level are equal
Ha2: For at least one of the means of Salary for type of Occupation and Education level are not
equal.
Where Alpha = 0.05
 If the p-value is < 0.05, then we reject the null hypothesis.
 If the p-value is >= 0.05, then we fail to reject the null hypothesis.

Table 4 : 1.6.1 Two way ANOVA

As we can see that there is some sort of interaction between the two treatments. So, we will
introduce a new term while performing the Two Way ANOVA.

Table 5 : 1.6.2 Two Way ANOVA

Due to the inclusion of the interaction effect term, we can see changes in the p-value of the first two
treatments as compared to the Two-Way ANOVA without the interaction effect terms.
And we see that the p-value of the interaction effect term of 'Education’ suggests that the Null
Hypothesis is rejected in this case.

PAGE| 8
1.7 EXPLAIN THE BUSINESS IMPLICATIONS OF PERFORMING ANOVA FOR THIS PARTICULAR
CASE STUDY
Answer:

 ANOVA stands for “analysis of variance” and is used in statistics when you are testing a
hypothesis to understand how different groups respond to each other by making connections
between independent and dependent variables. ANOVA is a statistical test that compares the
means of groups in order to determine if there is a difference between them. It is used when more
than two group means are compared. For two group means, we can do t-test.
 ANOVA is used in a business context to help manage income /salary by comparing your
education to occupation here in this case to help manage revenue income (salary).
 ANOVA can also be used to forecast Salary trends by analysing patterns in data to better
understand the future hike of Salary.
 It’s also a widely used statistical technique for comparing the relationship between factors that
cause a rise in Salary, assuming this report is for HR department or HR consulting firm. Some of
the key takeaways as below:
i. As the Education level upgrades Salary increases. On an average Doctorate earns higher salary
than Bachelors and HS-Grads. However, it might be possibility that being Doctorate may not
necessarily mean significant high salary than HS-Grad or Bachelors employees. So that means
Doctorates are suitable for all job role or not always preferred above other education levels,
maybe they can be considered some times as over qualified for certain job roles
ii. Though there is lesser significance of Occupation than education on Salary but at certain levels it
impacts Salary.
iii. We must also take note of that high salaries are offered to Bachelor’s degree holders than
Doctorates for few occupations. So, we can say that there are some shortcomings of dataset
provided which reduces accuracy of the test and analysis done, as there can be few more other
important variables which can impact salary such as years of experience, specialisation,
industry/domain etc.
iv. HR department plays more comprehensive role while setting up salary bands. As similar job titles
with different industries demands varying salary package as per job profile, plus years of
experience for the job matters here deciding scale of a person.
v. ANOVA test indicates that the Education level coupled with Occupation has significant influence
over salary than alone occupation type with comparison to Educational background.
2 EDA and PCA

Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions
given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following
file: Data Dictionary.xlsx.
 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA?
 Is scaling necessary for PCA in this case? Give justification and perform scaling.
 Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data].
 Check the dataset for outliers before and after scaling. What insight do you derive here?
[Please do not treat Outliers unless specifically asked to do so]
 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame
with the original features
 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two
places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features]
 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
 Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? Write Interpretations of the Principal
Components Obtained.

2.1 PERFORM EXPLORATORY DATA ANALYSIS [BOTH UNIVARIATE AND MULTIVARIATE


ANALYSIS TO BE PERFORMED].
What insight do you draw from the EDA?
Answer:
The first step to know our data:
understand it, get familiar with it. What are the answers we’re trying to get with that data? What
variables are we using, and what do they mean? How does it look from a statistical perspective? Is
data formatted correctly? Do we have missing values? And duplicated? What about outliers? So all
these answers are can be found out step by step as below:
Step1 : Import : a) all the necessary libraries and b) The Data

Step2: Describing the Data after loading it. Checking for datatypes, number of columns and rows,
checking for missing number of values, describing its min, max, mean values. Depending upon
requirement dropping off missing values or replacing it.
Step3: Reviewing new dataset and Identifying Outliers with Interquartile Range (IQR) and visualising
it for checking up for outliers.

EDA

Univariate Analysis

Univariate analysis revers to analysis of single variable. The main purpose is to summarise and find
patterns in data.
The statistical description of the numeric variable, histogram or distplot to view the distribution and
the box plot to view 5 point summary and outliers if any.
Table 6 : 2.1.1 Statistical Description of Original Dataset
Observations:
 Data consists of 777 Universities with 18 Variables but not a single categorical variable present.
 Index of columns : ‘Name’, 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F.Undergrad',
'P.Undergrad', 'Outstate', 'Room.Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S.F.Ratio',
'perc.alumni', 'Expend', 'Grad.Rate'.
 Perc.alumni have minimum values as 0. Needs to be cleaned
 There are no missing values in the data
 We have 1 categorical field Name field needs cleanup.
 Very few students fall under topper students with Top 10% and 25%
 No duplicate records
 P.Undergrad has a minimum value as 1 and maximum value as 21836.. This has to cleaned.
 Accept field needs cleanup ,has a minimum value as 72 and maximum value as 26330..
 Apps field needs cleanup ,has a minimum value as 81 and maximum value as 48094..
 Books has a minimum value as 96 and maximum value as 2340.. This has to cleaned.
 Enroll has a minimum value as 35 and maximum value as 6392.. This has to cleaned.
 FUndergrad field needs cleanup ,has a minimum value as 139 and maximum value as 31643.
 From Quick Summary table we found that max % of Graduated students are 118% which needs to be
rectified as it shouldn’t go beyond 100. So the error was found on row 95 for Cazenovia College has to
be corrected using Median value
 From Quick Summary table we found that max % of Phd students are 103% which needs to be rectified
as it shouldn’t go beyond 100. So the error was found on row 582 and has to be corrected using Median
value.

Table 7 : 2.1.2 Quartile Range for each Variables / Attributes of Dataset

Columns Lower Range Upper Range IQR


Apps -3496 7896 2848
Accept -2126 5154 1820
Enroll -748 1892 660
Top10perc -15 65 20
Top25perc -1 111 28
F.Undergrad -3527.5 8524.5 3013
P.Undergrad -1213 2275 872
Outstate -1087.5 21332.5 5605
Room.Board 1417.5 7229.5 1453
Books 275 795 130
Personal -425 2975 850
PhD 27.5 119.5 23
Terminal 39.5 123.5 21
S.F.Ratio 4 24 5
perc.alumni -14 58 18
Expend 632.5 16948.5 4079
Grad.Rate 15.5 115.5 25

Observations:
 There is drastic difference of values seen in Upper and Lower range of most of the variables.
 This indicates presence of outliers.
 In order to get most accurate prediction, we must do outlier treatment before scaling.
 As mentioned in FAQ s we have not done outlier treatment for PCA .
Figure 2.1.1 Histplot to check Distribution and Density of each Variables

Observations:
 Right Skewed data in variables: 'PhD', 'Terminal
 Data is skewed Leftside in variables:
Apps', 'Accept', 'Enroll', 'Top10perc', 'F.Undergrad', 'P.Undergrad', 'Room.Board', 'Books', 'Personal',
'S.F.Ratio','perc.alumni', 'Expend'
 Data normally distributed/BellCurve in Variables: Top25perc , 'Outstate', ','Grad.Rate'
Figure 2.1.2.2 Boxplot for checking presence of outliers in each feature or Univariate Analysis of all Variables
Multivariate Analysis :
Figure 2.1.2.3 Pair plot to see relationship of all Variables among each other
Observation:

Few pairs have very high co-relation:

o Application and acceptance


o Students from top 10% schools and from top 25% schools
o Students from top 10% schools and Graduation rate
o Enrollment and Full time undergrad students
o PHD faculties and Terminal.

Below Heatmap exhibits multicollinearity issue as significant number of high co-relation variables pairs / features.
When the statistical significance of independent variable is undermined Multicollinearity is observed.

Figure 2.1.2.4 Heat Map to check collinearity of Original Data


2.2 IS SCALING NECESSARY FOR PCA IN THIS CASE? GIVE JUSTIFICATION AND PERFORM
SCALING.

Answer:
 Our dataset has 18 attributes initially hence we get 18 principal components.
 Once we get the amount of variance explained by each principal component we can decide
how many components we need for our model based on the amount of information we want to
retain.
 Hence, Yes, it is necessary to normalize data before performing PCA.
 The PCA calculates a new projection to our data set.
 If we normalize our data, all variables have the same standard deviation, thus all variables
have the same weight and our PCA calculates relevant axis. This skews the PCA towards
high magnitude features. We can speed up gradient descent or calculations in algorithm
by scaling.
 Scaling of Data can be done using Z-Score method or Scandard Scalar in SkLearn
Formula for Z-score:

The StandardScaler assumes your data is normally distributed within each feature and will scale them such
that the distribution is now centred around 0, with a standard deviation of 1.

Figure 2.2.1 Histogram of Variables before performing Scaling


Figure 2.2.2 Histogram of Variables after performing Scaling

Table 8 : 2.2.1 Statistical Description of Scaled Dataset

Observations :
 After Scaling Standard deviation is 1.0 for all variables.
 Post scaling Q1(25%) value and minimum values difference is lesser than original dataset in
most of the variables.
2.3 COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE
CORRELATION MATRICES FROM THIS DATA.[ON SCALED DATA]

 Both the terms , Covariance and Corelation matrices measure the relationship and the dependency
between two variables.
 “Covariance” indicates the direction of the linear relationship between variables.
 “Correlation” on the other hand measures both the strength and direction of the linear relationship between
two variables.
 Correlation refers to the scaled form of covariance. Covariance is affected by the change in scale.
 Covariance indicates the direction of the linear relationship between variables. Correlation on the other
hand measures both the strength and direction of the linear relationship between two variables.

Figure 2.3.1 Heatmap of correlation between Variables after performing Scaling


Table 9 : 2.3.2 Least Co-relation strength among Variables after performing Scaling

Weak Corelation between Variables


Expend S.F.Ratio -0.584 Top10perc S.F.Ratio -0.385
Outstate S.F.Ratio -0.555 Room.Board S.F.Ratio -0.363
perc.alumni S.F.Ratio -0.403 S.F.Ratio Grad.Rate -0.307

Table 10 : 2.3.3 Highest Co-relation strength among Variables after performing Scaling

Strong Corelation between Variables


Enroll F.Undergrad 0.965 F.Undergrad Apps 0.814 Outstate Top10perc 0.562
Apps Accept 0.943 Outstate Expend 0.673 Top25perc PhD 0.546
Accept Enroll 0.912 Top10perc Expend 0.661 Top10perc PhD 0.532
Top10perc Top25perc 0.892 Room.Board Outstate 0.654 Expend Top25perc 0.527
Accept F.Undergrad 0.874 F.Undergrad P.Undergrad 0.571 Top25perc Terminal 0.525
PhD Terminal 0.85 Grad.Rate Outstate 0.571 P.Undergrad Enroll 0.513
Enroll Apps 0.847 perc.alumni Outstate 0.566 Expend Room.Board 0.502

Observations:
 Highest corelation is seen among :
 Enroll variable with F.Undergrad
 Enroll with Accept
 Apps with Accept and Apps
 Least corelations observed with SF Ratio variable with : Expend, Outstate, Grad Rate ,
perc.alumni , Room board and Top10perc.

2.4 CHECK THE DATASET FOR OUTLIERS BEFORE AND AFTER SCALING. WHAT INSIGHT DO
YOU DERIVE HERE?
Answer:
While performing Univariate Analysis we have plotted Boxplots for all the variables for checking of
Outliers presence.
After standardisation of data post scaling again Box plot is drawn on Scaled Data, also used
describe function. After scaling no much difference in terms of outliers reduction.
Figure 2.4.1 Boxplot for Scaled Data for all Variables

The scaling shrinks the range of the feature values as shown in the left figure below. However, the
outliers have an influence when computing the empirical mean and standard deviation.
StandardScaler therefore cannot guarantee balanced feature scales in the presence of outliers.
Table 11 : 2.4.1 Scaled Data summary

So, even if there are outliers in the data, they will not be affected by standardization.

2.5 PERFORM PCA AND EXPORT THE DATA OF THE PRINCIPAL COMPONENT
(EIGENVECTORS) INTO A DATA FRAME WITH THE ORIGINAL FEATURES
Answer :
In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first
seven features captures 70.12% variance.
Figure 2.5.1 Scree plot showing distribution of PCs
Using Z-score we have done dimension reduction from 17 PCAs to 9 PCAs . Results shown as
below.
Table 2 2.5.1 Eigenvectors (PC scores) of Original Data (vertical values)

Table 3 : 2.5.2 PCA


Figure 2.5.2 Square Graph to check features spread distribution

Figure 2. 5.4 Heat Map Co-relation Metrics between components and features
Figure 2. 5.3 Loading of each selected Principle Component
2.6 WRITE DOWN THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF THE
EIGENVECTORS. USE VALUES WITH TWO PLACES OF DECIMALS ONLY). WRITE THE
LINEAR EQUATION OF PC IN TERMS OF EIGENVECTORS AND CORRESPONDING
FEATURES.

In PCA, given a mean centered dataset X with n sample and p variables, the first principal component PC1 is
given by the linear combination of the original variables X_1, X_2, …, X_p
PC_1 = w_{17}X_1 + w_{16}X_2 + … + w_{1p}X_p

The first principal component PC1 represents the component that retains the maximum variance of
the data. w1 corresponds to an eigenvector of the covariance matrix

The explicit form of the PC1 is as below:


[ 0.2487656 , 0.2076015 , 0.17630359, 0.35427395, 0.34400128,
0.15464096, 0.0264425 , 0.29473642, 0.24903045, 0.06475752,
-0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,
0.31890875, 0.25231565],
2.7 CONSIDER THE CUMULATIVE VALUES OF THE EIGENVALUES. HOW DOES IT HELP YOU
TO DECIDE ON THE OPTIMUM NUMBER OF PRINCIPAL COMPONENTS? WHAT DO THE
EIGENVECTORS INDICATE?
Figure 2.7.1 Comparison between Cumulative and Individual explained Variance

Observations:
 The plot visually shows how much of the variance are explained , by how many principle
components.
 In the below plot we see that ,the 1st PC explains variance 33.13%, 2nd PC explains 57.19% and
so on.
 Effectively we can get material variance explained (ie. 90%) by analysing 9 Principle components
instead all of the 17 variables(attributes) in the dataset.

PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the
data. Because rotation is a kind of linear transformation, your new dimensions will be sums of the old
ones. The e igen-vectors (Principle Components) , determine the direction or Axes along which
linear transformation acts, stretching or compressing input vectors. They are the lines of change
that represent the action of the larger matrix, the very “line” in linear transformation.
2.8 EXPLAIN THE BUSINESS IMPLICATION OF USING THE PRINCIPAL COMPONENT
ANALYSIS FOR THIS CASE STUDY. HOW MAY PCS HELP IN THE FURTHER ANALYSIS?
[HINT: WRITE INTERPRETATIONS OF THE PRINCIPAL COMPONENTS OBTAINED]

The business implication of using the Principal Component Analysis :


 PCA is used in exploratory data analysis and for making predictive models, can be done only on
continuous variables..
 PCA used for dimensionality reduction by projecting each data point onto only the first few
principal components to obtain lower-dimensional data while preserving as much of the data's
variation as possible. In this case we can reduce dimensions from 17 to 9that explains over 90%
variances.
 The first principal component can equivalently be defined as a direction that maximizes the
variance of the projected data.

 The i th principal component can be taken as a direction orthogonal ( i.e. at 90 degrees to one

another.) to the first i-1 principal components that maximizes the variance of the projected data.

=============================================================

You might also like