Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 27

JULY 18, 2021

Project: Advanced Statistics

SUDHANVA SARALAYA
Contents
Project: Advanced Statistics..............................................................Error! Bookmark not defined.
Packages and Common procedures followed across this Assignment....................................................2
1. Problem 1A:...................................................................................................................................3
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.............................................................................................4
1.2 Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.......................................................4
1.3 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.......................................................4
1.4 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)................................................................4
2. Problem 1B:...................................................................................................................................6
1.1 What is the interaction between two treatments? Analyse the effects of one variable on the
other (Education and Occupation) with the help of an interaction plot.[hint: use the ‘pointplot’
function from the ‘seaborn’ function]................................................................................................6
1.2 Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses and
state your results. How will you interpret this result?........................................................................7
1.3 Explain the business implications of performing ANOVA for this case study.......................8
3. Problem 2:......................................................................................................................................9
1.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed. What insight do you draw from the EDA?.......................................................................9
1.2 Is scaling necessary for PCA in this case? Give justification and perform scaling...............14
1.3 Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data].........................................................................................................................16
1.4 Check the dataset for outliers before and after scaling. What insight do you derive here?
[Please do not treat Outliers unless specifically asked to do so].......................................................17
1.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]..........................18
1.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features........................................................................................................22
1.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with
two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and
corresponding features]....................................................................................................................23
1.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?.............................23
1.9 Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal
Components Obtained].....................................................................................................................26
Packages and Common procedures followed across this Assignment.
 Packages
o import numpy as np
o import pandas as pd
o import seaborn as sns
o from statsmodels.formula.api import ols # For n-way ANOVA
o from statsmodels.stats.anova import _get_covariance,anova_lm # For n-way
ANOVA
o import matplotlib.pyplot as plt
o %matplotlib inline
o import scipy.stats as stats
o from statsmodels.stats.multicomp import pairwise_tukeyhsd
o from statsmodels.stats.multicomp import MultiComparison
o from warnings import filterwarnings #ignore warning wen it shows
o filterwarnings("ignore")

 Process Followed in common Across the Project


o Importing the data with pd.read_csv()
o Checking if the data has been imported or not.
o Assigning the database to a variable
o Checking the description of the data using describe()
o Checking the Information of the data using Info()
o Checking for null values in the dataset
o Checking the question and getting an answer
1. Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To understand
the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each
person’s educational qualification and occupation are noted. Educational qualification is at
three levels, High school graduate, Bachelor, and Doctorate. Occupation is at four levels,
Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A
different number of observations are in each level of education – occupation combination.
 [Assume that the data follows a normal distribution. The normality assumption may not
always hold if the sample size is small.]
 We first load the data.
 Check the data, we pull out the first five records.
 We next see the shape and information of the database.
o Shape of the data is forty rows and three columns.
o Info – Education, Occupation and Salary with no null values
o Education and occupation are object type and Salary is integer type
 Next, we check the Summary of the data.

 We next check the value count for Education and occupation. The below shows the two
value counts.
Education
Doctorate 16
Bachelor’s 15
HS-grad 9
Occupation
Prof-specialty 13
Sales 12
Adm-clerical 10
Exec-managerial 5
1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
For conducting ANOVA, we used the Model OLS. That is ordinary least square method.
 The Hypothesis for the One-Way ANOVA are based on Education.
o H0: Salary depends on Educational Qualification.
o Ha: Salary does not depend on Educational Qualification
 The Hypothesis for the One-Way ANOVA are based on Occupation.
o H0: Salary depends on Occupation.
o Ha: Salary does not depend on Occupation

1.2 Perform a one-way ANOVA on Salary with respect to Education. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

 As P-Value is 1.257709e-08 is less than Alpha 0.05 we reject the null Hypothesis and
conclude that Salary is not dependent on Education.

1.3 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the
null hypothesis is accepted or rejected based on the ANOVA results.

 As P-Value is 0.458508 is more than Alpha 0.05 we fail to reject the null Hypothesis and
conclude that Salary is dependent on Occupation

1.4 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)

 The Null Hypothesis is rejected in the case of Anova done on Salary with respect to
Education.
 Check which class means are significantly different we can use the tukeyhsd method.
 Once the test is completed, we get the below results.
 We see that the reject column has True for all. This means that the hypothesis done on
individual columns say we can reject the hypothesis. Now that we know that all the column
hypotheses done is rejected, we can say with conclusive proof that the means are significantly
different for all the Columns under education.
2. Problem 1B:

1.1 What is the interaction between two treatments? Analyse the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot.[hint: use the
‘pointplot’ function from the ‘seaborn’ function]

 Once we have done the point plot, we see that We have a slight interaction between
education and. Occupation When it comes in terms of salary. A person who is having
a doctorate Degree and who has a professional specialty Occupation Earns around
2,50,000 Whereas a person who has a bachelor’s degree. And who has a professional
specialty Occupation Earns around 1 Lac. In the same way a High School graduate
Who has a Professional specialty Occupation Earns again 1 Lac.
 Now, a person who has a bachelor’s degree and has executive managerial position As
occupation. Earns around 2 Lac. But when you see in the same scenario a person who
has a doctorate degree. And has a managerial position as occupation, earns around
210,000 or 2,15,000.
 Next, a person who has a bachelor’s degree. And sales as occupation Earns around
1,90000. A person with a doctorate degree And who is into sales job or occupation
Earns around. 1,95,000. And finally, a person who is a High School graduate And has
a sales job. Only earns around 50,000.
 Going forward, we see one more thing where a person who has Adm clerical job And
a high school degree Earns around 75000 to. 80000. A person with a doctorate degree.
And who does Adm clerical job earns around 1,69000. And a person who has A
bachelor's degree and does Adm Clerical Job earns around 1,74,000.
 Seeing above inferences, we can conclude that there is interaction between the two
treatments.
 This is how one variable education impacts the other variable occupation with respect
to salary. We even see that as the education qualification increases, the salary
increases as well as the positions the person works for also increases. We see that a
person who was a High School graduate Doesn't earn as much as a person who has a
doctorate degree.

1.2 Perform a two-way ANOVA based on Salary with respect to both Education and
Occupation (along with their interaction Education*Occupation). State the null and alternative
hypotheses and state your results. How will you interpret this result?

We will now perform a 2-way ANOVA based on salary with respect to both education and
occupation along with their interaction. That is education x occupation. Now interaction
means how one variable is dependent on the other variable. Or you can even say how one
variable is interacting with the other variable. This test will give us conclusive proof If Both.
Have any interaction between them Or not.
 H0: There is interaction between Occupation and Education with respect to Salary
 Ha: There is No interaction between Occupation and Education with respect to Salary

 ANOVA for education and occupation with respect To salary.


 Here we see the C(Education) column and the C(Occupation) column Gives us a
different result when compared to one way ANOVA. Because here an interaction is
being used. This interaction causes slight fluctuation between the results as per the
one-way ANOVA and two-way ANOVA. In one way ANOVA for education, we saw
the result was 1.25770e- 08. But when we see the 2-way ANOVA for the same
column, we see the result as 5.46, 6264e -12.
 In the same way. For occupation in one-way ANOVA, we saw the result was
0.458508. But when we see the same thing and 2-way ANOVA, we see the result as
7.211580e -02.
 But when you are doing the two-way ANOVA. We see that. How one variable is
interacting with the other variable With Respect to. Another variable. In our case right
now. We are. Seeing how education and occupation are Interacting with one another
With respect to salary.
 Here we see the P value as 2.232500e-0 5 which is less than the alpha 0.05. Since our
P value is less than our alpha, We reject the hypothesis answer that there is no
interaction between the occupation and education With respect to Salary.
 Here are the final interpretations based on the two-way ANOVA.
 There is no conclusive proof to tell us that salary is dependent on education.
 Seeing the P value for occupation We can say that there is dependency when it comes
to salary and occupation.
 Finally, seeing the P value for Education:Occupation that is interaction between
education and occupation, We see that even though appoint plot shows light
interaction between salary Education And Occupation, our test denies that, hence we
conclude that there is no interaction between occupation and education in terms of
salary.

1.3 Explain the business implications of performing ANOVA for this case study.

The business implications for performing ANOVA for this case study is as below.
 This case study can be used to decide. How the industry standards are and how does
education and. Position or a job impacts your salary and how much money you earn.
 This Case study can be used for showing a person how education impacts occupation
and how occupation impacts jobs.
 Certain jobs require certain Education and higher the level of jobs higher The level of
education expected And higher list salary.
 Salary may not be education dependent, but it is. It is truly dependent on an
occupation. Which means. Even though you have less level of education. If you are in
an extremely high Position of a particular job, You will still be earning more than any
high, highly educated person.
 This case study can also be used by the new and upcoming Businesses to understand
how the industry pays whether it is based on education or based on occupation or the
job the person is doing. This way It can decide seeing this case study on how to pay
the salary New employees.
 This case study also shows us that education is not a vital factor for all the companies,
but it is the designation or job that the person has that pays more money. To have a
particular job or a designation, you need to have particular skillsets. We even see in
this Case that education. Levels can give you a certain kind of jobs which may not pay
more, or which may pay you more, but. You can be sure that once you have a very
high-level education, you will always be paid more no matter what level of
designation or job you are having.
 Finally, when comparing the three kinds of education Qualifications that we have seen
in this case study (i.e., High School graduate, bachelor’s degree, and doctorate) we see
that a doctorate degree holder is always paid more no matter what kind of job he does.
For example, if a candidate has a doctorate degree And is doing sales job, he is
earning more than the person who has a bachelor's degree. Or a person who is a high
school graduate and doing the sales job earns less turn the person with a doctorate
degree and doing the sales job.

3. Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You
are expected to do a Principal Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be
found in the following file: Data Dictionary.xlsx.

1.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed. What insight do you draw from the EDA?

In this step, we will perform the below operations to check what the data set comprises of. We
will check the below things:
o Head of the dataset
o Shape of the dataset
o Info of the dataset
o Summary of the dataset
o Inference
o There are a total of 777 rows indexed from 0 to 776
o There are eighteen columns indexed from 0 to 17
o There are no null values as we see every column has 777 entries
o There are total of sixteen integer variables, one float variable and 1 Object type
variable.
o The describe method will help us to see how data is spread for numerical
values. We can clearly see the minimum value, the maximum value, the mean
value, different percentile values the counts, as well as the standard deviation.

 Inference From Summary


 Number of applications received
There are total of 777 entries
The minimum entry is eighty-one and maximum entry is 48094.
The mean number of applications received is approx. 3002.
 Accept: Number of applications accepted
There are total of 777 entries
The minimum entry is seventy-two and maximum entry is 26330.
The mean number of applications accepted is approx. 2019.
 Enroll: Number of new students enrolled
There are total of 777 entries
The minimum entry is thirty-five and maximum entry is 6392
The mean number of new students enrolled is approx. 778.
 Top10perc: Percentage of new students from top 10% of Higher Secondary class
There are total of 777 entries
The minimum entry is one and maximum entry is ninety-six
The mean Percentage of new students from top 10% of Higher Secondary class is
approx.27.558559
 Top25perc: Percentage of new students from top 25% of Higher Secondary class
o There are total of 777 entries
o The minimum entry is nine and maximum entry is one hundred
o The mean Percentage of new students from top 25% of Higher Secondary class
is approx. fifty-six.
 F.Undergrad: Number of full-time undergraduate students
o There are total of 777 entries
o The minimum entry is 136 and maximum entry is 31643
o The mean number of full-time undergraduate students is approx. 3700.
 P.Undergrad: Number of part-time undergraduate students
o There are total of 777 entries
o The minimum entry is one and maximum entry is 21836
o The mean number of part-time undergraduate students is approx. 855.
 Outstate: Number of students for whom the college or university is Out-of-state tuition
o There are total of 777 entries
o The minimum entry is 2340 and maximum entry is 21700
o The mean number of students for whom the college or university is Out-of-
state tuition is approx. 10441.
 Room.Board: Cost of Room and board
o There are total of 777 entries
o The minimum entry is 1780 and maximum entry is 8124
o The mean Cost of Room and board is approx. 4358.
 Books: Estimated book costs for a student
o There are total of 777 entries
o The minimum entry is ninety-six and maximum entry is 2340
o The mean Estimated book costs for a student is approx. 549.
 11. Personal: Estimated personal spending for a student
o There are total of 777 entries
o The minimum entry is 250 and maximum entry is 6800
o The mean Estimated personal spending for a student is approx. 1341.
 PhD: Percentage of faculties with Ph.D.’s
o There are total of 777 entries
o The minimum entry is eight and maximum entry is 103
o The mean Percentage of faculties with PhD’s is approx. seventy-three.
 Terminal: Percentage of faculties with terminal degree
o There are total of 777 entries
o The minimum entry is twenty-four and maximum entry is one hundred
o The mean Percentage of faculties with terminal degree is approx. eighty.
 S.F.Ratio: Student/faculty ratio
o There are total of 777 entries
o The minimum entry is 2.5 and maximum entry is 39.8
o The mean number of Student/faculty ratio is approx. fourteen.

 perc.alumni: Percentage of alumni who donate


o There are total of 777 entries
o The minimum entry is zero and maximum entry is sixty-four
o The mean Percentage of alumni who donate is approx. twenty-three.
 Expend: The Instructional expenditure per student
o There are total of 777 entries
o The minimum entry is 3186 and maximum entry is 562333
o The mean number of The Instructional expenditure per student is approx.
9660.
 Grad.Rate: Graduation rate
o There are total of 777 entries
o The minimum entry is ten and maximum entry is 118
o The mean number of Graduation rate is approx. sixty-five.

 Check for duplicate values.


We use the function df.duplicated() to check for duplicate values. The function returns as a
table as well as the number of duplicate rows. When we say duplicate, we mean that every
single Column every single row has the same Value. In our record right now, we do not have
any duplicate values. Since we have zero duplicates records in the data, we do not have to
remove any data from the data set, which gives us only distinct records.

Univariate Analysis of the Data Given


Multivariate Analysis of the Given Data

For multivariate analysis we use the heat map. Here you can see how one variable interacts
with another variable. The darker the colour the more the interaction. The lighter of the colour
the lesser is the interaction. Here we see that there is. Huge correlation between applications
received, acceptance and enrolment. In the same way we see correlation between applications
received, applications accepted as well as enrolment With full time graduates. Further, we see
that there is some correlation between top 10% and top 25% students enrolled. Show the
more. We see that a teacher who has a PhD degree also has a terminal degree because there is
some sort of Correlation between them. These are some of the points derived from the above
analysis.

1.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.

 Often the variables of the data set are of different scales i.e. One variable is in millions
and other in only one hundred. If the data in these variables are of different scales, it is
tough to compare these variables.
 Feature scaling (also known as data normalization) is the method used to standardize
the range of features of data. Since, the range of values of data may vary widely, it
becomes a necessary step in data pre-processing while using machine learning
algorithms.
 In this method, we convert variables with different scales of measurements into a
single scale.
 StandardScaler normalizes the data using the formula (x-mean)/standard deviation.
 In our case we see that all are having various kinds of scales. Some are 10's and some
are in 1000's. So, we need to scale the data and make sure all numerical data are in the
same scale which will make things easier to compare
 We will be doing this only for the numerical variables.

Before Scaling

After Scaling
1.3 Comment on the comparison between the covariance and the correlation matrices from
this data [on scaled data].
 Using the correlation matrix is equivalent to standardizing each of the variables (to mean
zero and standard deviation 1). In general, PCA with and without standardizing will give
different results. Especially when the scales are different.

Here you can see how one variable interacts with another variable. The darker the colour the
more the interaction. The lighter of the colour the lesser is the interaction. Here we see that
there is. Huge correlation between applications received, acceptance and enrolment. In the
same way we see correlation between applications received, applications accepted as well as
enrolment With full time graduates. Further, we see that there is some correlation between top
10% and top 25% students enrolled. Show the more. We see that a teacher who has a PhD
degree also has a terminal degree because there is some sort of Correlation between them.
These are some of the points derived from the above analysis.
1.4 Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]
Before Scaling

After Scaling
 Insights

o The main insight derived here are that the scaling in the box plot has changed
completely.
o All the scaling in the boxplot before scaling was all over the place. Meaning
that there were with quite different numbers depending on the plot. For
example, we see that there are plots having five digits as scales and some are
having two which makes it difficult to scale and compare
o After the scaling was done, we see that all the plots are having the same scale
i.e., single digits and max double digits.

1.5 Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
 To extract the eigenvalues and eigenvectors There are two Prerequisites. They are:
o We first need to confirm the statistical significance of correlation
o Secondly, we need to confirm the adequacy of the sample size.
 To confirm the statistical significance of correlation, we use
factor_analyzer.factor_analyzer Library. From factor_analyzer.factor_analyzer we
import calculate_bartlett_sphericity.
 What it does is that it creates a hypothesis with null hypothesis, telling us that there is
no significance Correlation And Alternative hypothesis telling us that there is
significant correlation.
 If P value is greater than 0.05. The hypothesis is accepted And if it is less than 0.05,
the hypothesis is rejected.
 After performing The test, we get the pee value as 0.0. Sends 0.0 is less than 0.05.
That is our alpha we reject the null hypothesis. Depot, we have established the first.
Prerequisite that. There is a statistical significance of correlation. If there is no
correlation, then there is no point of doing PCA. Show first thing what we must
always confirm as if there is any correlation or not.
 Note to confirm the adequacy of sample size We use factor_analyzer.factor_analyzer
Library. From factor_analyzer.factor_analyzer we import _kmo
 The result of the test should be more than 0.7. If it is less than 0.7, it means that. We
need more data. In test we got the result as 0.799 which is more than a 0.77. Hence,
we have established that we have statistical significance of correlation as well as we
have adequate sample size.
 With both prerequisites Accounted for We can now proceed to extract the eigenvalues
as well as Eigenvectors.
 First, we need to apply PCA taking all the features into Account.
 Now to get the eigenvectors. We need to use pca.components_ function.
 Once we run the code, we get the eigenvectors for the data
 We will Now look at the eigenvectors for the data.
Here we see the principal components 1 to 8
In the next page we see the principal components 9 to 15.
These are the eigenvectors been extracted from the data.
We will now put it in a table format so that it will be easier for us to understand it. To do that
We will create a new data frame with the PC components as well as its indexes As in the
original.
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15

Apps 0.42 0.05 -0.07 0.27 -0.02 0 -0.01 -0.14 0.05 0.04 0.02 0.6 0.08 0.53 0.27

Accept 0.44 -0.01 -0.1 0.25 0 0.01 -0.02 -0.12 0.07 -0.11 -0.15 0.3 0.04 -0.63 -0.44

Enroll 0.44 -0.07 -0.08 0.16 -0.04 -0.1 -0.08 -0.02 0.05 -0.09 0.01 -0.44 -0.07 -0.28 0.68

F.Undergrad 0.44 -0.1 -0.06 0.1 -0.04 -0.07 -0.07 0.03 0.03 -0.08 0.06 -0.53 -0.05 0.46 -0.52

P.Undergrad 0.28 -0.15 0.14 -0.2 -0.2 0.34 -0.09 0.75 -0.26 0.14 -0.06 0.12 0.02 -0.04 0.03

Outstate -0.02 0.44 0.05 0.09 -0.04 0.12 0.1 0.06 0.19 0.12 -0.83 -0.14 -0.02 0.11 0.03

Room.Board 0.05 0.34 0.15 0.1 0.14 0.59 0.38 0.06 0.3 -0.31 0.36 -0.07 -0.06 -0.02 0.01

Books 0.08 0.02 0.68 0.11 0.64 -0.15 -0.26 0.06 -0.07 0 -0.03 0.01 -0.06 0 0

Personal 0.15 -0.18 0.5 -0.19 -0.33 -0.39 0.61 -0.03 0.15 -0.03 -0.04 0.04 0.03 -0.02 0

PhD 0.24 0.27 -0.12 -0.56 0.1 -0.07 -0.01 -0.15 -0.11 -0.01 0.01 0.12 -0.69 0 -0.01

Terminal 0.23 0.28 -0.06 -0.55 0.16 -0.03 -0.07 -0.12 -0.06 -0.14 0 -0.05 0.7 -0.01 0.02

S.F.Ratio 0.09 -0.32 -0.29 -0.14 0.5 0.02 0.29 0.13 0.42 0.52 -0.01 -0.02 0.04 -0.02 0

perc.alumni -0.08 0.33 -0.15 0.04 -0.04 -0.51 -0.18 0.52 0.46 -0.18 0.17 0.11 -0.02 -0.01 -0.02

Expend 0.09 0.38 0.22 0.06 -0.3 0.03 -0.19 -0.16 0.09 0.71 0.33 -0.08 0.05 -0.11 -0.06

Grad.Rate 0.02 0.34 -0.21 0.28 0.22 -0.26 0.47 0.19 -0.6 0.14 0.11 -0.06 0.05 -0.03 -0.02

 To get eigenvalues We use pca.explained_variance_. This always is returned to us in


descending order. The Eigenvalues are as below.

array([4.57703902, 4.17008298, 1.17275484, 1.0046249 , 0.84571571,


0.7431469 , 0.59827417, 0.57927156, 0.40507509, 0.32517076,
0.22049554, 0.16795605, 0.14323651, 0.04125279, 0.02523307])
1.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a
data frame with the original features

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15

Apps 0.42 0.05 -0.07 0.27 -0.02 0 -0.01 -0.14 0.05 0.04 0.02 0.6 0.08 0.53 0.27

Accept 0.44 -0.01 -0.1 0.25 0 0.01 -0.02 -0.12 0.07 -0.11 -0.15 0.3 0.04 -0.63 -0.44

Enroll 0.44 -0.07 -0.08 0.16 -0.04 -0.1 -0.08 -0.02 0.05 -0.09 0.01 -0.44 -0.07 -0.28 0.68

F.Undergrad 0.44 -0.1 -0.06 0.1 -0.04 -0.07 -0.07 0.03 0.03 -0.08 0.06 -0.53 -0.05 0.46 -0.52

P.Undergrad 0.28 -0.15 0.14 -0.2 -0.2 0.34 -0.09 0.75 -0.26 0.14 -0.06 0.12 0.02 -0.04 0.03

Outstate -0.02 0.44 0.05 0.09 -0.04 0.12 0.1 0.06 0.19 0.12 -0.83 -0.14 -0.02 0.11 0.03

Room.Board 0.05 0.34 0.15 0.1 0.14 0.59 0.38 0.06 0.3 -0.31 0.36 -0.07 -0.06 -0.02 0.01

Books 0.08 0.02 0.68 0.11 0.64 -0.15 -0.26 0.06 -0.07 0 -0.03 0.01 -0.06 0 0

Personal 0.15 -0.18 0.5 -0.19 -0.33 -0.39 0.61 -0.03 0.15 -0.03 -0.04 0.04 0.03 -0.02 0

PhD 0.24 0.27 -0.12 -0.56 0.1 -0.07 -0.01 -0.15 -0.11 -0.01 0.01 0.12 -0.69 0 -0.01

Terminal 0.23 0.28 -0.06 -0.55 0.16 -0.03 -0.07 -0.12 -0.06 -0.14 0 -0.05 0.7 -0.01 0.02

S.F.Ratio 0.09 -0.32 -0.29 -0.14 0.5 0.02 0.29 0.13 0.42 0.52 -0.01 -0.02 0.04 -0.02 0

perc.alumni -0.08 0.33 -0.15 0.04 -0.04 -0.51 -0.18 0.52 0.46 -0.18 0.17 0.11 -0.02 -0.01 -0.02

Expend 0.09 0.38 0.22 0.06 -0.3 0.03 -0.19 -0.16 0.09 0.71 0.33 -0.08 0.05 -0.11 -0.06

Grad.Rate 0.02 0.34 -0.21 0.28 0.22 -0.26 0.47 0.19 -0.6 0.14 0.11 -0.06 0.05 -0.03 -0.02
1.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]

PC1
Apps 0.42
Accept 0.44
Enroll 0.44
F.Undergrad 0.44
P.Undergrad 0.28
Outstate -0.02
Room.Board 0.05
Books 0.08
Personal 0.15
PhD 0.24
Terminal 0.23
S.F.Ratio 0.09
perc.alumni -0.08
Expend 0.09
Grad.Rate 0.02

1.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on
the optimum number of principal components? What do the eigenvectors indicate?

 Now before checking the cumulative values of the eigenvalues lets create a Scree Plot
as We would like to have a look at the principal component and the corresponding
value.
 We do not want to have fifteen principal components. If we had to work with fifteen
principal components, then we can work with the Original data which had all the
features. There was no need for principal component analysis.
 Principal component analysis is a dimension reduction technique so it will be
meaningful only if we are able to cut down on the number of features.
 In this case, the features will be PC1, PC2, PC 3 and so on. And so forth so there
should be a mechanism for which we are able to say that these 3 or 4 or 5 pcs alone
capture about 80% of the goodness of the data. We can drop the rest of the PC’s,
manage our work.
 Now How do we find that. A Concrete way would be to look at the Cumulative sum of
the explained variance ratio, meaning that. So now explain various ratios we
calculated earlier is all added. To calculate the Cumulative sum of the explained
variance ratio we use np.cumsum(pca.explained_variance_ratio_) function.
 After running the code, we get the below results
array([0.30474322, 0.58239096, 0.660474 , 0.72736279,
0.78367128, 0.83315064, 0.87298425, 0.91155266, 0.93852291,
0.96017306,0.97485384, 0.9860365 , 0.99557331, 0.99831996,
1.])

 Now we realise that when you take only seven principal components, we have
captured almost 87% of the data. Whatever information or value is there in your entire
data, 87% of that is captured in these seven principal components, which means the
remaining eight principal components can be dropped. So, this is a prioritization that
we will be able to achieve. Let us only select the PCs from 1 to 7 from the extracted
loading dataset loading that we got above.
 Eigenvectors represent directions. Eigenvalues represent magnitude, or importance.
 Now let us try to visualise how each of these values that see as coefficients Are
influencing each principal component
 If PC1 is particularly important for us as PC1 has the highest eigenvalue, whichever
feature here has the highest coefficient is going to be the most prominent features, and
there is a possibility that there are some features which are more important for a given
feature than for the others.
 For looking at it we will be visualizing it.
 In the below figure, we see that each PC has its own unique components.
1.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

 This is the same data; it is just it's visualised now in a different way. Notice that
Application received has been highest magnitude increase if PC1 is most important for
you. Then in PC1, what is especially important is Application received. When you see
this in other PC’s we see that it is not the most important there.
 Based on the PC done we can bring down the dimensionality without removing the
effectiveness or goodness of the data.
 PCA helps you interpret your data. Principal component analysis (PCA) simplifies the
complexity in high-dimensional data while retaining trends and patterns. It does this
by transforming the data into fewer dimensions, which function as summaries of
features.
 This overview may uncover the relationships between observations and variables, and
among the variables.
 PCA is a most widely used tool in exploratory data analysis and in machine
learning for predictive models. Moreover, PCA is an unsupervised statistical technique
used to examine the interrelations among a set of variables. It is also known as a
general factor analysis where regression determines a line of best fit.

******************************** Thank You********************************

You might also like