Professional Documents
Culture Documents
Advanced Statistics (AS) Project Report
Advanced Statistics (AS) Project Report
Table of Contents
1 Problem 1 Statement ............................................................................................................................ 1
1.1 Problem 1 A................................................................................................................................... 1
1.1.1 Preliminary Data Analysis ..................................................................................................... 1
1.1.2 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually. ............................................................................................... 4
1.1.3 Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results......................................................... 4
1.1.4 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results......................................................... 4
1.1.5 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded) ................................................................... 5
1.2 Problem 1B:................................................................................................................................... 7
1.2.1 What is the interaction between two treatments? Analyze the effects of one variable on
the other (Education and Occupation) with the help of an interaction plot. [hint: use the ‘point plot’
function from the ‘Seaborn’ function] .................................................................................................. 7
1.2.2 Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative hypotheses and
state your results. How will you interpret this result? ......................................................................... 8
1.2.3 Explain the business implications of performing ANOVA for this particular case study. ..... 9
2 Problem 2 Statement .......................................................................................................................... 10
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
What insight do you draw from the EDA? .............................................................................................. 10
2.1.1 College-wise Top/Bottom Criteria Data Insights................................................................. 11
2.1.2 Univariate Analysis .............................................................................................................. 14
2.1.3 Bi-Variate Analysis............................................................................................................... 22
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling. ..................... 25
2.3 Comment on the comparison between the covariance and the correlation matrices from this
data [on scaled data]. ............................................................................................................................. 28
2.3.1 Correlation Matrix ............................................................................................................... 28
2.3.2 Covariance Matrix ............................................................................................................... 29
2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? [Please
do not treat Outliers unless specifically asked to do so] ........................................................................ 31
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] ............................... 34
2.5.1 Eigen Values ........................................................................................................................ 34
`Advanced Statistics- Project
List of Figures
Figure 1-1 Salary Boxplot .............................................................................................................................. 2
Figure 1-2 Salary Histogram .......................................................................................................................... 3
Figure 1-3 Occupation-Salary Class Difference Boxplot................................................................................ 5
Figure 1-4 Education-Salary Class Difference Boxplot .................................................................................. 6
Figure 1-5 Education-Occupation Interaction Plot ....................................................................................... 7
Figure 2-1 Applications, Acceptance and Enrollment Histogram ............................................................... 15
Figure 2-2 New Students with Top HSC Scores (10% and 25%) Histogram ................................................ 16
Figure 2-3 Room-Board, Books and Personal Spends Histogram ............................................................... 16
Figure 2-4 Fulltime – Part-time Students Histogram .................................................................................. 17
Figure 2-5 Faculty in Colleges – Histogram ................................................................................................. 18
Figure 2-6 Instructional Expenditure Per Student and Number of Outstate Students - Histogram ........... 18
Figure 2-7 Percentage of Alumni Who Donate – Histogram ...................................................................... 19
Figure 2-8 Graduation Rate Histogram ....................................................................................................... 19
Figure 2-9 Unscaled Dataset Boxplot – See Outliers .................................................................................. 21
Figure 2-10 Data set Pair plot ..................................................................................................................... 22
Figure 2-11 Data set Heatmap .................................................................................................................... 24
Figure 2-12 Original Data Histogram Plot ................................................................................................... 27
Figure 2-13 Scaled Data Histogram Plot ..................................................................................................... 27
Figure 2-14 Original Data Boxplot – Outliers Identification ........................................................................ 32
Figure 2-15 Scaled Data Boxplot – Outliers Identification .......................................................................... 33
Figure 2-16 Eigen Values Ratio Cumulative Sum Scree Plot ....................................................................... 42
Figure 2-17 Eigen Values Scree Plot ............................................................................................................ 43
Figure 2-18 Reduced Dimensions Eigen Values Heat Map ......................................................................... 44
Figure 2-19 Transformed Data Set with Reduced Dimensionality .............................................................. 45
Figure 2-20 Original and Transformed Datasets ......................................................................................... 46
`Advanced Statistics- Project
List of Tables
Table 1-1 Problem 1 Data Sample................................................................................................................. 1
Table 1-2 Education Values Distribution....................................................................................................... 2
Table 1-3 Occupation Values Distribution .................................................................................................... 2
Table 1-4 Descriptive Statistics of P1 Data ................................................................................................... 3
Table 1-5 Education AOV Table .................................................................................................................... 4
Table 1-6 Occupation AOV Table .................................................................................................................. 4
Table 1-7 Education Comparison of Means .................................................................................................. 5
Table 1-8 Two Way AOV Table...................................................................................................................... 8
Table 1-9 Two Way AOV Table with Interaction ........................................................................................... 8
Table 2-1 Data description (all columns) .................................................................................................... 11
Table 2-2 Top College Statistics- Application, Acceptance and Enrollment ............................................... 11
Table 2-3 Top College Statistics- Fulltime/Part-time Students and Graduation Rate ................................ 12
Table 2-4 Top College Statistics- Instructional Expenditure and Faculty Quality ....................................... 12
Table 2-5 Top College Statistics- Cost to Students ..................................................................................... 13
Table 2-6 Top College Statistics-Student HSC Scores.................................................................................. 13
Table 2-7 Top College Statistics-Most Outstate Students .......................................................................... 13
Table 2-8 Descriptive Statistics of P2 Data ................................................................................................. 14
Table 2-9 Outlier Presence in all Numerical Columns................................................................................. 20
Table 2-10 Heat Map Data Correlation Inference ...................................................................................... 23
Table 2-11 Data before scaling ................................................................................................................... 26
Table 2-12 Data after scaling ...................................................................................................................... 26
Table 2-13 Scaled Data Correlation Matrix ................................................................................................. 28
Table 2-14 Scaled Data Covariance Matrix ................................................................................................. 30
Table 2-15 Eigen Vectors in data frame (Part 1) ......................................................................................... 38
Table 2-16 Eigen Vectors in data frame (Part 2) ......................................................................................... 39
Table 2-17 Manual PC Calculation(Using Scaled Data) ............................................................................... 40
Table 2-18 Auto PC Calculation with Python sklearn (Using Scaled Data) ................................................. 41
`Advanced Statistics- Project
1 Problem 1 Statement
Salary is hypothesized to depend on educational qualification and occupation. To understand the
dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s educational
qualification and occupation are noted.
Educational qualification is at three levels, High school graduate, Bachelor, and Doctorate. Occupation is
at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A
different number of observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may not always
hold if the sample size is small.]
1.1 Problem 1 A
1.1.1 Preliminary Data Analysis
A small sample of the data set is as below:
• The raw data given for analysis has a total of 40 rows and 3 columns.
• The columns ‘Education’ and ‘Occupation’ in the dataset are object data
• Salary column holds numerical data
• The dataset holds no null values; all values are valid
• There is no duplicated data
1
`Advanced Statistics- Project
Salary Understanding
Mean 162186.9 Average Salary for all data entries
Standard Deviation 64860.41 Deviation in salary from the mean salary
Salary which is central i.e. at this salary 50% of data entries
Median 169100
are above and 50% of data entries are below.
Salary which is Q1 i.e. 25% of data entries are below this
25% 99897.5
salary.
2
`Advanced Statistics- Project
The KDE plot shows that there is a fatter tail in the left as confirmed by the skewness parameter.
3
`Advanced Statistics- Project
1.1.2 State the null and the alternate hypothesis for conducting one-way ANOVA for both
Education and Occupation individually.
1.1.2.1 Education Hypothesis
Null Hypothesis: The mean salary for all education levels is the same.
Ho: µSalaryDoctorate = µSalaryBachelor = µSalaryHSGraduate
Alternate Hypothesis: At least one pair of mean salary for all education levels is different
Ha: At least one pair of means are not equal
1.1.3 Perform a one-way ANOVA on Salary with respect to Education. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
The AOV table resulted from the one-way ANOVA calculation on salary with respect to education is
represented below.
The p-value of 1.26e-08 is lower than the significance level α of 0.05. With this information, we reject the
null hypothesis and conclude with 95% confidence that there is indeed a difference in at least one pair of
the population means.
This indicates that, the mean salary of different education is different for at least one pair of education
classes.
1.1.4 Perform a one-way ANOVA on Salary with respect to Occupation. State whether the null
hypothesis is accepted or rejected based on the ANOVA results.
The AOV table resulted from the one-way ANOVA calculation on salary with respect to occupation is
represented below.
4
`Advanced Statistics- Project
1.1.5 If the null hypothesis is rejected in either (2) or in (3), find out which class means are
significantly different. Interpret the result. (Non-Graded)
We rejected the null hypothesis for the One-Way ANOVA on Salary with respect to Education i.e. the null
hypothesis of the population means of salary for all 3 education levels being equal is rejected as confirmed
in section 1.1.3.
In order to check which pairs of means are equal or not equal, we perform the pair-wise Tukey HSD test.
This test also lets us estimate the mean difference between each pair of population means of salary for
the education classes.
1. It is seen that the mean difference in the salary earned by the education class groups ([Bachelors
- Doctorate], [Bachelors- HS-grad], [Doctorate - HS-grad]) is statistically significant.
2. The p value for all the three education classes groups indicates that the null hypothesis (the mean
salary between each education class combination is the same) is to be rejected.
5
`Advanced Statistics- Project
6
`Advanced Statistics- Project
1. Adm-Clerical occupation
This occupation pays significantly lesser salary to HS Graduates – around 75K.
Bachelor and Doctorate degree holders earn around 160K-170K for the same occupation.
2. Sales occupation
HS Graduates in this occupation earn the lowest as compared to HS Graduates working in any of
the other occupations, around 50K only.
Bachelor and Doctorate degree holders earn much more, at around 195K. This is almost a 290%
increase as compared to the salary earned by HS Graduates in the same occupation.
7
`Advanced Statistics- Project
3. Prof-Specialty occupation
This occupation pays the highest salary of around 250K as compared to other occupations.
However, this high salary is reserved only for Doctorate degree holders.
Bachelor’s degree holders and HS graduates in this occupation earn a much lower amount in the
95K-105K range.
4. Exec-Managerial
This occupation is reserved for only Bachelor and Doctorate degree holders and is not followed
by any HS Graduate.
Bachelor degree holders earn around 190K, which is lesser than Doctorate degree holders who
earn around 210K.
1.2.2 Perform a two-way ANOVA based on Salary with respect to both Education and Occupation
(along with their interaction Education*Occupation). State the null and alternative
hypotheses and state your results. How will you interpret this result?
Null Hypothesis, Ho: The mean of Salary is the same with respect to the Occupation and Education.
Alternate Hypothesis, Ha: At least one of the mean of Salary with respect to Occupation and Education is
unequal.
The AOV table resulted from the two-way ANOVA calculation on salary with respect to education and
occupation is represented below.
When both education and occupation factors are considered, we see that occupation is not a statistically
significant factor as its p-value 0.3545 is greater than the significance level α of 0.05.
However, education plays a significant role in the salary factor as its p-value 1.98E-08 is lower than the
significance level α of 0.05.
The AOV table resulted from the two-way ANOVA calculation on salary with respect to education,
occupation and the interaction between education and occupation is represented below.
8
`Advanced Statistics- Project
We see that the education and occupation interaction is 2.23E-05 which is lower than the significance
level α of 0.05. This leads to conclude that there seems to be no statistically significant interaction
between these two factors as far as salary is concerned.
1.2.3 Explain the business implications of performing ANOVA for this particular case study.
1.2.3.1 Education Choices
Choosing to take up further education to earn a Bachelor’s degree or a Doctorate is a time-consuming,
expensive endeavor.
1. ANOVA allows a student to understand the relationship between the salary earned and the
education level. In this case, ANOVA clearly shows that the mean salary earned by persons of
different education backgrounds is dissimilar.
2. This ANOVA conclusion helps the person to decide whether the pursuing of a higher education is
the best course of action for him.
3. A HS graduate can decide to pursue a Bachelor’s degree to greatly enhance his earning capacity.
4. A Bachelor’s degree holder may deduce that earning a Doctorate degree may not give the salary
boost that justifies time/effort/money spent to get the additional education.
This proven education-salary dependency can be used by colleges to influence high-school graduates to
pursue college education instead of directly joining the workforce.
1. It is seen that there is not enough data to reject the hypothesis that the mean salary earned by
persons of different occupations is different.
2. Nevertheless, the education-occupation interaction data will help the person to choose an
occupation which can ensure that he earns a better salary.
3. HS Graduates can opt for occupations like Administrative-Clerical that far out-pay jobs like Sales.
4. Similarly, Bachelor degree holders can avoid occupation in Prof-Specialty which gives then a lower
earning potential than the other fields.
9
`Advanced Statistics- Project
2 Problem 2 Statement
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are expected
to do a Principal Component Analysis for this case study according to the instructions given. The data
dictionary of the 'Education - Post 12th Standard.csv' can be found in the following file: Data
Dictionary.xlsx.
2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
The total data info summary for this problem set is as below. The names of the columns, number of valid
data entries and the data types of each column is detailed below.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 777 entries, 0 to 776
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Names 777 non-null object
1 Apps 777 non-null int64
2 Accept 777 non-null int64
3 Enroll 777 non-null int64
4 Top10perc 777 non-null int64
5 Top25perc 777 non-null int64
6 F.Undergrad 777 non-null int64
7 P.Undergrad 777 non-null int64
8 Outstate 777 non-null int64
9 Room.Board 777 non-null int64
10 Books 777 non-null int64
11 Personal 777 non-null int64
12 PhD 777 non-null int64
13 Terminal 777 non-null int64
14 S.F.Ratio 777 non-null float64
15 perc.alumni 777 non-null int64
16 Expend 777 non-null int64
17 Grad.Rate 777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 109.4+ KB
We draw the following conclusions from the initial analysis of raw data.
10
`Advanced Statistics- Project
• The raw data given for analysis has a total of 777 rows and 17 columns.
• The column ‘Name’ holds object data and the rest of the columns all hold numerical data
• The dataset holds no null values; all values are valid
• There is no duplicated data
• ##1 - Maximum Acceptance (%) is computed as percentage of students who got acceptance after they
applied to that college
= Number of Accepted students / Number of Student Applications
11
`Advanced Statistics- Project
• ##2 - Maximum Enrollment (%) is computed as percentage of students who have enrolled in that
college after they received acceptance
= Number of Enrolled students / Number of Accepted Students)
• *1 - Rutgers at New Brunswick and Purdue University at West Lafayette got the highest number of
applications. However, the number of applications received by Rutgers at New Brunswick is more than
double of what Purdue University at West Lafayette received.
2.1.1.3 Top Colleges – Instructional Expenditure, Student-Faculty Ratio and PhD Faculty
Highest Instructional Lowest Instructional Best Student-Faculty
Most PhD Faculty (%)
expenditure expenditure Ratio
College Name Value College Name Value College Name Value College Name Value
Texas A&M
Johns Hopkins Jamestown University of
56233 3186 2.5 University at 103*3
University College Charleston
Galveston*2
Central Case Western
Washington
45702 Wesleyan 3365 Reserve 2.9 Pitzer College 100
University
College University
Antioch Lindenwood Johns Hopkins Bryn Mawr
42926 3480 3.3 100
University College University College
Table 2-4 Top College Statistics- Instructional Expenditure and Faculty Quality
• *3 - Validate data sanctity as there seems to be a discrepancy; Percent of PhD faculty is greater than
100
12
`Advanced Statistics- Project
• ##5 - Total cost to the student is total cost for Room-Board, Books and Personal expenditure
13
`Advanced Statistics- Project
Standard
Mean Median 25% 75% Range Maximum Minimum
Deviation
Apps 3001.638 3870.201 1558 776 3624 48013 48094 81
Accept 2018.804 2451.114 1110 604 2424 26258 26330 72
Enroll 779.973 929.1762 434 242 902 6357 6392 35
Top10perc 27.55856 17.64036 23 15 35 95 96 1
Top25perc 55.79665 19.80478 54 41 69 91 100 9
F.Undergrad 3699.907 4850.421 1707 992 4005 31504 31643 139
P.Undergrad 855.2986 1522.432 353 95 967 21835 21836 1
Outstate 10440.67 4023.016 9990 7320 12925 19360 21700 2340
Room.Board 4357.526 1096.696 4200 3597 5050 6344 8124 1780
Books 549.381 165.1054 500 470 600 2244 2340 96
Personal 1340.642 677.0715 1200 850 1700 6550 6800 250
PhD 72.66023 16.32815 75 62 85 95 103 8
Terminal 79.7027 14.72236 82 71 92 76 100 24
S.F.Ratio 14.0897 3.958349 13.6 11.5 16.5 37.3 39.8 2.5
perc.alumni 22.74389 12.3918 21 13 31 64 64 0
Expend 9660.171 5221.768 8377 6751 10830 53047 56233 3186
Grad.Rate 65.46332 17.17771 65 53 78 108 118 10
Table 2-8 Descriptive Statistics of P2 Data
1. The maximum number of applications a college in the dataset received is 48094 and the minimum number of application received is only
89. On average a college received around 3001 applications with the median being 1558 applications.
2. While around 2018 students accept the offer of admission from a college, on average only 780 students actually end up enrolling.
3. On comparing the F Undergrad and P Undergrad numbers, it is evident that more students opt for a full time undergraduate course as
compared to part time undergraduate course.
4. On average, a college generally has more staff with terminal degrees than with PHDs.
14
`Advanced Statistics- Project
5. The maximum student-faculty ratio is observed to be 39.8 but the average is better at 14.
6. There seems to be bad data in Grad.Rate as the maximum value of 118 is larger than 100; needs validation from SME
7. There seems to be bad data in PhD as the maximum value of 103 is larger than 100; needs validation from SME
15
`Advanced Statistics- Project
2.1.2.2 New Students with Top HSC Scores (10% and 25%) Histogram
The histogram percentage of new students who were in top 10% and top 25% of their HS is shown below.
Both of them show a general normal distribution. But the Top10Perc his slightly right skewed with a longer
tail indicating fewer colleges have larger number of new students holding the Top10% HSC scores.
Figure 2-2 New Students with Top HSC Scores (10% and 25%) Histogram
The spends needed by a student is Room-Board, Books and Personal. Room.Board plot shows a
normalized data set (the mean, median are close in values to each other, Table 2-8 Descriptive Statistics
16
`Advanced Statistics- Project
of P2 Data). Books seems to follow a normal distribution, but there are some peaks in the histogram bars.
The Personal histogram is right skewed which indicates that the mean is higher than the median (Table
2-8 Descriptive Statistics of P2 Data).
The histogram of fulltime and part-time undergraduate students (discrete data) is right skewed which
indicates that the mean is higher than the median (Table 2-8 Descriptive Statistics of P2 Data). The x-axis
indicates number of students and y axis indicates number of colleges.
By checking the data, we see that on average, only 18.3% of students in a college are part-time and 81.6%
of students are full-time students.
17
`Advanced Statistics- Project
Figure 2-6 Instructional Expenditure Per Student and Number of Outstate Students - Histogram
The histogram of the instructional expenditure per student shows data is right skewed which indicates
that the mean is higher than the median (Table 2-8 Descriptive Statistics of P2 Data). The plot shows that
there is a high peak to this data. The number of outstate students is looks to be normally distributed with
a little right skew.
18
`Advanced Statistics- Project
19
`Advanced Statistics- Project
The skewness seen in box plot is confirmed by the histogram plots in the previous sections
20
`Advanced Statistics- Project
21
`Advanced Statistics- Project
Please note that due to large number of data variables, the generated image is very small.
22
`Advanced Statistics- Project
23
`Advanced Statistics- Project
24
`Advanced Statistics- Project
2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Scaling is a preprocessing step which is applied to independent variables in order to normalize the data
within a particular range. Most datasets have features which vary highly in magnitudes, units and range.
Scaling ensures that all the features are given equal importance. If scaling is not done, then algorithms
which only takes magnitude in account will give results with incorrect modelling.
In the case of the Problem 2 data set, the disproportionate data magnitudes and units are clearly evident.
With the presence of these varied data magnitudes/units, it is best to apply scaling before any algorithm
processing is performed.
In this case, we will be scaling the data by computing the z-score value for all the values. The formula is
displayed below.
On doing this, the scaled data values will all have their mean tending to 0 and their standard deviation as
1 i.e. the data will become centralized.
A sample (complete data has 777 rows, 17 columns) of both the original the numerical dataset and the
dataset post scaling is shown below.
25
`Advanced Statistics- Project
Apps Accept Enroll Top10 perc Top25 perc F. Undergrad P. Undergrad Outstate Room. Board
0 1660 1232 721 23 52 2885 537 7440 3300
1 2186 1924 512 16 29 2683 1227 12280 6450
2 1428 1097 336 22 50 1036 99 11250 3750
3 417 349 137 60 89 510 63 12960 5450
4 193 146 55 16 44 249 869 7560 4120
Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 450 2200 70 78 18.1 12 7041 60
1 750 1500 29 30 12.2 16 10527 56
2 400 1165 53 66 12.9 30 8735 54
3 450 875 92 97 7.7 37 19016 59
4 800 1500 76 72 11.9 2 10922 15
Table 2-11 Data before scaling
Apps Accept Enroll Top10 perc Top25 perc F. Undergrad P. Undergrad Outstate Room. Board
0 -0.34688 -0.32121 -0.06351 -0.25858 -0.19183 -0.16812 -0.20921 -0.74636 -0.9649
1 -0.21088 -0.0387 -0.28858 -0.65566 -1.35391 -0.20979 0.244307 0.457496 1.909208
2 -0.40687 -0.37632 -0.47812 -0.31531 -0.29288 -0.54957 -0.49709 0.201305 -0.55432
3 -0.66826 -0.68168 -0.69243 1.840231 1.677612 -0.65808 -0.52075 0.626633 0.996791
4 -0.72618 -0.76455 -0.78073 -0.65566 -0.59603 -0.71192 0.009005 -0.71651 -0.21672
Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 -0.60231 1.270045 -0.16303 -0.11573 1.013776 -0.86757 -0.50191 -0.31825
1 1.21588 0.235515 -2.67565 -3.37818 -0.4777 -0.54457 0.16611 -0.55126
2 -0.90534 -0.25958 -1.20484 -0.93134 -0.30075 0.585935 -0.17729 -0.66777
3 -0.60231 -0.68817 1.185206 1.175657 -1.61527 1.151188 1.792851 -0.3765
4 1.518912 0.235515 0.204672 -0.52353 -0.55354 -1.67508 0.241803 -2.93961
Table 2-12 Data after scaling
26
`Advanced Statistics- Project
A histogram plot of the original data and the scaled data are displayed below for visual comparison. It is
evident that the graphs both display the same properties but the axes of the graphs show that the data
has been scaled.
27
`Advanced Statistics- Project
2.3 Comment on the comparison between the covariance and the correlation matrices from this data [on scaled data].
2.3.1 Correlation Matrix
Top Top Outstat Room Person Termin S.F. perc. Grad.
Apps Accept Enroll F. UG P. UG Books PhD Expend
10% 25% e .Board al al Ratio alumni Rate
Apps 1 0.94 0.85 0.34 0.35 0.81 0.4 0.05 0.16 0.13 0.18 0.39 0.37 0.1 -0.09 0.26 0.15
Accept 0.94 1 0.91 0.19 0.25 0.87 0.44 -0.03 0.09 0.11 0.2 0.36 0.34 0.18 -0.16 0.12 0.07
Enroll 0.85 0.91 1 0.18 0.23 0.96 0.51 -0.16 -0.04 0.11 0.28 0.33 0.31 0.24 -0.18 0.06 -0.02
Top
0.34 0.19 0.18 1 0.89 0.14 -0.11 0.56 0.37 0.12 -0.09 0.53 0.49 -0.38 0.46 0.66 0.49
10%
Top
0.35 0.25 0.23 0.89 1 0.2 -0.05 0.49 0.33 0.12 -0.08 0.55 0.52 -0.29 0.42 0.53 0.48
25%
F. UG 0.81 0.87 0.96 0.14 0.2 1 0.57 -0.22 -0.07 0.12 0.32 0.32 0.3 0.28 -0.23 0.02 -0.08
P. UG 0.4 0.44 0.51 -0.11 -0.05 0.57 1 -0.25 -0.06 0.08 0.32 0.15 0.14 0.23 -0.28 -0.08 -0.26
Outstat
0.05 -0.03 -0.16 0.56 0.49 -0.22 -0.25 1 0.65 0.04 -0.3 0.38 0.41 -0.55 0.57 0.67 0.57
e
Room.
0.16 0.09 -0.04 0.37 0.33 -0.07 -0.06 0.65 1 0.13 -0.2 0.33 0.37 -0.36 0.27 0.5 0.42
Board
Books 0.13 0.11 0.11 0.12 0.12 0.12 0.08 0.04 0.13 1 0.18 0.03 0.1 -0.03 -0.04 0.11 0
Person
0.18 0.2 0.28 -0.09 -0.08 0.32 0.32 -0.3 -0.2 0.18 1 -0.01 -0.03 0.14 -0.29 -0.1 -0.27
al
PhD 0.39 0.36 0.33 0.53 0.55 0.32 0.15 0.38 0.33 0.03 -0.01 1 0.85 -0.13 0.25 0.43 0.31
Termin
0.37 0.34 0.31 0.49 0.52 0.3 0.14 0.41 0.37 0.1 -0.03 0.85 1 -0.16 0.27 0.44 0.29
al
S.F.
0.1 0.18 0.24 -0.38 -0.29 0.28 0.23 -0.55 -0.36 -0.03 0.14 -0.13 -0.16 1 -0.4 -0.58 -0.31
Ratio
perc.
-0.09 -0.16 -0.18 0.46 0.42 -0.23 -0.28 0.57 0.27 -0.04 -0.29 0.25 0.27 -0.4 1 0.42 0.49
alumni
Expend 0.26 0.12 0.06 0.66 0.53 0.02 -0.08 0.67 0.5 0.11 -0.1 0.43 0.44 -0.58 0.42 1 0.39
Grad.
0.15 0.07 -0.02 0.49 0.48 -0.08 -0.26 0.57 0.42 0 -0.27 0.31 0.29 -0.31 0.49 0.39 1
Rate
Table 2-13 Scaled Data Correlation Matrix
The correlation matrix created is a NxN square matrix where N is the number of columns in the original data input (here, we have 17 columns so it is a 17x17
matrix). The column names and the index of the matrix are the same i.e. the names of the columns are reflected as names of the rows of the correlation matrix.
28
`Advanced Statistics- Project
This matrix is created to check the correlation of all the data variables on each other. The correlation can be a positive value which means the variables are directly
related to each other. In case the correlation is negative, the data variables are inversely related to one another.
The values in this matrix range between +1 to -1 only. The data is in the square matrix is a mirror image i.e. the top and the bottom half of the matrix along the
diagonal are mirror images of each other with the diagonal cells all having a value of 1. This matrix is used to generate the heat-map for quick correlation analysis.
See Figure 2-11 Data set Heatmap and Table 2-10 Heat Map Data Correlation Inference.
A sample of the created covariance matrix is shown below. The entire data set is a 777x777 matrix.
29
`Advanced Statistics- Project
771 -0.19 -0.35 0.02 0.56 -0.24 0.21 0.21 … 0.39 0.52 -0.37 0.05 0.02 0.48 0
772 0.22 0.34 0.01 -0.68 0.21 -0.27 -0.18 … -0.31 -0.37 0.69 -0.04 0.04 -1.07 0.1
773 -0.1 0.18 0.02 -0.01 -0.15 -0.04 0.09 … 0.03 0.05 -0.04 0.15 -0.06 0.05 0.09
774 -0.03 0.01 -0.02 0.06 0.15 0.02 -0.11 … 0.04 0.02 0.04 -0.06 0.16 -0.04 -0.02
775 -0.46 0.04 0.18 1.51 -0.09 0.55 0.24 … 0.31 0.48 -1.07 0.05 -0.04 3.07 -0.52
776 0.14 -0.24 -0.04 -0.29 -0.39 -0.17 -0.18 … 0.11 0 0.1 0.09 -0.02 -0.52 0.57
Table 2-14 Scaled Data Covariance Matrix
30
`Advanced Statistics- Project
2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here? [Please do not treat Outliers unless specifically asked to do so]
An outlier is a data point in the dataset which lies an abnormal distance from other values in a random
sample from a population. These unusual data points are problematic for several statistical analyses
because they can cause tests to either miss significant findings or distort real results.
It is important to note that scaling a dataset does NOT eliminate the presence of outliers. Outliers present
in the original dataset will still be present in the scaled dataset. The only way to correctly eliminate outliers
is to identify them and treat them by replacing the outliers with acceptable values or dropping that data.
This can only be done with the support of a subject matter expert.
Box plots graphically display the presence of outliers in all the data. We have plotted boxplots of the
original unscaled data (Figure 2-14 Original Data Boxplot – Outliers Identification) and the scaled data
(Figure 2-15 Scaled Data Boxplot – Outliers Identification).
1. We see that all data columns except the Top25Perc column have outliers.
2. Simply scaling does not affect the presence of outliers; We observe that the original data and the
scaled data box plots are similar and both show the presence of outliers.
3. The only difference in the unscaled data boxplots and scaled data box plots is the data scale of
the Y-axis. Scaled data Y-axis is a z-score scale.
31
`Advanced Statistics- Project
32
`Advanced Statistics- Project
33
`Advanced Statistics- Project
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]
As a precautionary measure, it is prudent to first check if the dataset will benefit from PCA treatment. To
check this, we apply the Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin (KMO) measure of
sampling adequacy tests.
We test the hypothesis that the data elements are uncorrelated in the population. The p-value should
be small so that we can reject the Null Hypothesis and PCA can be recommended.
Null Hypothesis, Ho: All data entries in the population are uncorrelated
Alternate Hypothesis, Ha: At least one pair of variables in the population are correlated
Conclusion: The calculated p-value is 0.0 so the null hypothesis is rejected. We continue with the PCA
calculation.
We check the MSA value should be calculated. In order for the PCA test to be recommended, the
value should be greater than 0.7.
Conclusion: The calculated KMO MSA is 0.813 so we continue with the PCA calculation.
Principal Component Analysis (PCA) is a statistical procedure which is used to convert a set of data
observations of possibly correlated variables into a set of values of linearly uncorrelated variables. PCA is
applied to only
34
`Advanced Statistics- Project
Eigen Vectors:
[[ 2.48765602e-01, 2.07601502e-01, 1.76303592e-01, 3.54273947e-01, 3.44001279e-01, 1.54640962e-01,
2.64425045e-02, 2.94736419e-01, 2.49030449e-01, 6.47575181e-02, -4.25285386e-02, 3.18312875e-01,
3.17056016e-01, -1.76957895e-01, 2.05082369e-01, 3.18908750e-01, 2.52315654e-01],
35
`Advanced Statistics- Project
36
`Advanced Statistics- Project
37
`Advanced Statistics- Project
2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with the original features
Once the PCA is completed, we get the Eigen vectors which are the loadings or coefficients of all PCs. This data updated as a data frame with the column names
from the original data set and with the indices labeled as PC numbers.
Please note that for easier readings, the table has been split into two parts with the index values of the rows repeated for clarity.
Apps Accept Enroll Top10 perc Top25 perc F. Undergrad P. Undergrad Outstate Room. Board
PCA0 0.248766 0.207602 0.176304 0.354274 0.344001 0.154641 0.026443 0.294736 0.24903
PCA1 0.331598 0.372117 0.403724 -0.08241 -0.04478 0.417674 0.315088 -0.24964 -0.13781
PCA2 -0.06309 -0.10125 -0.08299 0.035056 -0.02415 -0.06139 0.139682 0.046599 0.148967
PCA3 0.281311 0.267817 0.161827 -0.05155 -0.10977 0.100412 -0.15856 0.131291 0.184996
PCA4 0.005741 0.055786 -0.05569 -0.39543 -0.42653 -0.04345 0.302385 0.222532 0.560919
PCA5 -0.01624 0.007535 -0.04256 -0.05269 0.033092 -0.04345 -0.1912 -0.03 0.162755
PCA6 -0.04249 -0.01295 -0.02769 -0.16133 -0.11849 -0.02508 0.061042 0.108529 0.209744
PCA7 -0.10309 -0.05627 0.058662 -0.12268 -0.10249 0.07889 0.570784 0.009846 -0.22145
PCA8 -0.09023 -0.17786 -0.12856 0.3411 0.403712 -0.05944 0.560673 -0.00457 0.275023
PCA9 0.05251 0.04114 0.034488 0.064026 0.014549 0.020847 -0.22311 0.186675 0.298324
PCA10 0.043046 -0.05841 -0.0694 -0.0081 -0.27313 -0.08116 0.100693 0.143221 -0.35932
PCA11 0.024071 -0.1451 0.011143 0.038554 -0.08935 0.056177 -0.06354 -0.82344 0.35456
PCA12 0.595831 0.292642 -0.44464 0.001023 0.021884 -0.52362 0.125998 -0.14186 -0.06975
PCA13 0.080633 0.033467 -0.0857 -0.10783 0.151742 -0.05637 0.019286 -0.03401 -0.05843
PCA14 0.133406 -0.1455 0.02959 0.697723 -0.61727 0.009916 0.020952 0.038354 0.003402
PCA15 0.459139 -0.51857 -0.40432 -0.14874 0.051868 0.560363 -0.05273 0.101595 -0.02593
PCA16 0.35897 -0.54343 0.609651 -0.14499 0.080348 -0.41471 0.009018 0.0509 0.001146
Table 2-15 Eigen Vectors in data frame (Part 1)
Books Personal PhD Terminal S.F. Ratio Perc. Alumni Expend Grad. Rate
PCA0 0.064758 -0.04253 0.318313 0.317056 -0.17696 0.205082 0.318909 0.252316
PCA1 0.056342 0.219929 0.058311 0.046429 0.246665 -0.2466 -0.13169 -0.16924
38
`Advanced Statistics- Project
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only). [hint: write
the linear equation of PC in terms of eigenvectors and corresponding features]
The observed variables in the data be denoted by x1,x2,...,xn where n is the number of observations
The principal components are linear combinations of the a values. The equation for the same may be defined as follows:
PC1 = a1x1 + a2x2 +...+ an𝑝 xn
PC2 = b1x1 + b2x2 +...+ bn xn
PC3 = c1x1 + c2x2 +...+ cn xn
Where a/b/c etc. are extracted coefficients
39
`Advanced Statistics- Project
1. As a part of our calculation, data x1, x2 etc. is the z-score scaled data set, this is seen in Table 2-12 Data after scaling. This is placed in column X.
2. The coefficients are the Eigen vectors that have been computed for the scaled data set as seen in Table 2-15 Eigen Vectors in data frame (Part 1) and Table
2-16 Eigen Vectors in data frame (Part 2). These values (rounded to 2) are placed in columns A and B.
3. Total computation will result in 777 rows and 17 columns.
4. We have shown calculation of the first two PCs with the formula and table values below:
PC1 Value = Sum of (Column X * Column A)
PC2 Value = Sum of (Column X * Column B)
40
`Advanced Statistics- Project
The same output is calculated when we execute the PCA and fit_transform function of the sklearn.decomposition library. This is preferred as it is faster
and eliminates the possibility of human error, given that the inputs are specified correctly. A small sample of the consolidated output for the same dataset is
displayed below.
The highlighted green cells show the manually computed and the auto-calculated values.
41
`Advanced Statistics- Project
2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide
on the optimum number of principal components? What do the eigenvectors
indicate?
Consider the cumulative values of the eigenvalues; How does it help you to decide on the optimum
number of principal components?
The cumulative sum of the Eigen values generated tend towards the number of principal components.
We use this to compute the ratio (see formula below) which is used to perform the relative comparison
and find out the number of principal components that are relevant. This is done to reduce the dimensions.
Ratio = Eigen value of each PC /Sum of all Eigen values of all PCs
Eigen Values Ratio: [0.32020628, 0.26340214, 0.06900917, 0.05922989,
0.05488405, 0.04984701, 0.03558871, 0.03453621,
0.03117234, 0.02375192, 0.01841426, 0.01296041,
0.00985754, 0.00845842, 0.00517126, 0.00215754,
0.00135284]
The graphical display of the cumulative Eigen values ratio as scree plot is below. The red horizontal line
indicates the 80% cumulative variance. The PCs must be taken so as to explain between 70% -90% of the
total variance, we reach this value at around 5 principal components.
When we perform Eigen decomposition, we compute the Eigen Values and Eigen Vectors. Eigen vectors
indicate the direction of the principal components. The corresponding Eigen values are the magnitudes of
variance capture
42
`Advanced Statistics- Project
2.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of
the Principal Components Obtained]
2.9.1 Dimensionality Reduction
We can check the computed Eigen values to help decide the number of principal components that are to
be considered to reach an optimal number Kaiser's rule is simply to retain factors whose Eigen values are
greater than 1. It is based on the assumption that to retain a factor that explains less variance than a single
original variable is not reasonable i.e. it means that the PC does not do the job of even a lone original
variable.
So, going by this rule it can be taken that 4 is the ideal number of principal components that is to be taken
to reduce the dimensions of the original dataset.
Eigen Values: [5.45052162, 4.48360686, 1.17466761, 1.00820573,
0.93423123, 0.84849117, 0.6057878 , 0.58787222,
0.53061262, 0.4043029 , 0.31344588, 0.22061096,
0.16779415, 0.1439785 , 0.08802464, 0.03672545,
0.02302787]
The computed Eigen values are displayed graphically as a scree plot to show the principal components
that are to be considered based on their Eigen values.
Considering that we started with 17 columns of numeric data, reducing the number of relevant data fields
to only 4-5 is very useful. It reduces computation effort and focuses the dataset on significant variables.
43
`Advanced Statistics- Project
The redlined blocks indicate the highest values in each PC. We take the corresponding column names and rename the PC to get meaningful labels.
• PC0 = pc_Expend
• PC1 = pc_application_acceptance_enrollment
• PC2 = pc_books_personal_expenditure
• PC3 = pc_faculty_phd_terminal
• PC4 = pc_top_10_25_hsc
Once the columns have been renamed and the ‘Names’ column has been added back to the dataset, the transformed dataset is ready. This has 777 rows and 6
columns. We show a sample portion of this below.
44
`Advanced Statistics- Project
pc_application_ pc_books_
pc_faculty_phd_ pc_top_
Names pc_Expend acceptance_ personal_
terminal 10_25_hsc
enrollment expenditure
Abilene Christian
0 -1.59286 0.767334 -0.10107 -0.92175 -0.74398
University
1 Adelphi University -2.1924 -0.57883 2.278797 3.588919 1.059997
2 Adrian College -1.43096 -1.09282 -0.43809 0.677241 -0.36961
Agnes Scott
3 2.855557 -2.63061 0.14172 -1.29548 -0.18384
College
Alaska Pacific
4 -2.21201 0.021631 2.387029 -1.11454 0.684451
University
Figure 2-19 Transformed Data Set with Reduced Dimensionality
With this reduced dimensionality (important parameters only), we can isolate data needed.
45
`Advanced Statistics- Project
46
`Advanced Statistics- Project
2.9.4.2 Colleges with High PhD/Terminal Faculty and Top 10%/25% HSC
With the combination check of colleges having PhD and Terminal faculty above the third quartile and
having new students from top 10%/25% of HSC, we get 54 colleges. This college names list will be helpful
for future students
2.9.5 Conclusion
With reduced dimensionality shining a light on the important data parameters only, the future students
can decide which university to apply to for
47