Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Problem Statement 1:

Executive Summary

Salary is hypothesized to depend on educational qualification and occupation. To


understand the dependency, the salaries of 40 individuals [SalaryData.csv] are
collected and each person’s educational qualification and occupation are noted.
Educational qualification is at three levels, High school graduate, Bachelor, and
Doctorate. Occupation is at four levels, Administrative and clerical, Sales,
Professional or specialty, and Executive or managerial. A different number of
observations are in each level of education – occupation combination.
Introduction
The purpose of this whole exercise is to explore the dataset. Do the data analysis. Explore the
dataset using One Way ANOVA & Two way ANOVA. The data consists of 41different salary
grade with respect to Education and qualification. Analyse the different salary grade which can
help in analysing the impact of Education and Qualification.

Data Description

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from statsmodels.formula.api import ols

from statsmodels.stats.anova import _get_covariance,anova_lm

%matplotlib inline
Sample of the dataset:

Education Occupation Salary

0 Doctorate Adm-clerical 153197

1 Doctorate Adm-clerical 115945

2 Doctorate Adm-clerical 175935

3 Doctorate Adm-clerical 220754

4 Doctorate Sales 170769

1.1 State the null and the alternate hypothesis for conducting one-wayANOVA
for both Education and Occupation individually.
❖ For Education:
𝐻0 : The means of 'Education' variable with respect to each Salary is
equal.
𝐻1 : At least one of the means of 'Education' variable with respect to
eachSalary is unequal.
❖ For Occupation:
𝐻0 : The means of 'Occupation' variable with respect to each Salary is
equal.
𝐻1 : At least one of the means of 'Occupation' variable with respect to
eachSalary is unequal.
1.2Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State
whether the null hypothesis is accepted or rejected based on the ANOVA results.

df sum_sq mean_sq F PR(>F)


C(Education) 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08
Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN

➢ Since the p value is less than the significance level, we can reject the null hypothesis
and states that there is a difference in the mean Salary with different Education
Qualification.
1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’
. State whether the null hypothesis is accepted or rejected based on the ANOVA results.

df sum_sq mean_sq F PR(>F)


C(Occupation) 3.0 1.125878e+10 3.752928e+09 0.884144 0.458508
Residual 36.0 1.528092e+11 4.244701e+09 NaN NaN

➢ Since the p value is less than the significance level, we can reject the Alternative
hypothesis/Accept the null hypothesis and states that there is no difference in the
mean Salary with different Occupation.

1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means
are significantly different. Interpret the result.
➢ Since the p-values for Education less than .05, this means that both factors have a
statistically significant effect on Salary.

➢ And since the p-value for Occupation (.07) is not less than .05, this tells us that there
is no significant interaction effect between Occupation and Salary.

1.5 What is the interaction between the two treatments? Analyze the effects of one
variable on the other (Education and Occupation) with the help of an interaction plot.

• sns.pointplot(x='Education', y='Salary', data=data, ci=None)


• sns.pointplot(x='Occupation', y='Salary', data=data
• sns.pointplot(x='Occupation', y='Salary', data=data, ci=None)

1.6 Perform a two-way ANOVA based on the Education and Occupation (along with
their interaction Education*Occupation) with the variable ‘Salary’. State the null and
alternative hypotheses and state your results. How will you interpret this result?

• formula = 'Salary ~ C(Education) + C(Occupation)'

model = ols(formula, df).fit()

aov_table = anova_lm(model)

print(aov_table)

df sum_sq mean_sq F PR(>F)


C(Education) 2.0 1.026955e+11 5.134773e+10 31.257677 1.981539e-08
C(Occupation) 3.0 5.519946e+09 1.839982e+09 1.120080 3.545825e-01
Residual 34.0 5.585261e+10 1.642724e+09 NaN NaN

• Since the p value is less than the significance level for Education, we can reject the null
hypothesis and states that there is a difference in the mean Salary with different
Education Qualification.
• Since the p value is greater than the significance level for Occupation, we can fail to
reject the null hypothesis and states that there is no difference in the mean Salary with
different Occupation.

• sns.pointplot(x='Education', y='Salary', data=df, hue='Occupation',ci=None);

• from statsmodels.graphics.factorplots import interaction_plot


interaction_plot(np.array(df['Occupation']),np.array(df['Education']),np.array(df['
Salary']),markers=['D','^','o']);
1.7 Explain the business implications of performing ANOVA for this particular
case study.

• formula = 'Salary ~ C(Education) + C(Occupation) + C(Education):C(Occupation)'

model = ols(formula, df).fit()

aov_table = anova_lm(model)

(aov_table)

Conclusion & Recommendation

➢ Since the p-values for Education less than .05, this means that both factors have a
statistically significant effect on Salary.

➢ And since the p-value for Occupation (.07) is not less than .05, this tells us that there
is no significant interaction effect between Occupation and Salary.

➢ Since the p-value for the interaction effect less than 0.05, this means that there is
significant interaction effect between Education and Occupation.
Problem Statement 2:
Executive Summary
The dataset Education - Post 12th Standard.csv contains information on various colleges. You
are expected to do a Principal Component Analysis for this case study according to the
instructions given. The data dictionary of the 'Education - Post 12th Standard.csv' can be found
in the following file: Data Dictionary.xlsx.
• Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?
• Is scaling necessary for PCA in this case? Give justification and perform scaling.
• Comment on the comparison between the covariance and the correlation matrices from
this data [on scaled data].
• Check the dataset for outliers before and after scaling. What insight do you derive here?
[Please do not treat Outliers unless specifically asked to do so]
• Extract the eigenvalues and eigenvectors.[Using Sklearn PCA Print Both]
• Perform PCA and export the data of the Principal Component (eigenvectors) into a data
frame with the original features
• Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]
• Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
• Explain the business implication of using the Principal Component Analysis for this case
study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

Introduction
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Explore the dataset using EDA. The data consists of 777 rows and 18 variables. Analyse every
case with the help of exploratory data analysis.

Data Description

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt


import seaborn as sns

%matplotlib inline

import warnings

warnings.filterwarnings('ignore')

Sample of the dataset:

2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be
performed]. What insight do you draw from the EDA?

➢ With help of PCA we have been able to reduce features which is able to explain 85% of
variance in the data
➢ With help of reduced components we have been able to observe some patterns. Using
some rules around business context we are able to shortlist Educational qualification
which need to be monitored closely.

2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling.
Yes, it is necessary to normalize data before performing PCA.If you normalize your data, all
variables have the same standard deviation, thus all variables have the same weight and your
PCA calculates relevant axis.
The rule of thumb is that if your data is already on a different scale (e.g. every feature is XX per
100 inhabitants), scaling it will remove the information contained in the fact that your
features have unequal variances. If the data is on different scales, then you should normalize it
before running PCA.
2.3 Comment on the comparison between the covariance and the correlation matrices from
this data.[on scaled data]

Both Correlation and Covariance are very closely related to each other and yet they differ a
lot. When it comes to choosing between Covariance vs Correlation, the latter stands to be the
first choice as it remains unaffected by the change in dimensions, location, and scale, and can
also be used to make a comparison between two pairs of variables. Since it is limited to a
range of -1 to +1, it is useful to draw comparisons between variables across
domains. However, an important limitation is that both these concepts measure the only
linear relationship.

2.4 Check the dataset for outliers before and after scaling. What insight do you derive
here?
2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both]

➢ #Extract eigen vectors

➢ pca.components_

➢ array([[ 0.46457137, 0.4793131 , 0.25493646, 0.10391206, 0.49510403,


➢ 0.48323295],
➢ [-0.30863947, -0.25884109, 0.58743574, 0.69886653, 0.04529286,
➢ 0.04686531],
➢ [ 0.42551692, 0.42729496, 0.30679156, 0.15538098, -0.48014392,
➢ -0.53623863],
➢ [ 0.12993364, 0.1105342 , -0.70319737, 0.69017555, -0.00650881,
➢ -0.00531485],
➢ [-0.12981039, 0.1736953 , 0.0067583 , 0.0020202 , -0.70711239,
➢ 0.67299444],
➢ [-0.68856153, 0.69158614, -0.03568449, -0.01726965, 0.14910425,
➢ -0.15423335]])
➢ eigen values

➢ pca.explained_variance_

➢ array([3.05584486, 1.56711528, 0.74103744, 0.38830943, 0.15167818,


➢ 0.10374678])

➢ pca.explained_variance_ratio_

➢ array([0.508652 , 0.26084973, 0.12334729, 0.06463495, 0.02524716,


➢ 0.01726888])

2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a dat
a frame with the original features
2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values
with two places of decimals only). [hint: write the linear equation of PC in terms of
eigenvectors and corresponding features]

The explicit form of first PC is as below:PC1= 0.25*Apps + 0.21*Accept + 0.18*Enroll +


0.35*Top10perc + 0.34*Top25perc + 0.15*F.Undergrad+ 0.03*P.Undergrad + 0.29*Outstate +
0.25*Room.Board + 0.06*Books + -0.04*Personal + 0.32*PhD +0.32*Terminal + -
0.18*S.F.Ratio + 0.21*perc.alumni + 0.32*Expend + 0.25*Grad.Rate

2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?

A widely applied approach is to decide on the number of principal components by examining a


scree plot. By eyeballing the scree plot, and looking for a point at which the proportion of
variance explained by each subsequent principal component drops off. This is often referred to as
an elbow in the scree plot.

Conclusion
2.9 Explain the business implication of using the Principal Component Analysis for this
case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the
Principal Components Obtained]

With help of PCA we have been able to reduce features which is able to explain 85% of variance
in the data

With help of reduced components we have been able to observe some patterns. Using some rules
around business context we are able to shortlist Educational qualification which need to be
monitored closely.

Using the components additional rules can be derived and analyzed.

Unsupervised learning like clustering can further be applied on the data to segment the Education
Qualification based on the components created and further analyzed.
**********The End*************

You might also like