Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Business Report

Project: Advanced Statistics


Student’s Name –Tanaya Lokhande

Contents
1.1 STATE THE NULL AND THE ALTERNATE HYPOTHESIS FOR CONDUCTING ONE
WAY ANOVA FOR BOTH EDUCATION AND OCCUPATION
INDIVIDUALLY. . ................................... .......................................................................... .....4

1.2 PERFORM ONE WAY ANOVA FOR EDUCATION WITH RESPECT TO THE VARIABL ‘SALARY’. STATE
WHETHERTHE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON THE ANOVA RESULTS
... .......... ............................................. ........... 5

1.3 PERFORM ONE WAY ANOVA FOR VARIABLE OCCUPATION WITH RESPECT TO THE VARIABLE ‘SALARY’.
STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTED BASED ON THE ANOVA RESULTS.
............................... 5

1.4 IF THE NULL HYPOTHESIS IS REJECTED IN EITHER (1.2) OR IN (1.3), FIND OUT WHICH CLASS MEANS
ARESIGNIFICANTLY DIFFERENT. INTERPRET THE RESULT............................................................................ 6

1.5 WHAT IS THE INTERACTION BETWEEN THE TWO TREATMENTS? ANALYZE THE EFFECTS OF ONE
VARIABLE ONTHE OTHER (EDUCATION AND OCCUPATION) WITH THE HELP OF AN INTERACTION
PLOT............................................................................................................................................................................ 7

1.6 PERFORM A TWO-WAY ANOVA BASED ON THE EDUCATION AND OCCUPATION (ALONG WITH
THEIRINTERACTION EDUCATION *OCCUPATION) WITH THE VARIABLE ‘SALARY’.
STATE THE NULL AND ALTERNATIVEHYPOTHESES AND STATE YOUR RESULTS.
HOW WILL YOU INTERPRET THIS RESULT? ANSWER : .................................................................................... 9

1.7 EXPLAIN THE BUSINESS IMPLICATIONS OF PERFORMING ANOVA FOR THIS PARTICULAR CASE STUDY
................................................................................................................................................................................... 10

2.1
PERFORM EXPLORATORY DATA ANALYSIS [BOTH UNIVARIATE AND MULTIVARIATE ANALYSIS TO
BEPERFORMED]. .................................................................................................................................................... 11

2.2 IS SCALING NECESSARY FOR PCA IN THIS CASE? GIVE JUSTIFICATION AND PERFORM SCALING
. ................................................................................................................................................................................. 18

2.3 COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THE CORRELATION
MATRICES FROM THIS DATA. [ON SCALED
DATA ] ..................................................................................................................................................................... 20

2.4 CHECK THE DATASET FOR OUTLIERS BEFORE AND AFTER SCALING. WHAT INSIGHT DO YOU DERIVE
HERE? …………………………………………………………………………………………………………………………………….21

2.5 PERFORM PCA AND EXPORT THE DATA OF THE PRINCIPAL COMPONENT (EIGENVECTORS) INTO A
DATAFRAME WITH THE ORIGINAL FEATURES .............................................................................................. 23

2.6 WRITE DOWN THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF THE EIGENVECTORS. USE VALUES
WITHTWO PLACES OF DECIMALS ONLY). WRITE THE LINEAR EQUATION OF PC IN TERMS OF
EIGENVECTORS ANDCORRESPONDING FEATURES…………………………………………………………………..………. 27

2.7 CONSIDER THE CUMULATIVE VALUES OF THE EIGENVALUES. HOW DOES IT HELP YOU TO DECIDE ON
THEOPTIMUM NUMBER OF PRINCIPAL COMPONENTS? WHAT DO THE EIGENVECTORS
INDICATE? ............................................................................................................................................................ 28
2.8 EXPLAIN THE BUSINESS IMPLICATION OF USING THE PRINCIPAL COMPONENT ANALYSIS FOR THIS
CASESTUDY. HOW MAY PCS HELP IN THE FURTHER ANALYSIS? [HINT: WRITE INTERPRETATIONS OF THE
PRINCIPAL COMPONENTS
OBTAINED] ............................................................................................................................................................ 29

|
Problem 1A:
Salary is hypothesized to depend on educational qualification and occupation. To understand the dependency, the
salaries of 40 individuals [SalaryData.csv] are collected and each person’s

educational qualification and occupation are noted. Educational qualification is at three levels, High school graduate,
Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty,
and Executive or managerial. A different number of observations are in each level of education – occupation
combination. [Assume that the data follows a normal distribution. In reality, the normality assumption may not always
hold if the sample size is small.]
1. State the null and the alternate hypothesis for conducting one-way ANOVA for both Education and Occupation
individually.2. Perform a one-way ANOVA on Salary with respect to Education. State whether the null hypothesis is
accepted or rejected based on the ANOVA results.3. Perform a one-way ANOVA on Salary with respect to
Occupation. State whether the null hypothesis is accepted or rejected based on the ANOVA results.4. If the null
hypothesis is rejected in either (2) or in (3), find out which class means are significantly different. Interpret the result.
(Non-Graded)

1.1 STATE THE NULL AND THE ALTERNATE HYPOTHESIS FOR CONDUCTING ONE-WAY ANOVA FOR
BOTH EDUCATION AND OCCUPATION INDIVIDUALLY.
Answer: The Null and Alternate Hypothesis for the One-Way ANOVA for Education are:H0: The mean
Salary variable for each educational level is equalHa1: For at least one of the means of Salary for level of
Education is different The Null and Alternate Hypothesis for the One Way ANOVA for Occupation are:H0: The
mean Salary variable for each Occupation type is equalHa2: For at least one of the means of Salary for type of
Occupation is different Where Alpha = 0.05

If the p-value is < 0.05, then we reject the null hypothesis.

If the p-value is >= 0.05, then we fail to reject the null hypothesis.

1.2 PERFORM ONE-WAY ANOVA FOR EDUCATION WITH RESPECT TO THE VARIABLE‘SALARY’. STATE
WHETHER THE NULL HYPOTHESIS IS ACCEPTED OR REJECTEDBASED ON THE ANOVA RESULTS.

Answer:
Table 1 : 1.2.1 ANOVA for Education

Since the p-value is less than Alpha we reject Null Hypothesis (H0) for Education.
1.3 PERFORM ONE-WAY ANOVA FOR VARIABLE OCCUPATION WITH RESPECT TO THEVARIABLE
‘SALARY’. STATE WHETHER THE NULL HYPOTHESIS IS ACCEPTED ORREJECTED BASED ON THE
ANOVA RESULTS.

Answer:
Table 2 : 1.3.1 ANOVA for Occupation

Since the p-value is greater than Alpha we cannot reject Null Hypothesis (H0) for Occupation.

1.4 IF THE NULL HYPOTHESIS IS REJECTED IN EITHER (1.2) OR IN (1.3), FIND OUT WHICHCLASS MEANS
ARE SIGNIFICANTLY DIFFERENT. INTERPRET THE RESULT.
Answer:
ANOVA tells us if our results or significant or not, but does not tell us where the results are significant. But, the
interpretability of statistical significance is crucial to figure out in order to guides. So a Tukey Test allows us to
interpret the statistical significance of our ANOVA test and find out which specific groups’ means (compared with
each other) are different. So, after performing each round of ANOVA, we should use a Tukey Test to find out where
the statistical significance is occurring in our data.

Table 3 : 1.4.1 Turkey HSD test for Education levels

Tukey Test Inference:


The count of Salary mean actually differ for each pairs of Education level on an average(reject=true).
Interpretation of Inference :
The mean count of Salary does differ across different Education levels, like it is certainly higher in the
Bachelors/Doctorate as compared to Bachelors/HS-grad , Doctorate/HS-grad
1.5 WHAT IS THE INTERACTION BETWEEN THE TWO TREATMENTS? ANALYZE THEEFFECTS OF ONE
VARIABLE ON THE OTHER (EDUCATION AND OCCUPATION) WITHTHE HELP OF AN INTERACTION
PLOT.

Answer:
Figure 1.5.1 Interaction between Education and Occupation

Observation:
 From above plot we can make out that the interaction between people with:
 Adm-Clerical job with Bachelors and Doctorates is fairly good.

 Sales job with Bachelors and Doctorates is good.

 Prof-Speciality job with HS-grad and Bachelors is a bit.

 All four occupations with educational level HS-grad and Doctorate is absolutely NIL.

 Exec-Managerial job role has no interactions with any other educational background.

 From above plot we can figure out that people with educational level:

 Doctorates: are into higher salary brackets and mostly Prof-speciality roles orExec-managerial roles or in
sales profile, very few are doing Adm-clerical jobs

 Bachelors: fall in mid income range and found mostly working as an Exec -managers, Adm-clerks or into
sales but very few are found in Prof- specialty profile.

HS-grads: are in low-income brackets, mostly doing Prof-specialty or Adm -clerical work and few are doing Sales but
hardly any in Exec-managerial role.
1.6 PERFORM A TWO-WAY ANOVA BASED ON THE EDUCATION AND OCCUPATION (ALONGWITH THEIR
INTERACTION EDUCATION*OCCUPATION) WITH THE VARIABLE ‘SALARY’. STATE THE NULL AND ALTERNATIVE
HYPOTHESES AND STATE YOUR RESULTS. HOWWILL, YOU INTERPRET THIS RESULT?

ANSWER:
The Null and Alternate Hypothesis for the Two-Way ANOVA for each Occupation type and Education level are:
H0: The mean Salary variable for each Occupation type and Education level are equal
Ha2: For at least one of the means of Salary for type of Occupation and Education level are not equal.
Where Alpha = 0.05
If the p-value is < 0.05, then we reject the null hypothesis.
If the p-value is >= 0.05, then we fail to reject the null hypothesis.

Table 4 : 1.6.1 Two way ANOVA

As we can see that there is some sort of interaction between the two treatments. So, we will introduce a new term
while performing the Two Way ANOVA.

Table 5 : 1.6.2 Two Way ANOVA

Due to the inclusion of the interaction effect term, we can see changes in the p-value of the first two treatments as
compared to the Two-Way ANOVA without the interaction effect terms.
And we see that the p-value of the interaction effect term of 'Education’ suggests that the Null Hypothesis is rejected
in this case.

1.7 EXPLAIN THE BUSINESS IMPLICATIONS OF PERFORMING ANOVA FOR THIS PARTICULARCASE STUDY

Answer:

 ANOVA stands for “analysis of variance” and is used in statistics when you are testing a hypothesis to
understand how different groups respond to each other by making connections between independent and
dependent variables. ANOVA is a statistical test that compares the means of groups in order to determine if
there is a difference between them. It is used when more than two group means are compared. For two
group means, we can do t-test.

 ANOVA is used in a business context to help manage income /salary by comparing your education to
occupation here in this case to help manage revenue income (salary).

 ANOVA can also be used to forecast Salary trends by analyzing patterns in data to better understand the
future hike of Salary.
 It’s also a widely used statistical technique for comparing the relationship between factors that cause a rise
in Salary, assuming this report is for HR department or HR consulting firm. Some of the key takeaways as
below:

1. As the Education level upgrades Salary increases. On an average Doctorate earns higher salarythan
Bachelors and HS-Grads. However, it might be possibility that being Doctorate may not necessarily
mean significant high salary than HS-Grad or Bachelors employees. So that means Doctorates are
suitable for all job role or not always preferred above other education levels, maybe they can be
considered sometimes as over qualified for certain job roles.
2. Though there is lesser significance of Occupation than educationon Salary but at certain levels it
impacts Salary.
3. We must also take note of that high salaries are offered to Bachelor’s degree holders than Doctorates
for few occupations. So, we can say that there are some shortcomings of dataset provided which
reduces accuracy of the test and analysis done, as there can be few more other important variables
which can impact salary such as years of experience, specialization, industry/domain etc.
4. HR department plays more comprehensive role while setting up salary bands. As similar job titles with
different industries demands varying salary package as per job profile, plus years of experience for the
job matters here deciding scale of a person.
5. ANOVA test indicates that the Education level coupled with Occupation has significant influenceover
salary than alone occupation type with comparison to educational background.
Problem 2:
The dataset Education - Post 12th Standard.csv contains information on various colleges. You are
expected to do a Principal Component Analysis for this case study according to the instructions given. The
data dictionary of the 'Education - Post 12th Standard.csv' can be found in the following file: Data
Dictionary.xlsx.

2.1. PERFORM EXPLORATORY DATA ANALYSIS [BOTH UNIVARIATE AND MULTIVARIATE ANALYSIS TO
BE PERFORMED]. What insight do you draw from the EDA?

Answer:
The first step to know our data: understand it, get familiar with it. What are the answers we’re trying to get with that
data? What variables are we using, and what do they mean? How does it look from a statistical perspective? Is data
formatted correctly? Do we have missing values? And duplicated? What about outliers? So all these answers are can
be found out step by step as below:

Step1 : Import : a) all the necessary libraries and b) The Data


Step2: Describing the Data after loading it. Checking for datatypes, number of columns and rows, checking for
missing number of values, describing its min, max, mean values. Depending upon requirement dropping off missing
values or replacing it.
Step3: Reviewing new dataset and Identifying Outliers with Interquartile Range (IQR) and visualizing for checking up
for outliers.

Univariate Analysis
Univariate analysis revers to analysis of single variable. The main purpose is to summaries and find patterns in data.
The statistical description of the numeric variable, histogram or dist. plot to view the distribution and the box plot to
view 5-point summary and outliers if any.

Table 6: 2.1.1 Statistical Description of Original Dataset


 Observations:
 Data consists of 777 Universities with 18 Variables but not a single categorical variable present.
 Index of columns: ‘Name’, 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc', 'F. Undergrad','P.
Undergrad', 'Outstate', 'Room.Board', 'Books', 'Personal', 'PhD', 'Terminal', 'S.F. Ratio’, ‘Perc. Alumni',
'Expend', 'Grad. Rate'.
 Perc. Alumni have minimum values as 0. Needs to be cleaned
 There are no missing values in the data
 We have 1 categorical field Name field needs cleanup.
 Very few students fall under topper students with Top 10% and 25%
 No duplicate records
 P. Undergrad has a minimum value as 1 and maximum value as 21836. This has to cleaned.
 Accept field needs cleanup, has a minimum value as 72 and maximum value as 26330.
 Apps field needs cleanup, has a minimum value as 81 and maximum value as 48094.
 Books has a minimum value as 96 and maximum value as 2340. This has to cleaned.
 Enroll has a minimum value as 35 and maximum value as 6392. This has to cleaned.
 FUndergrad field needs cleanup ,has a minimum value as 139 and maximum value as 31643.
 From Quick Summary table we found that max % of Graduated students are 118% which needs to be
rectified as it shouldn’t go beyond 100. So, the error was found on row 95 for Cazenovia College has to be
corrected using Median value
 From Quick Summary table we found that max % of Phd students are 103% which needs to be rectified asit
shouldn’t go beyond 100. So the error was found on row 582 and has to be corrected using Median value.

Table 7 : 2.1.2 Quartile Range for each Variables / Attributes of Dataset

 Observations:

 There is drastic difference of values seen in Upper and Lower range of most of the variables.
 This indicates presence of outliers.
 In order to get most accurate prediction, we must do outlier treatment before scaling.
 As mentioned in FAQ s we have not done outlier treatment for PCA.
Figure 2.1.1 Histplot to check Distribution and Density of each Variables

 Observations:
 Right Skewed data in variables: 'PhD', 'Terminal
 Data is skewed Left side in variables: Apps', 'Accept', 'Enroll', 'Top10perc', 'F.Undergrad', 'P.Undergrad',
'Room. Board', 'Books', 'Personal','S.F.Ratio','Perc. Alumni', 'Expend'
 Data normally distributed/Bell Curve in Variables: Top25perc , 'Outstate', ','Grad. Rate'
Figure 2.1.2.2 Boxplot for checking presence of outliers in each feature or Univariate Analysis of all Variables
Multivariate Analysis:

Figure 2.1.2.3 Pair plot to see relationship of all Variables among each other

 Observation:
 Few pairs have very high co-relation:
 Application and acceptance
 Students from top 10% schools and from top 25% schools
 Students from top 10% schools and Graduation rate
 Enrollment and Full-time undergrad students
 PHD faculties and Terminal.
Below Heatmap exhibits multicollinearity issue as significant number of high co-relation variables pairs / features.

When the statistical significance of independent variable is undermined Multicollinearity is observed.


Figure 2.1.2.4 Heat Map to check collinearity of Original Data

2..2 IS SCALING NECESSARY FOR PCA IN THIS CASE? GIVE JUSTIFICATION AND PERFORMSCALING.
Answer:
 Our dataset has 18 attributes initially hence we get 18 principal components.
 Once we get the amount of variance explained by each principal component, we can decide how many
components we need for our model based on the amount of information we want to retain.
 Hence, Yes, it is necessary to normalize data before performing PCA.
 The PCA calculates a new projection to our data set.
 If we normalize our data, all variables have the same standard deviation, thus all variables have the same
weight and our PCA calculates relevant axis.
 This skews the PCA towards high magnitude features. We can speed up gradient descent or calculations in
algorithm by scaling.
 Scaling of Data can be done using Z-Score method or Standard Scalar in Sk Learn Formula for Z-score:
The Standard Scaler assumes your data is normally distributed within each feature and will scale them such that the
distribution is now centered around 0, with a standard deviation of 1.
Figure 2.2.1 Histogram of Variables before performing Scaling

Figure 2.2.2 Histogram of Variables after performing Scaling


Table 8 : 2.2.1 Statistical Description of Scaled Dataset

 Observations:

 After Scaling Standard deviation is 1.0 for all variables.

 Post scaling Q1(25%) value and minimum values difference is lesser than original dataset in most of the
variables.

2.3 COMMENT ON THE COMPARISON BETWEEN THE COVARIANCE AND THECORRELATION MATRICES
FROM THIS DATA. [ON SCALED DATA]
 Both the terms, Covariance and Corelation matrices measure the relationship and the dependency between
two variables.
 “Covariance” indicates the direction of the
linear relationship between
variables.
 “Correlation” on the other hand measures both the strength and direction of the linear relationship between
two variables.
 Correlation refers to the scaled form of covariance. Covariance is
affected
by the change in scale.
 Covariance indicates the direction of the linear relationship between variables. Correlation on the other hand
measures both the strength and direction of the linear relationship between two variables.
Figure 2.3.1 Heatmap of correlation between Variables after performing Scaling

Table 9 : 2.3.2 Least Co-relation strength among Variables after performing Scaling
Table 10 : 2.3.3 Highest Co-relation strength among Variables after performing Scaling

 Observations:
 Highest correlation is seen among:
Enroll variable with F. Undergrad
Enroll with Accept
Apps with Accept and Apps
 Least correlations observed with SF Ratio variable with: Expend, Outstate, Grad Rate, Perc. Alumni, Room
board and Top10perc.

2.4 CHECK THE DATASET FOR OUTLIERS BEFORE AND AFTER SCALING. WHAT INSIGHT DOYOU DERIVE HERE?

Answer:
While performing Univariate Analysis we have plotted Boxplots for all the variables for checking of Outliers
presence. After standardization of data post scaling again Box plot is drawn on Scaled Data, also used describe
function.

After scaling no much difference in terms of outlier’s reduction.

The scaling shrinks the range of the feature values as shown in the left figure below. However, the outliers have an
influencewhen computing the empirical mean and standard deviation. StandardScaler therefore cannot guarantee
balanced feature scales in the presence of outliers.

Figure 2.4.1 Boxplot for Scaled Data for all Variables


Table 11 :2.4.1 Scaled Data summary

So, even if there are outliers in the data, they will not be affected by standardization.

2.5 PERFORM PCA AND EXPORT THE DATA OF THE PRINCIPAL COMPONENT(EIGENVECTORS) INTO A DATA FRAME WITH
THE ORIGINAL FEATURES
Answer:
In below table we can see that first PC or Array explains 33.12% variance in our dataset, while first seven features
capture 70.12% variance.
Figure 2.5.1 Scree plot showing distribution of PCs

Using Z-score we have done dimension reduction from 17 PCAs to 9 PCAs. Results shown as below.
Table 2 2.5.1 Eigenvectors (PC scores) of Original Data (vertical values)
Table 3 : 2.5.2 PCA
Figure 2.5.2 Square Graph to check features spread distribution

Figure 2. 5.4 Heat Map Co-relation Metrics between components and features
Figure 2. 5.3 Loading of each selected Principal Component
2.6 WRITE DOWN THE EXPLICIT FORM OF THE FIRST PC (IN TERMS OF THEEIGENVECTORS. USE VALUES WITH TWO
PLACES OF DECIMALS ONLY). WRITE THELINEAR EQUATION OF PC IN TERMS OF EIGENVECTORS AND
CORRESPONDINGFEATURES.

In PCA, given a mean centered dataset X with n sample and p variables, the first principal component PC1 is given by
the linear combination of the original variables X_1, X_2, …, X_p
PC_1 = w_{17}X_1 + w_{16}X_2 + … + w_{1p}X_p

The first principal component PC1 represents the component that retains the maximum variance of the data. w1
corresponds to an eigenvector of the covariance matrix

The explicit form of the PC1 is as below:

[ 0.2487656, 0.2076015, 0.17630359, 0.35427395, 0.34400128,0.15464096, 0.0264425, 0.29473642, 0.24903045, 0.06475752, -


0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,0.31890875, 0.25231565],

2.7 CONSIDER THE CUMULATIVE VALUES OF THE EIGENVALUES. HOW DOES IT HELP YOUTO DECIDE ON THE OPTIMUM
NUMBER OF PRINCIPAL COMPONENTS? WHAT DO THEEIGENVECTORS INDICATE?

Figure 2.7.1 Comparison between Cumulative and Individual explained Variance

 Observations:
 The plot visually shows how much of the variance are explained, by how many principal components.
 In the below plot we see that, the 1st PC explains variance 33.13%, 2nd PC explains 57.19% and so on.
 Effectively we can get material variance explained (i.e. 90%) by analyzing 9 Principle components instead all
of the 17 variables(attributes) in the dataset.
PCA uses the eigenvectors of the covariance matrix to figure out how we should rotate the data. Because rotation is a
kind of linear transformation, your new dimensions will be sums of the old ones. The eigen-vectors (Principle
Components) , determine the direction or Axes along which linear transformation acts, stretching or compressing
input vectors. They are the lines of change that represent the action of the larger matrix, the very “line” in linear
transformation.

2.8 EXPLAIN THE BUSINESS IMPLICATION OF USING THE PRINCIPAL COMPONENTANALYSIS FOR THIS CASE STUDY. HOW
MAY PCS HELP IN THE FURTHER ANALYSIS? [HINT: WRITE INTERPRETATIONS OF THE PRINCIPAL COMPONENTS
OBTAINED]

This business case study is about education dataset which contain the names of various colleges, which has various
details of colleges and university. To understand more about the dataset we perform univariate analysis and
multivariate analysis which gives us the understanding about the variables. From analysis we can understand the
distribution of the dataset, skew, and patterns in the dataset. From multivariate analysis we can understand the
correlation of variables. Inference of multivariate analysis shows we can understand multiple variables highly
correlated with each other. The scaling helps the dataset to standardize the variable in one scale. Outliers are imputed
using IQR values once the values are imputed we can perform PCA. The principal component analysis is used reduce
the multicollinearity between the variables. Depending on the variance of the dataset we can reduce the PCA
components. The PCA components for this business case is 5 where we could understand the maximum variance of
the dataset. Using the components we can now understand the reduced multicollinearity in the dataset

You might also like