Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Problem 1:

1.1) State the Null and Alternate Hypothesis for conducting one-way ANOVA for both the variables
‘A’ and ‘B’ individually. [both statement and statistical form like Ho=mu, Ha>mu]
There are three levels for variable A and also three level for variable B.
For conducting one-way ANOVA for variable ‘A’, the null and alternate hypothesis are as follows:
Null Hypothesis: The mean hour of relief provided by compound A is same at all three level.
Alternate Hypothesis: At least for one pair of level for compound A the mean hour of relief is not equal.
Statistically, H0 : µ1= µ2 = µ3 and Hα : At least for one pair of level of compound A the mean hour of relief
is not equal.
For conducting one-way ANOVA for variable ‘B’, the null and alternate hypothesis are as follows:
Null Hypothesis: The mean hour of relief provided by compound B is same at all three level.
Alternate Hypothesis: At least for one pair of level for compound B the mean hour of relief is not equal.
Statistically, H0 : µ1= µ2 = µ3 and Hα : At least for one pair of level for compound B the mean hour of
relief is not equal.
1.2) Perform one-way ANOVA for variable ‘A’ with respect to the variable ‘Relief’. State whether the
Null Hypothesis is accepted or rejected based on the ANOVA results.
The summary of one-way ANOVA table for variable ‘A’ is as below:

Table 1.1: One-way ANOVA Table for compound ‘A’

Since the value of F-Statistic is approximately 23.5 and p value is highly significant. Therefore, based on
ANOVA test we reject the null hypothesis that mean hour of relief provided by compound ‘A’ is same at
all three levels.
1.3) Perform one-way ANOVA for variable ‘B’ with respect to the variable ‘Relief’. State whether the
Null Hypothesis is accepted or rejected based on the ANOVA results.
The summary of one way ANOVA table for variable ‘B’ is as below

Table 1.2: One-way ANOVA Table for compound ‘B’

Since the value of F-Statistic is approximately 61.8 and p value is highly significant. Therefore, based on
ANOVA test we reject the null hypothesis that mean hour of relief provided by compound ‘B’ is same at
all three levels.
1.4) Analyse the effects of one variable on another with the help of an interaction plot.
What is the interaction between the two treatments?
The interaction plots for both the compounds ‘A’ and ‘B’ is as below:

This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
Since, the line segments are not parallel, therefore there is interaction between the compounds ‘A’ and
‘B’. From the above plot we can interpret that for third level of compound B, as the level of compound
A increases, the hours of relief also increases.

In this case also, since, the line segments are not parallel, therefore there is interaction between the
compounds ‘A’ and ‘B’. From the above plot we can interpret that for third level of compound A, as the
level of compound B increases, the hours of relief also increases.
1.5) Perform a two-way ANOVA based on the different ingredients (variable ‘A’ & ‘B’ along with their
interaction 'A*B') with the variable 'Relief' and state your results.
For the two-way ANOVA, the null hypotheses are:
(i)The mean hour of relief provided by compound A is same at all three level.
(ii) The mean hour of relief provided by compound B is same at all three level
(iii) There is no interaction between compound A and B on mean hours of relief.
The alternative hypothesis are:
(i) At least for one level of compound A the mean hour of relief is different from the other.
(ii) At least for one level of compound B the mean hour of relief is different from the other.
(ii) There is an interaction between A and B.
The summary of two-way ANOVA table is as below:

This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
All three hypothesis are significant at 5% level. Therefore, we reject null hypothesis that the mean hour
of relief provided by compound A is same at all three level; we reject the hypothesis that the mean
hour of relief provided by compound A is same at all three level. Similarly, equality of means hour of
relief provided, at each combination of level of compound ‘A’ and compound ‘B’ is also rejected. Thus,
there is interaction between compound A and B on mean hours of relief.
1.6) Mention the business implications of performing ANOVA for this particular case study.
The mean hour of relief for severe cases of hay fever is significantly affected by both compounds ‘A’
and ‘B’ along with their interaction effect. The mean hour of relief increases with increase in level of
compound ‘A’ as well as with compound ‘B’. At third level of compound B, as the level of compound A
increases, the hours of relief for severe cases of hay fever also increases.Similarily, at the third level of
compound A, as the level of compound B increases, the hours of relief for severe cases of hay fever
also increases.

This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
Problem 2:
2.1) Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed].
The inferences drawn from this should be properly documented.
Univariate Analysis: The histogram for different variables is as below:

From the above histograms, we can infer the following:


(i) Most of the colleges have both part time and full-time students within range of 10-5000. There are
very few colleges in which number of part time and full-time students are more than 20,000.
(ii) The personal spending of students in most of colleges is within 2000. There are very few colleges in
which personal spending of the students is 2000 or more. The mean personal spending of students
across all colleges is 1340 with standard deviation of 677.
(iii)The graduation rate in most of the colleges is between 50%-80%. The minimum graduation rate for
a college is 10%.
(iv)study
This Most
sourceof the
was colleges
downloaded have out offrom
by 100000840671323 station students
CourseHero.com between
on 03-06-2022 7000-12,500.
05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
(v) Most of the colleges have 60%-80% faculties with PhD degree.

Multivariate Analysis: The scatterplot for with combination of two variables is as below:

The number of applications accepted increases with increase in number of received applications.
The graduation rate increases with decrease in Student/Faculty Ratio.
The colleges having high Percentage of Faculty with PhD degree have higher graduation rate.
The estimated personal spending of the student increases with increase in estimated book cost of
student.

2.2) Scale the variables and write the inference for using the type of scaling function for this case
study.
PCA works on total variance which is the sum of the variances in dataset. If one (or more variance) is
(are) very high compared to the rest, it (they) will dominate the construction of PC’s and all variables
will not have proper representation. In this case, the variance for ‘No. of Application Received’ is
14978459.53 while variance for ‘Cost of Room and Board’ is 1202743.027. If one component (e.g.
Apps: Number of applications received) varies more than another (e.g. Books.Board: Estimated Cost of
Room and Board for Student) because of their respective scales (Scalar Number vs. Dollar Currency),
PCA might determine that the direction of maximal variance more closely corresponds with the ‘Apps’
axis (if those variables are not scaled), which is incorrect.

Standard Scaler normalizes the data using the formula (x-mean)/standard deviation. We will be doing
this for the numerical variables. The first five row of scaled data is as follow:
This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
2.3) Comment on the comparison between covariance and the correlation matrix.
Covariance indicates the direction of the linear relationship between variables. Correlation on the
other hand measures both the strength and direction of the linear relationship between two variables.
Correlation is a function of the covariance. We can obtain the correlation coefficient of two variables by
dividing the covariance of these variables by the product of the standard deviations of the same
variables.
The covariance matrix of original data, covariance matrix of scaled data and correlation matrix of
original data have been calculated in this case. The covariance matrix of scaled data is similar to
correlation matrix of original data. It is on expected lines as the difference between values and
means(x-mean) had been divided by their standard deviations to arrive at scaled value of the variables.
2.4) Check the dataset for outliers before and after scaling. Draw your inferences from this exercise.
The variables have been scaled using four method viz. (i) Standard Scaler (ii) MinMax Scaler (iii)
Logarithmic Scaler (iv) Exponential Scaler
The boxplot figure has been drawn for the variables before and after scaling for all the scaling method
used. Form the boxplot figure of variables before and after scaling, it can be concluded that scaling did
not significantly reduced the number of outliers for each variable.
2.5) Build the covariance matrix, eigenvalues, and eigenvector.
The covariance matrix is as below:

The eigen values are as below:


[5.45052162, 4.48360686, 1.17466761, 1.00820573, 0.93423123, 0.84849117, 0.6057878, 0.58787222, 0.53061262,
0.4043029 , 0.02302787, 0.03672545, 0.31344588, 0.08802464, 0.1439785 , 0.16779415, 0.22061096]

The corresponding eigen vectors are as below:

This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
2.6) Write the explicit form of the first PC (in terms of Eigen Vectors).
The explicit form of first PC is as below:
PC1 = 0.25*Apps + 0.21*Accept + 0.18*Enroll + 0.35*Top10perc + 0.34*Top25perc + 0.15*F.Undergrad
+ 0.03*P.Undergrad + 0.29*Outstate + 0.25*Room.Board + 0.06*Books + -0.04*Personal + 0.32*PhD +
0.32*Terminal + -0.18*S.F.Ratio + 0.21*perc.alumni + 0.32*Expend + 0.25*Grad.Rate

2.7) Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate?
Perform PCA and export the data of the Principal Component scores into a data frame.
The cumulative value of eigen values are as below:

The cumulative value up to ninth Principal Component is 88.67. General rule of thumb is to choose first
k PC’s such that the first k PC’s explain 70-90% of the total variance. Hence from the cumulative values
of eigen values, help in selecting the required no. of PC’s. In this case first eight PC’s have been selected
capturing 88.7% of variation and thereby reducing our dimension by half.
The screeplot is as below:

This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
The eigenvector associated with the largest eigenvalue indicates the direction in which the data has the
most variance. Similarly, the eigenvector associated with the second largest eigenvalue indicates the
direction in which the data has the second most variance and so on.
As mentioned above, the first eight PC’s have been selected. The data of PC scores have been exported
into DataFrame ‘PC Scores’. The same has been attached with the assignment.
2.8) Mention the business implication of using the Principal Component Analysis for this case study.
[Hint: Write Interpretations of the Principal Components Obtained]
The first Principal Component can be viewed as measure of variables Top10perc, Top25perc, Terminal
and Expend. These four criteria vary together. If one increases the remaining tend to increase as well.
The second Principal Component can be viewed as measure of variables Enroll and F.Undergrad. These
two criteria vary together. If one increases the remaining tend to increase as well. Thus, we may
conclude that number of full-time undergraduate student increases as number of enrolment increases.
The third Principal Component can be viewed as measure of variables Books and Personal. These two
criteria vary together. Thus, Estimated Personal Spending of student increases with increase in
Estimated book cost for a student.
The fourth Principal Component can be viewed as measure of variables PHD and Terminal. These two
criteria vary together. The percentage of faculty with terminal degree increases with percentage of
faculty with Phd’s.
The fifth Principal Component can be viewed as measure of variables Room.Board. The colleges having
high value tend to have high cost of room and board.
The sixth principal component is primarily measure of variable book i.e. Estimated book cost for a
student.
The seventh Principal Component can be viewed as measure of variables Personal and Grad Rate.
These two variables vary together. The Graduation rate increases with increase in Estimated Personal
Spending for a student.

This study source was downloaded by 100000840671323 from CourseHero.com on 03-06-2022 05:42:48 GMT -06:00

https://www.coursehero.com/file/96998788/Project-2docx/
Powered by TCPDF (www.tcpdf.org)

You might also like