Assignment 2

1a.
Estimated Equation and Coefficients interpreted

𝑌̂ = 𝛽0 + 𝛽1 𝑋1
𝛽0 : The value of MBA_GPA when GMAT = 0.

𝛽1 : The amount of change in MBA_GPA associated with one-unit increase in GMAT
̂
𝑀𝐵𝐴_𝐺𝑃𝐴 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝐺𝑀𝐴𝑇𝑋1 = 1.79915 + 0.01106𝑋1
𝛽0 : The value of MBA_GPA will be 1.79915 when GMAT=0.

𝛽1 : The amount of change In MBA_GPA (0.01106) associated with one-unit increase in GMAT
1b. Model fit

We can use various methods such as F test, R Square, Adjusted R Square to evaluate how well the regression
model fits the data.
The R square is 40.52% and the Adjusted R Square is 39.84%. In some fields such as psychology this would be a
good R Square and Adjusted R square but on an average this is not a good score. This indicates that the model
does not fit the data very well.
Here we can either conduct individual T test or Global F test which will give the same output since the simple
linear regression model contains only one coefficient.
Conducting the Global F test,

𝐻0 : 𝛽1 = 0
𝐻𝑎 : 𝛽1 ≠ 0
Here F Value is 59.26 and P value <0.0001. Taking α=0.5, p-value<α. We reject the null hypothesis and accept the
alternate hypothesis. This means that 𝛽1 ≠ 0
1c. The MBA performance will be = 1.79915 + 0.01106(650) = 8.98816 when GMAT score is 650.
2a. Estimated Equation and Coefficients interpreted

𝑌̂ = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3
𝛽0 : The predicted response when all x variables are 0

𝛽1 , 𝛽2 and 𝛽3 : The change in predicted response associated with one unit increase in X1, X2 and X3, while
keeping all other x variables constant
̂
𝑀𝐵𝐴_𝐺𝑃𝐴 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑈𝑛𝑑𝑒𝑟𝐺𝑃𝐴𝑋1 + 𝐺𝑀𝐴𝑇𝑋2 + 𝑊𝑜𝑟𝑘𝑋3
̂
𝑀𝐵𝐴_𝐺𝑃𝐴 = 0.46609 + 0.06283𝑋1 + 0.01128𝑋2 + 0.09259𝑋3
𝛽0 : The predicted response will be 0.46609 when all x variables are 0

𝛽1 , 𝛽2 and 𝛽3 : The change (0.06283 for 𝛽1 ,0.01128 for 𝛽2 , 0.09259 for 𝛽3 )in predicted response associated with
one unit increase in X1, X2 and X3 individually, while keeping all other x variables constant
2b. Model fit

We can use various methods such as F test, R Square, Adjusted R Square to evaluate how well the regression
model fits the data. We can also use Cooks Distance to remove outliers, VIF to check for multicollinearity, plot
residuals versus predicted values to check for patterns.
The R square is 46.35% and the Adjusted R Square is 44.46%. On an average this is not a good score. This
indicates that the model does not fit the data very well.
2c. Conducting the Global F test,

𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = 0
𝐻𝑎 :Atleast one of the 𝛽′𝑠 ≠ 0
Here F Value is 24.48 and P value <0.0001. Taking α=0.5, p-value<α. We reject the null hypothesis and accept the
alternate hypothesis. This means that Atleast one of the 𝛽′𝑠 ≠ 0
Plotting Residuals vs Predicted Values to check for any patterns, we see that the observations are normally
distributed.
We can use VIF and Cooks Distance to check for multicollinearity and influential observations and see if we get a
better Adjusted R Square after removing influential observations and multicollinear terms.
Comparing Model 1 where we used Simple Linear Regression and Model 2 where we use Multiple Linear
Regression, we see that the adjusted R Square is higher for Model 2, giving a better model fit to the data.
3.
Probabilities
a. Heart disease is present: 69/171=0.4035
b. Blood cholesterol is less than 200mg/100ml: 5/69=0.2047
c. Heart disease is absent given that blood cholesterol is between 200 and 225 mg/100ml: 26/102=0.2549
d. Heart disease is present given that blood cholesterol is greater than 250 mg/100ml: 41/69=0.5942
e. Blood cholesterol is greater than 275 mg/100ml given that heart disease is absent: 10/33=0.3030
Odds
a. Heart disease is present (as opposed to absent): 69/102=0.6765
b. Blood cholesterol is less than 226 mg/100ml: 70/101=0.6931
c. Heart disease is present given that blood cholesterol is greater than 275 mg/100ml: 23/46=0.5
d. Blood cholesterol is greater than 250mg/100ml given that heart disease is present: 41/26=1.5769
4a.
4b. Estimated logistic regression equation
π(X)/1- π(X)=exp(𝛽0 + 𝛽1 𝑋1+𝛽2 𝑋2 + 𝛽3 X3)
π(X)/1- π(X)=exp(Intercept + EducationX1 + ExperienceX2 + SexX3)
=exp(-11.4464 + 1.1549X1 + 0.9098X2 + 0.0607007X3)
The regression coefficients represent changes in the log odds. Keeping all other explanatory variable constant, if we
change 𝛽1 (𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛) one by one unit, the log odds or logit is changed by 𝛽1 (𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛) units, where
𝛽1 (𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛) is the regression coefficient associated with the first explanatory variable. If we change an explanatory
variable by “k” units while keeping all other (explanatory) variables constant, we change the log odds by “k”𝛽1 .
𝛽0 = Intercept = exp(-11.4464)= 1.068788e-05. This means if the female person’s years of experience and education are
is 0, the chance of not getting hired will be very high.
𝛽1 = Education = exp(1.1549)= 3.173706 indicates that a one unit change in X1, increases the odds of being hired by a
multiplicative factor of 3.17, the change increases the odds of being hired by 100(3.17-1)=217%
𝛽2 = Experience = exp(0.9098)= 2.483826 indicates that a one unit change in X2, increases the odds of being hired by a
multiplicative factor of 2.48, the change increases the odds of being hired by 100(2.48-1)=148%
𝛽3 = Sex = exp(-2.8018)= 0.0607007 indicates that a one unit change in X3, increases the odds of being hired by a
multiplicative factor of 0.06, the change increases the odds of being hired by 100(1-0.06)=40%. Sex is a categorical
variable which can be either 0 or 1.
4c. Evaluation of model fit
I use Hosmer Lemeshow goodness of fit test, ROC Curve and c-stat to see the model fit.
c-stat: the probability of a randomly selected positive observation will be ranked ahead of a randomly selected negative
observation. Here the c-stat is 0.933 which is a very high probability of a randomly selected positive observation will be
ranked ahead of a randomly selected negative observation.
We can also use Hosmer-Lemeshow Goodness of fit test to assess if the model is a good fit for the overall data or not,
however if number of classes increases or decreases, Hosmer-Lemeshow Goodness of fit test can vary a lot due to its
sensitivity.
𝐻0 : The current model is a good fit to the data

𝐻𝑎 : The current model is not a good fit to the data
We assume α=0.05 and here the p value = 0.5788 which is greater than 0.05. Since p value > α we fail to rejec t the null
hypothesis and accept it. Therefore this model is a good fit to the data.
In the ROC Curve we see that it is near (0,1) which represents perfect classification.
Therefore we can conclude that the model is adequate to predict the likelihood to be hired.
4d. To check if gender is a significant predictor of hiring status, we can conduct an individual t-test.
𝐻0 : 𝛽3 =0
𝐻𝑎 : 𝛽3 ≠ 0
The p-value is 0.0313. We take α=0.05. Since p-value<α, we reject the null hypothesis and accept the alternate
hypothesis. Therefore 𝛽3 ≠ 0 and there is evidence that gender is a significant predictor of hiring status.
4e. If I have 4 years of higher education and no experience if I am male and if I am female.
Male:
= exp(𝛽0 + 𝛽1 𝑋1+𝛽2 𝑋2 + 𝛽3 X3)
= exp(-11.4464 + 1.1549(4) + 0 + -2.8018(1))
= exp( -9.6286)
=6.581913e-05
Female:
= exp(𝛽0 + 𝛽1 𝑋1+𝛽2 𝑋2 + 𝛽3 X3)
= exp(-11.4464 + 1.1549(4) + 0 + -2.8018(0))
= exp(-6.8268)
=0.001084322
Here we see that the hiring chances of both males and females with no experience and 4 years of education is close to
zero, however the hiring chances of females is slightly higher than males.
5a.
5b. Estimated logistic regression equation
π(X)/1- π(X)=exp(𝛽0 + 𝛽1 𝑋1)
π(X)/1- π(X)=exp(Intercept+ Balance𝑋1)
π(X)/1- π(X)=exp(-10.6513 + 0.00550X1)
The regression coefficient represents the change in the log odds. If we change 𝛽1 (𝐵𝑎𝑙𝑎𝑛𝑐𝑒) one by one unit, the log
odds or logit is changed by 𝛽1 (𝐵𝑎𝑙𝑎𝑛𝑐𝑒) units, where 𝛽1 (𝐵𝑎𝑙𝑎𝑛𝑐𝑒) is the regression coefficient associated with the first
explanatory variable. If we change an explanatory variable by “k” units while keeping all other (explanatory) variables
constant, we change the log odds by “k”𝛽1 .
𝛽0 = Intercept = exp(-10.6513)= 2.367005e-05. This means if the customer’s balance is 0, the chance to default will be
very high.
𝛽1 = Balance = exp(0.00550)= 1.005515 indicates that a one unit change in X1, increases the odds of the default by a
multiplicative factor of 1.01, the change increases the odds of the default by 100(1.01-1)=1%
5c. Predicted Probabilities vs Balance

This is an S shaped curve, the predicted value can only be between 0 and 1, as the value of balance gets larger it starts
approaching 1, as the value gets smaller it starts approaching 0.
̂ between 1537 and 2336

5d. The values of 𝑷
Here we see that the predicted value increases from 0.1 to 0.9 over the observed range of balance from 1537 to 2336.
5e. I evaluated the adequacy of the model to predict the likelihood to default by using the Hosmer Lemeshow
Goodness of Fit test and ROC Curve.
Here the ROC Curve is quite close to (0,1) and is very close to the perfect classification.
𝐻0 : The current model is a good fit to the data

𝐻𝑎 : The current model is not good fit to the data
The p-value is 0.8411. We take α=0.05. Since p-value>α, we support the null hypothesis and accept it. Therefore this
model is a good fit to the data.
If the model had more parameters we could use other methods such as Maximum Likelihood Ratio to check model fit.
5f. Summary of the predictive power of the model
c-stat: the probability of a randomly selected positive observation will be ranked ahead of a randomly selected negative
observation. Here the c-stat is 0.948 which is a very high probability of a randomly selected positive observation will be
ranked ahead of a randomly selected negative observation.
This high c-stat along with a good model fit indicated by the Hosmer Lemeshow test. This indicates that the model has a
good predictive power.
5g. The code and output in SAS are
We can conduct an individual t test for each of the predictors in this model to check whether they are significant or not.
To check if student is a significant predictor of whether a person will default or not, we can conduct an individual t-
test.
𝐻0 : 𝛽1 =0
𝐻𝑎 : 𝛽1 ≠ 0
The p-value is 0.0062. We take α=0.05. Since p-value<α, we reject the null hypothesis and accept the alternate
hypothesis. Therefore 𝛽1 ≠ 0 and there is evidence that student is a significant predictor of default status.
To check if balance is a significant predictor of whether a person will default or not, we can conduct an individual t-
test.
𝐻0 : 𝛽2 =0
𝐻𝑎 : 𝛽2 ≠ 0
The p-value is < 0.0001. We take α=0.05. Since p-value<α, we reject the null hypothesis and accept the alternate
hypothesis. Therefore 𝛽2 ≠ 0 and there is evidence that balance is a significant predictor of default status.
To check if income is a significant predictor of whether a person will default or not, we can conduct an individual t-
test.
𝐻0 : 𝛽3 =0
𝐻𝑎 : 𝛽3 ≠ 0
The p-value is 0.7115. We take α=0.05. Since p-value>α, we fail to reject the null hypothesis and accept the alternate
hypothesis. Therefore 𝛽3 = 0 and there is insufficient evidence to say that income is a significant predictor of default
status.
Therefore, only student and balance are statistically significant predictors at α=0.05 level of significance.
5h.
Since in the previous question we found that income is a statistically insignificant predictor, my final model includes
only student and balance as the independent variables.
Estimated logistic regression equation
π(X)/1- π(X)=exp(𝛽0 + 𝛽1 𝑋1+𝛽2 𝑋2)
π(X)/1- π(X)=exp(Intercept + StudentX1 + BalanceX2)= -11.1068 + 0.3574X1 + 0.00574X2)
For the intercept which is exp(-11.1068) = 1.500991e-05, if a customer is a student and their balance is 0, the odds of
defaulting are very high!
The regression coefficient 𝛽1 (Student) = exp(0.3574) = 1.43 indicates that if a customer is a student, it increases the
odds of the default by a multiplicative factor of 1.43, the change increases the odds of the default by 100(1.43-1)=43%
This a categorical variable which is either 0 or 1, indicating the status of the person where 1=student and 0= not a
student.
The regression coefficient 𝛽2 (Balance) = exp(0.00574) = 1.01 indicates that a one unit change in X2, increases the odds
of the default by a multiplicative factor of 1.01, the change increases the odds of the default by 100(1.01-1)=1%
5i
To check the predictive power of the models with balance as an independent variable and the model with balance and
student as independent variables, we can compare both of them using cstat and the Hosmer Lemeshow goodness of fit
test.
With balance and student as predictors:
With balance as a predictor:
Although both the p values are higher than 0.05, the model with student and balance as predictors has a slightly higher
p value and is slightly better than the model with only balance as a predictor. Also the cstat is marginally better for the
first model than the second one.
Thus we can conclude that the final model is only slightly better than the initial model with just one predictor.

Assignment 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 2

Uploaded by

Copyright:

Available Formats

1a.

Estimated Equation and Coefficients interpreted

𝛽0 : The value of MBA_GPA when GMAT = 0.

𝛽0 : The value of MBA_GPA will be 1.79915 when GMAT=0.

1b. Model fit

Conducting the Global F test,

2a. Estimated Equation and Coefficients interpreted

𝛽0 : The predicted response when all x variables are 0

𝛽0 : The predicted response will be 0.46609 when all x variables are 0

2b. Model fit

2c. Conducting the Global F test,

4b. Estimated logistic regression equation

π(X)/1- π(X)=exp(𝛽0 + 𝛽1 𝑋1+𝛽2 𝑋2 + 𝛽3 X3)

π(X)/1- π(X)=exp(Intercept + EducationX1 + ExperienceX2 + SexX3)

=exp(-11.4464 + 1.1549X1 + 0.9098X2 + 0.0607007X3)

4c. Evaluation of model fit

𝐻0 : The current model is a good fit to the data

5b. Estimated logistic regression equation

π(X)/1- π(X)=exp(𝛽0 + 𝛽1 𝑋1)

π(X)/1- π(X)=exp(Intercept+ Balance𝑋1)

π(X)/1- π(X)=exp(-10.6513 + 0.00550X1)

5c. Predicted Probabilities vs Balance

̂ between 1537 and 2336

𝐻0 : The current model is a good fit to the data

5f. Summary of the predictive power of the model

5g. The code and output in SAS are

Estimated logistic regression equation

π(X)/1- π(X)=exp(𝛽0 + 𝛽1 𝑋1+𝛽2 𝑋2)

π(X)/1- π(X)=exp(Intercept + StudentX1 + BalanceX2)= -11.1068 + 0.3574X1 + 0.00574X2)

With balance and student as predictors:

With balance as a predictor:

You might also like