Professional Documents
Culture Documents
APFFB Online 8 Advanced Regression
APFFB Online 8 Advanced Regression
Financial Blockchain
Regression Statistics
Multiple R 0.824518
R Square 0.67983
Adjusted R
Square 0.673017
Standard Error 19.11503
Observations 49
ANOVA
df SS MS F Significance F
Regression 1 36464.2 36464.2 99.79683 3.31E-13
Residual 47 17173.07 365.3844
Total 48 53637.27
Regression Statistics
Multiple R 0.895186
R Square 0.801359
Adjusted R
Square 0.764113
Standard Error 2.479981
Observations 20
ANOVA
df SS MS F Significance F
Regression 3 396.9846 132.3282 21.51571 7.34E-06
Residual 16 98.40489 6.150306
Total 19 495.3895
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 117.0847 99.7824 1.1734 0.257808 -94.4446 328.6139 -94.4446 328.6139
Triceps 4.334092 3.015511 1.437266 0.169911 -2.05851 10.72669 -2.05851 10.72669
Thigh -2.85685 2.582015 -1.10644 0.284894 -8.33048 2.61678 -8.33048 2.61678
Midarm -2.18606 1.595499 -1.37014 0.189563 -5.56837 1.196247 -5.56837 1.196247
Regression Statistics
Multiple R 0.602487
R Square 0.362991
Adjusted R
Square 0.345458
Observations 113
ANOVA
df SS MS F Significance F
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 1.001162 1.314724 0.7615 0.448003 -1.60458 3.606902 -1.60458 3.606902
Stay 0.308181 0.059396 5.188611 9.88E-07 0.190461 0.425901 0.190461 0.425901
Age -0.02301 0.023516 -0.97829 0.330098 -0.06961 0.023602 -0.06961 0.023602
Xray 0.019661 0.005759 3.414211 0.000899 0.008248 0.031074 0.008248 0.031074
Interpretation?
• The p-value for testing the coefficient that
multiplies Age is 0.330. Thus we cannot reject the null
hypothesis H0: β2 = 0. The variable Age is not a useful
predictor within this model that includes Stay and Xrays.
• For the variables Stay and X-rays, the p-values for
testing their coefficients are at a statistically significant
level so both are useful predictors of infection risk (within
the context of this model!).
• We usually don’t worry about the p-value for Constant. It
has to do with the “intercept” of the model and seldom
has any practical meaning. It also doesn’t give
information about how changing an x-variable might
change y-values.
Other Types of Regression
• Logistic Regression
– Dependent variable categorical
• Polynomial Regression
– Independent variable(s) appear in higher powers
• Stepwise Regression
– Algorithm to decide the order of inclusion of
independent varaibles (forward & backward)
• Ridge, Lasso & ElasticNet Regression
– Penalized methods (trade-off betn bias & variance)
What happens when DV is categorical and IVs
are quantitative/categorical ?
• Disease Outbreak Example:
– Ref: Applied Linear Statistical Models (4th ed) - Neter et al (Irwin)
– Data set available in Penn State University course-web
(https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.scie
nce.psu.edu.stat501/files/data/DiseaseOutbreak.txt)
– To investigate the epidemic outbreak of a disease spread by
mosquitoes
– Individuals randomly sampled within two sectors in a city
– Binary response variable: Y = 1 if disease present, Y = 0 if not
– 3 predictors (as risk factors): age, socioeconomic status(SES) & sector
within city
• Age (X1) is quantitative;
• SES is categorical with 3 levels (Upper, Middle, Lower) represented
by 2 indicators (X2 and X3): Upper (0, 0); Middle (1, 0) and Lower
(0, 1)
• City Sector is categorical; X4 = 0 for sector 1, X4 = 1 for sector 2
Data Set: 98 data points
Case Age (X1) Middle (X2) Lower(X Sector (X4) Disease (Y) Fitted Value
3)
1 33 0 0 0 0 .209
2 35 0 0 0 0 .219
3 6 0 0 0 0 .106
4 60 0 0 0 0 .371
5 18 0 1 0 1 .111
6 26 0 1 0 0 .136
7 6 0 1 0 0 ..
8 31 1 0 0 1 ..
… … … … … … …
97 11 0 1 0 0 ..
98 35 0 1 0 0 .171
Usual regression fails here!
• Having a categorical outcome variable (Y) violates the
assumption of linearity in normal regression
• The relationship between Y and X1 – X4 is not linear
• Predictors X1 – X4 are not necessarily normally distributed
• The idea is still to combine independent predictors X1 – X4
using coefficients as a + b1X1 + b2X2 +…+ b4X4 + e to
analyse the dependent variable Y (which is categorical, taking
only 2 values 0 and 1)
• What we want to predict knowing the X’s and the coefficients
is not a numerical value of the outcome variable y, but
• The probability that it is 1, namely p=P(Y=1), rather than it is
0, i.e., 1-p= P(Y=0);
– basic idea of Logistic Regression
Logistic Regression
Fitting of Logistic Regression using Minitab
• Select Stat > Regression > Binary Logistic Regression > Fit
Binary Logistic Model.
• Select “Disease" for the Response (the response event for
disease is 1 for this data).
• Select the predictor Age as Continuous predictor.
• Select other predictors as Qualitative predictors.
• Click Options and choose Deviance or Pearson residuals for
diagnostic plots.
• Click Graphs and select "Residuals versus order."
• Click Results and change "Display of results" to "Expanded
tables."
• Click Storage and select "Coefficients."
Result: Deviance Table
Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value
Regression 4 21.263 17.38% 21.263 5.3159 21.26 0.000
Age 1 7.405 6.05% 5.150 5.1495 5.15 0.023
Middle 1 1.804 1.47% 0.467 0.4669 0.47 0.494
Lower 1 1.606 1.31% 0.256 0.2560 0.26 0.613
Sector 1 10.448 8.54% 10.448 10.4481 10.45 0.001
Error 93 101.054 82.62% 101.054 1.0866
Total 97 122.318 100.00%
Fitted Model
Coefficients
Bias-Variance Trade-Off:
• Y= 0 + 1X1 + 2X2 + . . . + kXk + = Xβ + ε
• Least Square Estimate of β is obtained by minimizing the
sum of squared errors, and the estimated regression is:
Y = b0 b1 X 1 b2 X 2 bk X k
• The estimates of the parameters β are unbiased, but
their variances can be quite high when
– The predictor variables are highly correlated with each other;
– There are many predictors.
• Can we reduce the variance at the cost of introducing
some bias?
Ridge, Lasso & ElasticNet Regression
• https://www.datacamp.com/community/tut
orials/tutorial-ridge-lasso-elastic-net
• https://newonlinecourses.science.psu.edu/
stat501/node/374/