Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Advanced Programme in FinTech and

Financial Blockchain

Data Analysis and Interpretation


(More on Regression)

Prof. Saibal Chattopadhyay


IIM Calcutta
What is Regression?
Main question: Why we observe what we observe?
– Data on several factors/variables may have been observed
– Which factors matter most? Which can we ignore?
– How do they interact with each other?
– What is a Dependent Variable? The main factor of study
– What are Independent Variables? Factors we suspect have an
impact on our dependent variable
– Regression Analysis? A way of mathematically sorting out which of
these independent variables does indeed have an impact
– Simple Linear Regression (SLR): One dependent Y, one
independent X
– Multiple Linear Regression (MLR): One dependent Y, k independent
X1, X2, …, Xk
SLR Example: Mortality Versus Latitude
• SourceURL: https://onlinecourses.science.psu.edu/stat501/sites/online
courses.science.psu.edu.stat501/files/data/skincancer/index.txt
• This dataset of size n = 49 are for the 48 states and the District of
Columbia in the United States (recorded in 1950, w/o Alaska & Hawaii)
• Variable Y = mortality due to skin cancer (#deaths per 10 million);
• X = latitude (degree North) at the centre of each state
• Higher the latitude, less exposed to the harmful rays of Sun, and less
risk of death due to skin cancer
• Would expect a negative relationship between latitude and mortality
• Questions?
– Scatter Plot?
– Fitted Linear Regression?
– How effective the regression in explaining the data?
– How good is it to predict mortality knowing latitude?
Data
State Lat Mort Ocean Long
Alabama 33.0 219 1 87.0
Arizona 34.5 160 0 112.0
Arkansas 35.0 170 0 92.5
California 37.5 182 1 119.5
Colorado 39.0 149 0 105.5
Connecticut 41.8 159 1 72.8
Delaware 39.0 200 1 75.5
Wash,D.C. 39.0 177 0 77.0
Florida 28.0 197 1 82.0
Georgia 33.0 214 1 83.5
Idaho 44.5 116 0 114.0
Illinois 40.0 124 0 89.5
Indiana 40.2 128 0 86.2
Iowa 42.2 128 0 93.8
Kansas 38.5 166 0 98.5
Kentucky 37.8 147 0 85.0
Louisiana 31.2 190 1 91.8
Maine 45.2 117 1 69.0
Maryland 39.0 162 1 76.5
Massachusetts 42.2 143 1 71.8
Michigan 43.5 117 0 84.5
Minnesota 46.0 116 0 94.5
Mississippi 32.8 207 1 90.0
Missouri 38.5 131 0 92.0
Montana 47.0 109 0 110.5
Nebraska 41.5 122 0 99.5
Nevada 39.0 191 0 117.0
NewHampshire 43.8 129 1 71.5
NewJersey 40.2 159 1 74.5
NewMexico 35.0 141 0 106.0
MewYork 43.0 152 1 75.5
NorthCarolina 35.5 199 1 79.5
NorthDakota 47.5 115 0 100.5
Ohio 40.2 131 0 82.8
Oklahoma 35.5 182 0 97.2
Oregon 44.0 136 1 120.5
Pennsylvania 40.8 132 0 77.8
RhodeIsland 41.8 137 1 71.5
SouthCarolina 33.8 178 1 81.0
SouthDakota 44.8 86 0 100.0
Tennessee 36.0 186 0 86.2
Texas 31.5 229 1 98.0
Utah 39.5 142 0 111.5
Vermont 44.0 153 1 72.5
Virginia 37.5 166 1 78.5
Washington 47.5 117 1 121.0
WestVirginia 38.8 136 0 80.8
Wisconsin 44.5 110 0 90.2
Wyoming 43.0 134 0 107.5
Scatter Plot & the fitted linear regression

• Shows a generally linear relationship, on


average, with a negative slope
• As latitude increases, mortality decreases
SLR: Latitude and Mortality

• Least square fitted line: Y^ = 389.2 - 5.978 X


• Interpretation of slope? Every 1 degree increase in latitude
decreases skin cancer mortality by 5.978 units (per 10 m)
• s = 19.1150? Error Sum of Squares – difference between
original and predicted Y values
• R-Sq and R-Sq(adj) ?
R-square and Adjusted R-square

• R-sq = proportion of total variation that is explained


by the model (SLR)
• Latitude explains 67.98% of skin cancer mortality
• Adjusted R-sq: Uses n (number of observations) and
p (number of independent variables) in R-sq to
moderate how many variables are ‘just right’ in
regression:
R-sq(adj) = 1 – (1 – R-sq).(n-1)/(n-p-1)
• Can be negative as well!
• Used in multiple linear regression models with many
independent variables to choose the right ones
• In SLR, meaningless since only one indept variable
Regression Analysis Output
• ANOVA Table?
Source DF SS MS F P value
Regression 1 36464 36464.2 99.8 0.000
(Latitude)
Error 47 17173 365.4
Total 48 53637

• Ho: Slope = 0 against Ha: Slope not 0


• Ho rejected at 5% (also 1%) level
• Linear Regression is significant to explain the mortality
Output using Excel?
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.824518
R Square 0.67983
Adjusted R
Square 0.673017
Standard Error 19.11503
Observations 49

ANOVA
df SS MS F Significance F
Regression 1 36464.2 36464.2 99.79683 3.31E-13
Residual 47 17173.07 365.3844
Total 48 53637.27

• 67.98% variation in skin cancer mortality explained by


the latitude
• SLR significant at 5% (even 1%) level
Multiple Linear Regression Example: Physiological
Measurements Data
• https://onlinecourses.science.psu.edu/stat501/sites/onlineco
urses.science.psu.edu.stat501/files/data/bodyfat/index.txt
• Data from n = 20 individuals
• To assess how body fat is explained by other physiological
measurements such as triceps skinfold thickness, thigh
circumference, and mid-arm circumference
• The variables used are:
– Y = body fat,
– x1 = triceps skinfold thickness,
– x2 = thigh circumference,
– x3 = mid-arm circumference
Data
Triceps Thigh Midarm Bodyfat
19.5 43.1 29.1 11.9
24.7 49.8 28.2 22.8
30.7 51.9 37 18.7
29.8 54.3 31.1 20.1
19.1 42.2 30.9 12.9
25.6 53.9 23.7 21.7
31.4 58.5 27.6 27.1
27.9 52.1 30.6 25.4
22.1 49.9 23.2 21.3
25.5 53.5 24.8 19.3
31.1 56.6 30 25.4
30.4 56.7 28.3 27.2
18.7 46.5 23 11.7
19.7 44.2 28.6 17.8
14.6 42.7 21.3 12.8
29.5 54.4 30.1 23.9
27.7 55.3 25.7 22.6
30.2 58.6 24.6 25.4
22.7 48.2 27.1 14.8
25.2 51 27.5 21.1
MLR of Y on X1, X2 and X3
Output using Excel?
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.895186
R Square 0.801359
Adjusted R
Square 0.764113
Standard Error 2.479981
Observations 20

ANOVA
df SS MS F Significance F
Regression 3 396.9846 132.3282 21.51571 7.34E-06
Residual 16 98.40489 6.150306
Total 19 495.3895

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 117.0847 99.7824 1.1734 0.257808 -94.4446 328.6139 -94.4446 328.6139
Triceps 4.334092 3.015511 1.437266 0.169911 -2.05851 10.72669 -2.05851 10.72669
Thigh -2.85685 2.582015 -1.10644 0.284894 -8.33048 2.61678 -8.33048 2.61678
Midarm -2.18606 1.595499 -1.37014 0.189563 -5.56837 1.196247 -5.56837 1.196247

• Adj R-sq = 76.41%: % explained by the


MLR of Y on X1, X2 and X3
Another Example of MLR
• Data source: Applied Regression Models, (4th edition),
Kutner, Neter, and Nachtsheim
• Data from n = 113 hospitals in the United States
• Used to assess factors related to the likelihood
that a hospital patients acquires an infection
while hospitalized.
• The variables are:
– Y = infection risk,
– x1 = average length of patient stay,
– x2 = average patient age,
– x3 = measure of how many x-rays are given in the hospital
MLR
• Fitted Regression of Y on X1, X2 and X3
and the significance of the predictors X1
(Stay), X2 (Age) and X3 (X-ray) on the
infection risk:
Using Excel?
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.602487
R Square 0.362991
Adjusted R
Square 0.345458

Standard Error 1.084845

Observations 113

ANOVA

df SS MS F Significance F

Regression 3 73.09897 24.36632 20.70402 1.09E-10


Residual 109 128.2809 1.176889
Total 112 201.3798

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 1.001162 1.314724 0.7615 0.448003 -1.60458 3.606902 -1.60458 3.606902
Stay 0.308181 0.059396 5.188611 9.88E-07 0.190461 0.425901 0.190461 0.425901
Age -0.02301 0.023516 -0.97829 0.330098 -0.06961 0.023602 -0.06961 0.023602
Xray 0.019661 0.005759 3.414211 0.000899 0.008248 0.031074 0.008248 0.031074
Interpretation?
• The p-value for testing the coefficient that
multiplies Age is 0.330. Thus we cannot reject the null
hypothesis H0: β2 = 0. The variable Age is not a useful
predictor within this model that includes Stay and Xrays.
• For the variables Stay and X-rays, the p-values for
testing their coefficients are at a statistically significant
level so both are useful predictors of infection risk (within
the context of this model!).
• We usually don’t worry about the p-value for Constant. It
has to do with the “intercept” of the model and seldom
has any practical meaning. It also doesn’t give
information about how changing an x-variable might
change y-values.
Other Types of Regression

• Logistic Regression
– Dependent variable categorical
• Polynomial Regression
– Independent variable(s) appear in higher powers
• Stepwise Regression
– Algorithm to decide the order of inclusion of
independent varaibles (forward & backward)
• Ridge, Lasso & ElasticNet Regression
– Penalized methods (trade-off betn bias & variance)
What happens when DV is categorical and IVs
are quantitative/categorical ?
• Disease Outbreak Example:
– Ref: Applied Linear Statistical Models (4th ed) - Neter et al (Irwin)
– Data set available in Penn State University course-web
(https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.scie
nce.psu.edu.stat501/files/data/DiseaseOutbreak.txt)
– To investigate the epidemic outbreak of a disease spread by
mosquitoes
– Individuals randomly sampled within two sectors in a city
– Binary response variable: Y = 1 if disease present, Y = 0 if not
– 3 predictors (as risk factors): age, socioeconomic status(SES) & sector
within city
• Age (X1) is quantitative;
• SES is categorical with 3 levels (Upper, Middle, Lower) represented
by 2 indicators (X2 and X3): Upper (0, 0); Middle (1, 0) and Lower
(0, 1)
• City Sector is categorical; X4 = 0 for sector 1, X4 = 1 for sector 2
Data Set: 98 data points

Case Age (X1) Middle (X2) Lower(X Sector (X4) Disease (Y) Fitted Value
3)
1 33 0 0 0 0 .209
2 35 0 0 0 0 .219
3 6 0 0 0 0 .106
4 60 0 0 0 0 .371
5 18 0 1 0 1 .111
6 26 0 1 0 0 .136
7 6 0 1 0 0 ..
8 31 1 0 0 1 ..
… … … … … … …
97 11 0 1 0 0 ..
98 35 0 1 0 0 .171
Usual regression fails here!
• Having a categorical outcome variable (Y) violates the
assumption of linearity in normal regression
• The relationship between Y and X1 – X4 is not linear
• Predictors X1 – X4 are not necessarily normally distributed
• The idea is still to combine independent predictors X1 – X4
using coefficients as a + b1X1 + b2X2 +…+ b4X4 + e to
analyse the dependent variable Y (which is categorical, taking
only 2 values 0 and 1)
• What we want to predict knowing the X’s and the coefficients
is not a numerical value of the outcome variable y, but
• The probability that it is 1, namely p=P(Y=1), rather than it is
0, i.e., 1-p= P(Y=0);
– basic idea of Logistic Regression
Logistic Regression
Fitting of Logistic Regression using Minitab

• Select Stat > Regression > Binary Logistic Regression > Fit
Binary Logistic Model.
• Select “Disease" for the Response (the response event for
disease is 1 for this data).
• Select the predictor Age as Continuous predictor.
• Select other predictors as Qualitative predictors.
• Click Options and choose Deviance or Pearson residuals for
diagnostic plots.
• Click Graphs and select "Residuals versus order."
• Click Results and change "Display of results" to "Expanded
tables."
• Click Storage and select "Coefficients."
Result: Deviance Table
Source DF Seq Dev Contribution Adj Dev Adj Mean Chi-Square P-Value
Regression 4 21.263 17.38% 21.263 5.3159 21.26 0.000
Age 1 7.405 6.05% 5.150 5.1495 5.15 0.023
Middle 1 1.804 1.47% 0.467 0.4669 0.47 0.494
Lower 1 1.606 1.31% 0.256 0.2560 0.26 0.613
Sector 1 10.448 8.54% 10.448 10.4481 10.45 0.001
Error 93 101.054 82.62% 101.054 1.0866
Total 97 122.318 100.00%
Fitted Model
Coefficients

Term Coef SE Coef 95% CI Z-Value P-Value


Constant -2.313 0.643 (-3.572, -1.053) -3.60 0.000
Age 0.0298 0.0135 (0.0033, 0.0562) 2.20 0.028
Middle 1 0.409 0.599 (-0.765, 1.583) 0.68 0.495
Lower 1 -0.305 0.604 (-1.489, 0.879) -0.51 0.613
Sector 1 1.575 0.502 ( 0.592, 2.558) 3.14 0.002

Estimated Logistic Response function:


p^ = 1/{1+exp(2.313 -.298X1 - .409X2 + .305X3 – 1.575X4)}
Ridge, Lasso & ElasticNet Regression

Bias-Variance Trade-Off:
• Y= 0 + 1X1 + 2X2 + . . . + kXk + = Xβ + ε
• Least Square Estimate of β is obtained by minimizing the
sum of squared errors, and the estimated regression is:
Y = b0  b1 X 1  b2 X 2 bk X k
• The estimates of the parameters β are unbiased, but
their variances can be quite high when
– The predictor variables are highly correlated with each other;
– There are many predictors.
• Can we reduce the variance at the cost of introducing
some bias?
Ridge, Lasso & ElasticNet Regression

• Recall: a good model should ensure


– Parameter estimates βˆ close to the true β
– The fitted model Y = b0  b1 X 1  b2 X 2 bk X k
should fit future observations well
• Regularization: introduce some bias in
the estimated parameters so that the
variances are controlled better
• Idea of penalized regression
Depicting the Bias-Variance tradeoff
(Source: researchgate.net)
Ridge Regression
Lasso Regression
Elastic Net Regression
• Combine the penalties of ridge regression
and lasso to get the best of both worlds
• Minimize the loss function:

where α: mixing parameter between Ridge (α = 0)


and Lasso (α = 1)
• Can use glmnet in R package or caret
References

• https://www.datacamp.com/community/tut
orials/tutorial-ridge-lasso-elastic-net

• https://newonlinecourses.science.psu.edu/
stat501/node/374/

You might also like