Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

JTMS-03 Applied Statistics with R

Simple linear ordinary least squares regression

Dr. Georgi Dragolov

19.04.2021

Jacobs University Bremen


Linear regression

• Powerful tool with wide application


– Explore associations among variables
– Predict scores on the dependent variable based on the
independent variable(s) in the regression model

• Types of linear regression models


– Simple linear regression: one dependent variable regressed on
one independent variable
– Multiple linear regression: one dependent variable regressed
simultaneously on several independent variables
– Multivariate multiple linear regression: several dependent
variables simultaneously regressed on several independent
variables

19.04.2021 ASwR 2
Rationale of simple linear regression

• The method fits a regression model on the basis of the association


between independent variable X and dependent variable Y such that
Y is regressed on X

• An observed score yi on dependent variable Y is decomposed into:


– a predicted score ŷi for the dependent variable on the basis of
independent variable X
– and a deviation εi (error term, residual) of the predicted score ŷi
from the actually observed score yi

yi = ŷi + εi

19.04.2021 ASwR 3
Rationale of simple linear regression

• The predicted scores ŷ are located along the line of best fit (also
known as regression line)
– Recall the straight line from the correlational scatter plots that
depicts the correlation between the variables involved
– Two points are needed to fit a straight line
ŷi = a + bxi
where:
ŷi is the predicted score for observation i
a is the intercept of the line (point of intersection with Y-axis)
b is the slope (gradient) of the regression line
– Hence, the regression model is:
y = a + bx + ε

19.04.2021 ASwR 4
Rationale of simple linear regression

• Graphic illustration of the principle

Y Regression line
ŷ = a + bx

εi = yi – ŷi
Predicted
score ŷi
Observed
score yi

19.04.2021 ASwR 5
Rationale of simple linear regression

• Graphic illustration of the principle

Y Regression line
ŷ = a + bx

a
b

19.04.2021 ASwR 6
Rationale of simple linear regression

• Graphic illustration of the principle

Same intercept (a) Different intercepts (a)


Different slopes (b) Sample slope (b)

19.04.2021 ASwR 7
Rationale of simple linear regression

• Fitting the regression line (line of best fit): ŷ = a + bx


This is done using the method of Ordinary Least Squares (OLS).
– OLS aims is to minimize the sum of squared residuals Σεi2
Since y = ŷ + ε , ε = y – ŷ. Hence:
Σεi2 = (yi – ŷi)2 = (yi – (a + bxi))2

– The sum is minimized for:

19.04.2021 ASwR 8
Simple linear regression

• Example 1: Hours of studying and exam performance


We have data from 10 students on the hours they spent studying for an
exam as well as their performance on the exam in percentage points.
Test the hypothesis that the more time students invest in preparing for
an exam, the better they perform on it.
(Note that this is the same example as in the previous lecture.)

19.04.2021 ASwR 9
Simple linear regression

• Example 1: Hours of studying and exam performance


Student Hours Exam score
1 40 58
2 43 73
3 18 56
4 10 47
5 25 58
6 33 54
7 27 45
8 17 32
9 30 68
10 47 69
x̅ = 29 y̅ = 56
sx = 12.04 sy = 12.44
19.04.2021 ASwR 10
Simple linear regression

• Example 1: Hours of studying and exam performance


Student Hours xi – x̅ (xi – x̅)2 Exam score yi – y̅ (xi – x̅)(yi – y̅)
1 40 11 121 58 2 22
2 43 14 196 73 17 238
3 18 -11 121 56 0 0
4 10 -19 361 47 -9 171
5 25 -4 16 58 2 -8
6 33 4 16 54 -2 -8
7 27 -2 4 45 -11 22
8 17 -12 144 32 -24 288
9 30 1 1 68 12 12
10 47 18 324 69 13 234
x̅ = 29 Σ = 1304 y̅ = 56 Σ = 971
sx = 12.04 sy = 12.44
19.04.2021 ASwR 11
Simple linear regression

• Example 1: Hours of studying and exam performance


– Fitting the regression line ŷ = a + bx

= 971 / 1304 = 0.745

= 56 – 0.745*29 = 34.40

– Equation for the prediction:


Exam performance = 34.40 + 0.74*hours of studying

19.04.2021 ASwR 12
Simple linear regression

• Example 1: Hours of studying and exam performance


– Significance of the regression coefficients
• Test of the null hypothesis for the intercept (a = 0)
When predictor X is equal to 0, dependent variable Y is equal to 0
(i.e. a = 0). Note that this test is not always meaningful, especially
when there are no observed values of 0 on the predictor or
dependent variable.

• Test of the null hypothesis for the slope (b = 0)


Predictor X has no effect on dependent variable Y.

19.04.2021 ASwR 13
Simple linear regression

• Example 1: Hours of studying and exam performance


– Significance of the regression coefficients
• Assessed using a Wald test

W=

where β-hat is the estimated regression coefficient, se(β-hat) is its


standard error and β0 is the value against which the regression
coefficient is tested (0 in a test that the regression coefficient does
not differ from 0).

Software packages like SPSS and R assume the Wald statistic to be


t-distributed, whereas Stata assumes it to be z-distributed.

19.04.2021 ASwR 14
Simple linear regression

• Interpreting the regression coefficients


– Unstandardized regression coefficients: One unit change on
independent variable X (also called predictor) results in b unit
changes on dependent variable Y

– Unstandardized estimates in Example 1:


• Regression slope: b = 0.745
One more hour of studying brings about 0.745 points more on the
exam.

• Intercept: a = 34.40
Students who invested 0 hours of studying would get on average
34.40 points on the exam.

19.04.2021 ASwR 15
Simple linear regression

• Interpreting the regression coefficients


– Standardized regression coefficients: one standard deviation
change on predictor X results in β standard deviation changes
on dependent variable Y
• Can be straightforwardly interpreted as correlation
coefficients between the predictor and the dependent
variable
• Procedure for full standardization: β = b * (sx / sy)

19.04.2021 ASwR 16
Simple linear regression

• Interpreting the regression coefficients


– Standardized estimates in Example 1:
b = 0.745, sx = 12.04, sy = 12.44
β = 0.745 * (12.04 / 12.44) = 0.72

Hence, one standard deviation increase in the hours of studying


results in 0.72 standard deviations more on the exam.

Note that the standardized regression coefficient equals the Pearson


correlation coefficient for the association of the two variables: r(8) =
0.72 (see previous lecture). This is always the case with simple linear
regression!

19.04.2021 ASwR 17
Simple linear regression

• Specifying a linear regression model in R with the command lm()


– General form: model <- lm(dependent ~ predictor(s))

– Some important accessor functions:


summary() model summary
anova() overall ANOVA test of model
plot() diagnostic plots

– Some important elements:


model$coefficients vector with model coefficients
model$fitted.values vector with predicted scores
model$residuals vector with unstandardized residuals

19.04.2021 ASwR 18
Simple linear regression

• Example 1: Hours of studying and exam performance


– Specifying the model in R
ex1.reg <- lm(exmgrade ~ hrsstudy)

– R output
summary(ex1.reg)

19.04.2021 ASwR 19
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output
The last portion of the output from summary() provides the so-called
model summary. It contains the result of the overall ANOVA test of
model fit and the coefficient of determination R2 (multiple R2; adjusted
R2 corrects R2 for the number of predictors in the model).

The evidence informs that the independent variable (hours of studying)


explains 51.94 % of the variation in exam performance (the dependent
variable). The same could be rephrased as: hours of studying and
exam performance share 51.94 % of common variation.

19.04.2021 ASwR 20
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output

The overall ANOVA test of model fit shows whether the predictors in
the regression model explain a significant amount of variation in the
dependent variable. (Note the complementarity between ANOVA and
regression).
As F(1,8) = 8.647, p < .05, we can conclude that the explained amount
of variation (R2 = .5194) is statistically significant.

19.04.2021 ASwR 21
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output

The result from the overall ANOVA test of model fit can also be
obtained as follows.
anova(ex1.reg)

19.04.2021 ASwR 22
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output
The central portion of the output from summary() shows the regression
coefficients in unstandardized form (intercept, called ‘constant’ in
SPSS; predictor’s regression slope b), the standard errors of the
estimates, the corresponding t-scores and significance.

(Estimates may deviate from hand calculations due to rounding.)

19.04.2021 ASwR 23
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output

The evidence on the intercept (a = 34.406, p < .05) shows that when
students invest 0 hours of studying, the expected exam performance is
at 34.4 points, and this value is significantly different from 0.
As you may realize, this is not useful information, because there is no
observed score of 0 on exam performance. Actually, quite often the
intercept is not interpreted at all.

19.04.2021 ASwR 24
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output

The effect of the hours of studying on exam performance is positive and


significant at the 5 % level: b = .745, p = .019. As to the hypothesis
(one-tailed), the precise significance would be p = .019/2 = .0095.
Hence, we find evidence in support of the hypothesis: Every additional
hour of studying that students invest brings them 0.745 additional
scores on the exam.

19.04.2021 ASwR 25
Simple linear regression

• Example 1: Hours of studying and exam performance


– R output
The standardized regression coefficients can be obtained with the
command lm.beta() from package QuantPsyc.

library(QuantPsyc)
lm.beta(ex1.reg)

The size of the standardized regression coefficient (β = .72) renders the


association between study time and exam performance as very strong.

(Note that the constant/intercept is omitted in the standardized form.)

19.04.2021 ASwR 26
Simple linear regression

• Predicting scores using the regression equation


Using the regression model we fit on the association between hours of
studying and exam performance, what would be the exam score of a
student who invested 60 hours to prepare?
ŷ = a + bx
ŷ = 34.406 + 0.745*60 = 79.11
Hence, a student who invests 60 hours of studying would get about 79
points on the exam.

How many hours should a student invest in order to get 100 out of 100
points on the exam?
100 = 34.406 + 0.745*x
x = 88.05 hours

19.04.2021 ASwR 27
Core assumptions of simple linear regression

• Dependent variable
– Continuous (interval or ratio scaled)
• Independent variable (predictor)
– Continuous (not necessarily normally distributed) or dichotomous
– Non-zero variance: scores on the predictor should not be
identical for all observations
– Homoskedasticity: residuals have the same variance at each
level of the predictor
– Linearly related to the dependent variable
• Errors
– Normally distributed
– Lack of autocorrelation: For any two observations, the residuals
should be uncorrelated
19.04.2021 ASwR 28
Plots for diagnostics

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. actual (observed) values
– High model accuracy – Rather low model accuracy

19.04.2021 ASwR 29
Plots for diagnostics

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. actual (observed) values
– Omitted grouping variable – Omitted interaction effect

19.04.2021 ASwR 30
Plots for diagnostics

• Example 1: Hours of studying and exam performance


– Plot of predicted values vs. actual (observed) values
plot(ex1.reg$fitted.values, exmgrade, pch= 16, cex.lab= 1.3,
xlab= "Predicted exam scores", ylab= "Observed exam scores")
abline(lm(exmgrade ~ ex1.reg$fitted.values), lty= "dashed", lwd= 2, col= "red")

19.04.2021 ASwR 31
Plots for diagnostics

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. standardized residuals
– Examples of ‘well fitting’ models: Residuals are randomly
scattered, not forming a clear pattern

19.04.2021 ASwR 32
Plots for diagnostics

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. standardized residuals
– Example of a problematic model: Heteroskedasticity

19.04.2021 ASwR 33
Plots for diagnostics

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. standardized residuals
– Example of a problematic model: Non-linearity

19.04.2021 ASwR 34
Plots for diagnostics

https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. standardized residuals
– Example of a problematic model: Presence of outliers

19.04.2021 ASwR 35
Plots for diagnostics

• Example 1: Hours of studying and exam performance


– Plot of predicted values vs. standardized residuals
The residuals can be standardized with the command stdres().
stdres.exmgrade <- stdres(ex1.reg)
plot(ex1.reg$fitted.values, stdres.exmgrade, pch= 16, cex.lab= 1.3,
xlab= "Predicted exam scores", ylab= "Standardized residuals")
abline(a= 0, b= 0,
lty= "dashed", col= "red",
lwd= 2)

A similar plot can also be


obtained using the generic
accessor function plot(), i.e.:
plot(ex1.reg)

19.04.2021 ASwR 36
Plots for diagnostics

• Normal Q-Q plot


– used to assess whether the residuals are normally distributed
– The empirical quantiles of the residuals are plotted against the
quantiles of the standard normal distribution
– If the residuals follow a normal distribution with a mean of 0, the
points fall along the reference line that has an intercept of 0 and
a slope equal to the estimated standard deviation – basically, the
data points should closely follow the straight 45-degree
reference line (from bottom left to top right)

19.04.2021 ASwR 37
Plots for diagnostics

• Normal Q-Q plot


– Example plots for:
normally distributed residuals not normally distributed residuals

https://i.stack.imgur.com/wzOMY.png
19.04.2021 ASwR 38
Plots for diagnostics

• Example 1: Hours of studying and exam performance


– Normal Q-Q plot
qqnorm(stdres.exmgrade, pch= 16, cex.lab= 1.3,
ylab= "Quantiles of standardized residuals")
qqline(stdres.exmgrade,
lty= "dashed", lwd= 2,
col= "red")

A similar plot can also be


obtained using the generic
accessor function plot(), i.e.:
plot(ex1.reg)

19.04.2021 ASwR 39

You might also like