Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R

JTMS-03 Applied Statistics with R
Simple linear ordinary least squares regression
Dr. Georgi Dragolov
19.04.2021
Jacobs University Bremen

Linear regression
• Powerful tool with wide application

– Explore associations among variables
– Predict scores on the dependent variable based on the
independent variable(s) in the regression model
• Types of linear regression models

– Simple linear regression: one dependent variable regressed on
one independent variable
– Multiple linear regression: one dependent variable regressed
simultaneously on several independent variables
– Multivariate multiple linear regression: several dependent
variables simultaneously regressed on several independent
variables
19.04.2021 ASwR 2
Rationale of simple linear regression
• The method fits a regression model on the basis of the association

between independent variable X and dependent variable Y such that
Y is regressed on X
• An observed score yi on dependent variable Y is decomposed into:

– a predicted score ŷi for the dependent variable on the basis of
independent variable X
– and a deviation εi (error term, residual) of the predicted score ŷi
from the actually observed score yi
yi = ŷi + εi
19.04.2021 ASwR 3
• The predicted scores ŷ are located along the line of best fit (also
known as regression line)
– Recall the straight line from the correlational scatter plots that
depicts the correlation between the variables involved
– Two points are needed to fit a straight line
ŷi = a + bxi
where:
ŷi is the predicted score for observation i
a is the intercept of the line (point of intersection with Y-axis)
b is the slope (gradient) of the regression line
– Hence, the regression model is:
y = a + bx + ε
19.04.2021 ASwR 4
• Graphic illustration of the principle
Y Regression line
ŷ = a + bx
εi = yi – ŷi
Predicted
score ŷi
Observed
score yi
19.04.2021 ASwR 5
Y Regression line
ŷ = a + bx
a
b
19.04.2021 ASwR 6
Same intercept (a) Different intercepts (a)

Different slopes (b) Sample slope (b)
19.04.2021 ASwR 7
• Fitting the regression line (line of best fit): ŷ = a + bx

This is done using the method of Ordinary Least Squares (OLS).
– OLS aims is to minimize the sum of squared residuals Σεi2
Since y = ŷ + ε , ε = y – ŷ. Hence:
Σεi2 = (yi – ŷi)2 = (yi – (a + bxi))2
– The sum is minimized for:
19.04.2021 ASwR 8
Simple linear regression
• Example 1: Hours of studying and exam performance

We have data from 10 students on the hours they spent studying for an
exam as well as their performance on the exam in percentage points.
Test the hypothesis that the more time students invest in preparing for
an exam, the better they perform on it.
(Note that this is the same example as in the previous lecture.)
19.04.2021 ASwR 9

Student Hours Exam score
1 40 58
2 43 73
3 18 56
4 10 47
5 25 58
6 33 54
7 27 45
8 17 32
9 30 68
10 47 69
x̅ = 29 y̅ = 56
sx = 12.04 sy = 12.44
19.04.2021 ASwR 10

Student Hours xi – x̅ (xi – x̅)2 Exam score yi – y̅ (xi – x̅)(yi – y̅)
1 40 11 121 58 2 22
2 43 14 196 73 17 238
3 18 -11 121 56 0 0
4 10 -19 361 47 -9 171
5 25 -4 16 58 2 -8
6 33 4 16 54 -2 -8
7 27 -2 4 45 -11 22
8 17 -12 144 32 -24 288
9 30 1 1 68 12 12
10 47 18 324 69 13 234
x̅ = 29 Σ = 1304 y̅ = 56 Σ = 971
sx = 12.04 sy = 12.44
19.04.2021 ASwR 11

– Fitting the regression line ŷ = a + bx
= 971 / 1304 = 0.745
= 56 – 0.745*29 = 34.40
– Equation for the prediction:

Exam performance = 34.40 + 0.74*hours of studying
19.04.2021 ASwR 12

– Significance of the regression coefficients
• Test of the null hypothesis for the intercept (a = 0)
When predictor X is equal to 0, dependent variable Y is equal to 0
(i.e. a = 0). Note that this test is not always meaningful, especially
when there are no observed values of 0 on the predictor or
dependent variable.
• Test of the null hypothesis for the slope (b = 0)

Predictor X has no effect on dependent variable Y.
19.04.2021 ASwR 13

– Significance of the regression coefficients
• Assessed using a Wald test
W=
where β-hat is the estimated regression coefficient, se(β-hat) is its

standard error and β0 is the value against which the regression
coefficient is tested (0 in a test that the regression coefficient does
not differ from 0).
Software packages like SPSS and R assume the Wald statistic to be

t-distributed, whereas Stata assumes it to be z-distributed.
19.04.2021 ASwR 14
• Interpreting the regression coefficients

– Unstandardized regression coefficients: One unit change on
independent variable X (also called predictor) results in b unit
changes on dependent variable Y
– Unstandardized estimates in Example 1:

• Regression slope: b = 0.745
One more hour of studying brings about 0.745 points more on the
exam.
• Intercept: a = 34.40
Students who invested 0 hours of studying would get on average
34.40 points on the exam.
19.04.2021 ASwR 15

– Standardized regression coefficients: one standard deviation
change on predictor X results in β standard deviation changes
on dependent variable Y
• Can be straightforwardly interpreted as correlation
coefficients between the predictor and the dependent
variable
• Procedure for full standardization: β = b * (sx / sy)
19.04.2021 ASwR 16

– Standardized estimates in Example 1:
b = 0.745, sx = 12.04, sy = 12.44
β = 0.745 * (12.04 / 12.44) = 0.72
Hence, one standard deviation increase in the hours of studying

results in 0.72 standard deviations more on the exam.
Note that the standardized regression coefficient equals the Pearson

correlation coefficient for the association of the two variables: r(8) =
0.72 (see previous lecture). This is always the case with simple linear
regression!
19.04.2021 ASwR 17
• Specifying a linear regression model in R with the command lm()

– General form: model <- lm(dependent ~ predictor(s))
– Some important accessor functions:

summary() model summary
anova() overall ANOVA test of model
plot() diagnostic plots
– Some important elements:

model$coefficients vector with model coefficients
model$fitted.values vector with predicted scores
model$residuals vector with unstandardized residuals
19.04.2021 ASwR 18

– Specifying the model in R
ex1.reg <- lm(exmgrade ~ hrsstudy)
– R output
summary(ex1.reg)
19.04.2021 ASwR 19

– R output
The last portion of the output from summary() provides the so-called
model summary. It contains the result of the overall ANOVA test of
model fit and the coefficient of determination R2 (multiple R2; adjusted
R2 corrects R2 for the number of predictors in the model).
The evidence informs that the independent variable (hours of studying)

explains 51.94 % of the variation in exam performance (the dependent
variable). The same could be rephrased as: hours of studying and
exam performance share 51.94 % of common variation.
19.04.2021 ASwR 20

– R output
The overall ANOVA test of model fit shows whether the predictors in
the regression model explain a significant amount of variation in the
dependent variable. (Note the complementarity between ANOVA and
regression).
As F(1,8) = 8.647, p < .05, we can conclude that the explained amount
of variation (R2 = .5194) is statistically significant.
19.04.2021 ASwR 21

– R output
The result from the overall ANOVA test of model fit can also be
obtained as follows.
anova(ex1.reg)
19.04.2021 ASwR 22

– R output
The central portion of the output from summary() shows the regression
coefficients in unstandardized form (intercept, called ‘constant’ in
SPSS; predictor’s regression slope b), the standard errors of the
estimates, the corresponding t-scores and significance.
(Estimates may deviate from hand calculations due to rounding.)
19.04.2021 ASwR 23

– R output
The evidence on the intercept (a = 34.406, p < .05) shows that when
students invest 0 hours of studying, the expected exam performance is
at 34.4 points, and this value is significantly different from 0.
As you may realize, this is not useful information, because there is no
observed score of 0 on exam performance. Actually, quite often the
intercept is not interpreted at all.
19.04.2021 ASwR 24

– R output
The effect of the hours of studying on exam performance is positive and

significant at the 5 % level: b = .745, p = .019. As to the hypothesis
(one-tailed), the precise significance would be p = .019/2 = .0095.
Hence, we find evidence in support of the hypothesis: Every additional
hour of studying that students invest brings them 0.745 additional
scores on the exam.
19.04.2021 ASwR 25

– R output
The standardized regression coefficients can be obtained with the
command lm.beta() from package QuantPsyc.
library(QuantPsyc)
lm.beta(ex1.reg)
The size of the standardized regression coefficient (β = .72) renders the

association between study time and exam performance as very strong.
(Note that the constant/intercept is omitted in the standardized form.)
19.04.2021 ASwR 26
• Predicting scores using the regression equation

Using the regression model we fit on the association between hours of
studying and exam performance, what would be the exam score of a
student who invested 60 hours to prepare?
ŷ = a + bx
ŷ = 34.406 + 0.745*60 = 79.11
Hence, a student who invests 60 hours of studying would get about 79
points on the exam.
How many hours should a student invest in order to get 100 out of 100
points on the exam?
100 = 34.406 + 0.745*x
x = 88.05 hours
19.04.2021 ASwR 27
Core assumptions of simple linear regression
• Dependent variable
– Continuous (interval or ratio scaled)
• Independent variable (predictor)
– Continuous (not necessarily normally distributed) or dichotomous
– Non-zero variance: scores on the predictor should not be
identical for all observations
– Homoskedasticity: residuals have the same variance at each
level of the predictor
– Linearly related to the dependent variable
• Errors
– Normally distributed
– Lack of autocorrelation: For any two observations, the residuals
should be uncorrelated
19.04.2021 ASwR 28
Plots for diagnostics
https://www.qualtrics.com/support/stats-iq/analyses/regression-guides/interpreting-residual-plots-improve-regression/
• Plot of predicted values vs. actual (observed) values
– High model accuracy – Rather low model accuracy
19.04.2021 ASwR 29
• Plot of predicted values vs. actual (observed) values
– Omitted grouping variable – Omitted interaction effect
19.04.2021 ASwR 30

– Plot of predicted values vs. actual (observed) values
plot(ex1.reg$fitted.values, exmgrade, pch= 16, cex.lab= 1.3,
xlab= "Predicted exam scores", ylab= "Observed exam scores")
abline(lm(exmgrade ~ ex1.reg$fitted.values), lty= "dashed", lwd= 2, col= "red")
19.04.2021 ASwR 31
• Plot of predicted values vs. standardized residuals
– Examples of ‘well fitting’ models: Residuals are randomly
scattered, not forming a clear pattern
19.04.2021 ASwR 32
– Example of a problematic model: Heteroskedasticity
19.04.2021 ASwR 33
– Example of a problematic model: Non-linearity
19.04.2021 ASwR 34
– Example of a problematic model: Presence of outliers
19.04.2021 ASwR 35

– Plot of predicted values vs. standardized residuals
The residuals can be standardized with the command stdres().
stdres.exmgrade <- stdres(ex1.reg)
plot(ex1.reg$fitted.values, stdres.exmgrade, pch= 16, cex.lab= 1.3,
xlab= "Predicted exam scores", ylab= "Standardized residuals")
abline(a= 0, b= 0,
lty= "dashed", col= "red",
lwd= 2)
A similar plot can also be

obtained using the generic
accessor function plot(), i.e.:
plot(ex1.reg)
19.04.2021 ASwR 36
• Normal Q-Q plot

– used to assess whether the residuals are normally distributed
– The empirical quantiles of the residuals are plotted against the
quantiles of the standard normal distribution
– If the residuals follow a normal distribution with a mean of 0, the
points fall along the reference line that has an intercept of 0 and
a slope equal to the estimated standard deviation – basically, the
data points should closely follow the straight 45-degree
reference line (from bottom left to top right)
19.04.2021 ASwR 37
• Normal Q-Q plot

– Example plots for:
normally distributed residuals not normally distributed residuals
https://i.stack.imgur.com/wzOMY.png
19.04.2021 ASwR 38

– Normal Q-Q plot
qqnorm(stdres.exmgrade, pch= 16, cex.lab= 1.3,
ylab= "Quantiles of standardized residuals")
qqline(stdres.exmgrade,
lty= "dashed", lwd= 2,
col= "red")
A similar plot can also be

obtained using the generic
accessor function plot(), i.e.:
plot(ex1.reg)
19.04.2021 ASwR 39

Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R

Uploaded by

Copyright:

Available Formats

You might also like

Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Linear Ordinary Least Squares Regression: JTMS-03 Applied Statistics With R

Uploaded by

Copyright:

Available Formats

JTMS-03 Applied Statistics with R

Simple linear ordinary least squares regression

Dr. Georgi Dragolov

Jacobs University Bremen

• Powerful tool with wide application

• Types of linear regression models

• The method fits a regression model on the basis of the association

• An observed score yi on dependent variable Y is decomposed into:

• Graphic illustration of the principle

• Graphic illustration of the principle

• Graphic illustration of the principle

Same intercept (a) Different intercepts (a)

• Fitting the regression line (line of best fit): ŷ = a + bx

– The sum is minimized for:

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

= 971 / 1304 = 0.745

– Equation for the prediction:

• Example 1: Hours of studying and exam performance

• Test of the null hypothesis for the slope (b = 0)

• Example 1: Hours of studying and exam performance

where β-hat is the estimated regression coefficient, se(β-hat) is its

Software packages like SPSS and R assume the Wald statistic to be

• Interpreting the regression coefficients

– Unstandardized estimates in Example 1:

• Interpreting the regression coefficients

• Interpreting the regression coefficients

Hence, one standard deviation increase in the hours of studying

Note that the standardized regression coefficient equals the Pearson

• Specifying a linear regression model in R with the command lm()

– Some important accessor functions:

– Some important elements:

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

The evidence informs that the independent variable (hours of studying)

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

(Estimates may deviate from hand calculations due to rounding.)

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

The effect of the hours of studying on exam performance is positive and

• Example 1: Hours of studying and exam performance

The size of the standardized regression coefficient (β = .72) renders the

(Note that the constant/intercept is omitted in the standardized form.)

• Predicting scores using the regression equation

• Example 1: Hours of studying and exam performance

• Example 1: Hours of studying and exam performance

A similar plot can also be

• Normal Q-Q plot

• Normal Q-Q plot

• Example 1: Hours of studying and exam performance

A similar plot can also be

You might also like