Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Medical Statistics

Simple linear regression

Hans Burgerhof
Epidemiology
j.g.m.burgerhof@umcg.nl
Semester 1.2
• Three lectures
– Simple linear regression
– Multiple linear regression
– Building regression models and using SPSS
• Workshops/practicals
• Individual assignment
• Exam (multiple choice questions)
Dataset n = 277 patients with diabetes
Today: simple linear regression
What is it?
Why do we need it?

In semester 1.1. we learned how to test if a continuous outcome


variable like Weight is related to an explanatory variable like Sex,
Height or Smoking behaviour (three categories).

Assuming normality for the variable Weight, we used respectively


- t-test for independent groups
- Pearson’s correlation coefficient
- Oneway ANOVA
Scatterplot of weight against height

Pearson correlation is a
measure of the strength of
a linear relationship

A correlation coefficient does not enable us to predict the weight of


an individual if we know his/her height.
Weight = Height

is the intercept
is the slope

Remember High school


math:

Y = a·X + b

The intercept (or “constant”) is


the intersection of the line and
the Y-axis
“best fitting line”

Residual = distance
from an observation to
the regression line (in
vertical direction)

Each line has its own set of 277


residuals (

“ordinary least squares”


criterion:
The line for which is smallest, is
the best fitting line
Linear regression in SPSS (1)
Linear regression in SPSS (2)
Outcome variable or
Response variable

Explanatory variable(s)
or Predictor(s)
SPSS output of (simple) linear regression
Coefficients

Test for the slope:

Significant!

(the line is a horizontal one)

The equation can be used for prediction:


Weight = -59.58 + 0.838 · height
e.g. for a person with height = 180:
Intercept Weight = -59.58 + 0.838·180  91.3 kg
-59.58?
The scatterplot with adjusted axes

Intercept -59.58

Estimated weight -59.58 kg? For a person with height = 0 cm!


The ANOVA table
Null hypothesis: “total model explains nothing”
Sum of Squares:

Variance =
𝑀𝑆𝑟𝑒𝑔𝑟
𝐹=
𝑀𝑆𝑟𝑒𝑠

𝑌 −𝑌 =𝑌 − 𝑌^ + 𝑌^ −𝑌
Y Unexplained Explained
^
𝑌 (residual) (regression)
𝑌

In words: the total SS = residual SS + regression SS


Model summary
R² is the proportion variation of the
outcome variable, explained by the
model. The higher the R² , the better
the model.
R is the square root of R². R is equal to
the absolute value of Pearson’s
correlation coefficient (in simple linear
regression)

Adjusted R² is a more conservative


estimate for the proportion explained
in the population

Explained part = “standard error of the estimate” is an


estimate of the standard deviation of
the residuals. It can be used for
confidence intervals and prediction
intervals.
F test and t-test for slope in simple linear
regression

Always equal in
F = t² simple linear
regression
Assumptions for linear regression

1. Y has to be a continuous variable


2. Linear relationship between Y and X 2: Check in a scatterplot
3. The cases have to be mutually independent 3: You need
information on
the data collection
4. The residuals have to be normally distributed
4 and 5:
5. The variances of the residuals have to be Check
homogeneous (homoscedasticity) graphically in
SPSS
Checking assumptions

To check the
homoscedasticity

To check normality
of the residuals
SPSS output for checking the assumptions

Distribution of the residuals


seems rather normal

Homogeneous variation of the


residuals?

OK
Simple linear regression with a non-
continuous predictor

We saw a simple linear regression with a


continuous outcome variable Y (weight) and a
continuous explanatory variable X (height).

Can we perform a simple linear regression with


another kind of explanatory variable, like a
binary variable?
(in linear regression, Y has to be continuous!)
Is weight related to sex?
T-test for independent groups
.
Assuming normal distributions of weights: t-test for independent groups.

Null hypothesis of equal means is


Null hypothesis of equal rejected: males have on average
variances is not rejected higher weights than females
Linear regression of weight on sex

males females

Linear regression 
T-test
 Males (0): 87.43
Weight = 87.43 - 8.71·sex Females (1):
87.43 – 8.71 =
78.72
Conclusion
Performing a linear regression with a continuous Y and
a binary X is a valid analysis and is equivalent to
performing a t-test for independent groups (with equal
variances).

(Pooled) T-test Linear regression


Independent groups Independent observations
Continuous Y Continuous Y
Normally distibuted per group Normally distributed residuals
Equal variances homoscedasticity
Can we also use a categorical variable
with more than two categories as a
predictor in linear regression?

Oneway ANOVA:
SPSS results Oneway ANOVA

The null hypothesis of the At least two groups have significantly


Oneway ANOVA different means

is rejected.
Conclusion?
Linear regression?

Non smoking Current smokers


Stopped smoking
SPSS will draw the best fitting line if you ask
for it ...

Positive effect of
smoking on weight?

The coding used


0 = never smoking
1 = stopped smoking
2 = current smoker
is an arbitrary coding
The effect of a different coding

Stopped never
smoking current
Negative effect of
smoking?
Or another coding ...

Stopped
never current
smoking
No effect of smoking?

Conclusion: we cannot use a categorical (nominal) variable with ... unless we use
more than two categories as a predictor in linear regression ... dummy variables
Dummy coding
A dummy variable is a “helping variable”.
There are several ways to use dummy variables.
Most commonly used: the reference group coding.
Choose one of the groups as reference and make dummy variables for the
other groups to compare those groups to the reference group.

Variable: Smoking DumSm1 DumSm2


Never smoked 0 0

Stopped smoking 1 0

Current smoker 0 1

The number of dummy variables equals the


number of groups minus one
Linear regression of weight on smoking
behaviour (using dummies)
Variable: DumSm1 DumSm2
Smoking

SPSS output Never


Stopped
Current
0
1
0
0
0
1

Weight = 79.08 + 8.50·DumSm1 + 7.68·DumSm2


Never: Weight = 79.08
Stopped: Weight = 79.08 + 8.50 = 87.58
Current: Weight = 79.08 + 7.68 = 86.76
Summary
Linear regression can be used if the response variable Y is
continuous (and some other assumptions have been fulfilled).

The explanatory variable can be


- continuous, (relation with Pearson correlation)
- binary (relation with t-test) or
- categorical with more than two categories (relation with
Oneway ANOVA).

Linear regression can be used for prediction.

Simple linear regression can be extended to multiple linear


regression: . (Next lecture)

You might also like