Lecture 9

Regression
• Simple Linear Regression.

• Multiple Linear Regression.
• Simple Logistic Regression.
• Multiple Logistic Regression.
EPHD310 Basic Biostat Lect 9 Dr. Jaffa 1
Simple Linear Regression
• Linear regression is a technique used to assess whether

there’s a linear relationship between 2 (or more)
variables.
• Assess the linear relationship between caffeine intake

and blood pressure.
• Does caffeine intake associate with increasing or

decreasing blood pressure?
Basic Biostat Lect 9 Dr. Jaffa 2
1
• Linear regression involves one variable denoted as
“dependent” or “outcome” and referenced to as “y”.
• One (or more) variable(s) denoted as “independent” or

“explanatory” and referenced to as “x”(s).
• In the caffeine example: blood pressure is y

caffeine intake is x
• Aim is to assess whether caffeine intake (x) is associated

with increasing or decreasing blood pressure (y) using a
linear equation called “linear regression equation” and how
much is this association.

• The linear regression equation is denoted as such:
Y    X 
• And is used for predicting the value of y when x is given.
• If there’s one x then this is referred to as “simple linear

regression”.
• If more than one x then this is referred to as “multiple linear

regression”
2
• Assumptions:
1) The observations for the outcome Y are assumed to be
independent from each other, continuous and normally
distributed.
2) The observations for the predictor X could be

continuous or categorical.
• Linear regression equation:
Y    X 
• The line Y     X   is referred to as the regression line.
• α is the intercept of the line
• β is the slope : quantitative measure of association between X

and Y.
• Interpretation of β:
For each unit increase in X there is β units increase (or
decrease) in Y on average.
3
• If β = 0 then there’s no association between X and Y.
• If β > 0 then there’s a positive association between X and Y.
• If β < 0 then there’s a negative association between X and Y.

• ε is the error term: ε ~ N(0,σ2) assumed all independent.
• Outcome Y is assumed to be normally distributed.
• α and β are unknown so fit the model to obtain their

estimates denoted as “a” and “b” respectively.
• The fitted model will be denoted as follows:

Yˆ  a  bX
• The predicted, or average value of Y for a particular value of

x is denoted by Yˆ
4
• We need a formal test to test if the slope β = 0. We refer to this
test as “t test” and the hypothesis to be tested is
Ho: β = 0 vs H1: β ≠ 0
• If the t test is significant (P-value ≤ 0.05) then reject Ho and

deduce that there is a significant association between Y and X
i.e. X is a significant predictor of Y.
• Otherwise no significant association between X and Y
• The measure of association between X and Y is quantified by

the slope β.

Example: OBGYN
• Sometimes OBGYN orders tests for estriol levels from 24-
hour urine specimens taken from pregnant women who are
near term.
• The level of estriol has been found to be related to the

birthweight of the infant.
• The level of estriol can provide evidence of an abnormally

small fetus.
• Assume that babies birthweight is normally distributed, how

can we quantify the relationship between estriol and infants’
birthweights. Basic Biostat Lect 9 Dr. Jaffa 10
5
Example: OBGYN (continued)
• To quantify the relationship between estriol levels and
babies birthweights use linear regression since the outcome
babies birthweight is normally distributed.
• In this example the variable estriol levels is used as a

predictor (X) for the dependent variable infants birthweights
(Y).
• So there’s only one predictor (estriol) thus use simple linear

regression.

Example: OBGYN (continued)
i Estriol (mg/24 hr) Birthweight (g/100)
Xi Yi
1 7 25
2 9 25
3 9 25
4 12 27
5 14 27
6 16 27
7 16 24
…. 29 …. 22 …. 40
30 25 39
31 24 43
Basic Biostat Lect 9 Dr. Jaffa
12
6
Example: OBGYN
• Simple linear regression is used in the estriol and babies
birthweights.
• Estriol levels is the explanatory or predictive variable (X).
• Babies birthweights variable is used as the dependent

varaible (Y).
• We need to quantify the association between X and Y.
• Outcome of simple linear regression as generated from

SPSS is the following:
• Intercept estimate: ˆ  a  21.52
• Slope estimate: ˆ  b  0.608
• t test for Hypotheses: Ho: β = 0 vs H1: β ≠ 0 is significant with P-

value = 0.000 so reject H0 and deduce that there is a significant
relationship between estriol and birthweight at significance level of
α.
7
• The fitted model is: Yˆ  21.52  0.608 X
• With Y = babies birthweights and X = estriol levels.
• Thus Estriol is a significant predictor for babies

birthweights and there’s a significant association
between them.
• Interpretation of the result: For each 1 mg/24 hr increase

in estriol level corresponds 0.608 gram/100 increase in a
baby’s birthweight on average.
• Fitted model is: Yˆ  21.52  0.608 X
• Predict the average birthweight of a baby whose mother’s

estriol level is 16:
Yˆ  21.52  0.608 X  21.52  0.608(16)  31.248
• So we predict that at estriol level of 16 the average baby’s

birthweight will be 31.248 grams.
8
• Now assume that X is binary how do we interpret the estimate:
• Example   2.45  5.89 gender

SBP
• Gender = 1 for male and 0 for female.
• Interpretation: going from female to male increases the Mean

SBP by 5.89 mm Hg.

Multiple Linear Regression
• When more than one explanatory variable are included in the
model then this will be referred to as “multiple linear
regression”. Hence multiple regression is used to account for
confounding effects.
• The model: Y = α + β1X1+β2X2+….+βkXk +ε
• Y is the dependent or outcome variable that is continuous and

normally distributed and X1 is the 1st explanatory variable, …,
Xk is the kth explanatory variable.The Xs could be continuous
or categorical.
• β1 is the slope associated with the 1st explanatory variable, …,

βk is the slope associated with the kth explanatory variable.
9
Example: (hypertension pediatrics)
• A study was conducted on 16 infants to assess the

association between hypertension (in terms of SBP) and
babies birthweight and age (in days).
• Are babies birthweights and age good predictors for

hypertension in babies?
• The data collected on 16 infants are the following:
i Birthweight in oz (X1) Age in days (X2) Systolic Blood Pressure (mm Hg) (Y)
1 135 3 89
2 120 4 90
3 100 3 83
4 105 2 77
5 130 4 92
6 125 5 98
7 125 2 82
8 105 3 85
9 120 5 96
10 90 4 95
11 120 2 80
12 95 3 79
13 120 3 86
14 150 4 97
15 160 3 92
16 125 3 88
20
10
• SPSS output for the fitted model with SBP (Y) as the
dependent variable, babies birthweights (X1) and age (X2) as
the explanatory (or predictive) variables.
• Estimated α = 53.45,
• Estimated β1 = 0.126 (P-value = 0.003),
• Estimated β2 = 5.88 (P-value = .000)

• The fitted model is:
Yˆ  53.45  0.125 X 1  5.887 X 2
• i.e.:   53.45  0.125( Birthweight )  5.887( Age)
SBP
• Interpretations:
 For each 1 ounce increase in birthweight corresponds 0.125
mm Hg increase in the baby’s SBP, adjusting for the effect
of age in the model.
 For each 1 day increase in age corresponds 5.887 mm Hg

increase in the baby’s SBP, adjusting for the effect of
birthweight in the model.
11
• The overall F test is used in multiple linear regression to test

the overall model’s significance.
• Hypothesis to test model’s significance is:

Ho: β1 = β2 = … = βk = 0
versus H1: at least one of the βj ≠ 0
• Overall F test is provided in the ANOVA table.
• The ANOVA table corresponding to the pediatrics

hypertension example (SPSS output) is the following:
Basic Biostat Lect 9 Dr. 24

Jaffa Multiple Linear Regression
• Pediatrics hypertension example:
• Ho: βBirthweight = βAge = 0 versus

H1: at least one of the β’s ≠ 0
• F statistic = 48.081; P-value = .000 so overall F test is

significant so we reject the null hypothesis that both slopes are
zero and deduce that at least one of the predictors (Birthweight
and/or Age) is associated with SBP.
12
• In multiple linear regression a separate t test is performed on

each individual slope for testing the hypothesis:
Ho: βi = 0, all other βk’s are ≠ 0

H1: βi ≠ 0, all other βk’s are ≠ 0
• Or a different way of stating these hypotheses:
Ho: βi = 0 in a model that contains the other covariates

H1: βi ≠ 0 in a model that contains the other covariates
• Hence we are testing here the effect of specific covariate on the

outcome in the presence of other covariates.
• In the pediatrics hypertension example the hypotheses to be

tested are:
Ho: βBirthweight = 0, βAge ≠ 0
H1: βBirthweight ≠ 0, βAge ≠ 0
Or a different way of stating these hypotheses:
Ho: βBirthweight = 0 in a model that contains age
H1: βBirthweight ≠ 0 in a model that contains age
13
• t statistic = 3.657; P-value = 0.003; thus t test is significant.
So we reject the null hypothesis that the slope for
birthweight is zero and conclude that birthweight is
contributing significantly in explaining the dependent
variable SBP, adjusting for the effect of age in the model.

• The second hypothesis to be tested is:
Ho: β Age = 0, β Birthweight ≠ 0
H1: β Age ≠ 0, β Birthweight ≠ 0
• Or a different way of stating these hypotheses:

Ho: βAge = 0 in a model that contains Birthweight
H1: βAge ≠ 0 in a model that contains Birthweight
14
• t statistic = 8.656; P-value = 0.000; thus t test is significant and
we conclude that age is contributing significantly in explaining
SBP, adjusting for the effect of birthweight in the model.
• We conclude that at least one the explanatory variables is

contributing significantly in predicting pediatric SBP.
• The separate t tests conducted on the individual slopes

suggested that both slopes are significantly different from
zero.
• Thus we concluded that both birthweight and age of the

baby contribute significantly in predicting SBP.
15
• The coefficient of determination corresponding to the

pediatric hypertension (provided by SPSS) is 0.881.
• Thus 88.1% of the variability in SBP is explained by the

model that includes both birthweight and age.

• Predict the average SBP for babies whose birthweight is
135 ounces and who are 2 days old.
  53.45  0.125( Birthweight )  5.887( Age)
SBP
 53.45  0.125(135)  5.887(2)  82.099
• Thus babies who are 2 days old and weigh 135 ounces
are expected to have an average SBP of 82.099 mm Hg
16
Simple Logistic Regression
• In linear regression (simple and multiple) the dependent

variable Y was continuous and normally distributed.
• When the dependent variable Y is binary variable (example

lung cancer Yes/No).
• In this case linear regression can not be used.
• Instead we use what is known as “logistic regression”.
• The association between X and Y can be measured using

a measure of association referred to as “odds ratio” “OR”.
• Y = lung cancer status (1 = yes, 0 = no)

• X = smoking status (1 = yes, 0 = no)
Smoking Status
Lung Cancer yes no
yes 10 6
no 20 74
17
• The SPSS output of logistic regression with
Y = lung cancer = (1 for yes and 0 for no) dependent variable
X = smoking status = (1 for yes and 0 for no) explanatory
variable
Basic Biostat Lect 9

Dr. Jaffa
• The odds ratio of lung cancer is equal to exp(β)
• Ho: OR=exp(β) = 1 (i.e. no significant association between X

and Y)
H1: OR=exp(β) ≠1
Or equivalently:
• Ho: β = 0
H1: β ≠0
• If Ho is true then this means that exp(β) = 1 then there’s no

significant association between smoking and lung cancer.
36
18
• The SPSS output of logistic regression with
Y = lung cancer = (1 for yes and 0 for no) dependent variable
X = smoking status = (1 for yes and 0 for no) explanatory
variable
• Odds ratio (OR) = exp(1.819)= 6.167 with P-value = 0.002 <

0.05 thus test is significant so we reject the null hypothesis and
conclude that smoking and lung cancer are significantly
associated Basic Biostat Lect 9 Dr. Jaffa 37

• Interpretation: The odds of having lung cancer for smokers is
6.167 times that of having lung cancer among nonsmokers.
• Interpretation of the 95%CI for OR: we are 95% confident that
the true odds ratio of lung cancer for smokers versus non-
smokers ranges between (2.0, 19.018). This CI doesn’t include
1 so we can say that lung cancer is significantly associated with
smoking.
19
• If the 95% CI for OR contains 1 then the association between X
and Y is insignificant; otherwise the association is significant.

9 Dr. Jaffa
Multiple Logistic Regression
• If more than one predictor (or explanatory) variables are

involved in explaining the dependent variables then the
multiple logistic regression should be used.
• Multiple logistic regression allows us to account for covariate
effects.
• Example: Assume we want to assess the association

between accident in past year (yes/no) as dependent variable
and driver’s age, vision problem (yes =1/no = 0 ), and driver
took driving education course (yes = 1/no = 0) as
independent explanatory variables.
• Simple logistic regression results in the unadjusted OR, while

the multiple logistic regression results in the adjusted OR.
20
Accident in past Vision Driver Education
year ? Problem? course?
1 1 1
1 0 0
1 1 0
1 0 0
1 1 1
0 0 1
0 1 1
0 0 0
0 0 1
…. …. ….
1 0 1
1 1 0
1 1 0

Dr. Jaffa Multiple Logistic Regression
• SPSS outcome for multiple logistic regression corresponding to
the accident example.
Variables in the Equation
B S.E. Wald df Sig. Exp(B) 95% C.I.for EXP(B)
Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036
a
drivereducation -1.494 .705 4.496 1 .034 .224 .056 .893
Step 1
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828

a. Variable(s) entered on step 1: vision, drivereducation, age.

• The odds of being in a car accident for drivers with vision
problems is 5.527 times that of drivers without vision problem
and this is after adjusting for the effect of age and whether the
driver attended the driver’s education problem.
• Hence adjusting for driving education course and age, since the
OR>1 then vision problem is considered a risk factor for
42
accidents.
21
• The odds ratio of car accident is equal to exp(β1)
• Ho: OR=exp(βvision_Problem) = 1 in a model that contains age and
education course
H1: OR=exp(βvision_problem) ≠1 in a model that contains age and
education course
Or equivalently:
• Ho: βvision_Problem = 0 in a model that contains age and
education course
H1: βvision_Problem ≠0 in a model that contains age and education
course

• The odds ratio of car accident is equal to exp(β1)
• Ho: OR=exp(βdrivers_education) = 1 in a model that contains age
and vision problem
H1: OR=exp(βdrivers_education) ≠1 in a model that contains age
and vision problem
Or equivalently:
• Ho: βdrivers_education = 0 in a model that contains age and vision
problem
H1: βdrivers_education ≠0 in a model that contains age and vision
problem
22
• The odds ratio of car accident is equal to exp(β)
• Ho: OR=exp(βage) = 1 in a model that contains education class
and vision problem
H1: OR=exp(βage) ≠ 1 in a model that contains education class
and vision problem
Or equivalently:
• Ho: βage = 0 in a model that contains education class and
vision problem
H1: βage ≠ 0 in a model that contains education class and
vision problem

Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036
a
Step 1
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828


• For vision problem P-value = 0.015 indicating significant

association between vision problem and odds of car accidents
adjusted for age and education course.
23
Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036

Step 1a
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828

• We are 95% confident that the true OR of car accidents for

those with vision problem compared to those without this
problem lies between 1.387 and 22.036. This CI doesn’t include
1 so reject H0 of no significant association and deduce that
vision problem is associated with odds of car accident adjusting
for age and course education.

Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036
a
Step 1
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828


• The odds of being in a car accident for drivers who attended the
driver’s education program is 0.224 times that of those who did
not adjusting for age and vision problem.
• Hence, the odds of being in a car accident for drivers who did
not attend the driver’s education program is 1/0.224 = 4.46
times that of those who did adjusting for age and vision
48
problem. Basic Biostat Lect 9 Dr. Jaffa
24
• The odds of being in a car accident for drivers who attended
the driver’s education program is 0.224 times that of those
who did not adjusting for age and vision problem.
• Hence, the odds of being in a car accident for drivers who

did not attend the driver’s education program is 1/0.224 =
4.46 times that of those who did adjusting for age and vision
problem.
• Since the OR=0.224 < 1 then adjusting for age and vision
problem, taking the driving education course is a protective
factor against car accidents.

Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036
a
Step 1
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828


• For driver education P-value = 0.034 indicating significant

association between driver education and odds of car accidents
adjusted for age and vision problem.
25
Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036

Step 1a
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828

• We are 95% confident that the true OR of car accidents for

those who took the education course versus those who did not
lies between 0.056 and 0.893. This CI does not include 1 so we
fail to reject H0 of no significant association and deduce that
education course is associated with odds of car accident
adjusting for age and vision problem.
51

Dr. Jaffa Multiple Logistic Regression 52

Lower Upper
vision 1.710 .706 5.872 1 .015 5.527 1.387 22.036
a
Step 1
age .007 .018 .129 1 .719 1.007 .971 1.043
Constant -.188 .995 .036 1 .850 .828


• Accounting for vision and education, the odds of being in a car
accident increase multiplicatively by 1.007 as age increases by
1 year.
• Since P-value = 0.719 > 0.05 and 95%CI for OR=(0.971, 1.043)
and includes 1 so this result is not significant and age is not
associated with car accidents in a model that includes vision
and driver’s education.
26
• Thus the odds of car accidents is associated with vision

problem and driver’s education class but not with age.
• Specifically, the odds of car accidents increase with vision

problem (risk factor) and decrease when the driver attends
a driver education course (protective factor).
Concluding remarks on Regression
• Regression had it linear or logistic can be used when the

Epidemiological study design has independent samples.
• Logistic regression is used when the outcome Y is binary, and

linear regression is used when the outcome Y is continuous
and normally distributed.
• Multiple regression is the tool to account for confounders’

effects.
• Simple logistic regression generates the unadjusted OR while

the multiple logistic regression generates the adjusted OR.
54
27
EPHD310 Basic Biostatistics Course Learning Outcomes Per FHS Catalogue
LO4. Analyze quantitative data using common statistical methods for

inference through computer based statistical software and manual
computation.
LO6. Interpret results of statistical analyses found in public health studies

and biomedical sciences.
LO7. Apply ethical principles to data management and analysis.
28

Lecture 9

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 9

Uploaded by

Copyright:

Available Formats

Regression

• Simple Linear Regression.

EPHD310 Basic Biostat Lect 9 Dr. Jaffa 1

Simple Linear Regression

• Linear regression is a technique used to assess whether

• Assess the linear relationship between caffeine intake

• Does caffeine intake associate with increasing or

Basic Biostat Lect 9 Dr. Jaffa 2

• One (or more) variable(s) denoted as “independent” or

• In the caffeine example: blood pressure is y

• Aim is to assess whether caffeine intake (x) is associated

Simple Linear Regression

• If there’s one x then this is referred to as “simple linear

• If more than one x then this is referred to as “multiple linear

Basic Biostat Lect 9 Dr. Jaffa 4

2) The observations for the predictor X could be

• Linear regression equation:

Simple Linear Regression

• The line Y     X   is referred to as the regression line.

• α is the intercept of the line

• β is the slope : quantitative measure of association between X

Basic Biostat Lect 9 Dr. Jaffa 6

• If β = 0 then there’s no association between X and Y.

• If β > 0 then there’s a positive association between X and Y.

• If β < 0 then there’s a negative association between X and Y.

Basic Biostat Lect 9 Dr. Jaffa 7

Simple Linear Regression

• Outcome Y is assumed to be normally distributed.

• α and β are unknown so fit the model to obtain their

• The fitted model will be denoted as follows:

• The predicted, or average value of Y for a particular value of

Basic Biostat Lect 9 Dr. Jaffa 8

• If the t test is significant (P-value ≤ 0.05) then reject Ho and

• Otherwise no significant association between X and Y

• The measure of association between X and Y is quantified by

Basic Biostat Lect 9 Dr. Jaffa 9

Simple Linear Regression

• The level of estriol has been found to be related to the

• The level of estriol can provide evidence of an abnormally

• Assume that babies birthweight is normally distributed, how

• In this example the variable estriol levels is used as a

• So there’s only one predictor (estriol) thus use simple linear

Basic Biostat Lect 9 Dr. Jaffa 11

Simple Linear Regression

• Estriol levels is the explanatory or predictive variable (X).

• Babies birthweights variable is used as the dependent

• We need to quantify the association between X and Y.

• Outcome of simple linear regression as generated from

Simple Linear Regression

• Intercept estimate: ˆ  a  21.52

• Slope estimate: ˆ  b  0.608

• t test for Hypotheses: Ho: β = 0 vs H1: β ≠ 0 is significant with P-

• The fitted model is: Yˆ  21.52  0.608 X

• With Y = babies birthweights and X = estriol levels.

• Thus Estriol is a significant predictor for babies

• Interpretation of the result: For each 1 mg/24 hr increase

Basic Biostat Lect 9 Dr. Jaffa 15

Simple Linear Regression

• Fitted model is: Yˆ  21.52  0.608 X

• Predict the average birthweight of a baby whose mother’s

• So we predict that at estriol level of 16 the average baby’s

Basic Biostat Lect 9 Dr. Jaffa 16