Professional Documents
Culture Documents
Logistic Regression (Chapter 20) : Example - High Dieldrin Levels in Western Australian Breast Feeding Mothers
Logistic Regression (Chapter 20) : Example - High Dieldrin Levels in Western Australian Breast Feeding Mothers
,
_
where,
'
,
_
o
p
p
suburb new a in living mothers for High for odds
and for a mother living in an old suburb is given by
1
1
ln ) ln(
,
_
o
p
p
suburb old an in living mothers for High for odds
The difference in the log odds is equivalent to the log of the odds ratio (OR) because of
the following property of logarithms.
,
_
y
x
y x ln ) ln( ) ln(
Applying this property here we have
1 1 1
2 ) ( ) (
) ln(
+
,
_
o o
suburb old an in mothers for High for odds
suburb new a in mothers for High for odds
ln
suburb) old an in mothers for High for ln(odds - suburb new a in mothers for High for odds
This says that the OR associated with living in a new suburb is given by
1
2
e OR
3
Fitting the New Suburb Logistic Regression Model in JMP
Select Fit Model and place High Dieldrin in the Y box and New Suburb in the Model
Effects box.
Resulting output
The estimated OR associated with living in a new suburb is then
We can use JMP to compute the ORs by selecting Nominal Logistic > Odds Ratio
4
Similarly for House Treated we have the following logistic regression model.
Finding Predicted Probabilities
The logistic regression model can be used to estimate the probability of success given a
set of predictor values as follows:
X
X
o
o
e
e
X success P p
1
1
1
) | (
+
+
+
for situations where we have a single predictor
and is given by
p p o
p p o
X X
X X
e
e
X success P p
1 1
1 1
1
) | (
+ + +
+ + +
+
,
_
p
p
1
ln
=
+
1
Age -4.0886156 + .1222*Age
Note: The response in logistic regression is the natural log of the odds for success.
The blue curve added to the plot gives the P(High|Age) = p. For example, for mothers 25
years of age the predicted probability of finding a high dieldrin level in her breast milk
is .25. For mothers 35 years of age this probability increases to around .50. The distance
from the top of the plot to the curve represents the P(Low|Age). To attach an odds ratio
to mothers age we need to pick an incremental increase of interest, e.g. suppose we
wanted to find the odds ratio associated with a 5-year increase in age. The associated
odds ratio is found as follows:
OR for 5-year increase in age = e
5*.122
= 1.84
6
Thus for a 5-year increase in age a mothers odds for having high dieldrin are 1.84 times
higher or alternatively there is an 84% increase in their odds for having high dieldrin
levels in their breast milk.
Predicted Probabilities for Logistic Model Using Age
We can use the logistic regression model to obtain predicted probabilities of high dieldrin
levels as a function of age by using.
P(High|Age) =
Age
Age
e
e
+
+
+
1222 . 089 . 4
1222 . 089 . 4
1
For example,
P(High|Age=25) = 2623 .
1
25 1222 . 089 . 4
25 1222 . 089 . 4
+
+
+
e
e
P(High|Age=35) = 5469 .
1
35 1222 . 089 . 4
35 1222 . 089 . 4
+
+
+
e
e
Multiple Logistic Regression Model
Now we consider a logistic regression model.
Age Treated NewSuburb
p
p
o 3 2 1
1
ln + + +
,
_
where,
'
'
2 exp(
i
i factor risk with associated OR , i.e.
i
e
2
.
Examples:
For New Suburb we have: For House Treated we have:
To find a crude 95% CI associated with the OR associated with risk factor i we compute
))
( * (
( * 2 exp(
i i
SE value) table - t or normal t
which will give an lower and upper confidence limits for the true OR associated with risk
factor.
Examples:
For New Suburb we have: For House Treated we have:
) 22 . 53 , 359 . 1 (
)) 4678 . 96 . 1 0703 . 1 ( * 2 exp(
t
) 65 . 90 986 . 1 (
)) 4873 . 96 . 1 2984 . 1 ( * 2 exp(
,
t
These intervals are very wide because the sample size (n = 45) is not very big. Typically
these types of studies require a larger sample size to get precise CIs for ORs.
We can obtain both the ORs and their confidence intervals using JMP as follows.
Select both the options
The resulting output is shown on the following page.
Multiple Logistic Regression Model
9
Odds Ratios calculates the odds ratios for
all predictors in the model.
Confidence Intervals provides CIs for
the Odds Ratio, calculated using a method
slightly differently than approach above.
ROC Curve draws an ROC curve which
is shown and discussed later in the
handout. (Professional JMP only!)
The ORs associated with living in home treated for termites and living in a new suburb
are considerably larger than those found examining there effect independently. The
differences between those obtained above are due to the fact that the factors themselves
are potentially related and as result their estimated effects when placed in a model jointly
differ.
The odds ratio reported for age is found by using Max(Age) Min(Age) as the
incremental increase. For these data Max(Age) = 37 and Min(Age) = 21, thus a mother
who is 37 has 28.055 times higher odds for having high dieldrin levels in her breast milk
when compared to a mother who is 21 years of age. It is better to use an increment like 5
years instead, i.e. OR associated with a 5 year increase in age is calculated as follows:
833 . 2 ) 042 . 1 exp( ) 5 * 2083 exp(. OR
.
As stated previously, the confidence intervals for all of the ORs are quite broad in this
study because the sample size is small (n = 45).
Predicted Probabilities Using All Available Predictors
The predicted probabilities of high dieldrin can be found as follows.
P(High Dieldrin|House Treated, New Suburb, Age) =
Age ed HouseTreat NewSuburb
Age ed HouseTreat NewSuburb
e
e
2084 . 298 . 1 070 . 1 604 . 6
2084 . 298 . 1 070 . 1 604 . 6
1
+ + +
+ + +
+
For example the probability that a 30 year old mother living in a home treated for
termites in an old suburb is estimated to be:
P(High|Old Suburb, House Treated, Age = 30) =
30 2084 . 298 . 1 070 . 1 604 . 6
30 2084 . 298 . 1 070 . 1 604 . 6
1
+ +
+ +
+ e
e
= .4690
For a 25 year old mother living in a home treated for termites located in a new suburb the
probability of high dieldrin is estimated to be:
P(High|New Suburb, House Treated, Age = 25) =
25 2084 . 298 . 1 070 . 1 604 . 6
25 2084 . 298 . 1 070 . 1 604 . 6
1
+ + +
+ + +
+ e
e
= .7259
Estimates of the P(High Dieldrin|New Suburb, House Treated, Age) Using
Professional Version of JMP (FYI)
10
Selecting Save Probability Formula from the Nominal Logistic Fit pull down menu
places the predicted probabilities of high and low dieldrin levels in the spreadsheet along
with the predicted status. The predicted status is determined by whichever probability is
larger, low dieldrin level or high dieldrin level, given their demographics.
Here is a portion of this output which will appear back in the original data spreadsheet.
P(Low|X) P(High|X)
We can compare the predicted dieldrin status to the actual via a contingency table. Select
Fit Y by X from the Analyze menu a place Most Likely High Dieldrin in the X box and
High Dieldrin in the Y box. The table and mosaic plot are shown below.
Contingency Analysis of High Dieldrin By MostLikely High Dieldrin
H
i
g
h
D
i
e
l
d
r
i
n
0.00
0.25
0.50
0.75
1.00
High Low
MostLikely High Dieldrin
High
Low
From the table we see that 26.7% of mothers classified as having high dieldrin levels
actually had low dieldrin levels, similarly 17.9% of those classified as having low
dieldrin levels actually had high dieldrin levels. In total 9 out of 43 mothers were
misclassified for an estimated overall error rate of 20.9%.
Receiver Operating Characteristic (ROC) Curve
11
Actual Status
Predicted
Status
High Low
High 11
73.33
4
26.67
15
Low 5
17.86
23
82.14
28
16 27 43
The Receiver Operating Characteristic plots the true positive probability vs. the false
positive probability. As the sensitivity increases the false positive rate increases as
expected. A good classification rule based on upon a logistic model should have area
beneath the ROC curve of .90 or higher. Here we do not quite meet that standard.
Receiver Operating Characteristic
T
r
u
e
P
o
s
i
t
i
v
e
S
e
n
s
i
t
i
v
i
t
y
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
.00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1.00
1-Specificity
False Positive
Area Under ROC Curve = 0.83449
Example 2: Risk Factors for Low Birth Weight
12
These data come from a case-control study where risk factors for having a infant with
low birth weight (< 2500g) were studied. The following information was recorded for
each mother in the study: (Data File: LowBirth)
Low Birth Weight indicator of birth weight status (Low or Normal)
Prev? previous history of premature labor (History or None)
Hyper hypertension during pregnancy (HT or Normal)
Smoke mother smoked during pregnancy (Cig or No Cig)
Uterine uterine irritability during pregnancy (Irritation or None)
Minority minority status of mother (Nonwhite or White)
Age age of mother
Lwt mothers weight at last menstrual cycle
Important JMP Note: For interpretation purposes it is best to code the
outcome so that the adverse outcome is alphabetically first. The same is true
for risk factors, code them so the level that would be associated with
increased risk is alphabetically first.
To fit the multiple logistic regression model select Analyze > Fit Model and set up the
dialog box as shown below.
After using backward elimination to remove non-significant predictors, uterine irritability
and mothers age here, we have the following.
13
The only predictor which represents something a mother could control or change is
smoking during pregnancy. This is the primary factor of interest in this study and the
other factors, while interesting, are there for control purposes only. In summarizing the
effect smoking we would see the phrase: adjusting for age, pre-pregnancy weight,
race, hypertension, uterine irritability, and previous history of premature labor we find
the OR associated with smoking is OR = 2.66. This says that, after adjusting for these
factors, the odds for having a low birth weight infant are 2.66 times larger for mothers
who smoked during pregnancy.
14