HFHFH

Epidemiology and Data Analysis
Lecture 16
Logistic Regression
Tue, 10 July 2012
Masashi Kizuki
Health Promotion/ International Health
Tokyo Medical and Dental University
Today’s Topics
• Contingency table method (review)
• Logistic regression analysis
• Likelihood ratio test
• The Hosmer-Lemeshow goodness-of-fit test

Case 1
We will compare two antibiotics, cefaclor and

amoxicillin, in a RCT of 214 children with
acute otitis media. The primary endpoint is
cure within 14 days.
Are two antibiotics different? Use children as

the unit of analysis.
2x2 Contingency Table
Table. Crude association
Cured Not cured
Cefaclor 89 61
Amoxicillin 56 72
The odds of cure among cefaclor is

89/61 = 1.459
The odds of cure among amoxicillin is
56/72 = 0.778
The odds ratio (OR) of cure comparing cefaclor
with amoxicillin is
OR = 1.459/0.778 = 1.88
95% Confidence Interval for OR
Chi-square Test for 2x2 Table
H0: not different vs. H1: different
Expected table
Cured Not cured
Cefaclor 150*145/278 150*133/278 150
=78.24 =71.76
Amoxicillin 128*145/278 133*128/278 128
=66.76 =61.24
145 133 278
Because 6.72 >Χ21df,0.95, we reject H0

Case 2
We want to repeat the same analysis using a

mathematical model.
We will develop a statistical model to predict the
cure of disease.
Let
Y: cure by 14 days (1=yes, 0=no)
X: antibiotic (1=cefaclor, 2=amoxicillin)
Link Function
In a linear regression analysis, we expect that the relation

between and X is liner.
i.e. = α + β1x1 + β2x2 + …
In reality, there are many non-linear relations.

For example, when the outcome is a binary outcome (0
and 1), the relation cannot be liner.
We will expand the model using a link function.

{Link Function of } = α + β1x1 + β2x2+ …
A link function is a kind of transformation of Y.

Hypothetical Relation Between Probability and Exposure
To analyze binary outcome variable, we usually use p =
Pr(Y=1) rather than Y (0 or 1) itself.
Empirically, we assume a logistic curve between
probability of event and level of exposure as below.
1
Probability of event
logistic function
p
0
Level of exposure x
Transformation From Logistic Function to Logit
Then
Take loge of each side
Mathematically
equivalent
logit
Calculation of Logit
Logit of probability p is defined as log e of odds.
p logit(p)
0 loge(0/1) = loge(0) =
0.2 loge(0.2/0.8) = loge(0.25) = -1.39
0.5 loge(0.5/0.5) = loge(1) = 0
0.8 loge(0.8/0.2) = loge(4) = 1.39
1 loge(1/0) = loge() =
loge() ranges from (p=0) to (p=1).
Logistic Model
The relation between a binary outcome variable Y and

dependent variables, x1, x2, …, can be expressed using a
logistic model,
α+β1x1+β2x2+ … +e
where p = Pr(Y=1) and the link function is logit(p).
Here, we assume that the relation between logit(p) and

predictors is linear, or equivalently, that the relation
between p and predictors is logistic curve.
In logistic regression, maximum likelihood method is used

to estimate model parameters, α and β’s.
Result of Logistic Regression
Because X, antibiotic, is a categorical variable, we

should create a dummy variable for antibiotic.
Antibo(1) = 0 if amoxicillin and = 1 if cefaclor.
The log(odds) increases by 0.629 when cefaclor is used

compared with when amoxicillin is used.
Interpretation of β for Dummy Variable
Let pA be the probability of event when X1=1,

and pB be the probability when X1=0. Then,
Therefore, exp(β) = odds ratio comparing X1=1 with X1=0

(reference).
Interpretation of Results
exp(β) = exp(0.629) = 1.876

Odds ratio of cure comparing cefaclorwith
amoxicillin is 1.88.
95% confidence interval for β is
0.629 1.960.244 = (0.1508, 1.1072)
95% confidence interval for OR = exp(β) is
(exp(0.1508), exp(1.1072)) = (1.16, 3.03)
Hypothesis Testing in Logistic Regression
Hypotheses are H0: β=0 vs. H1: β≠0

Test statistic is under H0
The test statistic for β is
Because 2.58 > z0.975, we reject H0
p-value = 2*Pr(z>2.58) = 0.010
Prediction of Probability
The logistic model is mathematically equivalent to
We can estimate the probability of event (e.g. Y=1) for any

combination of dependent variables, x’s.
In our example,
For cefaclor (x=1),
This is same as observed relative frequency

Interpretation of β for Continuous Variable
Let pA be the probability of event when X=x1,

and pB be the probability when X=x1+1. Then
logit(pA) = α + βx1
logit(pB) = α + β(x1+1)
Then
β = logit(pA) - logit(pB)
Therefore, exp(β) = odds ratio for increase in X by 1.

Case 3
We also measured child age and side of ear. We

want to use these additional information to
improve our model.
Age: 1=0-1 years, 2=2-5 years, 3=6+ years

Side of ear: 1= one side, 2=both sides
Dummy Coding
Antibiotic (reference is amoxicillin)
Age (reference is 0-1 years)
Side of ear (reference is one side)

Multivariable Logistic Regression Model
logit(p) = -1.045 + 0.746cef

+ 1.000age2_5 + 1.605age6_
+ (-0.256)bothsides
The adjusted odds ratio comparing cefaclor with

amoxicillin is 2.11.
Difference between the adjusted and crude odds ratio
(1.88) indicates the presence of confounding.
Number of Variables in Logistic Regression
It is recommended to have ≥10 cases with an event (or

with no event) for every independent variable in the model.
In our example, there are 145 cured and 133 non-cured

children. Because 145 > 133, we will use 133.
133/10 = 13.3
We can include ≤13 independent variables in the
multivariable logistic regression.
Note that number of all dummy variables should be
counted. So if one characteristic has 3 categories and we
create 2 dummy variables, they are counted as 2 variables.
Likelihood
In logistic regression, likelihood quantifies the probability

of obtaining our sample data given the specified model.
Maximum likelihood method finds model parameters, α

and β’s, which maximize the overall likelihood.
We consider that a model with a higher likelihood is a

better model.
We use “-2 log likelihood” as an indicator for assessing

model fit. Because of its minus sign, lower “-2 log
likelihood” indicates better fit.
Likelihood Ratio Test
We can compare two models by -2 log likelihood.

Model 1 (k1 variables) -2 log L1
Model 2 (k1+k2 variables)-2 log L2
Model 2 is more complicated than model 1.
Hypotheses are
H0: -2 log L1 = -2 log L2
H1: -2 log L1 > -2 log L2
Likelihood ratio test statistic is

LR = (-2 log L1) - (-2 log L2)
The LR statistic is distributed as chi-square distribution
with k degrees of freedom.
Example of Likelihood Ratio Test
Model 1 (simpler model)

1 variable: antibiotic (1)
Model 2 (more complicated model)
1+2=3 variables: antibiotic (1), age (2)
Test statistic is
LR = (-2 log L1) - (-2 log L2)
= 378.127 - 354.078 = 24.05 ~ Χ2df
Because 24.05 > Χ2df, 0.95=5.99, we reject H0
Model 2 is significantly better than simpler model 1.

Age is significantly related with probability of cure.
-2 Log Likelihood in SPSS
Model 1 (antibiotic)
Model 2 (antibiotic & age)

Example of Likelihood Ratio Test
Model 1 (simpler model)

1 variable: antibiotic (1)
Model 2 (more complicated model)
1+1=2 variables: antibiotic (1), side of ear (1)
Test statistic is
LR = (-2 log L1) - (-2 log L2)
= 378.127 - 375.706 = 2.42 ~ Χ1df
Because 2.42 > Χ1df, 0.95=3.84, we accept H0
Model 2 is not better than simpler model 1.

Side of ear is not related with probability of cure.
-2 Log Likelihood in SPSS
Model 1 (antibiotic)
Model 2 (antibiotic & side of ear)

Nested Models
The likelihood ratio test can compare a simpler model 1

with a more complicated model 2.
The model 1 is said to be nested within model 2.

The simpler model should be obtained from the more
complicated model either by
• Removing some independent variables
e.g. remove age from the model, remove x2 from the
model, etc.
• Imposing some assumptions about the relation
e.g. use a continuous variable instead of categorical
variable, to impose a linear association
Case 4
Now we want to use contingency tables to adjust

for confounders and compare the result with
logistic regression.
We will create 6 tables and use the Mantel-

Haenszel weight method to estimate pooled odds
ratio.
Mantel-Haenszel Pooled Odds Ratio
OR M-H weight
14*21/17/6 17*6/58
=2.88 =1.76
39*25/13/20 13*20/97
=3.75 =2.68
15*8/8/17 8*17/48
=0.88 =2.83
8*13/10/2 10*2/33
=5.20 =0.61
10*4/12/5 12*5/31
=0.67 =1.94
3*1/1/6 1*6/11
=0.50 =0.55
¿ pool =
∑ ( wi × ¿^ i )
=2.16 cf. ORadj=2.11 in
∑ wi logistic regression
Logistic Regression & Contingency Table Analysis
Logistic regression Contingency table
analysis analysis
Transparency Black box Clear
of procedure Easy to check details
Adjustment for Easy Difficult or inefficient
confounders Add variables in the Number of tables
model becomes too many
Continuous OK Not OK
independent Use as continuous Variables should be
variables variables directly categorized
Many Relatively OK Number of strata

independent ≥10 events per each becomes too many
variables variable in the model is Some strata have too few
recommended data to analyze
Case 5
We want to assess the goodness of fit of a logistic

regression model.
Here, we will use the Hosmer-Lemeshow goodness-of-fit

test.
H0: model fits the data
H1: model does not fit the data
We will test if the predicted probabilities from the model
are reasonably close to the observed relative frequency.
Hosmer-Lemeshow goodness-of-fit test
1. Order each subject based on predicted probability from

the lowest to the highest.
2. Group ordered subjects into g (≥3) groups according to

the predicted probabilities (usually g=10)
3. Compute following values
Oj = ∑ y (observed frequency for group j)
Ej = ∑(expected frequency for group j)
= Ej / nj (expected mean probability for group j)
4. Compute a test statistic
Example of Hosmer and Lemeshow Test
Independent variables: antibiotic, age, side of ear
Observed and
expected
frequencies are
close.
Because p=0.498, we accept H0.

The model fits the data.
Today’s Progress
• Contingency table method (review)
• Logistic regression analysis
• Likelihood ratio test
• The Hosmer-Lemeshow goodness-of-fit test

Next Topics
• Interaction
• Variable selection

HFHFH

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HFHFH

Uploaded by

Copyright:

Available Formats

Epidemiology and Data Analysis

Tue, 10 July 2012

• Contingency table method (review)

• Logistic regression analysis

• Likelihood ratio test

• The Hosmer-Lemeshow goodness-of-fit test

We will compare two antibiotics, cefaclor and

Are two antibiotics different? Use children as

The odds of cure among cefaclor is

Because 6.72 >Χ21df,0.95, we reject H0

We want to repeat the same analysis using a

In a linear regression analysis, we expect that the relation

In reality, there are many non-linear relations.

We will expand the model using a link function.

A link function is a kind of transformation of Y.

Take loge of each side

Logit of probability p is defined as log e of odds.

The relation between a binary outcome variable Y and

Here, we assume that the relation between logit(p) and

In logistic regression, maximum likelihood method is used

Because X, antibiotic, is a categorical variable, we

Antibo(1) = 0 if amoxicillin and = 1 if cefaclor.

The log(odds) increases by 0.629 when cefaclor is used

Let pA be the probability of event when X1=1,

Therefore, exp(β) = odds ratio comparing X1=1 with X1=0

exp(β) = exp(0.629) = 1.876

Hypotheses are H0: β=0 vs. H1: β≠0

The logistic model is mathematically equivalent to

We can estimate the probability of event (e.g. Y=1) for any

For cefaclor (x=1),

This is same as observed relative frequency

Let pA be the probability of event when X=x1,

Therefore, exp(β) = odds ratio for increase in X by 1.

We also measured child age and side of ear. We

Age: 1=0-1 years, 2=2-5 years, 3=6+ years

Antibiotic (reference is amoxicillin)

Age (reference is 0-1 years)

Side of ear (reference is one side)

logit(p) = -1.045 + 0.746cef

The adjusted odds ratio comparing cefaclor with

It is recommended to have ≥10 cases with an event (or

In our example, there are 145 cured and 133 non-cured

In logistic regression, likelihood quantifies the probability

Maximum likelihood method finds model parameters, α

We consider that a model with a higher likelihood is a

We use “-2 log likelihood” as an indicator for assessing

We can compare two models by -2 log likelihood.

Likelihood ratio test statistic is

Model 1 (simpler model)

Model 2 is significantly better than simpler model 1.

Model 2 (antibiotic & age)

Model 1 (simpler model)

Model 2 is not better than simpler model 1.

Model 2 (antibiotic & side of ear)

The likelihood ratio test can compare a simpler model 1

The model 1 is said to be nested within model 2.

Now we want to use contingency tables to adjust

We will create 6 tables and use the Mantel-

Many Relatively OK Number of strata

We want to assess the goodness of fit of a logistic

Here, we will use the Hosmer-Lemeshow goodness-of-fit

1. Order each subject based on predicted probability from

2. Group ordered subjects into g (≥3) groups according to