Professional Documents
Culture Documents
Logistic Regression
Logistic Regression
LOGISTIC REGRESSION
1 Hosmer and Lemeshow, Applied Logistic Regression, 3rd edition, John Wiley and Sons, 2013.
Data Mining and Predictive Analytics, First Edition. Daniel T. Larose and Chantal D. Larose.
© 2015 John Wiley & Sons, Inc. Published 2015 by John Wiley & Sons, Inc.
359
360 CHAPTER 13 LOGISTIC REGRESSION
Patient ID 1 2 3 4 5 6 7 8 9 10
Age (x) 25 29 30 31 32 41 41 42 44 49
Disease (y) 0 0 0 0 0 0 0 0 1 1
Patient ID 11 12 13 14 15 16 17 18 19 20
Age (x) 50 59 60 62 68 72 79 80 81 84
Disease (y) 0 1 0 0 1 0 1 0 1 1
1.0
Disease
0.5
0.0
25 35 45 55 65 75 85
Age
Figure 13.1 Plot of disease versus age, with least squares and logistic regression lines.
is greater for the linear regression line, which means that linear regression does a
poorer job of estimating the presence of disease as compared to logistic regression
for patient 11. Similarly, this observation is also true for most of the other patients.
Where does the logistic regression curve come from? Consider the conditional
mean of Y given X = x, denoted as E(Y|x). This is the expected value of the response
variable for a given value of the predictor. Recall that, in linear regression, the
response variable is considered to be a random variable de ned as Y = 훽0 + 훽1 x + 휀.
Now, as the error term 휀 has mean zero, we then obtain E(Y|x) = 훽0 + 훽1 x for linear
regression, with possible values extending over the entire real number line.
For simplicity, denote the conditional mean E(Y|x) as 휋(x). Then, the con-
ditional mean for logistic regression takes on a different form from that of linear
regression. Speci cally,
e훽0 +훽1 x
휋(x) = (13.1)
1 + e훽0 +훽1 x
Curves of the form in equation (13.1) are called sigmoidal because they are S-shaped,
and therefore nonlinear. Statisticians have chosen the logistic distribution to model
dichotomous data because of its ]exibility and interpretability. The minimum for
ea
휋(x) is obtained at lim a→−∞ 1+e a = 0, and the maximum for 휋(x) is obtained at
a
]
e
lim a→∞ 1+e a = 1. Thus, 휋(x) is of a form that may be interpreted as a probability,
13.2 MAXIMUM LIKELIHOOD ESTIMATION 361
with 0 ≤ 휋(x) ≤ 1. That is, 휋(x) may be interpreted as the probability that the posi-
tive outcome (e.g., disease) is present for records with X = x, and 1 − 휋(x) may be
interpreted as the probability that the positive outcome is absent for such records.
Linear regression models assume that Y = 훽0 + 훽1 x + 휀, where the error term
휀 is normally distributed with mean zero and constant variance. The model assump-
tion for logistic regression is different. As the response is dichotomous, the errors can
take only one of two possible forms: If Y = 1 (e.g., disease is present), which occurs
with probability 휋(x) (the probability that the response is positive), then 휀 = 1 − 휋(x),
훽0 +훽1 x
the vertical distance between the data point Y = 1 and the curve 휋(x) = e 훽0 +훽1 x
1+e
directly below it, for X = x. However, if Y = 0 (e.g., disease is absent), which occurs
with probability 1 − 휋(x) (the probability that the response is negative), then 휀 =
0 − 휋(x) = −휋(x), the vertical distance between the data point Y = 0 and the curve
휋(x) directly above it, for X = x. Thus, the variance of 휀 is 휋(x)[1 − 휋(x)], which is the
variance for a binomial distribution, and the response variable in logistic regression
Y = 휋(x) + 휀 is assumed to follow a binomial distribution with probability of success
휋(x).
A useful transformation for logistic regression is the logit transformation, and
it is given as follows:
[
휋 (x)
g(x) = ln = 훽0 + 훽1 x
1 − 휋(x)
The logit transformation g(x) exhibits several attractive properties of the linear regres-
sion model, such as its linearity, its continuity, and its range from negative to positive
in nity.
independent allows us to express the likelihood function l(훃|x) as the product of the
individual terms: n
l(훃|x) = [휋(xi )]yi [1 − 휋(xi )]1−yi
i=1
Let us examine the results of the logistic regression of disease on age, shown
in Table 13.2. The coef cients, that is, the maximum-likelihood estimates of the
unknown parameters 훽0 and 훽1 , are given as b0 = −4.372 and b1 = 0.06696. Thus,
훽0 +훽1 x
휋(x) = e 훽0 +훽1 x is estimated as
1+e
eg(x) e−4.372+0.06696(age)
휋(x) = = ,
1 + eg(x) 1 + e−4.372+0.06696(age)
with the estimated logit
g(x) = −4.372 + 0.06696(age).
These equations may then be used to estimate the probability that the disease is
present in a particular patient, given the patient’s age. For example, for a 50-year-old
patient, we have
g(x) = −4.372 + 0.06696(50) = −1.024
and
eg(x) e−1.024
휋(x) = = = 0.26
1 + eg(x) 1 + e−1.024
Thus, the estimated probability that a 50-year-old patient has the disease is 26%, and
the estimated probability that the disease is not present is 100% − 26% = 74%.
However, for a 72-year-old patient, we have
g(x) = −4.372 + 0.06696(72) = 0.449
and
eg(x) e0.449
휋(x) = = = 0.61
1 + eg(x) 1 + e0.449
2 McCullagh and Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall, London, 1989.
13.4 INFERENCE: ARE THE PREDICTORS SIGNIFICANT? 363
Log-Likelihood = -10.101
Test that all slopes are zero: G = 5.696, DF = 1, P-Value = 0.017
The estimated probability that a 72-year-old patient has the disease is 61%, and the
estimated probability that the disease is not present is 39%.
Recall from simple linear regression that the regression model was considered sig-
ni cant if mean square regression (MSR) was large compared to mean squared error
(MSE). The MSR is a measure of the improvement in estimating the response when
we include the predictor, as compared to ignoring the predictor. If the predictor vari-
able is helpful for estimating the value of the response variable, then MSR will be
large, the test statistic F = MSR
MSE
will also be large, and the linear regression model
will be considered signi cant.
Signi cance of the coef cients in logistic regression is determined analogously.
Essentially, we examine whether the model that includes a particular predictor pro-
vides a substantially better t to the response variable than a model that does not
include this predictor.
De ne the saturated model to be the model that contains as many parameters
as data points, such as a simple linear regression model with only two data points.
Clearly, the saturated model predicts the response variable perfectly, and there is no
prediction error. We may then look on the observed values of the response variable to
be the predicted values from the saturated model. To compare the values predicted by
our tted model (with fewer parameters than data points) to predicted by the saturated
the values model, we use the deviance (McCullagh and Nelder3 ), as de ned here:
]
likelihood of the tted model
Deviance = D = −2 ln
likelihood of the saturated model
Here we have a ratio of two likelihoods, so that the resulting hypothesis test is called
a likelihood ratio test. In order to generate a measure whose distribution is known, we
must take −2 ln[likelihood ratio]. Denote the estimate of 휋(xi ) from the tted model
to be 휋i . Then, for the logistic regression case, and using equation (13.2), we have
3 McCullagh and Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall, London, 1989.
364 CHAPTER 13 LOGISTIC REGRESSION
For the disease example, note from Table 13.2 that the log-likelihood is given as
−10.101. Then,
G = 2{−10.101 − [7 ln(7) + 13 ln(13) − 20 ln(20)]} = 5.696
as indicated in Table 13.2.
The test statistic G follows a chi-square distribution with 1 degree of freedom
(i.e., 휒 2휈=1 ), assuming that the null hypothesis is true that 훽1 = 0. The resulting p-value
for this hypothesis test is therefore P(휒 21 ) > Gobserved = P(휒 21 ) > 5.696 = 0.017, as
shown in Table 13.2. This fairly small p-value indicates that there is evidence that
age is useful in predicting the presence of disease.
Another hypothesis test used to determine whether a particular predictor is sig-
ni cant is the Wald test (e.g., Rao4 ). Under the null hypothesis that 훽1 = 0, the ratio
b1
ZWald =
SE(b1 )
follows a standard normal distribution, where SE refers to the standard error of the
coef cient, as estimated from the data and reported by the software. Table 13.2 pro-
vides the coef cient estimate and the standard error as follows: b1 = 0.06696 and
SE(b1 ) = 0.03223, giving us:
0.06696
ZWald = = 2.08
0.03223
as reported under z for the coef cient age in Table 13.2. The p-value is then reported
as P(|z| > 2.08) = 0.038. This p-value is also fairly small, although not as small as
4 Rao, Linear Statistical Inference and Its Application, 2nd edition, John Wiley and Sons, Inc., 1973.
13.5 ODDS RATIO AND RELATIVE RISK 365
the likelihood ratio test, and therefore concurs in the signi cance of age for predicting
disease.
We may construct 100(1 − 훼)% con dence intervals for the logistic regression
coef cients, as follows.
b0 ± z ⋅ SE(b0 )
b1 ± z ⋅ SE(b1 )
where z represents the z-critical value associated with 100(1 − 훼)% con dence.
In our example, a 95% con dence interval for the slope 훽1 could be found thus:
b1 ± z ⋅ SE(b1 ) = 0.06696 ± (1.96)(0.03223)
= 0.06696 ± 0.06317
= (0.00379, 0.13013)
The above results may be extended from the simple (one predictor) logistic
regression model to the multiple (many predictors) logistic regression model. (See
Hosmer and Lemeshow5 for details.)
5 Hosmer and Lemeshow, Applied Logistic Regression, 3rd edition, John Wiley and Sons, 2013.
366 CHAPTER 13 LOGISTIC REGRESSION
74%, respectively, providing odds for the 50-year-old patient to be odds = 0.26 0.74
=
0.35.
Note that when the event is more likely than not to occur, then odds > 1; when
the event is less likely than not to occur, then odds < 1; and when the event is just as
likely as not to occur, then odds = 1. Note also that the concept of odds differs from
the concept of probability, because probability ranges from zero to one while odds
can range from zero to in nity. Odds indicate how much more likely it is that an event
occurred compared to it is not occurring.
In binary logistic regression with a dichotomous predictor, the odds that the
response variable occurred (y = 1) for records with x = 1 can be denoted as:
e훽0 +훽1
휋(1) 1+e훽0 +훽1
= = e훽0 +훽1
1 − 휋(1) 1
1+e훽0 +훽1
Correspondingly, the odds that the response variable occurred for records with
x = 0 can be denoted as:
e훽0
휋(0) 1+e훽0
= = e훽0
1 − 휋(0) 1
1+e훽0
The odds ratio (OR) is de ned as the odds that the response variable occurred
for records with x = 1 divided by the odds that the response variable occurred for
records with x = 0. That is,
휋(1)∕[1 − 휋(1)]
Odds ratio = OR =
휋(0)∕[1 − 휋(0)]
e훽0 +훽1
=
e훽0
=e 훽1
(13.3)
The OR is sometimes used to estimate the relative risk, de ned as the probability that
the response occurs for x = 1 divided by the probability that the response occurs for
x = 0. That is,
휋(1)
Relative risk =
휋(0)
For the OR to be an accurate estimate of the relative risk, we must have [1−휋(0)]
[1−휋(1)]
≈ 1,
which we obtain when the probability that the response occurs is small, for both x = 1
and x = 0.
The OR has come into widespread use in the research community, because of
the above simply expressed relationship between the OR and the slope coef cient.
For example, if a clinical trial reports that the OR for endometrial cancer among
ever-users and never-users of estrogen replacement therapy is 5.0, then this may be
interpreted as meaning that ever-users of estrogen replacement therapy are ve times
more likely to develop endometrial cancer than are never-users. However, this inter-
pretation is valid only when [1−휋(0)]
[1−휋(1)]
≈ 1.