Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

CHAPTER 13

LOGISTIC REGRESSION

Linear regression is used to approximate the relationship between a continuous


response variable and a set of predictor variables. However, for many data applica-
tions, the response variable is categorical rather than continuous. For such cases,
linear regression is not appropriate. Fortunately, the analyst can turn to an analogous
method, logistic regression, which is similar to linear regression in many ways.
Logistic regression refers to methods for describing the relationship between
a categorical response variable and a set of predictor variables. In this chapter, we
explore the use of logistic regression for binary or dichotomous variables; those inter-
ested in using logistic regression for response variables with more than two categories
may refer to Hosmer and Lemeshow.1 To motivate logistic regression, and to illustrate
its similarities to linear regression, consider the following example.

13.1 SIMPLE EXAMPLE OF LOGISTIC REGRESSION


Suppose that medical researchers are interested in exploring the relationship between
patient age (x) and the presence (1) or absence (0) of a particular disease (y). The data
collected from 20 patients is shown in Table 13.1, and a plot of the data is shown in
Figure 13.1. The plot shows the least-squares regression line (dotted straight line),
and the logistic regression line (solid curved line), along with the estimation error for
patient 11 (age = 50, disease = 0) for both lines.
Note that the least-squares regression line is linear, which means that linear
regression assumes that the relationship between the predictor and the response is
linear. Contrast this with the logistic regression line that is nonlinear, meaning that
logistic regression assumes the relationship between the predictor and the response
is nonlinear. The scatter plot makes plain the discontinuity in the response variable;
scatter plots that look like this should alert the analyst not to apply linear regression.
Consider the prediction errors for patient 11, indicated in Figure 13.1. The dis-
tance between the data point for patient 11 (x = 50, y = 0) and the linear regression line
is indicated by the dotted vertical line, while the distance between the data point and
the logistic regression line is shown by the solid vertical line. Clearly, the distance

1 Hosmer and Lemeshow, Applied Logistic Regression, 3rd edition, John Wiley and Sons, 2013.

Data Mining and Predictive Analytics, First Edition. Daniel T. Larose and Chantal D. Larose.
© 2015 John Wiley & Sons, Inc. Published 2015 by John Wiley & Sons, Inc.

359
360 CHAPTER 13 LOGISTIC REGRESSION

TABLE 13.1 Age of 20 patients, with indicator of disease

Patient ID 1 2 3 4 5 6 7 8 9 10

Age (x) 25 29 30 31 32 41 41 42 44 49
Disease (y) 0 0 0 0 0 0 0 0 1 1
Patient ID 11 12 13 14 15 16 17 18 19 20
Age (x) 50 59 60 62 68 72 79 80 81 84
Disease (y) 0 1 0 0 1 0 1 0 1 1

1.0
Disease

0.5

0.0
25 35 45 55 65 75 85
Age

Figure 13.1 Plot of disease versus age, with least squares and logistic regression lines.

is greater for the linear regression line, which means that linear regression does a
poorer job of estimating the presence of disease as compared to logistic regression
for patient 11. Similarly, this observation is also true for most of the other patients.
Where does the logistic regression curve come from? Consider the conditional
mean of Y given X = x, denoted as E(Y|x). This is the expected value of the response
variable for a given value of the predictor. Recall that, in linear regression, the
response variable is considered to be a random variable de ned as Y = 훽0 + 훽1 x + 휀.
Now, as the error term 휀 has mean zero, we then obtain E(Y|x) = 훽0 + 훽1 x for linear
regression, with possible values extending over the entire real number line.
For simplicity, denote the conditional mean E(Y|x) as 휋(x). Then, the con-
ditional mean for logistic regression takes on a different form from that of linear
regression. Speci cally,
e훽0 +훽1 x
휋(x) = (13.1)
1 + e훽0 +훽1 x
Curves of the form in equation (13.1) are called sigmoidal because they are S-shaped,
and therefore nonlinear. Statisticians have chosen the logistic distribution to model
dichotomous data because of its ]exibility and interpretability. The minimum for
ea
휋(x) is obtained at lim a→−∞ 1+e a = 0, and the maximum for 휋(x) is obtained at
a
]
e
lim a→∞ 1+e a = 1. Thus, 휋(x) is of a form that may be interpreted as a probability,
13.2 MAXIMUM LIKELIHOOD ESTIMATION 361

with 0 ≤ 휋(x) ≤ 1. That is, 휋(x) may be interpreted as the probability that the posi-
tive outcome (e.g., disease) is present for records with X = x, and 1 − 휋(x) may be
interpreted as the probability that the positive outcome is absent for such records.
Linear regression models assume that Y = 훽0 + 훽1 x + 휀, where the error term
휀 is normally distributed with mean zero and constant variance. The model assump-
tion for logistic regression is different. As the response is dichotomous, the errors can
take only one of two possible forms: If Y = 1 (e.g., disease is present), which occurs
with probability 휋(x) (the probability that the response is positive), then 휀 = 1 − 휋(x),
훽0 +훽1 x
the vertical distance between the data point Y = 1 and the curve 휋(x) = e 훽0 +훽1 x
1+e
directly below it, for X = x. However, if Y = 0 (e.g., disease is absent), which occurs
with probability 1 − 휋(x) (the probability that the response is negative), then 휀 =
0 − 휋(x) = −휋(x), the vertical distance between the data point Y = 0 and the curve
휋(x) directly above it, for X = x. Thus, the variance of 휀 is 휋(x)[1 − 휋(x)], which is the
variance for a binomial distribution, and the response variable in logistic regression
Y = 휋(x) + 휀 is assumed to follow a binomial distribution with probability of success
휋(x).
A useful transformation for logistic regression is the logit transformation, and
it is given as follows:
[
휋 (x)
g(x) = ln = 훽0 + 훽1 x
1 − 휋(x)
The logit transformation g(x) exhibits several attractive properties of the linear regres-
sion model, such as its linearity, its continuity, and its range from negative to positive
in nity.

13.2 MAXIMUM LIKELIHOOD ESTIMATION


One of the most attractive properties of linear regression is that closed-form solutions
for the optimal values of the regression coef cients may be obtained, courtesy of the
least-squares method. Unfortunately, no such closed-form solution exists for estimat-
ing logistic regression coef cients. Thus, we must turn to maximum-likelihood esti-
mation, which nds estimates of the parameters for which the likelihood of observing
the observed data is maximized.
The likelihood function l(훃|x) is a function of the parameters 훃 = 훽0 , 훽1 , … , 훽m
that expresses the probability of the observed data, x. By nding the values of 훃 =
훽0 , 훽1 , … , 훽m , which maximize l(훃|x), we thereby uncover the maximum-likelihood
estimators, the parameter values most favored by the observed data.
The probability of a positive response given the data is 휋(x) = P(Y = 1|x), and
the probability of a negative response given the data is given by 1 − 휋(x) = P(Y =
0|x). Then, observations where the response is positive, (Xi = xi , Yi = 1), will con-
tribute probability 휋(x) to the likelihood, while observations where the response is
negative, (Xi = xi , Yi = 0), will contribute probability 1 − 휋(x) to the likelihood.
Thus, as Yi = 0 or 1, the contribution to the likelihood of the ith observation may
be expressed as [휋(xi )]yi [1 − 휋(xi )]1−yi . The assumption that the observations are
362 CHAPTER 13 LOGISTIC REGRESSION

independent allows us to express the likelihood function l(훃|x) as the product of the
individual terms: n
l(훃|x) = [휋(xi )]yi [1 − 휋(xi )]1−yi
i=1

The log-likelihood L(훃|x) = ln[l(훃|x)] is computationally more tractable:


n
L(훃|x) = ln[l(훃|x)] = {yi ln[휋(xi )] + (1 − yi ) ln[1 − 휋(xi )]} (13.2)
i=1

The maximum-likelihood estimators may be found by differentiating L(훃|x) with


respect to each parameter, and setting the resulting forms equal to zero. Unfortu-
nately, unlike linear regression, closed-form solutions for these differentiations are
not available. Therefore, other methods must be applied, such as iterative weighted
least squares (see McCullagh and Nelder2 ).

13.3 INTERPRETING LOGISTIC REGRESSION OUTPUT

Let us examine the results of the logistic regression of disease on age, shown
in Table 13.2. The coef cients, that is, the maximum-likelihood estimates of the
unknown parameters 훽0 and 훽1 , are given as b0 = −4.372 and b1 = 0.06696. Thus,
훽0 +훽1 x
휋(x) = e 훽0 +훽1 x is estimated as
1+e

eg(x) e−4.372+0.06696(age)
휋(x) = = ,
1 + eg(x) 1 + e−4.372+0.06696(age)
with the estimated logit
g(x) = −4.372 + 0.06696(age).
These equations may then be used to estimate the probability that the disease is
present in a particular patient, given the patient’s age. For example, for a 50-year-old
patient, we have
g(x) = −4.372 + 0.06696(50) = −1.024
and
eg(x) e−1.024
휋(x) = = = 0.26
1 + eg(x) 1 + e−1.024
Thus, the estimated probability that a 50-year-old patient has the disease is 26%, and
the estimated probability that the disease is not present is 100% − 26% = 74%.
However, for a 72-year-old patient, we have
g(x) = −4.372 + 0.06696(72) = 0.449
and
eg(x) e0.449
휋(x) = = = 0.61
1 + eg(x) 1 + e0.449

2 McCullagh and Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall, London, 1989.
13.4 INFERENCE: ARE THE PREDICTORS SIGNIFICANT? 363

TABLE 13.2 Logistic regression of disease on age, results from minitab

Logistic Regression Table


Odds 95% CI
Predictor Coef StDev Z P Ratio Lower Upper
Constant -4.372 1.966 -2.22 0.026
Age 0.06696 0.03223 2.08 0.038 1.07 1.00 1.14

Log-Likelihood = -10.101
Test that all slopes are zero: G = 5.696, DF = 1, P-Value = 0.017

The estimated probability that a 72-year-old patient has the disease is 61%, and the
estimated probability that the disease is not present is 39%.

13.4 INFERENCE: ARE THE PREDICTORS SIGNIFICANT?

Recall from simple linear regression that the regression model was considered sig-
ni cant if mean square regression (MSR) was large compared to mean squared error
(MSE). The MSR is a measure of the improvement in estimating the response when
we include the predictor, as compared to ignoring the predictor. If the predictor vari-
able is helpful for estimating the value of the response variable, then MSR will be
large, the test statistic F = MSR
MSE
will also be large, and the linear regression model
will be considered signi cant.
Signi cance of the coef cients in logistic regression is determined analogously.
Essentially, we examine whether the model that includes a particular predictor pro-
vides a substantially better t to the response variable than a model that does not
include this predictor.
De ne the saturated model to be the model that contains as many parameters
as data points, such as a simple linear regression model with only two data points.
Clearly, the saturated model predicts the response variable perfectly, and there is no
prediction error. We may then look on the observed values of the response variable to
be the predicted values from the saturated model. To compare the values predicted by
our tted model (with fewer parameters than data points) to predicted by the saturated
the values model, we use the deviance (McCullagh and Nelder3 ), as de ned here:
]
likelihood of the tted model
Deviance = D = −2 ln
likelihood of the saturated model
Here we have a ratio of two likelihoods, so that the resulting hypothesis test is called
a likelihood ratio test. In order to generate a measure whose distribution is known, we
must take −2 ln[likelihood ratio]. Denote the estimate of 휋(xi ) from the tted model
to be 휋i . Then, for the logistic regression case, and using equation (13.2), we have

3 McCullagh and Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall, London, 1989.
364 CHAPTER 13 LOGISTIC REGRESSION

deviance equal to:


n [ [ }
휋i 1 − 휋i
Deviance = D = −2 ln yi ln + (1 − yi ) ln
i=1
yi 1 − yi
The deviance represents the error left over in the model, after the predictors have been
accounted for. As such it is analogous to the sum of squares error in linear regression.
The procedure for determining whether a particular predictor is signi cant is
to nd the deviance of the model without the predictor and subtract the deviance of
the model with the predictor, thus:
G = deviance(model without predictor) − deviance(model with predictor)
[
likelihood without predictor
= −2 ln
likelihood with predictor

Let n1 = yi and n0 = (1 − yi ). Then, for the case of a single predictor only, we


have:
{ n
G=2 yi ln 휋i + (1 − yi ) ln[1 − 휋i ] − [n1 ln(n1 ) + n0 ln(n0 ) − n ln(n)]
i=1

For the disease example, note from Table 13.2 that the log-likelihood is given as
−10.101. Then,
G = 2{−10.101 − [7 ln(7) + 13 ln(13) − 20 ln(20)]} = 5.696
as indicated in Table 13.2.
The test statistic G follows a chi-square distribution with 1 degree of freedom
(i.e., 휒 2휈=1 ), assuming that the null hypothesis is true that 훽1 = 0. The resulting p-value
for this hypothesis test is therefore P(휒 21 ) > Gobserved = P(휒 21 ) > 5.696 = 0.017, as
shown in Table 13.2. This fairly small p-value indicates that there is evidence that
age is useful in predicting the presence of disease.
Another hypothesis test used to determine whether a particular predictor is sig-
ni cant is the Wald test (e.g., Rao4 ). Under the null hypothesis that 훽1 = 0, the ratio
b1
ZWald =
SE(b1 )
follows a standard normal distribution, where SE refers to the standard error of the
coef cient, as estimated from the data and reported by the software. Table 13.2 pro-
vides the coef cient estimate and the standard error as follows: b1 = 0.06696 and
SE(b1 ) = 0.03223, giving us:
0.06696
ZWald = = 2.08
0.03223
as reported under z for the coef cient age in Table 13.2. The p-value is then reported
as P(|z| > 2.08) = 0.038. This p-value is also fairly small, although not as small as

4 Rao, Linear Statistical Inference and Its Application, 2nd edition, John Wiley and Sons, Inc., 1973.
13.5 ODDS RATIO AND RELATIVE RISK 365

the likelihood ratio test, and therefore concurs in the signi cance of age for predicting
disease.
We may construct 100(1 − 훼)% con dence intervals for the logistic regression
coef cients, as follows.
b0 ± z ⋅ SE(b0 )
b1 ± z ⋅ SE(b1 )
where z represents the z-critical value associated with 100(1 − 훼)% con dence.
In our example, a 95% con dence interval for the slope 훽1 could be found thus:
b1 ± z ⋅ SE(b1 ) = 0.06696 ± (1.96)(0.03223)
= 0.06696 ± 0.06317
= (0.00379, 0.13013)

훽1 ≠ 0, and that therefore the variable age is signi cant.


As zero is not included in this interval, we can conclude with 95% con dence that

The above results may be extended from the simple (one predictor) logistic
regression model to the multiple (many predictors) logistic regression model. (See
Hosmer and Lemeshow5 for details.)

13.5 ODDS RATIO AND RELATIVE RISK


Recall from simple linear regression that the slope coef cient 훽1 was interpreted as
the change in the response variable for every unit increase in the predictor. The slope
coef cient 훽1 is interpreted analogously in logistic regression, but through the logit
function. That is, the slope coef cient 훽1 may be interpreted as the change in the value
of the logit for a unit increase in the value of the predictor. In other words,
훽1 = g(x + 1) − g(x)
In this section, we discuss the interpretation of 훽1 in simple logistic regression
for the following three cases:
1. A dichotomous predictor
2. A polychotomous predictor
3. A continuous predictor.
To facilitate our interpretation, we need to consider the concept of odds. Odds
may be de ned as the probability that an event occurs divided by the probability that
the event does not occur. For example, earlier we found that the estimated probability
that a 72-year-old patient has the disease is 61%, and the estimated probability that the
72-year-old patient does not have the disease is 39%. Thus, the odds of a 72-year-old
patient having the disease equal odds = 0.61
0.39
= 1.56. We also found that the estimated
probabilities of a 50-year-old patient having or not having the disease are 26% and

5 Hosmer and Lemeshow, Applied Logistic Regression, 3rd edition, John Wiley and Sons, 2013.
366 CHAPTER 13 LOGISTIC REGRESSION

74%, respectively, providing odds for the 50-year-old patient to be odds = 0.26 0.74
=
0.35.
Note that when the event is more likely than not to occur, then odds > 1; when
the event is less likely than not to occur, then odds < 1; and when the event is just as
likely as not to occur, then odds = 1. Note also that the concept of odds differs from
the concept of probability, because probability ranges from zero to one while odds
can range from zero to in nity. Odds indicate how much more likely it is that an event
occurred compared to it is not occurring.
In binary logistic regression with a dichotomous predictor, the odds that the
response variable occurred (y = 1) for records with x = 1 can be denoted as:
e훽0 +훽1
휋(1) 1+e훽0 +훽1
= = e훽0 +훽1
1 − 휋(1) 1
1+e훽0 +훽1

Correspondingly, the odds that the response variable occurred for records with
x = 0 can be denoted as:
e훽0
휋(0) 1+e훽0
= = e훽0
1 − 휋(0) 1
1+e훽0

The odds ratio (OR) is de ned as the odds that the response variable occurred
for records with x = 1 divided by the odds that the response variable occurred for
records with x = 0. That is,
휋(1)∕[1 − 휋(1)]
Odds ratio = OR =
휋(0)∕[1 − 휋(0)]
e훽0 +훽1
=
e훽0
=e 훽1
(13.3)

The OR is sometimes used to estimate the relative risk, de ned as the probability that
the response occurs for x = 1 divided by the probability that the response occurs for
x = 0. That is,
휋(1)
Relative risk =
휋(0)

For the OR to be an accurate estimate of the relative risk, we must have [1−휋(0)]
[1−휋(1)]
≈ 1,
which we obtain when the probability that the response occurs is small, for both x = 1
and x = 0.
The OR has come into widespread use in the research community, because of
the above simply expressed relationship between the OR and the slope coef cient.
For example, if a clinical trial reports that the OR for endometrial cancer among
ever-users and never-users of estrogen replacement therapy is 5.0, then this may be
interpreted as meaning that ever-users of estrogen replacement therapy are ve times
more likely to develop endometrial cancer than are never-users. However, this inter-
pretation is valid only when [1−휋(0)]
[1−휋(1)]
≈ 1.

You might also like