Professional Documents
Culture Documents
Logistics Regression
Logistics Regression
Logistics Regression
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
14.2 Binary data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
14.2.1 Odds and odds ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
14.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
14.3.1 The logistic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
14.3.2 The logistic regression model . . . . . . . . . . . . . . . . . . . . . . . . 7
14.3.3 Testing hypotheses in logistic regression . . . . . . . . . . . . . . . . . . 10
14.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
14.1 Introduction
Many different types of linear models have been discussed in the course so far. But in all
the models considered, the response variable has been a quantitative variable, which has been
assumed to be normally distributed. In this module, we consider situations where the response
variable is a categorical random variable, attaining only two possible outcomes. Examples of
this type of data are very common. For example, the response can be whether or not a patient
is cured of a disease, whether or not an item in a manufacturing process passes the quality
control, whether or not a mouse is killed by toxic exposure in a toxicology experiment, etc.
Since the response variables are dichotomous (that is, they have only two possible outcomes),
it is inappropriate to assume that they are normally distributed–thus the data cannot be
analysed using the methods discussed so far in the course. The most common method to use
for analysing data with dichotomous response variables is logistic regression.
Data for which the response variable is dichotomous are discussed in more detail in Section
14.2. The most common model for such data is the logistic regression model. This is discussed
in Section 14.3. (The logistic regression model is a special case of a generalised linear model
which is the subject of the MAS course ST112.)
the response is ‘success’, if not (i.e. if it survives) the response is ‘failure’. It is standard to
let the response variable Y be a binary variable, which attains the value 1, if the outcome
is ‘success’, and 0 if the outcome is ‘failure’. In a regression situation, each response variable
is associated with given values of a set of explanatory variables x1 , x2 , . . . , xk . For example,
whether or not a patient is cured of a disease may depend on the particular medical treatment
the patient is given, the patient’s general state of health, age, gender, etc.; whether or not an
item in a manufacturing process passes the quality control may depend on various conditions
regarding the production process, such as temperature, quality of raw material, time since
last service of the machinery, etc. It is often possible to group the observations such that all
observations within a group have the same values of the explanatory variables. For instance,
we may group the patients in the disease-example according to, type of medical treatment,
gender and age-group (say, –30, 31–40, 41–49, 50–), such that there are several patients in
each grouping. When the data can be grouped it is easier to record the number of successes
and failures for each group, rather than recording a long series of 0s and 1s.
Example 14.1 Oral contraceptives and myocardial infarction
The link between use of an oral contraceptives and the incidence of myocardial infarction was
investigated. The table below gives the number of women in the study using the contraceptive
pill who suffered a myocardial infarction, and the number using the pill who did not suffer
a myocardial infarction. The corresponding numbers for women not using the pill are also
given.
Infarction
Yes No
Pill Yes 23 34
No 35 132
Further details on this dataset can be found here.
♦
In the previous modules, we have analysed regression models where the response variables
are normally distributed. The main concern in such models is to express the value of the
response variable Y as a function of the values of the explanatory variables. However, when
the response variable is binary, its value is not particularly interesting: either it is 0 or it
is 1. Instead, the interest centres on the probability of the response being success, that is,
P (Y = 1), for given values of the explanatory variables.
P (E) P (E)
odds (E) = = .
P (not E) 1 − P (E)
Similarly, for women in the study who are not using the pill, an estimate of the probability of
having a myocardial infarction is given by P (Eno-pill ) = 35/167 = 0.2096. The odds of having
a myocardial infarction, when not using the pill, is given by
0.2096
odds (Eno-pill ) = = 0.2652.
1 − 0.2096
Thus, the odds are 0.2652 to 1 (or, around 1 to 4) that a woman in the study not using the
pill will have a myocardial infarction.
The odds ratio comparing the odds of having a myocardial infarction for women using the pill
with the odds of having a myocardial infarction for women not using the pill, is given by
That is, the odds of having a myocardial infarction is 2.55 times higher for women using the
pill, than for women not using the pill.
♦
The odds ratio RA, B that compares the odds of events EA and EB (that is, Event E
occurring in group A and B, respectively), is defined as the ratio between the two odds; that
is
odds (EA ) P (EA ) P (EB )
RA, B = =
odds (EB ) 1 − P (EA ) 1 − P (EB )
In particular, if an odds ratio is equal to one, the odds are the same for the two groups. Note
that, if we define a factor with levels corresponding to groups A and B, respectively, then an
odds ratio equal to one is equivalent to there being no factor-effect.
When choosing an appropriate transformation for data of the type in Example 14.2, two
problems need to be addressed. One is that the relationship between the variables is curved
rather than linear, the other is that the probability of success can only take values between 0
and 1, while a linear function can attain any real value. It turns out that, often, one can take
care of both of these problems by transforming the probability using the logistic function.
The logistic function will turn a probability into a quantity which can take any real value,
and which is often linearly related to the explanatory variables.
1.0
0.8
0.6
0.4
0.2
Note that, the function (14.1) is not linear in the explanatory variable x . However, if we
transform both sides of the equation using the logit (or logistic) transformation
1.0 1.0
0.8 0.8
0.6 0.6
y
y
0.4 0.4
0.2 0.2
0.0 0.0
-10 -5 0 5 10 -10 -5 0 5 10
x x
1.0 1.0
0.8 0.8
0.6 0.6
y
y
0.4 0.4
0.2 0.2
0.0 0.0
-10 -5 0 5 10 -10 -5 0 5 10
x x
And this function is linear in x. Thus, if we transform the probability of success using the
logit transformation, we get a quantity logit(P (x)) which is linear in the explanatory variable
x.
Note that the ratio P (x) / (1 − P (x)) in (14.2) is the ratio between the probability of success
and the probability of failure. That is, it is the odds of success! Hence, logit(P (x)) in (14.2) is
the logarithm of the odds of success–called the log odds of success, given the value x of the
explanatory variable. Further, the log odds ratio (that is, the logarithm of the odds ratio)
-1
-2
corresponding to the probability of success when the explanatory variable has a value x and
the probability of success when the explanatory variable has the value x + 1, is given by
P (x + 1) P (x)
log − log = (β0 + β1 (x + 1)) − (β0 + β1 x) = β1 ,
1 − P (x + 1) 1 − P (x)
which does not depend on the value x. That is, the parameter β1 represents the difference
in log odds given a one unit increase in the explanatory variable. Since β1 is the logarithm
of the odds ratio, it follows that the odds ratio is given by eβ1 . Note that the odds ratio is
sometimes called the odds multiplier, because it is the value the odds are multiplied by,
when the value of the explanatory variable is increased by one unit.
As mentioned earlier, there are other types of functions that produce sigmoid curves. These
functions will lead to different transformations than the logit transformation. (For example
the probit transformation or the complementary log-log transformation.) However, the most
commonly used transformation is the logit transformation, for several reasons. First of all,
the logit transformation often works well in linearising the relationship between P (x) and x.
Another important reason is the useful interpretations one can make in terms of odds and
odds ratios.
The logistic regression model is defined as follows. Suppose that Y1 , . . . , Yn are inde-
pendent Bernoulli variables, and let pi denote the mean value of Yi , that is, pi = E[Yi ] =
P (Yi = 1). The mean value pi can be expressed in terms of the explanatory variables
xi,1 , xi,2 , . . . , xi,k as
1
pi = . (14.3)
1 + exp −β0 − kj=1 βj xi,j
P
The equation (14.4) is sometimes called the logit form of the model. Note that, logit(pi ) is
the log odds (that is, the logarithm of the odds) of success for the given values xi,1 , xi,2 , . . . , xi,k
of the explanatory variables.
In some cases the data are grouped (e.g. in Example 14.2 the data are grouped according
to set intervals of the area of third-degree burns on the patient’s body). In such cases, the
probability of success pi in group i corresponds to the proportion of successes in that group.
But the logistic model is defined for the more general setting where each response variable Yi
has its ‘own’ set of explanatory variable values xi,1 , xi,2 , . . . , xi,k . In such cases, the probability
of success, pi , cannot be interpreted as a proportion of successes, but must be interpreted as
the mean of Yi (or the probability that Yi = 1), that is, pi = E[Yi ] = P (Yi = 1).
In logistic regression models, the response variables Yi are Bernoulli distributed with param-
eter pi , rather than normally distributed as in linear regression models. Also, the response
variables Yi have different variances var[Yi ] = pi (1 − pi ), i = 1, . . . , n ; unlike in linear re-
gression models. Due to the distributional differences between logistic regression models and
linear regression models, it is not surprising that one cannot apply the same method for es-
timating the parameters β0 , β1 , . . . , βk in the two models. (Recall that in linear regression,
we used the least squares method to estimate the parameters). In logistic regression, the
parameters are estimated using maximum likelihood estimation. We shall always do this on
a computer. (See the Computing document.)
The fitted logistic regression model relating the probability p that a patient survives a third-
degree burn to the explanatory variable log(area +1), where ‘area’ is the area of third-degree
burns on the body, is given by
In particular, logit(p) is the log odds of survival for a patient with burns corresponding to
log(area+1). For example, for a patient with burns corresponding to log(area+1)= 1.90, the
2.45
log odds is 2.450, thus the odds of survival is e ≈ 11.59 to 1. Further, the value -10.662
in the equation is the log odds ratio for the explanatory variable log(area+1). Thus, the
odds multiplier corresponding to log(area+1) is e−10.662 = 0.000023. That is, when log(area
+1) is increased by 1, the odds of success (i.e. that the patient survives) is increased by a
factor e−10.662 (or, rather, reduced by a factor e10.662 ). For example, the odds of survival for
a patient with log(area+1)= 1.5 is e10.662 ≈ 43, 000 times higher than the odds of survival for
a patient with log(area+1)= 2.5.
♦
In general, when the explanatory variables are quantitative (as in Example 14.2), each of the
regression parameters β1 , β2 , . . . , βk can be interpreted as log odds ratios for the corresponding
explanatory variable, when all other explanatory variables are held fixed. That is, the odds
multiplier for xi is equal to eβi : When the explanatory variable xi is increased by 1 unit, and
all other explanatory variables are held constant, the odds of success is increased by a factor
eβi . (Note that, if βi is negative, then eβi < 1, so the odds of success is actually reduced by
e−βi , as in Example 14.2.)
The explanatory variables in a logistic regression model can be quantitative, or they can be
indicator variables representing different levels of factors (as in Example 14.1, where the two
factor levels correspond to ‘using the pill’ or ‘not using the pill’, respectively). If the indicator
variables are defined using cell reference coding (see Module 9), the regression parameter βi
for an indicator variable zi can be interpreted as the log odds ratio corresponding to the
factor level referred to by zi and the reference level, when all other explanatory variables are
held constant. (The reference level is the factor level which is represented implicitly by the
indicator variables, as the level where all indicator variables are equal to zero.) The odds
ratio corresponding to the factor level referred to by zi and the reference level is given by eβi ,
when all other explanatory variables are held fixed.
The log odds ratio is given by 0.9366, so, the odds ratio comparing the odds of having
myocardial infarction for women using the pill with the odds of having a myocardial infarction
for women not using the pill, is given by e0.9366 ≈ 2.55 (as we found in Subsection 14.2.1).
♦
Sometimes it is of interest to compare the odds of success for two groups (or individuals)
corresponding to different values of the explanatory variables. For example, suppose we wish
to compare the odds of Yi = 1 (with values of explanatory variables xi,1 , xi,2 , . . . , xi,k ) with
the odds of Yj = 1 (with values of explanatory variables xj,1 , xj,2 , . . . , xj,k ). The odds ratio
Rxi ,xj is given by
Pk
odds (xi,1 , xi,2 , . . . , xi,k ) eβ0 + l=1 βl xi,l Pk
Rxi ,xj = = Pk = e l=1 βl (xi,l −xj,l ) .
odds (xj,1 , xj,2 , . . . , xj,k ) eβ0 + l=1 βl xj,l
Note that the parameter β0 cancels out in the expression for the odds ratio. Also, if, for some
explanatory variable xi,l = xj,l , the corresponding regression parameter βl cancels out.
That is, the odds of survival for a patient with log(area+1)= 1.90 is 2.9 times higher than
the odds of survival for a patient with log(area+1)= 2.0.
♦
In this course, we shall not go into any technical details about test statistics, but focus on
interpreting the results of a logistic regression analysis through Example 14.3 below. We shall
use two different types of test statistics: the (log) likelihood ratio statistic (often referred
to as the -2 Log Q statistic) and the Wald statistic. In general, the likelihood statistic
is superior to the Wald statistic (in the sense that it gives more reliable results), so we shall
mainly concentrate on the likelihood ratio statistic. The reason for considering the Wald
statistic too, is that it is computationally easy and is given automatically in the output of
most statistical computer packages (e.g. SAS).
It is thought that the probability p of occurrence of vasoconstriction in the skin of the fingers
may depend on the volume, the rate, and the second-order interaction between volume and
rate, of air inspired by a subject. A typical output of a logistic regression analysis will start
by testing for overall regression, that is, testing the null hypothesis H0 : β1 = β2 = β3 = 0:
Both the likelihood ratio test and the Wald test for testing H0 are rejected at the 5% sig-
nificance level. However, you can see that the two p-values are quite different! (In cases
where only one of the two tests rejects H0 , one should always rely on the likelihood ratio
test.) A standard output will also contain the following information on the estimators for the
parameters in the logistic regression model:
It follows that the fitted logistic regression model including the second-order interaction be-
tween the explanatory variables volume and rate, is given by
The Wald statistics for testing significance of the explanatory variables are given in the column
‘Wald statistic’. The Wald statistics are approximately χ2 (Chi-squared) distributed with the
degrees of freedom given in the column ‘d.f.’. The p-values for the test statistics are given in
the column ‘p-value’. Note that the p-values corresponding to all three explanatory variables
exceed 5%. Since there is an interaction term x1 × x2 which is not significant (when x1 and
x2 are in the model), we can reduce the model to
according to the Wald test. However, rather than relying on the Wald test, we want to use
the likelihood ratio test for testing for no interaction.
To calculate the likelihood ratio statistic for testing significance of an explanatory variable x,
we need to compare the log likelihood-‘Model fit’ (often called -2 Log L) for the full model,
containing all explanatory variables, with the log likelihood-‘Model fit’ for the reduced model,
containing all explanatory variables except the one of interest, x. The Model fit is a measure
of how well the given model explains the data (the lower the better). For example, if the
full model explains the data ‘much better’ than the reduced model, the difference between -2
Log Lreduced and -2 Log Lfull will be ‘large’–in this case, we reject the null-hypothesis that
the variable x is non-significant. However, if the reduced model explains the data almost as
well as the full model, the difference between -2 Log Lreduced and -2 Log Lfull will be close to
zero–in this case, we accept the null-hypothesis that x is non-significant.
For these data, the model fit for the full model, is given by
(The values are taken from the SAS output in the Computing document.) In order to test
significance of the interaction term, we compare the value -2 Log Lfull = 26.542 to the value
of the model fit for the model with only x1 and x2 as explanatory variables. The model fit
for the reduced model is given by
Note that the model fit for the reduced model is always higher than the model fit for the full
model.
The log likelihood statistic is the difference between the two model fits, that is,
which is an observation from a χ2 distribution with 1 degree of freedom (that is, the number
of explanatory variables in the full model minus the number of explanatory variables in the
reduced model). This gives a p-value of p = 0.0740. Thus, the likelihood ratio test agrees
with the Wald test in not rejecting the hypothesis of no interaction at level 5%.
We refit the model for the data, using the reduced model (14.5), and obtain the following
output of the analysis. The table testing for overall regression is given by
Again, both the likelihood ratio test and the Wald test reject at the 5% significance level the
hypothesis of no overall regression. The table on the estimators for the parameters in the
logistic regression model is given by
According to the Wald tests, the model cannot be reduced any further at the 5% significance
level. But we wish to use the likelihood ratio test. In order to calculate the likelihood ratio
statistics for testing significance of x1 and x2 , respectively, we need the model fits for the two
reduced models:
Leaving out x1 : -2 Log L−x1 = 49.641
Leaving out x2 : -2 Log L−x2 = 46.993
These model fits are to be compared to our ‘new’ full model (14.5). That is -2 Log Lfull=
29.733. The differences between the full model fit and the two reduced model fits (-2 Log L−x1
and -2 Log L−x2 ) are 19.91 and 17.26, respectively. Both differences are highly significant in
a χ2 -distribution with 1 degree of freedom. Thus, according to the likelihood ratio tests, the
model cannot be reduced any further at the 5% significance level. Our final model is (14.6).
14.4 Summary
In this module, we have considered situations where the response variables are binary random
variables, taking the values 1 and 0, for ‘success’ and ‘failure’, respectively. When the response
variables are binary, the interest centres on the probability of the response being success, for
given values of the explanatory variables. (The explanatory variables can be quantitative or
dummy variables, indexing different levels of categorical variables.) Two important concepts
associated with probabilities of success are odds and odds ratios. The odds of the response
variable being success, for given values of the explanatory variables, is the ratio between the
probability that the response is a success and the probability that the response is failure, given
the values of the explanatory variables. The odds ratio compares the odds of the response
variable being success for two different sets of values of the explanatory variables.
The most common method to use for analysing data with binary response variables is logistic
regression. In the logistic regression model, the response variable is Bernoulli distributed with
mean value (i.e. the probability of success) related to the explanatory variables through the
logit transformation. In particular, the logistic function of the mean is linearly related to the
explanatory variables. The parameters in the model can be estimated using the method of
maximum likelihood. Odds and odds ratios for the response variables can easily be calculated
from the parameters of the fitted model. In order to test hypotheses in logistic regression, we
have used the likelihood ratio test and the Wald test.
Keywords: binary variable, odds, odds ratio, sigmoid curve, logit transformation, logistic
transformation, log odds, log odds ratio, odds multiplier, logistic regression model, logit form,
(log) likelihood ratio statistic, -2 Log Q, Wald statistic, model fit, -2 Log L.