Logistics Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Master of Applied Statistics Pia Veldt Larsen

ST111: Regression and analysis of variance

Module 14: Logistic regression

14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
14.2 Binary data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
14.2.1 Odds and odds ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
14.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
14.3.1 The logistic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
14.3.2 The logistic regression model . . . . . . . . . . . . . . . . . . . . . . . . 7
14.3.3 Testing hypotheses in logistic regression . . . . . . . . . . . . . . . . . . 10
14.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

14.1 Introduction
Many different types of linear models have been discussed in the course so far. But in all
the models considered, the response variable has been a quantitative variable, which has been
assumed to be normally distributed. In this module, we consider situations where the response
variable is a categorical random variable, attaining only two possible outcomes. Examples of
this type of data are very common. For example, the response can be whether or not a patient
is cured of a disease, whether or not an item in a manufacturing process passes the quality
control, whether or not a mouse is killed by toxic exposure in a toxicology experiment, etc.
Since the response variables are dichotomous (that is, they have only two possible outcomes),
it is inappropriate to assume that they are normally distributed–thus the data cannot be
analysed using the methods discussed so far in the course. The most common method to use
for analysing data with dichotomous response variables is logistic regression.

Data for which the response variable is dichotomous are discussed in more detail in Section
14.2. The most common model for such data is the logistic regression model. This is discussed
in Section 14.3. (The logistic regression model is a special case of a generalised linear model
which is the subject of the MAS course ST112.)

14.2 Binary data


When the response variable is dichotomous, it is convenient to denote one of the outcomes as
success and the other as failure. For example, if a patient is cured of a disease, the response
is ‘success’, if not, then the response is ‘failure’; if an item passes the quality control, the
response is ‘success’, if not, then the response is ‘failure’; if a mouse dies from toxic exposure,

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.2 Binary data 2

the response is ‘success’, if not (i.e. if it survives) the response is ‘failure’. It is standard to
let the response variable Y be a binary variable, which attains the value 1, if the outcome
is ‘success’, and 0 if the outcome is ‘failure’. In a regression situation, each response variable
is associated with given values of a set of explanatory variables x1 , x2 , . . . , xk . For example,
whether or not a patient is cured of a disease may depend on the particular medical treatment
the patient is given, the patient’s general state of health, age, gender, etc.; whether or not an
item in a manufacturing process passes the quality control may depend on various conditions
regarding the production process, such as temperature, quality of raw material, time since
last service of the machinery, etc. It is often possible to group the observations such that all
observations within a group have the same values of the explanatory variables. For instance,
we may group the patients in the disease-example according to, type of medical treatment,
gender and age-group (say, –30, 31–40, 41–49, 50–), such that there are several patients in
each grouping. When the data can be grouped it is easier to record the number of successes
and failures for each group, rather than recording a long series of 0s and 1s.
Example 14.1 Oral contraceptives and myocardial infarction
The link between use of an oral contraceptives and the incidence of myocardial infarction was
investigated. The table below gives the number of women in the study using the contraceptive
pill who suffered a myocardial infarction, and the number using the pill who did not suffer
a myocardial infarction. The corresponding numbers for women not using the pill are also
given.
Infarction
Yes No
Pill Yes 23 34
No 35 132
Further details on this dataset can be found here.

Example 14.2 Surviving third-degree burns


These data refer to 435 adults who were treated for third-degree burns by the University
of Southern California General Hospital Burn Center. The patients were grouped according
to the area of third-degree burns on the body. In the table below are recorded, for each
midpoint of the groupings ‘log(area +1)’, the number of patients in the corresponding group
who survived, and the number who died from the burns.
log(area+1) survived died
(midpoint)
1.35 13 0
1.60 19 0
1.75 67 2
1.85 45 5
1.95 71 8
2.05 50 20
2.15 35 31
2.25 7 49
2.35 1 12

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.2 Binary data 3

Further details on this dataset can be found here.


In the previous modules, we have analysed regression models where the response variables
are normally distributed. The main concern in such models is to express the value of the
response variable Y as a function of the values of the explanatory variables. However, when
the response variable is binary, its value is not particularly interesting: either it is 0 or it
is 1. Instead, the interest centres on the probability of the response being success, that is,
P (Y = 1), for given values of the explanatory variables.

14.2.1 Odds and odds ratio


The odds of some event happening (e.g. the event that Y = 1 ) is defined as the ratio of the
probability that the event will occur divided by the probability that the event will not occur.
That is, the odds of the event E is given by

P (E) P (E)
odds (E) = = .
P (not E) 1 − P (E)

Example 14.1 (continued) Oral contraceptives and myocardial infarction


In the study on oral contraceptives and the incidence of myocardial infarction, there were 57
women in the study using the pill; of these, 23 had a myocardial infarction, and 34 did not.
An estimate of the probability of having a myocardial infarction for women in the study using
the pill is given by P (Epill ) = 23/57 = 0.4035. Hence, the odds, amongst these women, of
having a myocardial infarction when using the pill, is given by
0.4035
odds (Epill ) = = 0.6764.
1 − 0.4035
That is, the probability of having a myocardial infarction is around 2/3rds the probability
of not having a myocardial infarction, for women using the pill. In other words, the odds
are around 3 to 2 that a woman using the pill will not have a myocardial infarction. (To be
more precise, the odds are 0.6764 to 1 that a woman using the pill will have a myocardial
infarction.)

Similarly, for women in the study who are not using the pill, an estimate of the probability of
having a myocardial infarction is given by P (Eno-pill ) = 35/167 = 0.2096. The odds of having
a myocardial infarction, when not using the pill, is given by
0.2096
odds (Eno-pill ) = = 0.2652.
1 − 0.2096
Thus, the odds are 0.2652 to 1 (or, around 1 to 4) that a woman in the study not using the
pill will have a myocardial infarction.

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 4

The odds ratio comparing the odds of having a myocardial infarction for women using the pill
with the odds of having a myocardial infarction for women not using the pill, is given by

odds (Epill ) 0.6764


Rpill,no-pill = = = 2.5505.
odds (Eno-pill ) 0.2652

That is, the odds of having a myocardial infarction is 2.55 times higher for women using the
pill, than for women not using the pill.

The odds ratio RA, B that compares the odds of events EA and EB (that is, Event E
occurring in group A and B, respectively), is defined as the ratio between the two odds; that
is 
odds (EA ) P (EA ) P (EB )
RA, B = =
odds (EB ) 1 − P (EA ) 1 − P (EB )
In particular, if an odds ratio is equal to one, the odds are the same for the two groups. Note
that, if we define a factor with levels corresponding to groups A and B, respectively, then an
odds ratio equal to one is equivalent to there being no factor-effect.

14.3 Logistic regression


For binary data, we are interested in analysing the relationship between the probability of the
response being success and the explanatory variables, rather than analysing the relationship
between the value of the response variable and the explanatory variables.

Example 14.2 (continued) Surviving third degree burns


Consider the data on patients with third-degree burns. A first idea might be to model the
relationship between the probability of success (that the patient survives) and the explanatory
variable ‘log(area +1)’ as a simple linear regression model. However, the scatterplot in Figure
14.1 of the proportions of patients surviving a third-degree burn against the explanatory
variable shows a distinct curved relationship between the two variables, rather than a linear
one. It seems that a transformation of the data is in place.

When choosing an appropriate transformation for data of the type in Example 14.2, two
problems need to be addressed. One is that the relationship between the variables is curved
rather than linear, the other is that the probability of success can only take values between 0
and 1, while a linear function can attain any real value. It turns out that, often, one can take
care of both of these problems by transforming the probability using the logistic function.
The logistic function will turn a probability into a quantity which can take any real value,
and which is often linearly related to the explanatory variables.

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 5

1.0

0.8

0.6

0.4

0.2

1.4 1.6 1.8 2.0 2.2

Figure 14.1: Proportion of survived patients against log-area of burn

14.3.1 The logistic function


The curved relationship in Figure 14.1 in Example 14.2 is typical for many situations where
the response variable is binary. The underlying smooth curve is called a sigmoid. A sigmoid
curve has the properties that the y-variable (the probability of success) is constrained to lie
between 0 and 1 such that y tends to 0 when x becomes small and y tends to 1 when x becomes
large (or the other way around), and the relationship between the y-variable and x-variable
is linear from about y = 0.2 to y = 0.8. Figure 14.2 shows four examples of different sigmoid
curves.
There are various types of functions that produce sigmoid curves, but the most common one
is of the following form:
1
P (x) = y = , (14.1)
1 + exp (−β0 − β1 x)
where P (x) denotes the probability of success when the explanatory variable has the value
x. The two parameters β0 and β1 determine the location, slope and spread of the curve. The
function is symmetric about the point x = −β0 /β1 . In particular, when x = −β0 /β1 , then
P (x) = 0.5.

Note that, the function (14.1) is not linear in the explanatory variable x . However, if we
transform both sides of the equation using the logit (or logistic) transformation

logit (y) = log (y/ (1 − y)) ,

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 6

1.0 1.0
0.8 0.8
0.6 0.6
y

y
0.4 0.4
0.2 0.2
0.0 0.0
-10 -5 0 5 10 -10 -5 0 5 10
x x

1.0 1.0
0.8 0.8
0.6 0.6
y

y
0.4 0.4
0.2 0.2
0.0 0.0
-10 -5 0 5 10 -10 -5 0 5 10
x x

Figure 14.2: Four different sigmoid curves

we get the following equation:


 
P (x)
logit (P (x)) = log = β0 + β1 x. (14.2)
1 − P (x)

And this function is linear in x. Thus, if we transform the probability of success using the
logit transformation, we get a quantity logit(P (x)) which is linear in the explanatory variable
x.

Example 14.2 (continued) Surviving third degree burns


The scatterplot in Figure 14.3 shows the logit-transformed proportions of patients surviving
a third-degree burn against the explanatory variable ‘log(area +1)’. You can see that the
points in the plot lie close to a straight line. That is, there appears to be a linear relationship
between the logit-transformed probability of a patient surviving a third-degree burn, and the
variable log(area +1).

Note that the ratio P (x) / (1 − P (x)) in (14.2) is the ratio between the probability of success
and the probability of failure. That is, it is the odds of success! Hence, logit(P (x)) in (14.2) is
the logarithm of the odds of success–called the log odds of success, given the value x of the
explanatory variable. Further, the log odds ratio (that is, the logarithm of the odds ratio)

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 7

-1

-2

1.8 1.9 2.0 2.1 2.2 2.3

Figure 14.3: Logit-proportion of survived patients against log-area of burn

corresponding to the probability of success when the explanatory variable has a value x and
the probability of success when the explanatory variable has the value x + 1, is given by
   
P (x + 1) P (x)
log − log = (β0 + β1 (x + 1)) − (β0 + β1 x) = β1 ,
1 − P (x + 1) 1 − P (x)

which does not depend on the value x. That is, the parameter β1 represents the difference
in log odds given a one unit increase in the explanatory variable. Since β1 is the logarithm
of the odds ratio, it follows that the odds ratio is given by eβ1 . Note that the odds ratio is
sometimes called the odds multiplier, because it is the value the odds are multiplied by,
when the value of the explanatory variable is increased by one unit.

As mentioned earlier, there are other types of functions that produce sigmoid curves. These
functions will lead to different transformations than the logit transformation. (For example
the probit transformation or the complementary log-log transformation.) However, the most
commonly used transformation is the logit transformation, for several reasons. First of all,
the logit transformation often works well in linearising the relationship between P (x) and x.
Another important reason is the useful interpretations one can make in terms of odds and
odds ratios.

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 8

14.3.2 The logistic regression model


The logistic regression model describes the relationship between a dichotomous response vari-
able Y , coded to take the values 1 or 0 for ‘success’ and ‘failure’, respectively, and k ex-
planatory variables x1 , x2 , . . . , xk . The explanatory variables can be quantitative or indicator
variables referring to the levels of categorical variables. Since Y is a binary variable, it has a
Bernoulli distribution with parameter p = P (Y = 1), that is, p is the probability of success for
given values x1 , x2 , . . . , xk of the explanatory variables. Recall that, for a Bernoulli variable,
the mean is given by E[Y ] = P (Y = 1) = p.

The logistic regression model is defined as follows. Suppose that Y1 , . . . , Yn are inde-
pendent Bernoulli variables, and let pi denote the mean value of Yi , that is, pi = E[Yi ] =
P (Yi = 1). The mean value pi can be expressed in terms of the explanatory variables
xi,1 , xi,2 , . . . , xi,k as
1
pi =  . (14.3)
1 + exp −β0 − kj=1 βj xi,j
P

If we apply the logit-transformation to (14.3), we get a linear relationship between logit(pi )


and the explanatory variables:
  k
pi X
logit (pi ) = log = β0 + βj xi,j . (14.4)
1 − pi
j=1

The equation (14.4) is sometimes called the logit form of the model. Note that, logit(pi ) is
the log odds (that is, the logarithm of the odds) of success for the given values xi,1 , xi,2 , . . . , xi,k
of the explanatory variables.

In some cases the data are grouped (e.g. in Example 14.2 the data are grouped according
to set intervals of the area of third-degree burns on the patient’s body). In such cases, the
probability of success pi in group i corresponds to the proportion of successes in that group.
But the logistic model is defined for the more general setting where each response variable Yi
has its ‘own’ set of explanatory variable values xi,1 , xi,2 , . . . , xi,k . In such cases, the probability
of success, pi , cannot be interpreted as a proportion of successes, but must be interpreted as
the mean of Yi (or the probability that Yi = 1), that is, pi = E[Yi ] = P (Yi = 1).

In logistic regression models, the response variables Yi are Bernoulli distributed with param-
eter pi , rather than normally distributed as in linear regression models. Also, the response
variables Yi have different variances var[Yi ] = pi (1 − pi ), i = 1, . . . , n ; unlike in linear re-
gression models. Due to the distributional differences between logistic regression models and
linear regression models, it is not surprising that one cannot apply the same method for es-
timating the parameters β0 , β1 , . . . , βk in the two models. (Recall that in linear regression,
we used the least squares method to estimate the parameters). In logistic regression, the
parameters are estimated using maximum likelihood estimation. We shall always do this on
a computer. (See the Computing document.)

Example 14.2 (continued) Surviving third-degree burns

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 9

The fitted logistic regression model relating the probability p that a patient survives a third-
degree burn to the explanatory variable log(area +1), where ‘area’ is the area of third-degree
burns on the body, is given by

logit (p) = 22.708 − 10.662 × log (area+1).

In particular, logit(p) is the log odds of survival for a patient with burns corresponding to
log(area+1). For example, for a patient with burns corresponding to log(area+1)= 1.90, the
2.45

log odds is 2.450, thus the odds of survival is e ≈ 11.59 to 1. Further, the value -10.662
in the equation is the log odds ratio for the explanatory variable log(area+1). Thus, the
odds multiplier corresponding to log(area+1) is e−10.662 = 0.000023. That is, when log(area
+1) is increased by 1, the odds of success (i.e. that the patient survives) is increased by a
factor e−10.662 (or, rather, reduced by a factor e10.662 ). For example, the odds of survival for
a patient with log(area+1)= 1.5 is e10.662 ≈ 43, 000 times higher than the odds of survival for
a patient with log(area+1)= 2.5.

In general, when the explanatory variables are quantitative (as in Example 14.2), each of the
regression parameters β1 , β2 , . . . , βk can be interpreted as log odds ratios for the corresponding
explanatory variable, when all other explanatory variables are held fixed. That is, the odds
multiplier for xi is equal to eβi : When the explanatory variable xi is increased by 1 unit, and
all other explanatory variables are held constant, the odds of success is increased by a factor
eβi . (Note that, if βi is negative, then eβi < 1, so the odds of success is actually reduced by
e−βi , as in Example 14.2.)

The explanatory variables in a logistic regression model can be quantitative, or they can be
indicator variables representing different levels of factors (as in Example 14.1, where the two
factor levels correspond to ‘using the pill’ or ‘not using the pill’, respectively). If the indicator
variables are defined using cell reference coding (see Module 9), the regression parameter βi
for an indicator variable zi can be interpreted as the log odds ratio corresponding to the
factor level referred to by zi and the reference level, when all other explanatory variables are
held constant. (The reference level is the factor level which is represented implicitly by the
indicator variables, as the level where all indicator variables are equal to zero.) The odds
ratio corresponding to the factor level referred to by zi and the reference level is given by eβi ,
when all other explanatory variables are held fixed.

Example 14.1 (continued) Oral contraceptives and myocardial infarction


For the data on oral contraceptives and the incidence of myocardial infarction, let Y denote
the response variable taking the value 1 if the woman has a myocardial infarction, and 0 if
she does not, and let z denote the indicator variable taking the value 1 if oral contraceptives
were used, and 0 if it were not. The fitted logistic regression model relating p = P (Y = 1) to
z, is given by

logit (p) = −1.3275 + 0.9366 z.

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 10

The log odds ratio is given by 0.9366, so, the odds ratio comparing the odds of having
myocardial infarction for women using the pill with the odds of having a myocardial infarction
for women not using the pill, is given by e0.9366 ≈ 2.55 (as we found in Subsection 14.2.1).

Sometimes it is of interest to compare the odds of success for two groups (or individuals)
corresponding to different values of the explanatory variables. For example, suppose we wish
to compare the odds of Yi = 1 (with values of explanatory variables xi,1 , xi,2 , . . . , xi,k ) with
the odds of Yj = 1 (with values of explanatory variables xj,1 , xj,2 , . . . , xj,k ). The odds ratio
Rxi ,xj is given by
Pk
odds (xi,1 , xi,2 , . . . , xi,k ) eβ0 + l=1 βl xi,l Pk
Rxi ,xj = = Pk = e l=1 βl (xi,l −xj,l ) .
odds (xj,1 , xj,2 , . . . , xj,k ) eβ0 + l=1 βl xj,l
Note that the parameter β0 cancels out in the expression for the odds ratio. Also, if, for some
explanatory variable xi,l = xj,l , the corresponding regression parameter βl cancels out.

Example 14.2 (continued) Surviving third-degree burns


Suppose we are interested in comparing the odds of surviving third-degree burns for patients
with burns corresponding to log(area +1)= 1.90, and patients with burns corresponding to
log(area +1)= 2.00. The odds ratio R1.90, 2.00 is given by

R1.90, 2.00 = e−10.662×(1.90−2.00) = e1.0662 = 2.904.

That is, the odds of survival for a patient with log(area+1)= 1.90 is 2.9 times higher than
the odds of survival for a patient with log(area+1)= 2.0.

14.3.3 Testing hypotheses in logistic regression


In logistic regression, hypotheses on significance of explanatory variables cannot be tested
in quite the same way as in linear regression. Recall that in linear regression, where the
response variables are normally distributed, we can use t- or F -test statistics for testing
significance of explanatory variables. But in logistic regression, the response variables are
Bernoulli distributed, so we have to use different test statistics, which exact distributions are
unknown. Fortunately, there exists fairly good approximations to the distributions of the test
statistics.

In this course, we shall not go into any technical details about test statistics, but focus on
interpreting the results of a logistic regression analysis through Example 14.3 below. We shall
use two different types of test statistics: the (log) likelihood ratio statistic (often referred
to as the -2 Log Q statistic) and the Wald statistic. In general, the likelihood statistic
is superior to the Wald statistic (in the sense that it gives more reliable results), so we shall
mainly concentrate on the likelihood ratio statistic. The reason for considering the Wald
statistic too, is that it is computationally easy and is given automatically in the output of
most statistical computer packages (e.g. SAS).

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 11

Example 14.3 Transient vasoconstriction in skin of fingers


A study was made into the effect of volume (x1 ) and rate (x2 ) of air inspired by human
subjects on the occurrence (Y ) of transient vasoconstriction in the skin of the fingers. The
response variable Y takes the value 1 if transient vasoconstriction occurs, and 0 if it does not.
A total of 39 observations were obtained on these variables from 3 subjects in a laboratory.
The data are assumed to be independent (including those on the same subject). Some of the
data are shown in the table below.
Y : 1 1 ··· 0 1
Volume (x1 ): 3.70 3.50 ··· 0.75 1.30
Rate (x2 ): 0.83 1.09 ··· 1.90 1.63

It is thought that the probability p of occurrence of vasoconstriction in the skin of the fingers
may depend on the volume, the rate, and the second-order interaction between volume and
rate, of air inspired by a subject. A typical output of a logistic regression analysis will start
by testing for overall regression, that is, testing the null hypothesis H0 : β1 = β2 = β3 = 0:

Testing overall regression


Test d.f. χ2 -value p-value
-2 Log Q 3 27.498 < 0.0001
Wald 3 9.113 0.0278

Both the likelihood ratio test and the Wald test for testing H0 are rejected at the 5% sig-
nificance level. However, you can see that the two p-values are quite different! (In cases
where only one of the two tests rejects H0 , one should always rely on the likelihood ratio
test.) A standard output will also contain the following information on the estimators for the
parameters in the logistic regression model:

Parameter d.f. Estimate Standard error Wald statistic p -value


Intercept 1 -7.1335 3.3574 4.5145 0.0336
x1 1 1.2676 2.1512 0.3472 0.5557
x2 1 0.5155 1.5105 0.1165 0.7329
x1 × x2 1 2.4103 1.6636 2.0992 0.1474

It follows that the fitted logistic regression model including the second-order interaction be-
tween the explanatory variables volume and rate, is given by

logit (p) = −7.134 + 0.516x1 + 1.268x2 + 2.410x1 × x2 .

The Wald statistics for testing significance of the explanatory variables are given in the column
‘Wald statistic’. The Wald statistics are approximately χ2 (Chi-squared) distributed with the
degrees of freedom given in the column ‘d.f.’. The p-values for the test statistics are given in
the column ‘p-value’. Note that the p-values corresponding to all three explanatory variables
exceed 5%. Since there is an interaction term x1 × x2 which is not significant (when x1 and
x2 are in the model), we can reduce the model to

logit (p) = β0 + β1 x1 + β2 x2 , (14.5)

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.3 Logistic regression 12

according to the Wald test. However, rather than relying on the Wald test, we want to use
the likelihood ratio test for testing for no interaction.

To calculate the likelihood ratio statistic for testing significance of an explanatory variable x,
we need to compare the log likelihood-‘Model fit’ (often called -2 Log L) for the full model,
containing all explanatory variables, with the log likelihood-‘Model fit’ for the reduced model,
containing all explanatory variables except the one of interest, x. The Model fit is a measure
of how well the given model explains the data (the lower the better). For example, if the
full model explains the data ‘much better’ than the reduced model, the difference between -2
Log Lreduced and -2 Log Lfull will be ‘large’–in this case, we reject the null-hypothesis that
the variable x is non-significant. However, if the reduced model explains the data almost as
well as the full model, the difference between -2 Log Lreduced and -2 Log Lfull will be close to
zero–in this case, we accept the null-hypothesis that x is non-significant.

For these data, the model fit for the full model, is given by

Model fit for full model: -2 Log Lfull = 26.542.

(The values are taken from the SAS output in the Computing document.) In order to test
significance of the interaction term, we compare the value -2 Log Lfull = 26.542 to the value
of the model fit for the model with only x1 and x2 as explanatory variables. The model fit
for the reduced model is given by

Model fit for reduced model: -2 Log Lreduced = 29.733.

Note that the model fit for the reduced model is always higher than the model fit for the full
model.

The log likelihood statistic is the difference between the two model fits, that is,

D = (-2 Log Lreduced ) − (-2 Log Lfull )


= 29.733 − 26.542
= 3.191,

which is an observation from a χ2 distribution with 1 degree of freedom (that is, the number
of explanatory variables in the full model minus the number of explanatory variables in the
reduced model). This gives a p-value of p = 0.0740. Thus, the likelihood ratio test agrees
with the Wald test in not rejecting the hypothesis of no interaction at level 5%.

We refit the model for the data, using the reduced model (14.5), and obtain the following
output of the analysis. The table testing for overall regression is given by

Testing overall regression


Test d.f. χ2 -value p-value
-2 Log Q 2 24.306 < 0.0001
Wald 2 8.998 0.0111

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.4 Summary 13

Again, both the likelihood ratio test and the Wald test reject at the 5% significance level the
hypothesis of no overall regression. The table on the estimators for the parameters in the
logistic regression model is given by

Parameter d.f. Estimate Standard error Wald statistic p -value


Intercept 1 -9.5533 3.2417 8.6845 0.0032
x1 1 3.8906 1.4318 7.3834 0.0066
x2 1 2.6561 0.9166 8.3972 0.0038

That is, the fitted model is given by

logit (p) = −9.553 + 3.891x1 + 2.656x2 . (14.6)

According to the Wald tests, the model cannot be reduced any further at the 5% significance
level. But we wish to use the likelihood ratio test. In order to calculate the likelihood ratio
statistics for testing significance of x1 and x2 , respectively, we need the model fits for the two
reduced models:
Leaving out x1 : -2 Log L−x1 = 49.641
Leaving out x2 : -2 Log L−x2 = 46.993
These model fits are to be compared to our ‘new’ full model (14.5). That is -2 Log Lfull=
29.733. The differences between the full model fit and the two reduced model fits (-2 Log L−x1
and -2 Log L−x2 ) are 19.91 and 17.26, respectively. Both differences are highly significant in
a χ2 -distribution with 1 degree of freedom. Thus, according to the likelihood ratio tests, the
model cannot be reduced any further at the 5% significance level. Our final model is (14.6).

Further details on this dataset can be found here.


14.4 Summary
In this module, we have considered situations where the response variables are binary random
variables, taking the values 1 and 0, for ‘success’ and ‘failure’, respectively. When the response
variables are binary, the interest centres on the probability of the response being success, for
given values of the explanatory variables. (The explanatory variables can be quantitative or
dummy variables, indexing different levels of categorical variables.) Two important concepts
associated with probabilities of success are odds and odds ratios. The odds of the response
variable being success, for given values of the explanatory variables, is the ratio between the
probability that the response is a success and the probability that the response is failure, given
the values of the explanatory variables. The odds ratio compares the odds of the response
variable being success for two different sets of values of the explanatory variables.

The most common method to use for analysing data with binary response variables is logistic
regression. In the logistic regression model, the response variable is Bernoulli distributed with

http://statmaster.sdu.dk/courses/st111 May 23, 2006


14.4 Summary 14

mean value (i.e. the probability of success) related to the explanatory variables through the
logit transformation. In particular, the logistic function of the mean is linearly related to the
explanatory variables. The parameters in the model can be estimated using the method of
maximum likelihood. Odds and odds ratios for the response variables can easily be calculated
from the parameters of the fitted model. In order to test hypotheses in logistic regression, we
have used the likelihood ratio test and the Wald test.

Keywords: binary variable, odds, odds ratio, sigmoid curve, logit transformation, logistic
transformation, log odds, log odds ratio, odds multiplier, logistic regression model, logit form,
(log) likelihood ratio statistic, -2 Log Q, Wald statistic, model fit, -2 Log L.

http://statmaster.sdu.dk/courses/st111 May 23, 2006

You might also like