Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Stat 544, Lecture 12 1

'
&
$
%
Introduction to
Generalized Linear
Models
Logit model with categorical predictors. Before
leaving the topic of logistic regression, lets look at
one more example. No course in categorical data
would be complete without repeatedly analyzing the
infamous Berkeley graduate admissions dataset. This
table shows graduate admissions information for the
six largest departments at U.C. Berkeley in the fall of
1973:
Men Men Women Women
Dept. rejected accepted rejected accepted
A 313 512 19 89
B 207 353 8 17
C 205 120 391 202
D 278 139 244 131
E 138 53 299 94
F 351 22 317 24
Stat 544, Lecture 12 2
'
&
$
%
Let D = department, S = sex, and A = admission
status (rejected or accepted). Because its reasonable
to regard A as a response and D, S as potential
predictors, its quite natural to analyze these data by
logistic regression. In this example, the counts are
large enough for goodness-of-t testing.
Entering the data. Lets dene success to be
granted admission to Berkeley. The cases or rows in
the dataset correspond to combinations of D and S.
Lets create a data le with columns for the covariates
and columns for n
i
y
i
and y
i
:
A M 313 512
A F 19 89
B M 207 353
B F 8 17
C M 205 120
C F 391 202
D M 278 139
D F 244 131
E M 138 53
E F 299 94
F M 351 22
F F 317 24
Now read it into R:
> admissions <- read.table("admissions.dat",
+ col.names=c("Dept","Sex","Reject","Accept"))
> admissions
Dept Sex Reject Accept
1 A M 313 512
2 A F 19 89
Stat 544, Lecture 12 3
'
&
$
%
3 B M 207 353
4 B F 8 17
5 C M 205 120
6 C F 391 202
7 D M 278 139
8 D F 244 131
9 E M 138 53
10 E F 299 94
11 F M 351 22
12 F F 317 24
> y <- admissions$Accept
> n <- admissions$Accept+admissions$Reject
> dept <- admissions$Dept
> sex <- admissions$Sex
Overdispersion? First we must decide on the
maximal model so that we may address the
possibility of overdispersion. Suppose that we dene
dummy indicators for sex (1) and department (5).
The main-eects-only model will have p = 7
coecients and N p = 5 degrees of freedom for
estimating overdispersion. This model says that the
eect of sex on the log-odds of success is identical
across departments.
Is that the most complicated model that we should
consider? Probably not. Its possible, even probable,
that gender eects vary by department. So we can
dene the maximal model to include main eects for
sex, department, and their interactions. That model
has 1 + 1 + 5 + 5 = 12 parameters and is saturated,
Stat 544, Lecture 12 4
'
&
$
%
leaving us no way to estimate a scale parameter for
overdispersion. Our choices are:
Consider the saturated model to be the maximal
model and assume = 1.
Consider the main-eects-only model to be the
maximal model, eliminating the possibility of
interactions.
Putting it another way, suppose that the
main-eects-only model doesnt t. We must decide
whether we are going to attribute that lack of t to
overdispersion or to interactions. Because interactions
are entirely believable in this example, I would prefer
the latter. Therefore, I will choose the saturated
model as the maximal one, setting = 1 and
proceeding under the untestable assumption that the
data are binomial.
Choices like this always need to be made. If there
were more predictors, I might be willing to throw
away the higher-order interactions a priori and
estimate a scale parameter. For example, if there were
four predictors, I would be happy to throw away all
three- and four-way interactions and consider the
Stat 544, Lecture 12 5
'
&
$
%
model with all two-way interactions to be maximal.
Fitting some models. First, lets t the null model
with an intercept only:
> source("lec11.R") # load the R functions from Lecture 11
> x <- matrix(1,nrow=12,ncol=1)
> result <- nr.logit(x,y,n)
1...2...3...4...
> logit.print(result)
The Newton-Raphson algorithm converged in 4 iterations.
coef SE coef/SE pval
[1,] -0.4558088 0.03050388 -14.94 0
Loglikelihood = -3022.62659013274
Pearsons X^2 = 797.068595649145
Deviance G^2 = 876.571862683973
df = 11
50% of cells have expected counts below 5.0
Notice that the tted values are all equal to the
overall proportion accepted. The intercept of the null
model, which is log-odds of acceptance, could have
been estimated directly:
> result$fitted
[1] 0.3879806 0.3879806 0.3879806 0.3879806 0.3879806 0.3879806
[7] 0.3879806 0.3879806 0.3879806 0.3879806 0.3879806 0.3879806
> p <- sum(y)/sum(n)
> p
[1] 0.3879806
> log(p/(1-p))
[1] -0.4558088
So we didnt need a logistic regression procedure to t
this model; we could have just collapsed all of the
Stat 544, Lecture 12 6
'
&
$
%
data down to the sucient statistics
P
i
y
i
and
P
i
n
i
.
But if we collapsed in this way, we would lose the
opportunity to test the t of the model.
Obviously, the null model doesnt t; both X
2
and G
2
are huge. Now lets add a main eect for sex:
> x <- cbind(const=1,male=1*(sex=="M"))
> result <- nr.logit(x,y,n)
1...2...3...4...5...
> logit.print(result)
The Newton-Raphson algorithm converged in 5 iterations.
coef SE coef/SE pval
const -0.8304864 0.05077209 -16.36 0
male 0.6118568 0.06389111 9.58 0
Loglikelihood = -2975.66499423578
Pearsons X^2 = 714.303888162303
Deviance G^2 = 782.64867089005
df = 10
50% of cells have expected counts below 5.0
Sex is signicant. We could also perform this test by
comparing the t of this model to the last model:
X
2
= 82.76
G
2
= 93.92
Notice that these are pretty close to the squared Wald
statistic 9.58
2
= 91.78.
Once again, we could have derived the estimates for
this model without using logistic regression software.
Stat 544, Lecture 12 7
'
&
$
%
Look at the 2 2 table for S A, collapsed over D:
Accepted Rejected
Male 1199 1492
Female 557 1278
The estimated log-odds of admission for females is
log(557/1278) = 0.8304864, which is equal to the
intercept. The log-odds ratio is
log

1199 1278
557 1492

= 0.6118568,
exactly equal to the slope. We did not need to run
logistic regression to get these estimates. But the
2 2 table alone would not allow us to test t.
Looking the slope, exp(0.6118) = 1.84 tells us that
males appear to be 84% more likely (on the odds
scale) to gain admission than females. But we should
be wary of that interpretation, because this model
does not t; we have ignored Department.
The goodness-of-t statistics X
2
= 714.30,
G
2
= 782.65 are precisely the same statistics that one
would get by testing whether the model (DS, SA)
holds for the three-way table. They are testing the
signicance of all terms not in the current model.
Those terms include the ve eects for department
Stat 544, Lecture 12 8
'
&
$
%
and the ve eects for the department by sex
interaction. These test statistics are very large,
indicating that at least some of the omitted covariates
are signicantly related to admission.
Now lets use Department as a predictor. We could
dene ve dummy variables, but that would require
choosing one department to serve as a baseline.
Instead, lets try eect coding, which is often used
in ANOVA:
DeptA DeptB DeptC DeptD DeptE
A 1 0 0 0 0
B 0 1 0 0 0
C 0 0 1 0 0
D 0 0 0 1 0
E 0 0 0 0 1
F 1 1 1 1 1
If we include an intercept and these ve eects, the
intercept
0
will be the grand average of the log-odds
of admission across the departments. The coecient
for DeptA,
1
, will be the log-odds for admission in A
minus the grand average;
2
will be the log-odds for
admission in B minus the grand average; and so on.
Stat 544, Lecture 12 9
'
&
$
%
To get the eect for Department F, we would take

1

2

5
,
because deviations from a grand average must sum to
zero.
> DeptA <- 1*(dept=="A")-1*(dept=="F")
> DeptB <- 1*(dept=="B")-1*(dept=="F")
> DeptC <- 1*(dept=="C")-1*(dept=="F")
> DeptD <- 1*(dept=="D")-1*(dept=="F")
> DeptE <- 1*(dept=="E")-1*(dept=="F")
> x <- cbind(int=1,DeptA=DeptA,DeptB=DeptB,DeptC=DeptC,DeptD=DeptD,DeptE=DeptE)
> result <- nr.logit(x,y,n)
1...2...3...4...5...6...7...
> logit.print(result)
The Newton-Raphson algorithm converged in 7 iterations.
coef SE coef/SE pval
int -0.65062620 0.03900137 -16.68 0.000
DeptA 1.24408616 0.06810581 18.27 0.000
DeptB 1.19349118 0.08014789 14.89 0.000
DeptC 0.03493708 0.06862994 0.51 0.611
DeptD -0.00861943 0.07257673 -0.12 0.905
DeptE -0.43887441 0.08707357 -5.04 0.000
Loglikelihood = -2595.17191982568
Pearsons X^2 = 19.8653921025301
Deviance G^2 = 21.6625220698506
df = 6
50% of cells have expected counts below 5.0
Many of the Department eects are signicant. To
test their joint signicance, we may compare this
model to the null model:
X
2
= 777.20
G
2
= 854.91
Stat 544, Lecture 12 10
'
&
$
%
Comparing these to
2
5
gives p-values of zero. This
proves that the odds of admission are not the same
across departments.
Notice that the overall t of this model is much better
than the previous ones (X
2
= 19.9, G
2
= 21.7), but
the 95th percentile of
2
6
is 12.6, so the model still
does not t. These are the same statistics that you
would get from testing the t of (DS, DA) for the
three-way table.
Now t the model with main eects for Sex and
Department:
> x <- cbind(int=1,male=1*(sex=="M"),DeptA=DeptA,DeptB=DeptB,DeptC=DeptC,
+ DeptD=DeptD,DeptE=DeptE)
> result <- nr.logit(x,y,n)
1...2...3...4...5...6...7...
> logit.print(result)
The Newton-Raphson algorithm converged in 7 iterations.
coef SE coef/SE pval
int -0.59338803 0.06149323 -9.65 0.000
male -0.09672564 0.08081178 -1.20 0.231
DeptA 1.27251950 0.07227144 17.61 0.000
DeptB 1.22889795 0.08555160 14.36 0.000
DeptC 0.01161973 0.07135220 0.16 0.871
DeptD -0.01530050 0.07280923 -0.21 0.834
DeptE -0.46499070 0.08981668 -5.18 0.000
Loglikelihood = -2594.45323057656
Pearsons X^2 = 18.8317115084122
Deviance G^2 = 20.2251435716120
df = 5
50% of cells have expected counts below 5.0
Stat 544, Lecture 12 11
'
&
$
%
The eect for Sex is not signicant, and this model
still does not t (the 95th percentile of
2
5
is 11.07).
This model says that Sex and Admission could be
related, but the SA odds ratio is exactly the same in
every department. In other words, it ts a common
conditional odds ratio for every department. This is
the model of homogeneous association
(DS, SA, DA).
The estimate of the conditional SA odds ratio is
exp(0.09672) = 0.91, now suggesting that males are
about 9% less likely to gain admission than females.
When Department was added to the model, the
coecient for Sex changed its sign. This is an
example of Simpsons paradox. The marginal SA odds
ratio was greater than one, but the pooled estimate of
the conditional SA odds ratio is less than one. Its not
signicantly dierent from one, though. So we must
say that there is no evidence of gender bias within
departments in an overall sense.
But again, we should be very cautious in our
interpretation, because this model does not t. The
only covariates that separate this model from the
saturated model are the Sex Department
Stat 544, Lecture 12 12
'
&
$
%
interactions. Therefore, the goodness-of-t test for
this model is testing the joint signicance of the ve
interactions, and they are jointly signicant. This
means that the conditional SA odds ratios are not
identical for all departments. There is variation in the
eects of Sex across departments.
To see what may be going on, we could calculate the
SA odds ratio for each department. Or we could
examine the residuals:
> cbind(dept,sex,result$residuals)
dept sex
[1,] 1 2 -1.25875251
[2,] 1 1 3.53075287
[3,] 2 2 -0.05751478
[4,] 2 1 0.27599253
[5,] 3 2 1.24496732
[6,] 3 1 -0.90816860
[7,] 4 2 0.11808709
[8,] 4 1 -0.12262914
[9,] 5 2 1.22814279
[10,] 5 1 -0.83561275
[11,] 6 2 -0.21334187
[12,] 6 1 0.21392412
Case 2 (Department A female) is an outlier. The
residual 3.53 indicates that this model substantially
overpredicts the number of females accepted to
Department A.
Now lets t the saturated model:
> male <- 1*(sex=="M")
Stat 544, Lecture 12 13
'
&
$
%
> x <- cbind(int=1,male=male,
+ A=DeptA,B=DeptB,C=DeptC,D=DeptD,E=DeptE,
+ male.A=male*DeptA,male.B=male*DeptB,male.C=male*DeptC,
+ male.D=male*DeptD,male.E=male*DeptE)
> result <- nr.logit(x,y,n)
1...2...3...4...5...6...7...
> logit.print(result)
The Newton-Raphson algorithm converged in 7 iterations.
coef SE coef/SE pval
int -0.45373972 0.0951220 -4.77 0.000
male -0.20117699 0.1101725 -1.83 0.068
A 1.99793711 0.2272148 8.79 0.000
B 1.20751152 0.3627633 3.33 0.001
C -0.20670014 0.1185477 -1.74 0.081
D -0.16823118 0.1298826 -1.30 0.195
E -0.70340907 0.1355362 -5.19 0.000
male.A -0.85089896 0.2411397 -3.53 0.000
male.B -0.01884555 0.3738933 -0.05 0.960
male.C 0.32609862 0.1610929 2.02 0.043
male.D 0.13000071 0.1647833 0.79 0.430
male.E 0.40136401 0.1971532 2.04 0.042
Loglikelihood = -2584.34065879075
Pearsons X^2 = 9.7385400855217e-29
Deviance G^2 = 1.34559030584569e-13
df = 0
50% of cells have expected counts below 5.0
The goodness-of-t statistics are zero except for
rounding error.
Under this coding, getting the SA odds ratio for each
department is tricky. But we can change the coding
to make it easier. Lets remove the intercept, dene
dummy indicators for all six Departments, and add
these dummies and their interactions with sex:
> DeptA <- 1*(dept=="A")
Stat 544, Lecture 12 14
'
&
$
%
> DeptB <- 1*(dept=="B")
> DeptC <- 1*(dept=="C")
> DeptD <- 1*(dept=="D")
> DeptE <- 1*(dept=="E")
> DeptF <- 1*(dept=="F")
> x <- cbind(DeptA=DeptA, DeptB=DeptB, DeptC=DeptC,
+ DeptD=DeptD, DeptE=DeptE, DeptF=DeptF,
+ male.A=male*DeptA, male.B=male*DeptB, male.C=male*DeptC,
+ male.D=male*DeptD, male.E=male*DeptE, male.F=male*DeptF)
> result <- nr.logit(x,y,n)
1...2...3...4...5...6...7...
> logit.print(result)
The Newton-Raphson algorithm converged in 7 iterations.
coef SE coef/SE pval
DeptA 1.54419739 0.25272027 6.11 0.000
DeptB 0.75377180 0.42874646 1.76 0.079
DeptC -0.66043986 0.08664895 -7.62 0.000
DeptD -0.62197090 0.10831412 -5.74 0.000
DeptE -1.15714879 0.11824880 -9.79 0.000
DeptF -2.58084794 0.21171028 -12.19 0.000
male.A -1.05207596 0.26270810 -4.00 0.000
male.B -0.22002254 0.43759263 -0.50 0.615
male.C 0.12492163 0.14394243 0.87 0.385
male.D -0.07117628 0.15007770 -0.47 0.635
male.E 0.20018702 0.20024255 1.00 0.317
male.F -0.18889583 0.30516354 -0.62 0.536
Loglikelihood = -2584.34065879075
Pearsons X^2 = 1.41849511893867e-29
Deviance G^2 = 6.3948846218409e-14
df = 0
50% of cells have expected counts below 5.0
In this new coding scheme, the coecient for DeptA
is the estimated log-odds of admission for females in
A. The coecient for male.A is the log-odds for males
in A minus the log-odds for females in Athat is, the
conditional SA log-odds ratio.
Stat 544, Lecture 12 15
'
&
$
%
Looking at the output, we see that the only
conditional SA log-odds ratio that is signicant is the
one in Department A. Lets remove all the
non-signicant ones and re-t:
> x <- cbind(DeptA=DeptA, DeptB=DeptB, DeptC=DeptC,
+ DeptD=DeptD, DeptE=DeptE, DeptF=DeptF,
+ male.A=male*DeptA)
> result <- nr.logit(x,y,n)
1...2...3...4...5...6...7...
> logit.print(result)
The Newton-Raphson algorithm converged in 7 iterations.
coef SE coef/SE pval
DeptA 1.5441974 0.25272027 6.11 0
DeptB 0.5428650 0.08575468 6.33 0
DeptC -0.6156891 0.06916243 -8.90 0
DeptD -0.6592456 0.07496274 -8.79 0
DeptE -1.0895006 0.09534700 -11.43 0
DeptF -2.6756468 0.15243404 -17.55 0
male.A -1.0520760 0.26270810 -4.00 0
Loglikelihood = -2585.64491485695
Pearsons X^2 = 2.61737866168455
Deviance G^2 = 2.60851213239251
df = 5
50% of cells have expected counts below 5.0
The t of this model is excellent, P(
2
5
> 2.61) = 0.76.
And the interpretation is clear. This model says that
Sex has no eect except in Department A, where
males are substantially less likely to be admitted than
females. The odds ratio exp(1.05) = 0.35 means
that males are only about 35% as likely (on the odds
scale) to be admitted as females are. An approximate
Stat 544, Lecture 12 16
'
&
$
%
95% condence interval for this eect goes from
exp(1.05 2 0.26) = 0.21
to
exp(1.05 + 2 0.26) = 0.58.
Generalized linear models. First, lets clear up
some potential misunderstandings about terminology.
The term general linear model usually refers to
conventional linear regression models for a continuous
response variable given continuous and/or categorical
predictors. It includes multiple linear regression, as
well as ANOVA and ANCOVA (with xed eects
only). The form is
y
i
N(x
T
i
,
2
),
where x
i
contains known covariates and contains
the coecients to be estimated. These models are t
by least squares and weighted least squares using
SAS Proc GLM
R functions lsfit() (older, uses matrices) and
lm() (newer, uses data frames)
Stat 544, Lecture 12 17
'
&
$
%
and many other programs.
The term generalized linear model refers to a larger
class of models popularized by McCullagh and Nelder
(1982, second edition 1989). In these models, the
response variable y
i
is assumed to follow an
exponential family distribution with mean
i
, which
is assumed to be some (often nonlinear) function of
x
T
i
. Some would call these nonlinear because
i
is
often a nonlinear function of the covariates, but
McCullagh and Nelder consider them to be linear,
because the covariates aect the distribution of y
i
only through the linear combination x
T
i
.
The rst widely used software package for tting these
models was called GLIM. Because of this program,
GLIM became a well-accepted abbreviation for
generalized linear models, as opposed to GLM
which often is used for general linear models. Today,
GLIMs are t by many packages, including
SAS Proc Genmod
R function glm()
Besides the normal distribution, possible models for
the response variable in GLIMs include:
Stat 544, Lecture 12 18
'
&
$
%
binomial (common)
Poisson (common)
Gamma (less common)
Inverse Gaussian (rare)
Because this is a course in categorical data, we will
emphasize the binomial and the Poisson versions.
Elements of the GLIM. For now, we will dispense
with the subscript i and let y represent the response
for a single unit; x (p 1) represent the vector of
covariates for that unit; and = E(y). Dene the
linear predictor as
= x
T
,
where (p 1) is to be estimated. We assume that
g() =
for some monotonic g, which is called the link
function.
In most (but not all) cases, we will choose a link
function that maps the natural parameter space for
onto the whole real line. Here are the most common
examples of GLIMs.
Stat 544, Lecture 12 19
'
&
$
%
Normal linear regression. Assume that
y N(,
2
),
= x
T
.
In this model, = is the identity link.
Logistic regression. The response is taken to be the
observed proportion of successes, y n
1
Bin(n, ),
where
log

= x
T
.
Notice that = , the support of y is {0, 1/n, . . . , 1},
and the link function is = logit().
Loglinear model. Assume that
y Poisson(),
log = x
T
.
The link is = log .
A cautionary note. The normal linear regression
model has an additive error, so we may write
y = x
T
+
where is a random residual. But other GLIMs do
not have this structure. In a logit model, for example,
Stat 544, Lecture 12 20
'
&
$
%
we cannot write
log

= x
T
+.
For this model, the random error is contained in
y n
1
Bin(n, ), and g() = is a purely functional
(deterministic) relationship.
The likelihood function. In a GLIM, the
distribution for y is assumed to have the form
f(y; ) = exp

y b()
a()
+c(y, )

,
where is called the canonical parameter and is
called the dispersion parameter. The mean
E(y) = is a function of alone, so is the
parameter of interest; is usually regarded as a
nuisance.
For the most part, we will not treat as a parameter
in the same sense as . We will carry out estimation
and inference under an assumed value for . If
needs to be estimated, we will nd a way to estimate
it and then treat the estimate as if it were xed and
known to be true.
For any xed value of , f(y; ) is a one-parameter
Stat 544, Lecture 12 21
'
&
$
%
exponential family in . If is regarded as unknown,
then it may or may not be a two-parameter
exponential family in (, ).
First two moments. What can we say about
= E(y) and Var(y)? If we try to derive moments by
brute force,
=
Z
y f(y; ) dy,
we run into trouble because the support for y varies
from one model to another. However, if we rely on
well known properties of loglikelihood derivatives, we
can get an answer very easily.
The loglikelihood is
l(; y) =
y b()
a()
+c(y, ).
The rst derivative or score is
l

(; y) =
y b

()
a()
,
and the second derivative is
l

(; y) =
b

()
a()
.
Stat 544, Lecture 12 22
'
&
$
%
By the well known property
E

(; y)

= 0,
it immediately follows that = b

(). Then, from the


well-known information equality
E

(; y)
2

= E

(; y)

,
we see that
E

( y )
2
a
2
()

=
b

()
a()
,
so that Var(y) = a()b

(). Because depends on


but not , we may write the variance as
Var(y) = a()V ()
where V () is called the variance function. This
function captures the relationship, if any, between the
mean and variance of y.
In most cases, a() will have the form
a() = /w,
where is the dispersion parameter and w is a known
weight. We have already been exposed to the
dispersion parameter. The weight arises in certain
Stat 544, Lecture 12 23
'
&
$
%
models (e.g. the binomial) and may be inversely
proportional to some measure of sample size.
Now lets look at the most important examples of
generalized linear models and identify their parts.
Example: normal response. Under the normal
model y N(,
2
), the log-density is
log f =
1
2
2
(y )
2

1
2
log
`
2
2

=
y
2
/2

2

y
2
2
2

1
2
log
`
2
2

.
Therefore, the canonical parameter is = , and the
remaining elements are b() =
2
/2, =
2
,
a() = , and
c(y, ) =
y
2
2

1
2
log (2) .
In a heteroscedastic model y N(,
2
/w), where w
is a known weight, we would have =
2
and
a() = /w.
Stat 544, Lecture 12 24
'
&
$
%
Example: binomial response. If y n
1
Bin(n, ),
then the log-probability mass function is
log f = log n! log(ny)! log(n(1 y))!
+ny log +n(1 y) log(1 )
=
y log

+ log(1 )
1/n
+ c,
where c doesnt involve . Therefore, the canonical
parameter is
= log

,
the b-function is
b() = log(1 ) = log

1 +e

,
the dispersion parameter is = 1, and a() = /n.
Example: Poisson model. Deriving the analogous
expressions for y Poisson() will be left as an
exercise.

You might also like