Analysis of Three-Way Tables

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Stat 544, Lecture 7 1

'
&
$
%
Analysis of
Three-Way Tables
Introducing three-way tables
Suppose that we have three categorical variables, A,
B, and C, where
A takes possible values 1, 2, . . . , I,
B takes possible values 1, 2, . . . , J,
C takes possible values 1, 2, . . . , K.
If we collect the triplet (A, B, C) for each unit in a
sample of n units, then the data can be summarized
as a three-dimensional table. Let n
ijk
be the number
of units having A = i, B = j, and C = k. Then the
vector of cell counts (n
111
, n
112
, . . . , n
IJK
)
T
can be
arranged into a table whose dimensions are
I J K. As before, we will use + to indicate
summation over a subscript; for example,
n
i+k
=
J
X
j=1
n
ijk
,
Stat 544, Lecture 7 2
'
&
$
%
n
++k
=
I
X
i=1
J
X
j=1
n
ijk
.
To display this table, we must use a set of two-way
tables. For example, the AB C table could be
displayed by showing B C tables for each level of A.
If the n units in the sample are independently drawn
from the same population, then the vector of cell
counts x has a multinomial distribution with index n
and parameter
= (
111
,
112
, . . . ,
IJK
)
T
.
Under the unrestricted (saturated) multinomial
model, there are no constraints on other than

+++
= 1, and the ML estimates are
ijk
= n
ijk
/n.
The saturated model always ts the data perfectly;
the estimated expected frequency n
ijk
equals the
observed frequency n
ijk
for every cell, yielding
X
2
= G
2
= 0 with zero df. Fitting a saturated model
might not reveal any special structure that may exist
in the relationships among A, B, and C. To
investigate these relationships, we will propose
simpler models and perform tests to see whether these
simpler models t the data.
Stat 544, Lecture 7 3
'
&
$
%
Mutual independence. The simplest model that
one might propose is
P(A = i, B = j, C = k) = P(A = i) P(B = j) P(C = k)
for all i, j, k.
Dene

i
= P(A = i), i = 1, 2, . . . , I,

j
= P(B = j), j = 1, 2, . . . , J,

k
= P(C = k), k = 1, 2, . . . , K.
so that
ijk
=
i

k
for all i, j, k. The unknown
parameters are
= (
1
,
2
, . . . ,
I
),
= (
1
,
2
, . . . ,
J
),
= (
1
,
2
, . . . ,
K
).
Because each of these vectors must add up to one, the
number of free parameters in the model is
(I 1) + (J 1) + (K 1).
Notice that under the independence model,
(n
1++
, n
2++
, . . . , n
I++
) Mult(n, ),
(n
+1+
, n
+2+
, . . . , n
+J+
) Mult(n, ),
Stat 544, Lecture 7 4
'
&
$
%
(n
++1
, n
++2
, . . . , n
++K
) Mult(n, ),
and these three vectors are mutually independent.
Thus the three parameter vectors , , and can be
estimated independently of one another.
The ML estimates are given by

i
= n
i++
/n, i = 1, 2, . . . , I,

j
= n
+j+
/n, j = 1, 2, . . . , J,

k
= n
++k
/n, k = 1, 2, . . . , K.
It follows that the estimates of the expected cell
counts are
E
ijk
= n
i

j

k
=
n
i++
n
+j+
n
++k
n
2
.
To test the null hypothesis of full independence
against the alternative of the saturated model, we
calculate the expected counts E
ijk
and nd X
2
or G
2
in the usual manner,
X
2
=
X
i
X
j
X
k
(E
ijk
n
ijk
)
2
E
ijk
,
G
2
= 2
X
i
X
j
X
k
n
ijk
log

n
ijk
E
ijk

.
Stat 544, Lecture 7 5
'
&
$
%
The degrees of freedom for this test are
= (IJK 1) [ (I 1) + (J 1) + (K 1) ].
Graphically, we can express the model of complete
independence as:
A
B
C
In this graph, the lack of connections between the
nodes indicates no relationships exist among A, B,
and C. In the notation of loglinear models, this model
is expressed as (A, B, C).
In terms of odds ratios, the model (A, B, C) implies
that if we look at the marginal tables AB, B C,
and AC, that all of the odds ratios in these
marginal tables are equal to 1.
Two variables independent of a third. The
model
A
B
C
Stat 544, Lecture 7 6
'
&
$
%
indicates that A and B are jointly independent of C.
The line linking A and B indicates that A and B are
possibly related, but not necessarily so. Therefore, the
model of complete independence is a special case of
this one. In loglinear notation, this model is (AB, C).
If the model of complete independence (A, B, C) ts a
data set, then the model (AB, C) will also t, as will
(AC, B) and (BC, A). In that case, we will prefer to
use (A, B, C) because it is more parsimonious. Our
goal is to nd the simplest model that ts the data.
Under (AB, C), the probabilities satisfy

ijk
= P(A = i, B = j) P(C = k)
=
ij

k
,
where
P
i
P
j

ij
= 1 and
P
k

k
= 1. The number
of free parameters is (IJ 1) +(K 1), and their ML
estimates are

ij
= n
ij+
/n,
k
= n
++k
/n. The
estimated expected frequencies are

E
ijk
=
n
ij+
n
++k
n
.
Notice the similarity between this formula and the
one for the model of independence in a two-way table,

E
ij
= n
i+
n
+j
/n. If we view A and B as a single
Stat 544, Lecture 7 7
'
&
$
%
categorical variable with IJ levels, then the
goodness-of-t test for (AB, C) is equivalent to the
test of independence between the combined variable
AB and C.
Conditional independence. The model (AB, AC),
A
B
C
indicates that A and B may be related; A and C may
be related; and that B and C may be related, but
only through their mutual associations with A. In
other words, any relationship between B and C can
be fully explained by A.
In terms of odds ratios, this model implies that if we
look at the B C tables at each level of A = 1, . . . , I,
that the odds ratios in these tables are not
signicantly dierent from 1.
Notice that the odds ratios in the marginal B C
table, collapsed or summed over A, are not necessarily
1. The conditional BC odds ratios at the levels of
A = 1, . . . , I can be quite dierent from the marginal
Stat 544, Lecture 7 8
'
&
$
%
odds ratios. In extreme cases, the marginal
relationship between B and C can be in the opposite
direction from their conditional relationship given A;
this is known as Simpsons paradox.
Under the conditional independence model, the
probabilities can be written as

ijk
= P(A = i) P(B = j, C = k|A = i)
= P(A = i) P(B = j|A = i) P(C = k|A = i)
=
i

j(i)

k(i)
,
where
P
i

i
= 1,
P
j

j(i)
= 1 for each i, and
P
k

k(i)
= 1 for each i. The number of free
parameters is
(I 1) +I(J 1) +I(K 1).
The ML estimates of these parameters are

i
= n
i++
/n,

j(i)
= n
ij+
/n
i++
,
k(i)
= n
i+k
/n
i++
.
The estimated expected frequencies are

E
ijk
=
n
ij+
n
i+k
n
i++
.
Notice again the similarity to the formula for
independence in a two-way table. The test for
conditional independence of B and C given A is
Stat 544, Lecture 7 9
'
&
$
%
equivalent to separating the table by levels of
A = 1, . . . , I, and testing for independence within
each level. The overall X
2
or G
2
statistics are found
by summing the individual test statistics for BC
independence given A. The total degrees of freedom
for this test must be I(J 1)(K 1).
Homogeneous association. If we take the previous
model (AB, AC) and add a direct link between B and
C, we obtain the saturated model:
A
B
C
In loglinear notation, the saturated model is (ABC).
This model allows the BC odds ratios at each level of
A = 1, . . . , I to be arbitrary.
There is a model that is intermediate in complexity
between (AB, AC) and (ABC). Recall that (AB, AC)
requires the BC odds ratios at each level of
A = 1, . . . , I to be equal to one. Suppose that we
require the BC odds ratios at each level of A to be
identical, but not necessarily one. This model is called
Stat 544, Lecture 7 10
'
&
$
%
homogeneous association. There is no graphical
notation for this model, but the loglinear notation is
(AB, BC, AC).
The model of homogeneous association says that the
conditional relationship between any pair of variables
given the third one is the same at each level of the
third one. That is, there are no interactions. (An
interaction means that the relationship between two
variables changes across the levels of a third.) This is
similar in spirit to the mutivariate normal distribution
for continuous variables, which says that the
conditional correlation between any two variables
given a third is the same for all values of the third.
Under the model of homogeneous association, there
are no closed-form estimates for the cell probabilities.
ML estimates must be computed by an iterative
procedure. The most popular methods are
iterative proportional tting (IPF), and
Newton Raphson (NR).
We will describe these methods in future lectures.
Stat 544, Lecture 7 11
'
&
$
%
Modeling strategy. With three variables, there are
nine possible models that we have discussed.
complete independence: (A, B, C)
two variables independent of a third: (AB, C),
(AC, B), (BC, A)
conditional independence: (AB, BC), (AC, BC),
(AB, AC)
homogeneous association: (AB, AC, BC)
saturated: (ABC)
With real data, we may not want to t all of these
models but focus only on those that make sense. For
example, suppose that C can be regarded as a
response variable, and A and B are predictors. In
regression, we do not model the relationships among
predictors, but allow arbitrary associations among
them. Therefore, the simplest model that we may
wish to t is a null model (AB, C) which says that
neither predictor is related to the response.
If the null model doesnt t, then we should try
(AB, AC), which says that A is related to C, but B is
not. This is equivalent to a logistic regression for C
Stat 544, Lecture 7 12
'
&
$
%
with a main eect for A but no eect for B. We may
also try (AB, BC), which is equivalent to a logistic
regression for C with a main eect for B but no eect
for A.
If neither of those models t, we may try the model of
homogeneous association, (AB, BC, AC), which is
equivalent to a logistic regression for C with main
eects for A and for B but no interaction.
The saturated model (ABC) is equivalent to a logistic
regression for C with a main eect for A, a main
eect for B, and an AB interaction.
Example: This table below classies n = 800 boys
according to socioeconomic status (S), whether a boy
scout (B), and juvenile delinquency (D):
Socioeconomic Boy Delinquent
status scout Yes No
Low Yes 11 43
No 42 169
Medium Yes 14 104
No 20 132
High Yes 8 196
No 2 59
Stat 544, Lecture 7 13
'
&
$
%
The marginal totals are
n
1++
= 11 + 43 + 14 + 104 + 8 + 196 = 376,
n
2++
= 42 + 169 + 20 + 132 + 2 + 59 = 424
for B;
n
+1+
= 11 + 42 + 14 + 20 + 8 + 2 = 97,
n
+2+
= 43 + 169 + 104 + 132 + 196 + 59 = 703
for D; and
n
++1
= 11 + 43 + 42 + 169 = 265,
n
++2
= 14 + 104 + 20 + 132 = 270,
n
++3
= 8 + 196 + 2 + 59 = 265
for S.
The test for mutual independence yields X
2
= 214.9
and G
2
= 218.7 with 7 degrees of freedom; both
p-values are zero.
It makes sense to think of deliquency D as a response,
and to think of socioeconomic status S and boy
scouting B as potential predictors. The model
(SB, D), which says that delinquency is independent
Stat 544, Lecture 7 14
'
&
$
%
of the two predictors, does not t either
(X
2
= 32.96, G
2
= 36.42, df = 5).
Fitting other models of interest will be left as a
homework exercise.
Next time: Begin logistic regression

You might also like