STAT1371 Topic11 PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

STAT1371: Statistical Data Analysis

Topic 11: Categorical Data Analysis


//
Semester 2, 2020

//
/
Contents

I Testing Categorical Data

I The Goodness of fit test


I With and without unknown parameters

I χ2 distribution

I Two-way classification table/contingency table

I Tests for Independence

DEPARTMENT OF MATHEMATICS & STATISTICS 2


Testing Categorical Data

I In many applications, data are simply classified into distinct


categories.

I These categories need not have a natural numerical ordering.

I For example, in an experiment involving a dihybrid cross of flies, 148


progeny were classified by phenotype as follows:

AB Ab aB ab Total
87 31 25 5 148

DEPARTMENT OF MATHEMATICS & STATISTICS 3


I We want to evaluate whether data resemble a particular distribution.

I For instance, theory predicts a ratio of 9:3:3:1 for AB:Ab:aB:ab.

I In terms of probability, the ratio is 9 3 3 1


16 : 16 : 16 : 16 .

I If we use groups 1 to 4 to represent the 4 phenotypes AB, Ab, aB, ab


respectively, then the probability model can be represented as
9 3 3 1
p1 = , p2 = , p3 = , p4 = .
16 16 16 16
where pi is the probability of an observation to fall into group i, such
that
Xg
pi = 1.
i=1

I Do the data support the theory?

DEPARTMENT OF MATHEMATICS & STATISTICS 4


Motivational setting
I We learned how to deal with this when there are only two categories
back in Topic 9.

I Suppose we have n independent trials with X success and n − X


failures, i.e. X ∼ B(n, p).

I We wish to test H0 : p = p0 against H1 : p 6= p0 . If H0 is true then

Success Failure
Observe O1 = X O2 = n − X
Expect E1 = np0 E2 = n(1 − p0 )

I Recall from Topic 9, large value of |X − np0 | support H1 .

I Assuming both np0 , n(1 − p0 ) > 5 so that normal approximation is


appropriate, then large values of
(X − np0 )2
τ= , support H1 .
np0 (1 − p0 )
DEPARTMENT OF MATHEMATICS & STATISTICS 5
I Notice that

(O2 − E2 )2 = [(n − X − (n − np0 ))]2


= (X − np0 )2 = (O1 − E1 )2 .

I Also
1 1 1
+ = .
np0 n(1 − p0 ) np0 (1 − p0 )

I Thus

(X − np0 )2
 
1 1
τ= = (X − np0 )2 +
np0 (1 − p0 ) np0 n(1 − p0 )
2
(O1 − E1 ) (O2 − E2 )2
= +
E1 E2

I This is a special case of Pearson’s χ2 statistic.

DEPARTMENT OF MATHEMATICS & STATISTICS 6


Pearson’s χ2 Goodness of Fit test

I Assume we have g categories, not just success/failure and H0


specifies a model giving expected frequencies for each category.

I For instance, we could be testing:

9 3 3 1
H0 : p1 = , p2 = , p3 = , p4 = , against H1 : not H0 ,
16 16 16 16
I The alternative hypothesis is that the true proportions are not all as
specified.

I It is not necessary that all proportion are different, only that at least
one is not as specified for H0 to be false.

I We can test the claim by comparing the observed frequency with the
expected frequency under H0 .

DEPARTMENT OF MATHEMATICS & STATISTICS 7


I The Pearson’s χ2 test-statistic (without continuity correction) is
g
X (Oi − Ei )2
τ= ,
i=1
Ei

where
I Oi is the observed frequency in the ith category;

I Ei = npi is the expected frequency if H0 is true;

I n=
P P
Oi = Ei is the total number of observations.

DEPARTMENT OF MATHEMATICS & STATISTICS 8


I If the data support the proposed model, then we expect (Oi , Ei ) to
be close together and (Oi − Ei )2 wouldn’t be too large.

I If the data do not support the proposed model, then we expect at


least one (Oi , Ei ) would be far apart and (Oi − Ei )2 would be large.

I We reject the model if the χ2 test-statistic is too large.

I The sampling distribution of the statistic has (asymptotically) a


chi-squared distribution with g − 1 degrees of freedom.

P-value = P(χ2g−1 ≥ τobs ).

I Note: chi is pronounced as KI.

I The χ2 test should only be used when the expected frequencies Ei are
all greater than 5.
I Recall this is consistent to np > 5 for the normal approximation to the
binomial!

DEPARTMENT OF MATHEMATICS & STATISTICS 9


Continuity correction
I As we are observing count/frequency, categorical data are obviously
discrete.

I However, the test involves comparing the test statistic (discrete) to a


continuous distribution.

I Therefore we should use the continuity correction, and the test


statistic of the χ2 Goodness of Fit test becomes:
g 1 2

X |Oi − Ei | − 2
τ= .
i=1
Ei

I This is also known as the Yates’ continuity correction.

I Note that the continuity correction is always going to make the test
statistic smaller, and the test becomes more conservative.
I In some rare cases that the fit is exceptionally well, i.e. |Oi − Ei | ≤ 1 ,
2
then the correction will be reduced so that the correction will not be
bigger than the differences themselves.
DEPARTMENT OF MATHEMATICS & STATISTICS 10
The χ2 distribution
I A χ2 random variable can only take non-negative values.

I The distribution is not symmetric but is right-skewed:


Chi−squared density with dfs 1, 2, 4 and 9

0.6

df
0.4 1
f(x)

2
4
9

0.2

0.0

0 5 10 15 20 25
x

I Fun facts:
I χ21 = Z 2 , where Z ∼ N(0, 1);
I If X ∼ χ2ν then E(X ) = ν and Var(X ) = 2ν.
DEPARTMENT OF MATHEMATICS & STATISTICS 11
The χ2 Table
I Our principal interest in the χ2 distribution is the calculation of
P-values of the Goodness of Fit test.

I The χ2 Tables typically give


P(χ2ν ≥ x ) = p(= 1-pchisq(x, nu) in R.)
where ν is the degrees of freedom (row), p is the upper tailed
probabilities (column, top header) and x is given in the body of the
table.

I You can find the chisq-table.pdf under Table section on iLearn.

I In R, the following functions are helpful:


I pdf: dchisq(x, df = nu);

I cdf: pchisq(q, df = nu);

I quantile or critical values: qchisq(p, df = nu);

I random numbers: rchisq(n, df = nu).


DEPARTMENT OF MATHEMATICS & STATISTICS 12
Examples of using χ2 Table

i)
P(χ21 > 3.841) = 0.05. (Note 1.962 = 3.841, P(|Z |2 > 1.962 ) = 0.05.)

ii) From the table, we get


0.025 < P(χ210 > 20) < 0.05
This can be confirmed in R:

1-pchisq(20, 10)

# [1] 0.02925269

iii)
P(13.85 < χ224 < 39.36) = 0.925

DEPARTMENT OF MATHEMATICS & STATISTICS 13


Back to the phenotypes example
I It was an experiment involving a dihybrid cross of flies, 148 progeny
were classified by phenotype as follows:

group 1 2 3 4 Total
phenotype AB Ab aB ab
Oi 87 31 25 5 148

I We are testing
9 3 3 1
H0 : p1 = , p2 = , p3 = , p4 = againstH1 : not H0 .
16 16 16 16
I Under H0 , the model specifies the following expected frequencies

group 1 2 3 4 Total
9 3 3 1
Ei 16 ×148 = 16 ×148 = 16 ×148 = 16 ×148 = 148
83.25 27.75 27.75 9.25
DEPARTMENT OF MATHEMATICS & STATISTICS 14
I The test statistic of the test is
4 2
X |Oi − Ei | − 12
τ= ∼ χ2g−1 = χ23 , under H0 .
i=1
E i

I The χ2 test is valid as all Ei > 5.

I The observed value of the test statistic is


2 2
|87 − 83.25| − 12 |31 − 27.75| − 12
τobs = +
83.25 27.75
1 2
2
|5 − 9.25| − 12

|25 − 27.75| − 2
+ +
27.75 9.25
= 0.1269 + 0.2725 + 0.1824 + 1.5203
= 2.1021.

I The P-value for testing the fit of the model is


P-value = P(χ23 ≥ 2.1021) > 0.1.

I Since the P-value is large, we conclude that the data are consistent
with H0 , i.e the observed ratio is not significantly different from
9:3:3:1.
DEPARTMENT OF MATHEMATICS & STATISTICS 15
Car Accidents Example

I The number of fatal accidents in NSW roads in months with 31 days


in 1993 were:

Jan Mar May July Aug Oct Dec


44 56 37 42 59 59 63

I Test the claim that the accident rate is the same for all months.

I Let pi denotes the probability that a fatal accident is “allocated” to


month i.

I We are testing:

1
H0 : pi = , i = 1, 2, . . . , 7, against H1 : not H0 .
7

I The total number of accidents is 360. Thus Ei = 360


7 = 51.43.

DEPARTMENT OF MATHEMATICS & STATISTICS 16


I The test statistic of the test is
7 2
X |Oi − Ei | − 12
τ= ∼ χ2g−1 = χ26 , under H0 .
i=1
E i

I We can present the information in a table

Month Jan Mar May July Aug Oct Dec Total


Oi 44 56 37 42 59 59 63 360
Ei 51.43 51.43 51.43 51.43 51.43 51.43 51.43 360
(|Oi −Ei |− 21 )2
Ei
0.93 0.32 3.77 1.55 0.97 0.97 2.38 10.89

I This means τobs = 10.89. Thus

P-value = P(χ26 ≥ 10.89) > 0.1.

I As the P-value is large, data is consistent with H0 , i.e. there is not


enough evidence to conclude the accident is not constant across the
months based on these data.

DEPARTMENT OF MATHEMATICS & STATISTICS 17


Further application of χ2 Goodness of Fit Test

I The test so far assume H0 is fully specified, i.e. they are determined
by some outside consideration before the data are investigated.

I If we want to check the fit of a model that involves unknown


parameters, we first have to estimate the parameters with the data.

I Since we use the same data to estimate the parameters and test the
fit, we find the sampling distribution of the χ2 test statistic has to be
adjusted.

I The sampling distribution is still χ2 but the dfs are reduced to


g − k − 1, where
I g is the number of categories

I k is the smallest number of parameters that need to be estimated


using the data.

DEPARTMENT OF MATHEMATICS & STATISTICS 18


Example: More on phenotypes
I In a backcross experiment to investigate the genetic linkage between
two factors A and B in a species of flower, some researchers classified
400 offspring by phenotype as follows:

AB Ab aB ab
128 86 74 112

i) Under the no linkage model, the four phenotypes are equally likely.
Show that this model is a poor fit.
ii) If linkage is in the coupling phase, the probability of the four
phenotypes are

AB Ab aB ab
1 1 1 1
2
(1− p) 2
p 2
p 2
(1− p)

where p is the recombination fraction and is estimated by the overall


proportion of Ab and aB. Show that this model fits the data well.
DEPARTMENT OF MATHEMATICS & STATISTICS 19
Example part i)

I Model says that all categories are equally likely.

I We are testing

H0 : pi = 1/4, i = 1, . . . , 4, against H1 : not H0 .

I The test statistic of the test is


4 1 2

X |Oi − Ei | −
τ= 2
∼ χ2g−1 = χ23 , under H0 .
i=1
Ei

I The test can be summarised into the following table

AB Ab aB ab Total
Oi 128 86 74 112 400
Ei 100 100 100 100 400
(|Oi −Ei |− 12 )2
Ei
7.5625 1.8225 6.5025 1.3225 17.21

DEPARTMENT OF MATHEMATICS & STATISTICS 20


I As all Ei > 5, χ2 test is valid.

I From the table, we have τobs = 17.21 and


P-value = P(χ23 ≥ 17.21) < 0.005.
As the P-value is small, there is evidence to reject H0 i.e. data are not
consistent with the model.

DEPARTMENT OF MATHEMATICS & STATISTICS 21


Example part ii)

I We are testing

1−p p p 1−p
H0 : p1 = , p2 = , p3 = , p4 = , against H1 : not H0 .
2 2 2 2
I Here we estimate p with p̂ = 86+74 400 = 0.4.
I The test statistic of the test is
4 2
X |Oi − Ei | − 12
τ= ∼ χ2g−1−1 = χ22 , under H0 .
i=1
E i

I We lose another df as we estimated an extra parameter from the data.

DEPARTMENT OF MATHEMATICS & STATISTICS 22


I The test can be summarised into the following table

AB Ab aB ab Total
Oi 128 86 74 112 400
Ei 120 80 80 120 400
(|Oi −Ei |− 12 )2
Ei
0.46875 0.378125 0.378125 0.46875 1.69375

I As all Ei > 5, χ2 test is valid.

I From the table, we have τobs = 1.69375 and

P-value = P(χ22 ≥ 1.69375) < 0.1.

I As the P-value is large, data are consistent with H0 , i.e. there is no


significant difference between the proposed model and the data.

DEPARTMENT OF MATHEMATICS & STATISTICS 23


Example: Infections

I 200 groups of 5 insects each were inspected.

I For each group the number of infected insects (x ) was counted:


3, 2, 5, 1, 0, . . . , 2.

I The data were condensed into the table below, writing xi for the
number of infected and fi for the corresponding frequency:

xi 0 1 2 3 4 5 Total
fi 20 62 55 38 20 5 200

I Does the binomial model fit the data?

DEPARTMENT OF MATHEMATICS & STATISTICS 24


I We are testing:
H0 : X ∼ B(5, p) against H1 : not H0 .

I We need to estimate p.

I There were 5 × 200 = 1000 insects in total and 391 of these were
infected, i.e. an estimate for p would be
391
p̂ = = 0.391.
1000
I The test can be summarised into the following table:

i 0 1 2 3 4 5 Total
pi 0.0838 0.2689 0.3453 0.2217 0.0712 0.0091 1
Oi 20 62 55 38 20 5 200
Ei = npi 16.76 53.78 69.06 44.34 14.24 1.82 200

where  
5
pi = (0.391)i (1 − 0.391)5−i , i = 0, . . . , 5.
i
- Wait a minute! One of the cells has an expected value < 5! The χ2 test
isDEPARTMENT
not valid!
OF MATHEMATICS & STATISTICS 25
I If any Ei value falls below 5, we can
I get a larger n i.e. get more sample

I "pool" classes (combine counts) in a sensible way

I We can then carry out the χ2 test with one fewer category (and also
one less df).

I Here we will combine the last two cells together to get a single
category for ≥ 4 and the table becomes:

i 0 1 2 3 ≥4 Total
pi 0.0838 0.2689 0.3453 0.2217 0.0803 1
Oi 20 62 55 38 25 200
Ei = npi 16.76 53.78 69.06 44.34 16.06 200
(|Oi −Ei |− 12 )2
Ei
0.4479 1.1082 2.6625 0.7692 4.4355 9.4233

DEPARTMENT OF MATHEMATICS & STATISTICS 26


I The test statistic of the test is
5 2
X |Oi − Ei | − 12
τ= ∼ χ2g−1−1 = χ23 , under H0 ,
i=1
E i

as g = 5 now.

I Hence τobs = 9.4233 and


P-value = P(χ23 ≥ 9.4233) < 0.025.

I As the P-value is small, we have evidence against H0 , i.e. there are


significant difference between the proposed binomial model and the
data.

I A similar procedure can be used to test the fit of other discrete


distribution such as Poisson and negative binomial.

DEPARTMENT OF MATHEMATICS & STATISTICS 27


Testing the fit of a normal model

I Given a dataset x1 , x2 , . . ., xn we want to test if the data come from


a N(µ, σ 2 ) population.

1) We first calculate the sample mean, x , and the sample variance, s 2 .

2) Form a grouped frequency table and summarise the data with


(ideally) 5 to 10 categories.
I Aim to have at least 5 values in each category.

3) To check against normal population, work out the expected


frequencies for each category by fitting N(x , s 2 ).

4) Calculate χ2 test statistic as usual.

5) To calculate the P-value, use g − 2 − 1 df.

DEPARTMENT OF MATHEMATICS & STATISTICS 28


Example: Rainfall

I We have n = 30 observations corresponding to Sydney’s annual


rainfall (in inches) from 1941-1970 (from the 1972 Australian Year
Book):

26.74 48.29 50.74 31.04 46.47 36.05


41.45 38.83 66.26 86.63 53.15 59.19
40.86 41.29 72.46 67.33 27.13 59.19
59.67 51.01 57.08 44.90 80.11 43.30
36.01 48.40 52.78 24.56 56.94 43.42
I Test if the rainfall follows a normal distribution.

I We are testing:
H0 : X ∼ N(µ, σ 2 ), against H1 : not H0 .

I We estimate µ and σ 2 with x = 49.71 and s 2 = 229.15. respectively.

DEPARTMENT OF MATHEMATICS & STATISTICS 29


b) Grouping the data into a frequency table:

Interval x ≤ 40 40 < x ≤ 50 50 < x ≤ 60 x ≥ 60 Total


Frequency 7 9 9 5 30

c) We now calculate the expected frequencies using

X ∼ N (49.71, 229.15).

Then
 
40 − 49.71
p1 = P(X ≤ 40) = P Z ≤ √ = P(Z ≤ −0.64) = 0.2611.
229.15

I Thus E1 = 30 × 0.2611 = 7.833.

DEPARTMENT OF MATHEMATICS & STATISTICS 30


I To calculate p2 :
p2 = P(40 < Y ≤ 50) = P(−0.64 < Z ≤ 0.019) = 0.2469
E2 = 30 × 0.2469 = 7.407.
Similarly,
E3 = 30 × 0.2437 = 7.311 and
E4 = 30 − 7.833 − 7.407 − 7.311 = 7.449.

d)

I The test statistic of the test is


4 2
X |Oi − Ei | − 12
τ= ∼ χ2g−2−1 = χ21 , under H0 .
i=1
E i

I The χ2 test is valid as all Ei > 5.

DEPARTMENT OF MATHEMATICS & STATISTICS 31


I The observed value of the test statistic is
2 2
|7 − 7.833| − 12 |9 − 7.407| − 12
τobs = +
7.833 7.407
2 2
|9 − 7.311| − 12 |5 − 7.449| − 12

+ +
7.311 7.449
= 0.0142 + 0.1613 + 0.1934 + 0.5099
= 0.8788.

I Here g = 4 and k = 2 so we have 1 d.f.

I The P-value is P(χ21 ≥ 0.8788) = 0.204 > 0.1, with R.

I As the P-value is large, data are consistent with H0 i.e. data are
consistent with the normal model.

DEPARTMENT OF MATHEMATICS & STATISTICS 32


I This procedure can be modified to test the goodness of fit of other
continuous distribution such as exponential & gamma.

I The procedure is not unique as the number of categories is not fixed,


and there are also many ways to define the boundary of these
categories.

I The procedure is computationally intensive, but it is good to recycle


an existing procedure to test new things.

DEPARTMENT OF MATHEMATICS & STATISTICS 33


Tests for independence
I If we have data classified according to two attributes, then we can
construct a contingency table or a two-way classification table which
is a convenient way of presenting the group frequencies.

I For example, we have data on 422 drivers and motorcyclists killed in


NSW in 1988. We classify the people by blood alcohol level and sex.

Alc (g/100ml) 0 (0, 0.08) [0.08, 0.15) ≥ 0.15 Total


Male 206 37 35 76 354
Female 53 5 4 6 68
Total 259 42 39 82 422

I Test the claim that gender affects blood alcohol level, i.e. testing
whether the two categorising variables dependent (versus
independent)?

I We would be testing:
H0 : the two variables are independent against H1 : not H0 .
DEPARTMENT OF MATHEMATICS & STATISTICS 34
A probability model for contingency tables

I Recall from Topic 2 that independence means that the joint


probabilities equal the product of the marginal probabilities, that is
P(X = x , Y = y ) = P(X = x ) · P(Y = y )

I Let pij denote the probability of a victim being sex i and alcohol level
group j then the independence model says:
pij = pi· × p·j , where
I pi· is the prob. of being of sex i, i = 1, . . . , r with r = 2

I p·j is the prob. of being in alcohol group j, j = 1, . . . , c with c = 4.

I r and c represent the number of rows and columns in the contingency


table.

DEPARTMENT OF MATHEMATICS & STATISTICS 35


I We will use the following notation:
I Oij , observed number of being of sex i and alcohol level group j;
Pc
I Oi· = Oij observed number in row i, i.e. row marginal total
j=1

Pr
I O·j = Oij observed number in column j, i.e. column marginal
i=1
total

I We estimate pi· and p·j by the marginal proportions, i.e.


Oi· O·j
p̂i· = , p̂·j =
n n
I If H0 is true, that is if row and column variables are independent,
then the expected number Eij in cell (i, j) (that is, row i, column j, or
in our context, Sex i and alcohol  level group j), can be estimated
Oi· O·j Oi· × O·j
Eij = np̂i· p̂·j = n × =
n n n
(row i total) × (column j total)
=
table total

DEPARTMENT OF MATHEMATICS & STATISTICS 36


I The expected frequencies for the accident data are

Sex/Alcohol Level 0 (0, 0.08) [0.08, 0.15) >=0.15


Male 217.265 35.232 32.716 68.787
Female 41.735 6.768 6.284 13.213

For instance,
259 × 354
E11 = = 217.265;
422
39 × 68
E23 = = 6.284.
422

I The test statistic for this test is


r X c 1 2

X |Oij − Eij | −
τ= 2
∼ χ2(r −1)(c−1) , under H0 .
i=1 j=1
Eij

I We lose 1 df for each factor because we have used the marginal totals
P P
in calculating the expected values (i.e. p̂i· = 1 = p̂·j )

I For χ2 test to be valid, it still requires all Eij > 5.


DEPARTMENT OF MATHEMATICS & STATISTICS 37
I There are generally two ways to organise all these calculations:
1 2
I Separate table for Oij , Eij and (|Oij −Eij |− 2 )
Eij

I Put all info in a single table but each cell has

Oij
(Eij )
2
(|Oij −Eij |− 21 )
Eij

I We will do the former here and will provide an example of the latter
in the SGTA.
2
I If we calculate ( ij E ij 2 ) for each cell, we would get
|O −E |− 1
ij

Sex/Alcohol Level 0 (0, 0.08) [0.08, 0.15) >=0.15


Male 0.5334 0.0456 0.0973 0.6551
Female 2.7767 0.2376 0.5065 3.4106
DEPARTMENT OF MATHEMATICS & STATISTICS 38
I For instance,
2 2
|O11 − E11 | − 21 |206 − 217.265| − 12
= = 0.5334;
E11 217.265
2 2
|O23 − E23 | − 21 |4 − 6.284| − 12
= = 0.5065.
E23 6.284
As a result,
X X |Oij − Eij | − 1 2

2
τobs = = 8.2628.
i j
Eij

I In this dataset we have r = 2 and c = 4 so the df for the χ2 test is


(2 − 1)(4 − 1) = 3.

I This means that


P-value = P(χ23 ≥ 8.2628) ∈ (0.01, 0.025)

I As P-value is small, we have evidence against H0 , i.e. there is strong


evidence to suggest blood alcohol level and sex are related in accident
victims.
DEPARTMENT OF MATHEMATICS & STATISTICS 39
Can we visualise the dataset?

I We will need to create cluster bar charts with the geom_bar()


function.

I See Topic 1 for more details.

library(patchwork)
dat <- data.frame(death = c(206, 37, 35, 76, 53, 5, 4, 6),
sex =rep(c("Male", "Female"), c(4,4)),
alcohol =
factor(rep(c("0", "(0, 0.08)", "[0.08, 0.15)", ">= 0.15"),2),
levels =c("0", "(0, 0.08)", "[0.08, 0.15)", ">= 0.15")))
p1 <- ggplot(data=dat) +
geom_bar(aes(x = alcohol, fill=sex, y = death),
stat="identity", position = "dodge")
p2 <- ggplot(data=dat) +
geom_bar(aes(x = sex, fill=alcohol, y = death),
stat="identity", position = "dodge")
p1+p2

DEPARTMENT OF MATHEMATICS & STATISTICS 40


200 200

150 150

alcohol
sex 0
death

death
Female (0, 0.08)
100 100
Male [0.08, 0.15)
>= 0.15

50 50

0 0

0 (0, 0.08)[0.08, 0.15)>= 0.15 Female Male


alcohol sex

I It is evident that Male has a much higher frequency in the ≥ 0.15 of


blood alcohol level than Female group.

I The second point is a bit a more subtle: Female has a higher than
expected frequency in the zero alcohol level group.
DEPARTMENT OF MATHEMATICS & STATISTICS 41

You might also like