Professional Documents
Culture Documents
STAT1371 Topic11 PDF
STAT1371 Topic11 PDF
STAT1371 Topic11 PDF
//
/
Contents
I χ2 distribution
AB Ab aB ab Total
87 31 25 5 148
Success Failure
Observe O1 = X O2 = n − X
Expect E1 = np0 E2 = n(1 − p0 )
I Also
1 1 1
+ = .
np0 n(1 − p0 ) np0 (1 − p0 )
I Thus
(X − np0 )2
1 1
τ= = (X − np0 )2 +
np0 (1 − p0 ) np0 n(1 − p0 )
2
(O1 − E1 ) (O2 − E2 )2
= +
E1 E2
9 3 3 1
H0 : p1 = , p2 = , p3 = , p4 = , against H1 : not H0 ,
16 16 16 16
I The alternative hypothesis is that the true proportions are not all as
specified.
I It is not necessary that all proportion are different, only that at least
one is not as specified for H0 to be false.
I We can test the claim by comparing the observed frequency with the
expected frequency under H0 .
where
I Oi is the observed frequency in the ith category;
I n=
P P
Oi = Ei is the total number of observations.
I The χ2 test should only be used when the expected frequencies Ei are
all greater than 5.
I Recall this is consistent to np > 5 for the normal approximation to the
binomial!
I Note that the continuity correction is always going to make the test
statistic smaller, and the test becomes more conservative.
I In some rare cases that the fit is exceptionally well, i.e. |Oi − Ei | ≤ 1 ,
2
then the correction will be reduced so that the correction will not be
bigger than the differences themselves.
DEPARTMENT OF MATHEMATICS & STATISTICS 10
The χ2 distribution
I A χ2 random variable can only take non-negative values.
0.6
df
0.4 1
f(x)
2
4
9
0.2
0.0
0 5 10 15 20 25
x
I Fun facts:
I χ21 = Z 2 , where Z ∼ N(0, 1);
I If X ∼ χ2ν then E(X ) = ν and Var(X ) = 2ν.
DEPARTMENT OF MATHEMATICS & STATISTICS 11
The χ2 Table
I Our principal interest in the χ2 distribution is the calculation of
P-values of the Goodness of Fit test.
i)
P(χ21 > 3.841) = 0.05. (Note 1.962 = 3.841, P(|Z |2 > 1.962 ) = 0.05.)
1-pchisq(20, 10)
# [1] 0.02925269
iii)
P(13.85 < χ224 < 39.36) = 0.925
group 1 2 3 4 Total
phenotype AB Ab aB ab
Oi 87 31 25 5 148
I We are testing
9 3 3 1
H0 : p1 = , p2 = , p3 = , p4 = againstH1 : not H0 .
16 16 16 16
I Under H0 , the model specifies the following expected frequencies
group 1 2 3 4 Total
9 3 3 1
Ei 16 ×148 = 16 ×148 = 16 ×148 = 16 ×148 = 148
83.25 27.75 27.75 9.25
DEPARTMENT OF MATHEMATICS & STATISTICS 14
I The test statistic of the test is
4 2
X |Oi − Ei | − 12
τ= ∼ χ2g−1 = χ23 , under H0 .
i=1
E i
I Since the P-value is large, we conclude that the data are consistent
with H0 , i.e the observed ratio is not significantly different from
9:3:3:1.
DEPARTMENT OF MATHEMATICS & STATISTICS 15
Car Accidents Example
I Test the claim that the accident rate is the same for all months.
I We are testing:
1
H0 : pi = , i = 1, 2, . . . , 7, against H1 : not H0 .
7
I The test so far assume H0 is fully specified, i.e. they are determined
by some outside consideration before the data are investigated.
I Since we use the same data to estimate the parameters and test the
fit, we find the sampling distribution of the χ2 test statistic has to be
adjusted.
AB Ab aB ab
128 86 74 112
i) Under the no linkage model, the four phenotypes are equally likely.
Show that this model is a poor fit.
ii) If linkage is in the coupling phase, the probability of the four
phenotypes are
AB Ab aB ab
1 1 1 1
2
(1− p) 2
p 2
p 2
(1− p)
I We are testing
AB Ab aB ab Total
Oi 128 86 74 112 400
Ei 100 100 100 100 400
(|Oi −Ei |− 12 )2
Ei
7.5625 1.8225 6.5025 1.3225 17.21
I We are testing
1−p p p 1−p
H0 : p1 = , p2 = , p3 = , p4 = , against H1 : not H0 .
2 2 2 2
I Here we estimate p with p̂ = 86+74 400 = 0.4.
I The test statistic of the test is
4 2
X |Oi − Ei | − 12
τ= ∼ χ2g−1−1 = χ22 , under H0 .
i=1
E i
AB Ab aB ab Total
Oi 128 86 74 112 400
Ei 120 80 80 120 400
(|Oi −Ei |− 12 )2
Ei
0.46875 0.378125 0.378125 0.46875 1.69375
I The data were condensed into the table below, writing xi for the
number of infected and fi for the corresponding frequency:
xi 0 1 2 3 4 5 Total
fi 20 62 55 38 20 5 200
I We need to estimate p.
I There were 5 × 200 = 1000 insects in total and 391 of these were
infected, i.e. an estimate for p would be
391
p̂ = = 0.391.
1000
I The test can be summarised into the following table:
i 0 1 2 3 4 5 Total
pi 0.0838 0.2689 0.3453 0.2217 0.0712 0.0091 1
Oi 20 62 55 38 20 5 200
Ei = npi 16.76 53.78 69.06 44.34 14.24 1.82 200
where
5
pi = (0.391)i (1 − 0.391)5−i , i = 0, . . . , 5.
i
- Wait a minute! One of the cells has an expected value < 5! The χ2 test
isDEPARTMENT
not valid!
OF MATHEMATICS & STATISTICS 25
I If any Ei value falls below 5, we can
I get a larger n i.e. get more sample
I We can then carry out the χ2 test with one fewer category (and also
one less df).
I Here we will combine the last two cells together to get a single
category for ≥ 4 and the table becomes:
i 0 1 2 3 ≥4 Total
pi 0.0838 0.2689 0.3453 0.2217 0.0803 1
Oi 20 62 55 38 25 200
Ei = npi 16.76 53.78 69.06 44.34 16.06 200
(|Oi −Ei |− 12 )2
Ei
0.4479 1.1082 2.6625 0.7692 4.4355 9.4233
as g = 5 now.
I We are testing:
H0 : X ∼ N(µ, σ 2 ), against H1 : not H0 .
X ∼ N (49.71, 229.15).
Then
40 − 49.71
p1 = P(X ≤ 40) = P Z ≤ √ = P(Z ≤ −0.64) = 0.2611.
229.15
d)
I As the P-value is large, data are consistent with H0 i.e. data are
consistent with the normal model.
I Test the claim that gender affects blood alcohol level, i.e. testing
whether the two categorising variables dependent (versus
independent)?
I We would be testing:
H0 : the two variables are independent against H1 : not H0 .
DEPARTMENT OF MATHEMATICS & STATISTICS 34
A probability model for contingency tables
I Let pij denote the probability of a victim being sex i and alcohol level
group j then the independence model says:
pij = pi· × p·j , where
I pi· is the prob. of being of sex i, i = 1, . . . , r with r = 2
Pr
I O·j = Oij observed number in column j, i.e. column marginal
i=1
total
For instance,
259 × 354
E11 = = 217.265;
422
39 × 68
E23 = = 6.284.
422
I We lose 1 df for each factor because we have used the marginal totals
P P
in calculating the expected values (i.e. p̂i· = 1 = p̂·j )
Oij
(Eij )
2
(|Oij −Eij |− 21 )
Eij
I We will do the former here and will provide an example of the latter
in the SGTA.
2
I If we calculate ( ij E ij 2 ) for each cell, we would get
|O −E |− 1
ij
library(patchwork)
dat <- data.frame(death = c(206, 37, 35, 76, 53, 5, 4, 6),
sex =rep(c("Male", "Female"), c(4,4)),
alcohol =
factor(rep(c("0", "(0, 0.08)", "[0.08, 0.15)", ">= 0.15"),2),
levels =c("0", "(0, 0.08)", "[0.08, 0.15)", ">= 0.15")))
p1 <- ggplot(data=dat) +
geom_bar(aes(x = alcohol, fill=sex, y = death),
stat="identity", position = "dodge")
p2 <- ggplot(data=dat) +
geom_bar(aes(x = sex, fill=alcohol, y = death),
stat="identity", position = "dodge")
p1+p2
150 150
alcohol
sex 0
death
death
Female (0, 0.08)
100 100
Male [0.08, 0.15)
>= 0.15
50 50
0 0
I The second point is a bit a more subtle: Female has a higher than
expected frequency in the zero alcohol level group.
DEPARTMENT OF MATHEMATICS & STATISTICS 41