The Chi Square Test and The Analysis of Count Data

1
The Chi-Square Test and

the Analysis of Count Data
Introduction
Example 1: Mode of Transportation
A group of social scientists in the Philippines is investigating the distribution of

preferred modes of transportation among commuters in Metro Manila. Based on
previous studies, the distribution follows a specific pattern: 40% for jeepneys, 25%
for public buses, 20% for trains (MRT / LRT), 10% for private cars, and 5% for other
modes. From a sample of 500 commuters, they record the observed frequencies: 210
for jeepneys, 130 for public buses, 90 for trains (MRT / LRT), 50 for private cars, and
20 for other modes. Data are shown in the table below:
Mode Observed Frequency Expected Frequency
jeep 210 200
bus 130 125
train 90 100
car 50 50
others 20 25
500 500
The data in this table is count data – that is, the number of commuters in each mode
of transportation has been counted and tabulated. The hypothesis of interest is that
the observed distribution significantly deviates from the expected distribution at a
5% level of significance.
2
Chi-Square Goodness-of-Fit Test (One-Dimensional

Count Data)
In a binomial experiment, the qualitative data are classified into two distinct
categories. We will now consider a generalization of the experiment and consider
classes in which qualitative data are classified into more than two categories. Such
data are usually referred to as count data, or enumerative data. Many practical
experiments result in one-dimensional count data. We will consider a class of
experiments with characteristics similar to those of the binomial experiment. This
type of experiment, called the multinomial experiment, is an extension of the
binomial experiment.
The characteristics of the multinomial experiment are rarely satisfied exactly for
practical experiments, because populations of interest usually contain a finite
number of observations. Thus, we will consider an experiment to be multinomial if a
random sample of n observations is taken from a very large population.
For example, suppose we want to compare the observed distribution of preferred
modes of transportation and the percentage of the commuters to each of the five
modes of transportation from the previous study, at 5% level of significance. That is,
we will test the null hypothesis that the observed distribution of preferred modes of
transportation is the same as the previous study.
HO: The observed distribution of preferred modes of transportation is the
same as the previous study.
HA: The observed distribution of preferred modes of transportation deviates
from the previous study.

3
Mode Observed Expected Difference Square of Chi-square

Frequency, Frequency, O-E the subtotal
O E Difference 2
2 (𝑂−𝐸)
(O-E)2 χ = 𝐸
jeep 210 200 10 100 0.5
bus 130 125 5 25 0.2
train 90 100 -10 100 1
car 50 50 0 0 0
others 20 25 -5 25 1
500 500 0 2.7
Note that the farther the observed values are from their expected values, the larger
2 2
χ will become. That is, large values of χ imply that the null hypothesis is false. We
have to know the sampling distribution in order to decide whether the data indicates
deviation from the previous study. If the null hypothesis is true, the distribution of
2
in repeated sampling is approximately a χ distribution. The approximation for the
2
sampling distribution of χ is adequate as long as the expected number of
2
observations in each of the k categories is at least 5. The χ distribution is
characterized by a single parameter, called the degrees of freedom associated with

2
the distribution. Because large values of χ support the alternative hypothesis, the
2
rejection region for the test will be located in the upper tail of the χ distribution.
4
2
Since the computed χ = 2.7 does not exceed the table value of 9.488, we fail to reject
the null hypothesis. That is, at 5% level of significance, the data do not provide
sufficient evidence to conclude that the observed distribution of the mode of
transportation significantly deviates from the expected distribution based on the
previous study. The following are the assumptions that must be met for a
Chi-Square Goodness-of-Fit Test:
1. All expected frequencies are 1 or greater.
2. At most 20% of the expected frequencies are less than 5.
3. Simple random sample.

5
Chi-Square Independence Test (Two-Dimensional

Count Data)
We now consider multinomial experiments in which the data are classified according
to two factors. The data is then summarized in the two-way table called a
contingency table; it presents multinomial count data classified on two scales, or
dimensions.
Example 2: SEA Games Medal Tally
The 2023 SEA Games final medal counts for the top five nations are shown below. At
the 0.10 level of significance can it be concluded that the type of medal won was
dependent upon the competing country?
Rank Country Gold Silver Bronze Total
1 Vietnam 136 105 114 355
2 Thailand 108 96 109 313
3 Indonesia 87 80 109 276
4 Cambodia 81 74 127 282
5 Philippines 58 86 116 260
We test the following hypothesis:

HO: The type of medal won is independent of the competing country.
HA: The type of medal won is dependent on the competing country.

6
Country Gold Silver Bronze Total
O E O E O E
Vietnam 136 112.3 105 105.4 114 137 355
Thailand 108 99 96 92.9 109 121 313
Indonesia 87 87.3 80 81.9 109 107 276
Cambodia 81 89.2 74 83.7 127 109 282
Philippines 58 82.2 86 77.2 116 101 260
Total 470 441 575 1486
2
Since the computed χ = 26.50 exceeds the table value 13.362, we reject the null
hypothesis. That is, at 10% level of significance, the data do provide sufficient
evidence to conclude that the type of medal won was dependent upon the
competing country. The following are the assumptions that must be met for a
Chi-Square Independence Test:
4. All expected frequencies are 1 or greater.
5. At most 20% of the expected frequencies are less than 5.
6. Simple random sample.

Name:
Section:
Perform the following test of hypothesis.
1. The litter size of Bengal tigers is typically two or three cubs, but it can vary
between one and four. Based on long-term observations, the litter size of
Bengal tigers in the wild has the distribution given in the table provided. A
zoologist believes that Bengal tigers in captivity tend to have different
(possibly smaller) litter sizes from those in the wild. To verify this belief, the
zoologist searched all data sources and found 316 litter size records of Bengal
tigers in captivity. The results are given in the table provided. Test, at the 5%
level of significance, whether there is sufficient evidence in the data to
conclude that the distribution of litter sizes in captivity differs from that in the
wild. [https://saylordotorg.github.io/text_introductory-statistics/s15-02-chi-square-one-sample-goodness.html]
Litter Size Wild Litter Distribution Observed Frequency
1 0.11 41
2 0.69 243
3 0.18 27
4 0.02 5
Hypotheses: 𝐻0: The observed litter distribution of Bengal Tigers in

captivity is equal to the wild litter distribution.
𝐻𝑎: The observed litter distribution of Bengal Tigers in
captivity is not equal to the wild litter distribution.
Assumption 1. All expected frequencies are 1 or greater.

Check 2. None of the expected frequencies are less than 5.
3. Simple random sampling was used to get 316 Bengal
tigers in captivity.
Level of Level of significance: α = 0. 05

Significance 2
Critical value: χ 0.05
= 7. 815
and Critical
value
2 2 2 2
Test Statistic (𝑂−𝐸) (41−35 (243−218) (27−57) (5−6)
Σ 𝐸 = 35
+ 218
+ 57
+ 6
(You may
include a = 19.8517*
screenshot from
Jamovi)
Name:
Section:
*Slightly different from Jamovi output due to rounding error.
Decision We reject the null hypothesis since the test statistic

2
χ = 19. 9496 is greater than the critical value (7. 815).
Moreover, the p-value is less than the level of significance.
Conclusion At a 5% level of significance, the data provide sufficient

evidence to conclude that the observed distribution of litter
size of Bengal Tigers in captivity differs from the litter size
distribution of those in the wild.
2. Is being left-handed hereditary? To answer this question, 250 adults are

randomly selected and their handedness and their parents’ handedness are
noted. The results are summarized in the table provided. Test, at the 1% level
of significance, whether there is sufficient evidence in the data to conclude
that there is a hereditary element in handedness.
https://saylordotorg.github.io/text_introductory-statistics/s15-01-chi-square-tests-for-independe.html
Name:
Section:
Hypotheses: 𝐻0: A person’s handedness is independent of their parents’

handedness.
𝐻𝑎: A person’s handedness is dependent on their parents’
handedness.
Assumption 1. All expected frequencies are 1 or greater.

Check 2. The second assumption is not satisfied, as more than
20% of the expected frequencies are less than 5.
Notice that two (2) expected frequencies are less
than 5. This is more than 20% of the 6 expected
frequencies.
3. Simple random sampling was used to get a sample
of 250 adults.
We will proceed with the chi-square test of independence,

but the data might result in false interpretations because
not all assumptions are satisfied.
Level of Level of significance: α = 0. 01

Significance 2
Critical Value: χ 0.01
= 9. 210
and Critical
value
2 2 2 2 2 2 2
Test Statistic (𝑂−𝐸) 8 10 12 178 21 21
Σ 𝐸 = 23.32 + 3.72
+ 3.96
+ 163.68
+ 27.28
+ 29.04
-250
(You may
include a = 41. 0372
screenshot from
Jamovi)
Name:
Section:
Decision 2
Since the test statistic (χ = 41. 0372) is greater than the
critical value (9. 210). Moreover, the 1-value is less than the
level of significance.
Conclusion At a 1% level of significance, the data provide sufficient

evidence to conclude that there is a hereditary element in
handedness.
Thanks to Bea Abalos for the answers.

The Chi Square Test and The Analysis of Count Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Chi Square Test and The Analysis of Count Data

Uploaded by

Copyright:

Available Formats

1

The Chi-Square Test and

A group of social scientists in the Philippines is investigating the distribution of

Mode Observed Frequency Expected Frequency

jeep 210 200

bus 130 125

the observed distribution significantly deviates from the expected distribution at a

Chi-Square Goodness-of-Fit Test (One-Dimensional

categories. We will now consider a generalization of the experiment and consider

experiments result in one-dimensional count data. We will consider a class of

experiments with characteristics similar to those of the binomial experiment. This

type of experiment, called the multinomial experiment, is an extension of the

practical experiments, because populations of interest usually contain a finite

number of observations. Thus, we will consider an experiment to be multinomial if a

random sample of n observations is taken from a very large population.

For example, suppose we want to compare the observed distribution of preferred

transportation is the same as the previous study.

HO: The observed distribution of preferred modes of transportation is the

same as the previous study.

HA: The observed distribution of preferred modes of transportation deviates

from the previous study.

Mode Observed Expected Difference Square of Chi-square

bus 130 125 5 25 0.2

train 90 100 -10 100 1

500 500 0 2.7

characterized by a single parameter, called the degrees of freedom associated with

sufficient evidence to conclude that the observed distribution of the mode of

transportation significantly deviates from the expected distribution based on the

Chi-Square Goodness-of-Fit Test:

1. All expected frequencies are 1 or greater.

2. At most 20% of the expected frequencies are less than 5.

3. Simple random sample.

Chi-Square Independence Test (Two-Dimensional

Example 2: SEA Games Medal Tally

Rank Country Gold Silver Bronze Total

1 Vietnam 136 105 114 355

2 Thailand 108 96 109 313

3 Indonesia 87 80 109 276

4 Cambodia 81 74 127 282

5 Philippines 58 86 116 260

We test the following hypothesis:

HA: The type of medal won is dependent on the competing country.

Country Gold Silver Bronze Total

Vietnam 136 112.3 105 105.4 114 137 355

Thailand 108 99 96 92.9 109 121 313

Indonesia 87 87.3 80 81.9 109 107 276

Cambodia 81 89.2 74 83.7 127 109 282

Philippines 58 82.2 86 77.2 116 101 260

Total 470 441 575 1486

Chi-Square Independence Test:

4. All expected frequencies are 1 or greater.

5. At most 20% of the expected frequencies are less than 5.

6. Simple random sample.

Perform the following test of hypothesis.

Litter Size Wild Litter Distribution Observed Frequency

Hypotheses: 𝐻0: The observed litter distribution of Bengal Tigers in

Assumption 1. All expected frequencies are 1 or greater.

Level of Level of significance: α = 0. 05

*Slightly different from Jamovi output due to rounding error.

Decision We reject the null hypothesis since the test statistic

Conclusion At a 5% level of significance, the data provide sufficient

2. Is being left-handed hereditary? To answer this question, 250 adults are

Hypotheses: 𝐻0: A person’s handedness is independent of their parents’