Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

1

The Chi-Square Test and


the Analysis of Count Data

Introduction
Example 1: Mode of Transportation

A group of social scientists in the Philippines is investigating the distribution of


preferred modes of transportation among commuters in Metro Manila. Based on
previous studies, the distribution follows a specific pattern: 40% for jeepneys, 25%
for public buses, 20% for trains (MRT / LRT), 10% for private cars, and 5% for other
modes. From a sample of 500 commuters, they record the observed frequencies: 210
for jeepneys, 130 for public buses, 90 for trains (MRT / LRT), 50 for private cars, and
20 for other modes. Data are shown in the table below:

Mode Observed Frequency Expected Frequency

jeep 210 200

bus 130 125

train 90 100

car 50 50

others 20 25

500 500

The data in this table is count data – that is, the number of commuters in each mode

of transportation has been counted and tabulated. The hypothesis of interest is that

the observed distribution significantly deviates from the expected distribution at a

5% level of significance.
2

Chi-Square Goodness-of-Fit Test (One-Dimensional


Count Data)
In a binomial experiment, the qualitative data are classified into two distinct

categories. We will now consider a generalization of the experiment and consider

classes in which qualitative data are classified into more than two categories. Such

data are usually referred to as count data, or enumerative data. Many practical

experiments result in one-dimensional count data. We will consider a class of

experiments with characteristics similar to those of the binomial experiment. This

type of experiment, called the multinomial experiment, is an extension of the

binomial experiment.

The characteristics of the multinomial experiment are rarely satisfied exactly for

practical experiments, because populations of interest usually contain a finite

number of observations. Thus, we will consider an experiment to be multinomial if a

random sample of n observations is taken from a very large population.

For example, suppose we want to compare the observed distribution of preferred

modes of transportation and the percentage of the commuters to each of the five

modes of transportation from the previous study, at 5% level of significance. That is,

we will test the null hypothesis that the observed distribution of preferred modes of

transportation is the same as the previous study.

HO: The observed distribution of preferred modes of transportation is the

same as the previous study.

HA: The observed distribution of preferred modes of transportation deviates

from the previous study.


3

Mode Observed Expected Difference Square of Chi-square


Frequency, Frequency, O-E the subtotal
O E Difference 2
2 (𝑂−𝐸)
(O-E)2 χ = 𝐸
jeep 210 200 10 100 0.5

bus 130 125 5 25 0.2

train 90 100 -10 100 1

car 50 50 0 0 0

others 20 25 -5 25 1

500 500 0 2.7

Note that the farther the observed values are from their expected values, the larger
2 2
χ will become. That is, large values of χ imply that the null hypothesis is false. We

have to know the sampling distribution in order to decide whether the data indicates

deviation from the previous study. If the null hypothesis is true, the distribution of

2
in repeated sampling is approximately a χ distribution. The approximation for the

2
sampling distribution of χ is adequate as long as the expected number of

2
observations in each of the k categories is at least 5. The χ distribution is

characterized by a single parameter, called the degrees of freedom associated with


2
the distribution. Because large values of χ support the alternative hypothesis, the

2
rejection region for the test will be located in the upper tail of the χ distribution.
4

2
Since the computed χ = 2.7 does not exceed the table value of 9.488, we fail to reject

the null hypothesis. That is, at 5% level of significance, the data do not provide

sufficient evidence to conclude that the observed distribution of the mode of

transportation significantly deviates from the expected distribution based on the

previous study. The following are the assumptions that must be met for a

Chi-Square Goodness-of-Fit Test:

1. All expected frequencies are 1 or greater.

2. At most 20% of the expected frequencies are less than 5.

3. Simple random sample.


5

Chi-Square Independence Test (Two-Dimensional


Count Data)
We now consider multinomial experiments in which the data are classified according
to two factors. The data is then summarized in the two-way table called a
contingency table; it presents multinomial count data classified on two scales, or
dimensions.

Example 2: SEA Games Medal Tally

The 2023 SEA Games final medal counts for the top five nations are shown below. At
the 0.10 level of significance can it be concluded that the type of medal won was
dependent upon the competing country?

Rank Country Gold Silver Bronze Total

1 Vietnam 136 105 114 355

2 Thailand 108 96 109 313

3 Indonesia 87 80 109 276

4 Cambodia 81 74 127 282

5 Philippines 58 86 116 260

We test the following hypothesis:


HO: The type of medal won is independent of the competing country.

HA: The type of medal won is dependent on the competing country.


6

Country Gold Silver Bronze Total

O E O E O E

Vietnam 136 112.3 105 105.4 114 137 355

Thailand 108 99 96 92.9 109 121 313

Indonesia 87 87.3 80 81.9 109 107 276

Cambodia 81 89.2 74 83.7 127 109 282

Philippines 58 82.2 86 77.2 116 101 260

Total 470 441 575 1486

2
Since the computed χ = 26.50 exceeds the table value 13.362, we reject the null

hypothesis. That is, at 10% level of significance, the data do provide sufficient

evidence to conclude that the type of medal won was dependent upon the

competing country. The following are the assumptions that must be met for a

Chi-Square Independence Test:

4. All expected frequencies are 1 or greater.

5. At most 20% of the expected frequencies are less than 5.

6. Simple random sample.


Name:
Section:

Perform the following test of hypothesis.

1. The litter size of Bengal tigers is typically two or three cubs, but it can vary
between one and four. Based on long-term observations, the litter size of
Bengal tigers in the wild has the distribution given in the table provided. A
zoologist believes that Bengal tigers in captivity tend to have different
(possibly smaller) litter sizes from those in the wild. To verify this belief, the
zoologist searched all data sources and found 316 litter size records of Bengal
tigers in captivity. The results are given in the table provided. Test, at the 5%
level of significance, whether there is sufficient evidence in the data to
conclude that the distribution of litter sizes in captivity differs from that in the
wild. [https://saylordotorg.github.io/text_introductory-statistics/s15-02-chi-square-one-sample-goodness.html]

Litter Size Wild Litter Distribution Observed Frequency

1 0.11 41

2 0.69 243

3 0.18 27

4 0.02 5

Hypotheses: 𝐻0: The observed litter distribution of Bengal Tigers in


captivity is equal to the wild litter distribution.
𝐻𝑎: The observed litter distribution of Bengal Tigers in
captivity is not equal to the wild litter distribution.

Assumption 1. All expected frequencies are 1 or greater.


Check 2. None of the expected frequencies are less than 5.
3. Simple random sampling was used to get 316 Bengal
tigers in captivity.

Level of Level of significance: α = 0. 05


Significance 2
Critical value: χ 0.05
= 7. 815
and Critical
value
2 2 2 2
Test Statistic (𝑂−𝐸) (41−35 (243−218) (27−57) (5−6)
Σ 𝐸 = 35
+ 218
+ 57
+ 6
(You may
include a = 19.8517*
screenshot from
Jamovi)
Name:
Section:

*Slightly different from Jamovi output due to rounding error.

Decision We reject the null hypothesis since the test statistic


2
χ = 19. 9496 is greater than the critical value (7. 815).
Moreover, the p-value is less than the level of significance.

Conclusion At a 5% level of significance, the data provide sufficient


evidence to conclude that the observed distribution of litter
size of Bengal Tigers in captivity differs from the litter size
distribution of those in the wild.

2. Is being left-handed hereditary? To answer this question, 250 adults are


randomly selected and their handedness and their parents’ handedness are
noted. The results are summarized in the table provided. Test, at the 1% level
of significance, whether there is sufficient evidence in the data to conclude
that there is a hereditary element in handedness.
https://saylordotorg.github.io/text_introductory-statistics/s15-01-chi-square-tests-for-independe.html
Name:
Section:

Hypotheses: 𝐻0: A person’s handedness is independent of their parents’


handedness.
𝐻𝑎: A person’s handedness is dependent on their parents’
handedness.

Assumption 1. All expected frequencies are 1 or greater.


Check 2. The second assumption is not satisfied, as more than
20% of the expected frequencies are less than 5.
Notice that two (2) expected frequencies are less
than 5. This is more than 20% of the 6 expected
frequencies.
3. Simple random sampling was used to get a sample
of 250 adults.

We will proceed with the chi-square test of independence,


but the data might result in false interpretations because
not all assumptions are satisfied.

Level of Level of significance: α = 0. 01


Significance 2
Critical Value: χ 0.01
= 9. 210
and Critical
value
2 2 2 2 2 2 2
Test Statistic (𝑂−𝐸) 8 10 12 178 21 21
Σ 𝐸 = 23.32 + 3.72
+ 3.96
+ 163.68
+ 27.28
+ 29.04
-250
(You may
include a = 41. 0372
screenshot from
Jamovi)
Name:
Section:

Decision 2
Since the test statistic (χ = 41. 0372) is greater than the
critical value (9. 210). Moreover, the 1-value is less than the
level of significance.

Conclusion At a 1% level of significance, the data provide sufficient


evidence to conclude that there is a hereditary element in
handedness.

Thanks to Bea Abalos for the answers.

You might also like