Professional Documents
Culture Documents
Chi-Square Probability Distribution
Chi-Square Probability Distribution
Chi-Square Probability Distribution
In some instances, the data is summarized in a table. A table entry contains the number of
observations from the sample that have a particular characteristic and a particular response. The
formulas that follow assume that we've summarized the data in a table.
Intuition
The idea behind the chi-square test is to compare what we observe in the random sample to what
we expect to observe when we assume that there is no relationship between the characteristic and
the response. For example, suppose that 40% of the people in the random sample are men. Then
we'd expect (if there is no relationship) 40% of the apple juice drinkers to be male, 40% of the orange
juice drinkers to be male, and so on. Equivalently, suppose that 10% of the people in the sample
drink cranberry juice. Then we'd expect 10% of the men to drink cranberry juice and 10% of the
women to drink cranberry juice.
We'll compare the observed percentages to the expected percentages. If the observed and expected
percentages differ by more than is implied by random chance, we'll conclude that there is a
relationship between the characteristic and the response. In the example, suppose that 15% of the
men drink cranberry juice and 5% of the women drink cranberry juice. This difference could lead us
to conclude that there is a significant difference between the percentage of women and men who
prefer cranberry juice.
Hypothesis
The null hypothesis says that the characteristic (gender) does not affect the response (choice of
juice). The alternative says that the response is affected by the characteristic.
Note the following incorrect statement of H0: All the proportions are equal. This statement is vague
(what proportions?) and incomplete (are all proportions equal?).
Formulas
Let be the observed count for category c and response r and
The observed count comes directly from the sample data. The expected count can be calculated as
follows:
The following statistic has a chi-square distribution with degrees of freedom equal to (c-1)(r-1),
where c is the number of possible characteristics and r is the number of possible responses:
The null hypothesis states that there is no association between the characteristic and the response
(or, equivalently, between the row variable and the column variable). The alternative hypothesis is
that a relationship exists between the characteristic and the response. Alternatively, the alternative
hypothesis says that the response differs depending on value of the characteristic.
There are three Khan academy videos that you might find helpful:
Chi-square Introduction
Example 1.
An on-line music service company wants to know (for the purposes of designing a marketing
campaign) if customer age is important in deciding to subscribe, and, if so, which age groups are
most likely to subscribe to its services. The company has gathered a random sample of 1000 people
from the population, and asked each person whether he or she would subscribe to the service. The
company knows which age group the person falls into: under 18 years of age, 18 - 34 years of age,
and 35 years or older. The company could use a chi-square test to answer its questions. How do we
know this? We know that a chi-square test is useful because we are asking a question about the
differences in outcomes (subscribe or don't subscribe) across people who differ in by some
characteristic (age group).
Here, we have two responses (rows) (Yes, No), so R = 2. We have three characteristics (columns)
(Under 18, 18-34, 35 and Over), so C = 3.
Overall, 619 people said they would subscribe, and 381 said that they would not subscribe. In
percentage terms, 61.9% of the people said "yes" and 38.1% said "no". Did that percentage differ by
age group?
In the Under-18 age group, 120 out of 161 (or 74.5%) of the people said "yes"
For the 18-34 year olds, 262 out of 365 (71.8%) said "yes"
For the 35 and older group, 237 out of 474 (50%) said "yes"
The null hypothesis of the chi-square test is that the characteristic (age group) ''does not'' affect the
outcome (subscribing to the service). More specifically, the null hypothesis says that the percentage
of people overall who said "yes" is the same as the percentage who said "yes" in each age group. If
the null hypothesis is true, then we should see 69.1% of the people in each group saying "yes". In
other words, we'd expect to see the following data:
For example, 99.66 is 61.9% of the 161 Under 18 year olds who were in the sample. These are called
the ''expected counts''. Since we are applying percentages to whole numbers, we won't get whole
numbers out of the calculation. This is fine.
To calculate the chi-square statistic, take the difference between the Observed Count and the
Expected Count, square the difference, and divide by the Expected Count, producing the following
table:
The p-value for this statistic is zero (1.4874E-12), as illustrated in the following graph. So we reject
the null hypothesis and conclude that the age groups differ in their decision to purchase the
service.
Specifically, rejection of the null hypothesis leads us to conclude that there is a relationship between
the row and column variables. In this case, age is related to whether or not a person would subscribe
to the music service.