Chi-Square Probability Distribution

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Introduction

A chi-square test is a test based on the chi-square probability distribution. Here, we discuss


the Pearson's chi-square test for two-way tables. The data here is qualitative: we have data which is
grouped by a characteristic and by a response or outcome. For example, we might ask a group of
people what type of juice they prefer - orange, apple, cranberry or other. We might be curious about
whether juice preference differs by gender. Is the percentage of men who buy orange juice different
from the percentage of women who buy orange juice? The characteristic is gender (male or female)
and the response is the preferred juice (orange, apple, cranberry, other). In the random sample, an
observation is the gender and juice preference of a person.The response must have two or more
possible values; the characteristic must have two or more possible values.

In some instances, the data is summarized in a table. A table entry contains the number of
observations from the sample that have a particular characteristic and a particular response. The
formulas that follow assume that we've summarized the data in a table.

Intuition

The idea behind the chi-square test is to compare what we observe in the random sample to what
we expect to observe when we assume that there is no relationship between the characteristic and
the response. For example, suppose that 40% of the people in the random sample are men. Then
we'd expect (if there is no relationship) 40% of the apple juice drinkers to be male, 40% of the orange
juice drinkers to be male, and so on. Equivalently, suppose that 10% of the people in the sample
drink cranberry juice. Then we'd expect 10% of the men to drink cranberry juice and 10% of the
women to drink cranberry juice.

We'll compare the observed percentages to the expected percentages. If the observed and expected
percentages differ by more than is implied by random chance, we'll conclude that there is a
relationship between the characteristic and the response. In the example, suppose that 15% of the
men drink cranberry juice and 5% of the women drink cranberry juice. This difference could lead us
to conclude that there is a significant difference between the percentage of women and men who
prefer cranberry juice.

Hypothesis

H0: There is no relationship between the response and the characteristic

HA: There is a relationship between the response and the characteristic

The null hypothesis says that the characteristic (gender) does not affect the response (choice of
juice). The alternative says that the response is affected by the characteristic.

Note the following incorrect statement of H0: All the proportions are equal. This statement is vague
(what proportions?) and incomplete (are all proportions equal?).

Formulas
Let be the observed count for category c and response r and

be the expected count for category c and response r.

The observed count comes directly from the sample data. The expected count can be calculated as
follows:

Expected Count = (Category Total) x (Response Total)/(Total Observations)

The following statistic has a chi-square distribution with degrees of freedom equal to (c-1)(r-1),
where c is the number of possible characteristics and r is the number of possible responses:

The null hypothesis states that there is no association between the characteristic and the response
(or, equivalently, between the row variable and the column variable). The alternative hypothesis is
that a relationship exists between the characteristic and the response. Alternatively, the alternative
hypothesis says that the response differs depending on value of the characteristic.

The p-value of the test statistic is

where X2 is a chi-square random variable with df = (r-1)(c-1).

Khan Academy Videos

There are three Khan academy videos that you might find helpful:

Chi-square Introduction
Example 1.

An on-line music service company wants to know (for the purposes of designing a marketing
campaign) if customer age is important in deciding to subscribe, and, if so, which age groups are
most likely to subscribe to its services. The company has gathered a random sample of 1000 people
from the population, and asked each person whether he or she would subscribe to the service. The
company knows which age group the person falls into: under 18 years of age, 18 - 34 years of age,
and 35 years or older. The company could use a chi-square test to answer its questions. How do we
know this? We know that a chi-square test is useful because we are asking a question about the
differences in outcomes (subscribe or don't subscribe) across people who differ in by some
characteristic (age group).

The data for this example is:

  Under 18 18-34 35 and over Total

Yes 120 262 237 619

No 41 103 237 381

Tota 161 365 574 1000


  Under 18 18-34 35 and over Total

Here, we have two responses (rows) (Yes, No), so R = 2. We have three characteristics (columns)
(Under 18, 18-34, 35 and Over), so C = 3.

Overall, 619 people said they would subscribe, and 381 said that they would not subscribe. In
percentage terms, 61.9% of the people said "yes" and 38.1% said "no". Did that percentage differ by
age group?

 In the Under-18 age group, 120 out of 161 (or 74.5%) of the people said "yes"
 For the 18-34 year olds, 262 out of 365 (71.8%) said "yes"
 For the 35 and older group, 237 out of 474 (50%) said "yes"

The null hypothesis of the chi-square test is that the characteristic (age group) ''does not'' affect the
outcome (subscribing to the service). More specifically, the null hypothesis says that the percentage
of people overall who said "yes" is the same as the percentage who said "yes" in each age group. If
the null hypothesis is true, then we should see 69.1% of the people in each group saying "yes". In
other words, we'd expect to see the following data:

  Under 18 18-34 35 and over Total

Yes 99.66 225.9 293.41 619


4

No 61.34 139.0 180.59 381


6

Total 161 365 574 1000

For example, 99.66 is 61.9% of the 161 Under 18 year olds who were in the sample. These are called
the ''expected counts''. Since we are applying percentages to whole numbers, we won't get whole
numbers out of the calculation. This is fine.

To calculate the chi-square statistic, take the difference between the Observed Count and the
Expected Count, square the difference, and divide by the Expected Count, producing the following
table:

  Under 18 18-34 35 and over Total

Yes 4.152 5.757 10.844 619

No 7.745 9.353 17.618 381

Tota 161 365 574 1000


l
To calculate the chi-square statistic, add the non-bold elements in this table to obtain the result χ 2 =
54.468. There are (3-1)x(2-1) = 2 degrees of freedom. The p-value for this test can be calculated
using statistical software, by looking at a chi-square table, or using a chi-square p-value calculator.
The critical value for this test is illustrated in this graph (5.991). Since 54.468 > 5.991, we reject
H0: 

The p-value for this statistic is zero (1.4874E-12), as illustrated in the following graph. So we reject
the null hypothesis and conclude that the age groups differ in their decision to purchase the
service. 

Specifically, rejection of the null hypothesis leads us to conclude that there is a relationship between
the row and column variables. In this case, age is related to whether or not a person would subscribe
to the music service.

You might also like