Professional Documents
Culture Documents
Lecture 14
Lecture 14
Tests
Categorical Data
Goodness-of-fit test
Examining independence between variables:
the test of independence
Comparing several populations: the test of
homogeneity
Example: Market Research for Toothpaste
wants to test the hypothesis:
Ideas
Goodness of Fit Tests
A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a claimed
distribution.
Hypotheses
Test Statistic
The Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution depends
upon the degrees of freedom, just like Students t-
distribution.
3. As the number of degrees of freedom increases,
the chi-square distribution becomes more
symmetric.
4. The values are non-negative.
Characteristics of the Chi-Square Distribution
P-value
A study of 667 drivers who were using a cell phone when they were involved in a
collision on a weekday examined the relationship between these accidents and the
day of the week.
Car accidents and day of the week
Are the accidents equally likely to occur on any day of the working week?
H
0
specifies that all 5 days are equally likely for car accidents each p
i
= 1/5.
H
0
specifies that all days are equally likely for car
accidents each p
i
= 1/5.
Car accidents and day of the week
The expected count for each of the five days is np
i
= 667(1/5) = 133.4.
Following the chi-square distribution with 5 1 = 4 degrees of freedom. From table X
2
2
day 2
(count - 133.4)
(observed - expected)
8.49
expected 133.4
The p-value is thus between 0.1 and 0.05 since 7.779<8.49<9.488, which is not
significant at 5%.
There is no significant evidence of different car accident rates for different weekdays
when the driver was using a cell phone.
.1 .05 .025 .01 .005
df
4 7.779 9.488 11.143 13.227 14.860
Combining Categories
A Market Survey
The chi-square test is an overall technique for comparing any number of
population proportions, testing for evidence of a relationship between two
categorical variables. We can either:
Test for independence: Take one SRS and classify the individuals in the
sample according to two categorical variables
Compare several populations: Randomly select several SRSs each
from a different population (or from a population subjected to
different treatments)
Both models use the X
2
test to test the hypothesis of no relationship.
Models for two-way tables
Testing for independence (no association)
We now have a single sample from a single population. For each
individual in this SRS of size n we measure two categorical
variables. The results are then summarized in a two-way table.
The null hypothesis is that the row and column variables are
independent. The alternative hypothesis is that the row and
column variables are dependent.
Computing expected counts
When testing the null hypothesis that there is no relationship between
both categorical variables of a two-way table, we compare actual
counts from the sample data with expected counts given H
0
.
The expected count in any cell of a two-way table when H
0
is true is:
Although in real life counts must be whole numbers, an expected
count need not be. The expected count is the mean over many
repetitions of the study, assuming no relationship.
In a study conducted in a Northern Ireland supermarket, researchers counted
the number of bottles of French, Italian, and other wine
purchased while shoppers were subject to one of three treatments: no
music, French accordion music, and Italian string music.
Background Music and Consumer Behavior
What is the expected count in the upper-left cell of
the two-way table, under H
0
?
Column total 84: Number of bottles sold without
music
Row total 99: Number of bottles of French wine
sold
Table total 243: all bottles sold during the study
The null hypothesis is that there is no relationship between music and wine
sales. The alternative is that these two variables are related.
Nine similar calculations
produce the table of
expected counts:
Music and wine purchase decision
This expected cell count is thus
(84)(99) / 243 = 34.222
The chi-square statistic (c
2
) is a measure of how much the
observed cell counts in a two-way table diverge from the
expected cell counts.
The formula for the c
2
statistic is:
(summed over all r * c cells in the table)
Tip: First, calculate the c
2
components, (observed-expected)
2
/expected, for each cell
of the table, and then sum them up to arrive at the c
2
statistic.
c
2
observed count - expected count
2
expected count