Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

1

INTRODUCTION TO
STATISTICS & PROBABILITY

Chapter 9: Analysis of Two-Way Tables

Dr. Nahid Sultana

1/21/2023 Copyright© Nahid Sultana 2017-2018


Chapter 9
2
Analysis of Two-Way Tables

9.1 Inference for Two-Way Tables


9.2 Goodness of Fit

Copyright© Nahid Sultana 2017-2018 1/21/2023


Inference for Two-Way Tables

➢ Two-Way Tables
➢ Expected Cell Counts
➢ The Chi-Square Statistic
➢ The Chi-Square Distributions
➢ The Chi-Square Test

Copyright© Nahid Sultana 2017-2018 1/21/2023


3
Two-Way Tables
4

Objectives:
➢ Given a two-way table, test whether two categorical variables are
associated.
Association arises in two forms:
➢ Compare two or more populations to see if they have the same
distribution of a categorical variable.
➢ Examine two categorical variables of one population to see if they are
independent.
Testing for association of 2 categorical variables
➢ In Section 2.6 by looking at their joint distribution and conditional
distributions.
➢ Here we will perform formal hypothesis tests to determine whether two
categorical variables are associated.
Copyright© Nahid Sultana 2017-2018 1/21/2023
Expected Cell Counts
5

Two-way tables sort the data according to two categorical variables.


We want to test the hypothesis that there is no relationship between
these two categorical variables (H0).

To test this hypothesis, we compare actual counts from the sample data
with expected counts, given the null hypothesis of no relationship.

The expected count in any cell of a two-way table when H0 is true is:

Copyright© Nahid Sultana 2017-2018 1/21/2023


The Problem of Multiple Comparisons
6

H0: there is no difference in the distribution of a categorical variable


for several populations or treatments
Ha: there is a difference in the distribution of a categorical variable
for several populations or treatments

To see if the data give convincing evidence against the null hypothesis,
we compare the observed counts in a two-way table with the counts we
would expect if H0 were true.

The test statistic that makes the comparison is the chi-square statistic.

Copyright© Nahid Sultana 2017-2018 1/21/2023


The Chi-Square Statistic
7

The chi-square statistic is a measure of how far the observed


counts are from the expected counts. The formula for the statistic is:
(Observed - Expected) 2
 =
2

Expected

where “observed” represents an observed cell count, “expected”


represents the
 expected count for the same cell, and the sum is over
all r  c cells in the table.

➢ A large value of 2 will provide evidence against the null hypothesis.


➢ The P-value for a 2 test comes from comparing the value of the 2
statistic with critical values for a chi-square distribution.

Copyright© Nahid Sultana 2017-2018 1/21/2023


The Chi-Square Distributions
8

The Chi-Square Distributions


The chi-square distributions are a family of distributions that take only
positive values and are skewed to the right.
A particular 2 distribution is specified by giving its degrees of freedom.
The 2 test for a two-way table with r rows and c columns uses critical values
from the 2 distribution with (r – 1)(c – 1) degrees of freedom.
The P-value is the area under the density curve of this 2 distribution to the
right of the value of the test statistic.
Copyright© Nahid Sultana 2017-2018 1/21/2023
Computations for Two-Way Tables
9

When analyzing relationships between two categorical variables,


follow this procedure:
1. Calculate descriptive statistics that convey the important information
in the table. Usually these will be column or row percents.
2. Find the expected counts and use them to compute the 2 statistic.
3. Compare your 2 statistic to the chi-square critical values from
Table F to find the approximate P-value for your test.
4. Draw a conclusion about the association between the row and
column variables.
Copyright© Nahid Sultana 2017-2018 1/21/2023
Example: Music and chocolate purchase decision
Conditional distribution same No association
10 Chocolate None French Italian Total
What is the relationship between type French 30 39 30 99

of music played in supermarkets and Italian 11 1 19 31


Other 43 35 35 113
type of chocolate purchased?
Total 84 75 84 243
Are the distributions of chocolate
purchases under the three music treatments similar or different?
✓ Sales of Italian chocolate are Comparing Conditional Distributions
very low (1.3%) when French
music is playing but are higher
when Italian music (22.6%) or no
music (13.1%) is playing.
✓ French chocolate appears
popular in this market, selling well
under all music conditions but
notably better when French music
is playing.
Example: Music and chocolate purchase decision (Cont…)
Observed count and expected counts
11

H0: There is no difference in the distributions of chocolate purchases at


this store when no music, French music, or Italian music is played.
Ha: There is a difference in the distributions of chocolate purchases at
this store when no music, French music, or Italian music is played.

Observed count and expected counts


We obtain observed count from the data.
We compute expected count from row and column totals.
➢ “Expected” counts are counts we “expect” to see if H0 is true i.e.
there is no relationship.
row total  column total
➢ Expected count = table total
Copyright© Nahid Sultana 2017-2018 1/21/2023
Example: Music and chocolate purchase decision (Cont…)
Observed count and expected counts
12
Chocolate None French Italian Total Chocolate None French Italian Total
French 30 39 30 99 French 34.22 30.56 34.22 99
Italian 11 1 19 31 Italian 10.72 9.57 10.72 31
Other 43 35 35 113 Other 39.06 34.88 39.06 113
Total 84 75 84 243 Total 84 75 84 243

The expected count of French chocolate bought when no music was


playing: 99
× 84 = 34.22
243
➢ Expected counts typically have decimals.
➢ Both expected counts and observed counts have same number of
marginal totals.
➢ More discrepancy between them, the more evidence we have that null
hypothesis is incorrect.
Copyright© Nahid Sultana 2017-2018 1/21/2023
Example: Music and chocolate purchase decision (Cont…)
Chi-square statistics and P-value
Chocolate None French Italian Total Chocolate None French Italian Total
13 French 34.22 30.56 34.22 99
French 30 39 30 99
Italian 11 1 19 31 Italian 10.72 9.57 10.72 31
Other 43 35 35 113 Other 39.06 34.88 39.06 113
Total 84 75 84 243 Total 84 75 84 243

The c 2 statistic is the sum of nine such terms :

c =å 2

(Observed - Expected)2 (30 - 34.22) 2 (39 - 30.56) 2
Expected
=
34.22
+
30.56
+ ...+
(35 - 39.06) 2
39.06
P
= 0.52 + 2.33 + ...+ 0.42 = 18.28
To find the P-value using a chi-square table Df .0025 .001
look in the df = (3-1)(3-1) = 4. 4 16.42 18.47

The small P-value (between 0.001 and 0.0025) gives us convincing evidence
to reject H0 and conclude that there is a difference in the distributions of
chocolate purchases at this store when no music, French music, or Italian music
is played.
Example: Smoking Habits
Student smokes Student doesn’t smoke Row total
14 Both parents smoke 400 1380 1780
One parent smokes 416 1823 2239
Neither parent smokes 188 1168 1356
Column total 1004 4371 5375

Conditional distribution of student smokers for different


parental smoking statuses:
➢ Percent of students who smoke when both parents
smoke = 400/1780 = 22.5%
➢ Percent of students who smoke when one parent
smokes = 416/2239 = 18.6%
➢ Percent of students who smoke when neither parent
smokes = 188/1356 = 13.9%
“Is student smoking associated with parent’s smoking?”
a) Ho: There is NO association between student smoking and parent’s smoking.
b) Ha: There is an association between student smoking and parent’s smoking.
Copyright© Nahid Sultana 2017-2018 1/21/2023
Example: Smoking Habits (Cont…)
15

P1=Proportion of students who smoke when both parents smoke


P2= Proportion of students who smoke when one parent smokes
P3= Proportion of students who smoke when neither parents smokes

Conditional distribution same No association

If p1=p2=p3, then there is no relationship between student smoking and


whether both, one or neither parent smokes.
H0: There is NO association between student smoking and parent’s
smoking.

Ha: There is association between student smoking and parent’s smoking.


Copyright© Nahid Sultana 2017-2018 1/21/2023
Example: Smoking Habits (Cont…) 2
2 = 
(Observed - Expected)
Expected
Expected count = (row total × column total)/table total
16
Expected count of smoke when both parents smoke
= (1004 × 1780)/ 5375 = 332.5

Student smokes Student doesn’t smoke Row total
Both parents smoke 400 (332.5) 1380 (1447.5) 1780
One parent smokes 416 (418.2) 1823 (1820.8) 2239
Neither parent smokes 188 (353.3) 1168 (1102.7) 1356
Column total 1004 4371 5375
χ2 = 37.57, df = (r-1)(c-1)=(3-1)(2-1)=2
The small P-value =0.0000 < 0.001 gives us
convincing evidence to reject H0 and conclude that
there is association between student smoking and
parent’s smoking.
Conditions of a Chi-Square Test:
17

Chocolate None French Italian Total Student Student Row


smokes doesn’t smoke total
French 34.22 30.56 34.22 99 Both parents (332.5) (1447.5) 1780
smoke
Italian 10.72 9.57 10.72 31 One parent (418.2) (1820.8) 2239
smokes
Other 39.06 34.88 39.06 113
Neither parent (353.3) (1102.7) 1356
Total 84 75 84 243 smokes
Column total 1004 4371 5375

➢ Average of the expected counts is greater than 5


➢ All individual expected counts are 1 or greater
➢ For 2x2 tables, all four expected counts should be 5 or greater.
------If these requirement fail then two or more groups must be
combined to form a “smaller” two-way table.
Copyright© Nahid Sultana 2017-2018 1/21/2023
The Chi-Square Test for
Goodness of Fit
18

M&M milk chocolate candies. Here’s what the company’s Consumer Affairs
Department says about the color distribution of its M&M’S milk chocolate
candies:
On average, the new mix of colors of M&M’S milk chocolate candies will
contain 13 percent of each of browns and reds, 14 percent yellows, 16
percent greens, 20 percent oranges, and 24 percent blues.
➢ The one-way table below summarizes the data from a sample bag of
M&M’S milk chocolate candies. In general, one-way tables display the
distribution of a categorical variable for the individuals in a sample.

Color Blue Orange Green Yellow Red Brown Total


Count 9 8 12 15 10 6 60
9
The sample proportion of blue M & M’ S is pˆ = = 0.15.
60
Copyright© Nahid Sultana 2017-2018 1/21/2023
The Chi-Square Test for
19
Goodness of Fit (Cont…)
We can write the hypotheses in symbols as:
H0: pblue = 0.24, porange = 0.20, pgreen = 0.16,
pyellow = 0.14, pred = 0.13, pbrown = 0.13,
Ha: At least one of the pi’s is incorrect
where pcolor = the true population proportion of M&M’S milk chocolate
candies of that color.
The idea of the chi-square test for goodness of fit is this: we compare the
observed counts from our sample with the counts that would be expected if
H0 is true. The more the observed counts differ from the expected counts,
the more evidence we have against the null hypothesis.

In general, the expected counts can be obtained by multiplying the


proportion of the population distribution in each category by the sample
size. Copyright© Nahid Sultana 2017-2018 1/21/2023
The Chi-Square Test for
20
Goodness of Fit (Cont…)
Assuming that the color distribution stated by Mars, Inc. is true, 24% of all
M&M’S milk chocolate candies produced are blue.
For random samples of 60 candies, the average number of blue M&M’S
should be (0.24)(60) = 14.40. This is our expected count of blue M&M’S.
Using this same method, we can find the expected counts for the other color
categories:
Orange: (0.20)(60) = 12.00
Green: (0.16)(60) = 9.60
Yellow: (0.14)(60) = 8.40
Red: (0.13)(60) = 7.80
Brown: (0.13)(60) = 7.80

Copyright© Nahid Sultana 2017-2018 1/21/2023


The Chi-Square Test for
21
Goodness of Fit (Cont…)
To calculate the chi-square statistic, use the same formula as you did
earlier in the chapter.
(Observed - Expected)2
c =∑
2
å Expected

(9 -14.40) 2 (8 -12.00) 2 (12 - 9.60) 2


c =
2
+ +
14.40 12.00 9.60

(15 - 8.40) 2 (10 - 7.80) 2 (6 - 7.80) 2


+ + +
8.40 7.80 7.80

c 2 = 2.025 + 1.333 + 0.600 + 5.186 + 0.621+ 0.415


= 10.180
Copyright© Nahid Sultana 2017-2018 1/21/2023
The Chi-Square Test for
Goodness of Fit (Cont…)
22

We computed the chi - square statistic for our sample of 60 M & M’ S to be


 2 = 10.180. Because all of the expected counts are at least 5, the  2
statistic will follow a chi - square distribution with df = 6 - 1= 5 reasonably
well when H 0 is true. P
df .15 .10 .05
4 6.74 7.78 9.49
5 8.12 9.24 11.07
6 9.45 10.64 12.59

The value c 2 =10.180 falls between the critical values 9.24 and 11.07. The
Since our P-value
corresponding areasisinbetween
the right 0.05
tail ofand 0.10,
the chi it is greater
- square thanwith
distribution α= df 0.05.
=5
Therefore,
are 0.10 andwe fail to reject H . We don’t have sufficient evidence to
0.05.
0
conclude that the company’s claimed color distribution is incorrect.
So, the P - value for a test basedNahid
Copyright© on Sultana
our sample data is
2017-2018
22
between 0.05 and 0.10.
1/21/2023
Examples
23

In the table below, we examine the relationship between final grade and the
reported hours per week each student said they studied for the course.

The expected count of those who studied between 5 and 10 hours per week
and earned a B for the course is:
a. 4.672.
13  23
b. 4.266. = 4.671875
64
c. 8.265.
Copyright© Nahid Sultana 2017-2018 1/21/2023
Examples
24

A die is tossed 60 times and the number of dots appearing on the


top face are recorded in the table below.
Top Face 1 2 3 4 5 6
# of occurrences 8 12 13 9 11 7

An investigator wants to know if there is enough evidence to indicate


that the die is not fair. What is the most appropriate test?

a. Chi-square: Independence test

b. Chi-square: Comparing several proportions test

c. Chi-square: Goodness of fit test


Copyright© Nahid Sultana 2017-2018 1/21/2023
Examples
25

A die is tossed 60 times and the number of dots appearing on the top
face are recorded in the table below.
Top Face 1 2 3 4 5 6
# of occurrences 8 12 13 9 11 7

An investigator wants to know if there is enough evidence to indicate that


the die is not fair. What is the expected number of 3’s if the die is fair?

a. 6
b. 7
c. 10 1/6 * 60
d. 11
Copyright© Nahid Sultana 2017-2018 1/21/2023
Examples
26

A die is tossed 60 times and the number of dots appearing on the top
face are recorded in the table below.
Top Face 1 2 3 4 5 6
# of occurrences 8 12 13 9 11 7

An investigator wants to know if there is enough evidence to indicate


that the die is not fair. What are the appropriate degrees of freedom
for the chi-square test?
a. 5

b. 6

c. 10 k -1= 6 -1= 5
d. 12
Copyright© Nahid Sultana 2017-2018 1/21/2023
Examples
27

A die is tossed 60 times and the number of dots appearing on the top
face are recorded in the table below.
Top Face 1 2 3 4 5 6
# of occurrences 8 12 13 9 11 7

An investigator wants to know if there is enough evidence to indicate


that the die is not fair.

The value of the chi-square statistic is 2.8. Is there sufficient evidence to


conclude that the die is not fair? Use a = 0.05.

a. Yes Critical value = 11.07


p − value = .7308
b. No
Copyright© Nahid Sultana 2017-2018 1/21/2023

You might also like