Cap 15 Doane

A PowerPoint Presentation Package to Accompany
Applied Statistics in Business &

Economics, 6th edition
David P. Doane and Lori E. Seward
Prepared by Lloyd R. Jaisingh
15-1 Copyright ©2019 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Chapter 15
Chi-Square Tests
Chapter Contents
15.1 Chi-Square Test for Independence
15.2 Chi-Square Tests for Goodness-of-Fit
15.3 Uniform Goodness-of-Fit Test
15.4 Poisson Goodness-of-Fit Test
15.5 Normal Chi-Square Goodness-of-Fit Test
15.6 ECDF Tests (Optional)
Chapter 15
Chapter Learning Objectives (LOs)
LO15-1: Recognize a contingency table and understand how it is created.

LO15-2: Find degrees of freedom and use the chi-square table of critical
values.
LO15-3: Perform a chi-square test for independence on a contingency
table.
LO15-4: Perform a goodness-of-fit (GOF) test for a multinomial
distribution.
LO15-5: Perform a GOF test for a uniform distribution.
Chapter 15
Chapter Learning Objectives (LOs), continued
LO15-6: Explain the GOF test for a Poisson distribution.

LO15-7: Explain the chi-square GOF test for normality.
LO15-8: Interpret ECDF tests and know their advantages compared to
chi-square GOF tests.
Chapter 15
15.1 Chi-Square Test for
Independence
LO15-1: Recognize a contingency table and understand
how it is created.
Contingency Tables
• A contingency table is a cross-tabulation of n paired observations into
categories.
• Each cell shows the count of observations that fall into the
category defined by its row and column heading as shown in Table 15.2.
Chapter 15
how it is created (continued, 2).
Contingency Tables
For example: Marketing researchers did a survey of 291 websites in

three nations (France, U.K., U.S.) and obtained the contingency
table shown here as Table 15.1. Is location of the privacy disclaimer
independent of the website’s nationality? This question can be answered
by using a test based on the frequencies in this contingency table.
Chapter 15
Chi-Square Test
• In a test of independence for an r x c contingency table, the
hypotheses are
H0: Variable A is independent of variable B
H1: Variable A is not independent of variable B
• Use the chi-square test for independence to test these hypotheses.
• This non-parametric test is based on frequencies.
• The n data pairs are classified into c columns and r rows, and then the
observed frequency fjk is compared with the expected frequency ejk
under the assumption of independence.
Chapter 15
Chi-Square Test, continued

• The chi-square test statistic measures the relative difference between
expected and observed frequencies:
If the two variables are independent, then fjk should be close to ejk, leading to

a chi-square test statistic near zero. Conversely, large differences
between fjk and ejk will lead to a large chi-square test statistic. The chi-square
test statistic cannot be negative (due to squaring) so it is always a right-tailed
test. If the test statistic is far enough in the right tail, we will reject the
hypothesis of independence.
Chapter 15
LO15-2: Find degrees of freedom and use the chi-square
table of critical values.
Chi-Square Distribution
• The critical value comes from the chi-square probability distribution
with (r – 1)(c – 1) degrees of freedom.
df = degrees of freedom = (r – 1)(c – 1)
where r = number of rows in the table

c = number of columns in the table
• Appendix E contains critical values for right-tail areas of the chi-
square distribution.
• The Excel function =CHISQ.INV.RT(α, df) also gives the critical
value in the right-tail.
Chapter 15
table of critical values (continued, 2).
Chi-Square Distribution
• Consider the shape of the chi-square distribution. As the degrees
of freedom increases, the shape begins to resemble a normal,
bell-shaped curve.
• However, for any contingency table you are likely to encounter,
degrees of freedom will not be large enough to assume normality.
Chapter 15
table of critical values (continued, 3).
Expected Frequencies
 Assuming that H0 is true, the expected frequency of row j and
column k is:
where
Rj = total for row j (j = 1, 2, …, r)
Ck = total for column k (k = 1, 2, …, c)
n = sample size
Chapter 15
LO15-3: Perform a chi-square test for independence
on a contingency table.
Steps in Testing the Hypotheses

Step 1: State the Hypotheses
• H0: Variable A is independent of variable B
• H1: Variable A is not independent of variable B
Step 2: Specify the Decision Rule

• Calculate df = (r – 1)(c – 1)
• For a given a, look up the right-tail critical value (c2R) from
Appendix E or by using Excel.
Reject H0 if test statistic > c2R.
• Instead of using Appendix E, you can use the Excel function
=CHISQ.INV.RT(α, df) to get the critical value in the right-tail.
Chapter 15
on a contingency table (continued, 2).
Steps in Testing the Hypotheses (continued)

• For example, for df = 6 and a = .05, c2.05 = 12.59.
Chapter 15

• Here is the rejection region.
Chapter 15
Step 3: Calculate the Test Statistic
The expected frequencies are computed from
• For example:
Chapter 15
• The chi-square test statistic is
Step 4: Make the Decision

• Reject H0 if c2calc > test statistic or if the p-value  .
Step 5: Take Action
Chapter 15
Test of Two Proportions

• For a 2 × 2 contingency table, the chi-square test is equivalent to a
two-tailed z test for two proportions, if the samples are large
enough to ensure normality.
• The hypotheses for a two-tailed test are:
Figure 14.6
Chapter 15
Test of Two Proportions, continued

• The z-test statistic is computed from the following formula.
• Reject H0 if .
Figure 14.6
Chapter 15
Small Expected Frequencies

• The chi-square test is unreliable if the expected frequencies are too
small.
• Rules of thumb:
 Cochran’s Rule requires that e > 5 for all cells.
jk
 Another rule of thumb is that up to 20% of the cells may have e <
jk
5
• Most agree that a chi-square test is infeasible if ejk < 1 in any cell.
• If this happens, try combining adjacent rows or columns to enlarge
the expected frequencies.
Chapter 15
Cross-Tabulating Raw Data

• Chi-square tests for independence can also be used to analyze
quantitative variables by coding them into categories.
• For example, the variables Infant Deaths per 1,000 and Doctors
per 100,000 can each be coded into various categories:
Chapter 15
Why Do a Chi-Square Test on Numerical Data?

• The researcher may believe there’s a relationship between X and
Y, but doesn’t want to make an assumption on its form (linear,
quadratic etc.) as required by regression.
• There are outliers or anomalies that prevent us from assuming that
the data came from a normal population. Unlike correlation and
regression, the chi-square test does not require any normality
assumptions.
• The researcher has numerical data for one variable but not the
other. A chi-square test can be used if we convert the numerical
variable into categories.
Chapter 15
3-Way Tables and Higher

• More than two variables can be compared using contingency
tables.
• However, it is difficult to visualize a higher order table.
• For example, you could visualize a cube as a stack of tiled 2-way
contingency tables.
• Major computer packages permit 3-way tables.
Chapter 15
15.2 Chi-Square Test for
Goodness-of-Fit
LO15-4: Perform a goodness-of-fit (GOF) test for a
multinomial distribution.
Purpose of the Test
• The goodness-of-fit (GOF) test helps you decide whether your
sample resembles a particular kind of population.
• The chi-square test will be used because it is versatile and easy
to understand.
Chapter 15
multinomial distribution (continued, 2).
Multinomial GOF Test

• A multinomial distribution is defined by any k probabilities 1, 2, …, k
that sum to unity.
• For example, consider the following “official” proportions of M&M
colors.
Chapter 15
Multinomial GOF Test, continued

• The hypotheses are
H0: 1 = .13, 2 = .13, 3 = .24, 4 = .20, 5 = .16, 6 = .14
H1: At least one of the j differs from the hypothesized value.
• No parameters are estimated (m = 0) and there are c = 6 classes, so

the degrees of freedom are df = c – m – 1 = 6 – 0 – 1 = 5.
Chapter 15
Test Statistic and Degrees of Freedom for GOF
• Assuming n observations, the observations are grouped into c
classes and then the chi-square test statistic is found using:
where fj = the observed frequency of

observations in class j
ej = the expected frequency in class j if
H0 were true
Chapter 15
Test Statistic and Degrees of Freedom for GOF,
continued
• If the proposed distribution gives a good fit to the sample, the test
statistic will be near zero.
• The test statistic follows the chi-square distribution with c – m – 1
degrees of freedom df = c – m – 1.
• where c is the number of classes (bins) used in the test and m is
the number of parameters estimated.
Small Expected Frequencies

• Goodness-of-fit tests may lack power in small samples. As a guideline, a
chi-square goodness-of-fit test should be avoided if n < 25. Cochran’s Rule
that expected frequencies should be at least 5 (i.e., all ej ≥ 5) also provides
a guideline, although some experts would weaken the rule to require
only ej ≥ 2.
Chapter 15
GOF Test for Other Distributions
• The hypotheses are:
H0: The population follows a _____ distribution
H1: The population does not follow a ______ distribution
• The blank may contain the name of any theoretical distribution (e.g.,
uniform, Poisson, normal).
• In a GOF test, if we use sample data to estimate the distribution’s
parameters, then our degrees of freedom would be as follows:
Chapter 15
Data-Generating Situations
• Instead of “fishing” for a good-fitting model, visualize a priori the
characteristics of the underlying data-generating process.
• It is undoubtedly true that the most common GOF test is for the
normal distribution, simply because so many parametric tests
assume normality, and that assumption must be tested. Also, the
normal distribution may be used as a default benchmark for any
mound-shaped data that have centrality and tapering tails, as long
as you have reason to believe that a constant mean and variance
would be reasonable.
• However, you would not consider a Poisson distribution for
continuous data or certain integer variables because a Poisson
model only applies to integer data on arrivals or rare, independent
events.
• We remind you of this because software makes it possible to fit
inappropriate distributions all too easily.
Chapter 15
Mixtures: A Problem
• Mixtures occur when more than one data-generating process is
superimposed on top of one another.
• Your sample may not resemble any known distribution. One common
problem is mixtures.
• A sample may have been created by more than one data-generating
process superimposed on top of another.
• For example, adult heights of either sex would follow a normal distribution,
but a combined sample of both genders will be bimodal, and its mean and
standard deviation may be unrepresentative of either sex.
• Obtaining a good fit is not sufficient justification for assuming a particular
model. Each probability distribution has its own logic about the nature of
the underlying process, so we also must examine the data-generating
situation and be convinced that the proposed model is both
logical and empirically apt.
Chapter 15
Eyeball Tests
• A simple “eyeball” inspection of the histogram or dot plot may suffice
to rule out a hypothesized population.
• For example, if the sample is strongly bimodal or skewed, or if
outliers are present, we would anticipate a poor fit to a normal
distribution. The shape of the histogram can give you a rough idea
whether a normal distribution is a likely candidate for a good fit.
• You can be fairly sure that a formal test will agree with what your
common sense tells you, as long as the sample size is not too small.
• Yet a limitation of eyeball tests is that we may be unsure just how
much variation is expected for a given sample size. If anything, the
human eye is overly sensitive, causing us to commit α error
(rejecting a true null hypothesis) too often.
• People are sometimes unduly impressed by a small departure from
the hypothesized distribution, when actually it is within chance.
Chapter 15
15.3 Uniform Goodness-of-Fit Test
LO15-5: Perform a goodness of-fit (GOF) test for a
uniform distribution.
Uniform Distribution
• The uniform goodness-of-fit test is a special case of the multinomial
in which every value has the same chance of occurrence.
• The chi-square test for a uniform distribution compares all c groups
simultaneously.
• The hypotheses are:
H0: 1 = 2 = …, c = 1/c
H1: Not all j are equal
Chapter 15
uniform distribution (continued, 2).
Uniform GOF Test: Grouped Data

• The test can be performed on data that are already tabulated into
groups.
• Calculate the expected frequency ej for each cell.
• The degrees of freedom are df = c – 1 because there are no
parameters for fitting the uniform distribution.
• Obtain the critical value c2a from Appendix E for the desired level
of significance a.
• The p-value can be obtained from the Excel function
=CHISQ.DIST.RT(c2calc, df)
• Reject H0 if p-value  a.
Chapter 15
Uniform GOF Test: Raw Data

• First form c bins of equal width and create a frequency distribution.
• Calculate the observed frequency fj for each bin.
• Define ej = n/c.
• Perform the chi-square calculations.
• The degrees of freedom are df = c – 1 since there are no parameters
for the uniform distribution.
• Obtain the critical value from Appendix E for a given significance
level a and make the decision.
Chapter 15
Uniform GOF Test: Raw Data, continued

• Maximize the test’s power by defining bin width as
• As a result, the expected frequencies will be as large as possible.
Chapter 15
Uniform GOF Test: Raw Data (continued, 3)

• Calculate the mean and standard deviation of the uniform
distribution from:
• If the data are not skewed and the sample size is large (n > 30),
then the mean is approximately normally distributed.
• So, test the hypothesized uniform mean using
Chapter 15
15.4 Poisson Goodness-of-Fit Test
LO15-6: Explain the GOF test for a Poisson distribution.
Poisson Data-Generating Situations

• In a Poisson distribution model, X represents the number of events
per unit of time or space.
• X is a discrete nonnegative integer (X = 0, 1, 2, …)
• Event arrivals must be independent of each other.
• Sometimes called a model of rare events because X typically has a
small mean.
Chapter 15
LO15-6: Explain the GOF test for a Poisson distribution
(continued, 2).
Poisson Goodness-of-Fit Test

• The mean λ is the only parameter. The initial steps for the test are:
• Step 1: Tally the observed frequency for each of each x-value.
• Step 2: If λ is unknown, estimate it from the sample.
• Step 3: Use the estimated λ to find the Poisson probability P(X)
for each value of X.
• Step 4: Multiply P(X = x) by the sample size n to get the expected
frequencies .
• Step 5: Perform the chi-square calculations.
• Step 6: Make the decision.
• You may need to combine classes until expected frequencies
become large enough for the test (at least until > 2).
Chapter 15
(continued, 3).
Poisson GOF Test: Tabulated Data
 Calculate the sample mean as:
 Using this estimate mean, calculate the Poisson probabilities

either by using the Poisson formula P(x) = (λxe-l)/x! or Excel.
Chapter 15
(continued, 4).
Poisson GOF Test: Tabulated Data, continued

• For c classes with m = 1 parameter estimated, the degrees of
freedom are df = c – m – 1 = c – 2.
• Obtain the critical value for a given a from Appendix E.
• Make the decision.
Chapter 15
15.5 Normal Chi-Square Goodness-of-Fit Test
LO15-7: Explain the chi-square GOF test for normality.
Normal Data Generating Situations

• Two parameters, the mean and the standard deviation , fully describe
the normal distribution.
• Unless μ and are known a priori, they must be estimated from a
sample.
• Using these statistics, the chi-square goodness-of-fit test can be used.
Chapter 15
LO15-7: Explain the chi-square GOF test for normality
(continued, 2).
Method 1: Standardizing the Data

• There are various ways to calculate the frequencies for a chi-square
test. One way is to transform the sample
observations x1, x2, . . . , xn into standardized values:
• We could count the sample observations fj within intervals of the form

and compare them with the known frequencies ej based on the
normal distribution, as illustrated in Figure 15.13 (on next slide).
Chapter 15
(continued, 3).
Method 1: Standardizing the Data, continued
Advantage is a
standardized
scale.
Disadvantage is
that data are no
longer in the
original units.
Chapter 15
(continued, 4).
Method 2: Equal Bin Widths

• To obtain equal-width bins, divide the exact data range into c groups of
equal width.
• Step 1: Count the sample observations in each bin to get observed
frequencies fj.
• Step 2: Convert the bin limits into standardized z-values by using the
formula.
• Step 3: Find the normal area within each bin assuming a normal
distribution.
• Step 4: Find expected frequencies ej by multiplying each normal area by
the sample size n.
• Classes may need to be collapsed from the ends inward to enlarge
expected frequencies.
Chapter 15
(continued, 5).
Method 3: Equal Expected Frequencies

• Define histogram bins in such a way that an equal number of observations
would be expected within each bin under the null hypothesis.
• Define bin limits so that ej = n/c
• A normal area of 1/c in each of the c bins is desired.
• The first and last classes must be open-ended for a normal distribution, so
to define c bins, we need c – 1 cut-points.
• The upper limit of bin j can be found directly by using Excel.
• Alternatively, find zj for bin j using Excel and then calculate the upper limit
for bin j as s.
• Once the bins are defined, count the observations fj within each bin and
compare them with the expected frequencies ej = n/c.
Chapter 15
(continued, 6).
Method 3: Equal Expected Frequencies, continued

• Table 15.17 shows some standard normal cutpoints for equal area
bins.
Table 15.16
Chapter 15
(continued, 7).
Histograms
• The fitted normal histogram gives visual clues as to the likely
outcome of the GOF test.
• Histograms reveal any outliers or other non-normality issues.
• Further tests are needed since histograms vary.
Chapter 15
15.6 ECDF Tests (Optional)
LO15-8: Interpret ECDF tests and know their advantages
compared to chi-square GOF tests.
• There are many alternatives to the chi-square test for goodness-of-fit.

These alternatives are based on the Empirical Cumulative Distribution
Function (ECDF).
• The Anderson-Darling (A-D) test is the most widely used for non-normality
because of its power.
• The A-D test is based on a probability plot. When the data fit the
hypothesized distribution closely, the probability plot will be close to a
straight line.
• The A-D test is more powerful than a chi-square test if raw data are
available because it treats the observations individually. Also, the probability
plot has the attraction of revealing discrepancies between the sample and
the hypothesized distribution, and it is usually easy to spot outliers.
Chapter 15
compared to chi-square GOF tests (continued, 2).
• Another such test is the Kolmogorov-Smirnov (K-S) test, which uses the
largest absolute difference between the actual and expected cumulative
relative frequency of the n data values.
• The K-S test assumes that no parameters are estimated. If parameters are
estimated, use a Lilliefors test whose test statistic is the same but with a
different table of critical values. Both tests are done by computer.
• The K-S test can be illustrated in the same probability plot as the A-D test
as shown in Figure 15.15 (see the next slide).
Chapter 15
compared to chi-square GOF tests (continued, 3).

Cap 15 Doane

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cap 15 Doane

Uploaded by

Copyright:

Available Formats

A PowerPoint Presentation Package to Accompany

Applied Statistics in Business &

David P. Doane and Lori E. Seward

Prepared by Lloyd R. Jaisingh

LO15-1: Recognize a contingency table and understand how it is created.

LO15-6: Explain the GOF test for a Poisson distribution.

For example: Marketing researchers did a survey of 291 websites in

Chi-Square Test, continued

If the two variables are independent, then fjk should be close to ejk, leading to

where r = number of rows in the table

Steps in Testing the Hypotheses

Step 2: Specify the Decision Rule

Steps in Testing the Hypotheses (continued)

Steps in Testing the Hypotheses (continued)

The expected frequencies are computed from

Step 4: Make the Decision

Step 5: Take Action

Test of Two Proportions

Test of Two Proportions, continued

Small Expected Frequencies

Cross-Tabulating Raw Data

Why Do a Chi-Square Test on Numerical Data?

3-Way Tables and Higher

Multinomial GOF Test

Multinomial GOF Test, continued

• No parameters are estimated (m = 0) and there are c = 6 classes, so

where fj = the observed frequency of

Small Expected Frequencies

Uniform GOF Test: Grouped Data

Uniform GOF Test: Raw Data

Uniform GOF Test: Raw Data, continued

• As a result, the expected frequencies will be as large as possible.

Uniform GOF Test: Raw Data (continued, 3)

Poisson Data-Generating Situations

Poisson Goodness-of-Fit Test

Poisson GOF Test: Tabulated Data

 Calculate the sample mean as:

 Using this estimate mean, calculate the Poisson probabilities

Poisson GOF Test: Tabulated Data, continued

LO15-7: Explain the chi-square GOF test for normality.

Normal Data Generating Situations

Method 1: Standardizing the Data

• We could count the sample observations fj within intervals of the form​​

Method 1: Standardizing the Data, continued

Method 2: Equal Bin Widths

Method 3: Equal Expected Frequencies

Method 3: Equal Expected Frequencies, continued

• There are many alternatives to the chi-square test for goodness-of-fit.

You might also like

• We could count the sample observations fj within intervals of the form