The Chi

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

THE CHI-SQUARE TEST

The Chi-square test is intended to test how likely it is that an observed


distribution is due to chance. It is also called a "goodness of fit" statistic,
because it measures how well the observed distribution of data fits with the
distribution that is expected if the variables are independent.

A Chi-square test is designed to analyze categorical data. That means that the
data has been counted and divided into categories. It will not work with
parametric or continuous data (such as height in inches). For example, if you
want to test whether attending class influences how students perform on an
exam, using test scores (from 0-100) as data would not be appropriate for a Chi-
square test. However, arranging students into the categories "Pass" and "Fail"
would. Additionally, the data in a Chi-square grid should not be in the form of
percentages, or anything other than frequency (count) data. Thus, by dividing a
class of 54 into groups according to whether they attended class and whether
they passed the exam, you might construct a data set like this:

Pass Fail
Attended 25 6
Skipped 8 15

Importance
Be very careful when constructing your categories! A Chi-square test can tell
you information based on how you divide up the data. However, it cannot tell
you whether the categories you constructed are meaningful. For example, if you
are working with data on groups of people, you can divide them into age groups
(18-25, 26-40, 41-60...) or income level, but the Chi-square test will treat the
divisions between those categories exactly the same as the divisions between
male and female, or alive and dead! It's up to you to assess whether your
categories make sense, and whether the difference (for example) between age
25 and age 26 is enough to make the categories 18-25 and 26-40 meaningful.
This does not mean that categories based on age are a bad idea, but only that
you need to be aware of the control you have over organizing data of that sort.

Another way to describe the Chi-square test is that it tests the null
hypothesis that the variables are independent. The test compares the observed
data to a model that distributes the data according to the expectation that the
variables are independent. Wherever the observed data doesn't fit the model, the
likelihood that the variables are dependent becomes stronger, thus proving the
null hypothesis incorrect!

The following table would represent a possible input to the Chi-square test,
using 2 variables to divide the data: gender and party affiliation. 2x2 grids like
this one are often the basic example for the Chi-square test, but in actuality any
size grid would work as well: 3x3, 4x2, etc.

Democrat Republican
Male 20 30
Female 30 20

This shows the basic 2x2 grid. However, this is actually incomplete, in a sense;
generally, the data table should include "marginal" information giving the total
counts for each column and row, as well as for the whole data set:

Democrat Republican Total


Male 20 30 50
Female 30 20 50
Total 50 50 100

We now have a complete data set on the distribution of 100 individuals into
categories of gender (Male/Female) and party affiliation
(Democrat/Republican). A Chi-square test would allow you to test how likely it
is that gender and party affiliation are completely independent; or in other
words, how likely it is that the distribution of males and females in each party is
due to chance.

So, as implied, the null hypothesis in this case would be that gender and party
affiliation are independent of one another. To test this hypothesis, we need to
construct a model which estimates how the data should be distributed if our
hypothesis of independence is correct. This is where the totals we put in the
margins will become handy: later on, I'll show how you can calculate your
estimated data using the marginals. Meanwhile, however, I've constructed an
example which will allow very easy calculations. Assuming that there's a 50/50
chance of males or females being in either party, we get the very simple
distribution shown below.

Democrat Republican Total


Male 25 25 50
Female 25 25 50
Total 50 50 100

This is the information we would need to calculate the likelihood that gender
and party affiliation are independent. I will discuss the next steps in calculating
a Chi square value later, but for now I'll focus on the background information.

Note: you can assume a different null hypothesis for a Chi-square test. Using
the scenario suggested above, you could test the hypothesis that women are
twice as likely to register as Democrats than men, and a Chi-square test would
tell you how likely it is that the observed data reflects that relationship between
your variables. In this case, you would simply run the test using a model of
expected data built under the assumption that this hypothesis is true, and the
formula will (as before) test how well that distribution fits the observed data. I
will not discuss this in more detail, but it is important to know that the null
hypothesis is not some abstract "fact" about the test, but rather a choice you
make when calculating your model.

You might also like