Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

A DVA NC E STATI STI C S

CHI-SQUARE
TEST
N O N PA R A M E T R I C T E S T S
OUT L INE OF TO PICS

Introduction to Chi-Square
Terms

Topic Types of Chi-Square Test


Goodness-of-fit Test

Overview Test of Independence


Somer’s d
Yates Correction of Continuity
Fisher-Irwin Exact Test
Exercise
Statistics is the grammar
of science.
KARL PEARSON
PAR A ME TR IC TE ST
A test in which, the population constants like mean, standard
deviation, standard error, correlation coefficient, proportion etc.
and data tend to follow one assumed or established distribution
such as normal, binomial, poisson etc.

NO N PAR AM E TR IC TE ST
A test in which no constant of a population is used. Data do not
follow any specific distribution and no assumption are made in
these test. E.g. to classify good, better, and best we just allocate
arbitrary numbers or marks to each category.

HY POT HE SIS

Terms To A definite statement about the population parameters

Take Note NU LL H Y PO T HE SIS (H o )


States that no association exists between the two cross-
tabulated variables in the population and therefore the variables
are statistically independent. E.g. If we want to compare 2
methods, method A and method B for its superiority, and if the
assumption is that both methods are equally good, then this
assumption is called as NULL HYPOTHESIS.
A LT E RN AT E H Y PO T HE SIS (H 1 o r Ha )
Proposes that the two variables are related in the population. If
we assume that from 2 methods, method A is superior that
method B. then this assumption is called as ALTERNATIVE
HYPOTHESIS.

DE GR E E OF FR EE D OM
It denotes the extent of independence (freedom) enjoyed by a
given set of observed frequencies. Suppose we are given a set
of n observed frequencies which are subjected to k independent
constraints (restrictions) then,

Degree of Freedom = Number of Frequencies – Number of

Terms To Independent constraints on them

Take Note In other terms,


Df = (r-1)(c-1)
WHERE
r = the number of rows
c = the number of colomns
C ON T IG EN C Y TA B LE
A table is prepared by enumeration of qualitative data by entering
the actual frequencies, and if that table represents occurrence of
two sets of events, that table is called the contingency table. It is
also called as association table.

Terms To
Take Note
C hi -S quar e ( X2) S ta ti st i cs

A chi-square (X2) statistic is a test that measure how a model compares to actual

“Chi-Square Statistics”
observed data.

It test significance and was developed by Karl Pearson in 1900.

The data used in calculating statistic must be random, raw, mutually exclusive,
drawn from independent variables, and drawn from a large enough sample. For
Introduction to
example, the result of tossing a fair coin meet these criteria.

Chi-square test are often used in hypothesis testing. This statistics compares the
size of any discrepancies between the expected results and the actual results,
given the size of sample and the number of variables in the relationship.

For these tests, degrees of freedom are utilized to determine if a certain null
hypothesis can be rejected based on the total number of variables and samples
within the experiment. As with any statistic, the larger the sample size, the more
reliable the result.
C hi -S quar e ( X2) S ta ti st i cs

This test can be used in

1. Goodness of fit distribution

2. Test of Independence of Attributes

Chi-Square Test
Application of
Two Main Kinds
of
Chi-Square Test

Test of Inde pe ndenc e Goodne ss -of-fit Test


Is there a relationship between student’s How well does the coin in my hand match
sex and course choice? a theoretically fair coin?
Formula for
Chi-Square
G oodness -of-fit Te st

χ2 provides a way to test how well a sample of data matches the


(known or assumed) characteristics of the larger population that the
sample is intended to represent. This is known as goodness of fit. If
the sample data do not fit the expected properties of the population

Chi-Square Test
that we are interested in, then we would not want to use this sample
to draw conclusions about the larger population.
Main Kind of
Example

For example, consider an imaginary coin with exactly a 50/50 chance


of landing heads or tails and a real coin that you toss 100 times. If
this coin is fair, then it will also have an equal probability of landing
on either side, and the expected result of tossing the coin 100 times is
that heads will come up 50 times and tails will come up 50 times.
G oodness -of-fit Te st

This enables us to see how well does the assumed theoretical


distribution (such as Binomial distribution, Poisson distribution or
Normal distribution) fit to the observed data

Chi-Square Test
The chi-square test formula for goodness of fit is:
Main Kind of

If chi-square (calculated) > chi-square (tabulated), with (n-1) d.f,


then null hypothesis is rejected otherwise accepted.

And if null hypothesis is accepted, then it can be concluded that the


given distribution follows theoretical distribution.
Te st of Indepe ndenc e
This enables us to explain whether two attributes are
associated.

For instance, we may be interested in knowing whether a


new medicine is effective in controlling fever or not, chi-

Chi-Square Test
square test is useful.
Main Kind of
In such a situation, we proceed with the null hypothesis that
t h e t w o a t t r i b u t e s ( Vi z . , n e w m e d i c i n e a n d c o n t r o l o f f e v e r )
are independent which means that new medicine is not
e f f e c t i v e i n c o n t r o l l i n g f e v e r.

Chi-square (calculated) > chi-square (tabulated) at a certain level of


significance for given degrees of freedom, the null hypothesis is rejected, i.e.
two variables are dependent (i.e., the new medicine is effective in controlling
the fever) and if, chi-square (calculated) < chi-square (tabulated), the null
hypothesis is accepted, i.e. 2 variables are independent (i.e., the new medicine
is not effective in controlling the fever).
Somers ’ D el ta
Somers’ Delta (Somers’ D) is a measure of
the strength and direction of the
association between an ordinal dependent
variable and an ordinal independent
variable.

An ordinal variable is one in which the values


have a natural order (e.g. “bad”, “neutral”,
“good”)

The value for Somers’ D ranges between -1


and 1 where:
Somer’s Delta
-1 : Indicates that all pairs of the variables
disagree

1 : Indicates that all pairs of the variable agree

Somers’ D is used in practice for many


Somers ’ D el ta
Given two variables, X and Y, we can say:

Two pairs (Xi, Yi) and (Xj, Yj) are concordant


if the ranks of both elements agree.

Two pairs (Xi, Yi) and (Xj, Yj) are discordant


if the ranks of both elements disagree.

In calculating Somers’ D use the formula:

Somer’s Delta

NOTE: The resulting value will always be


between -1 and 1
Somers ’ D el ta
Suppose a grocery store would like to assess
the relationship between the following two
ordinal variables:

The overall niceness of the cashier (ranked


from 1 to 3)

The overall satisfaction of the customer’s


experience (also ranked from 1 to 3)

Collecting the following ratings from a sample


of 10 customers: Somer’s Delta:
Example
Somers ’ D el ta
First compute the following

Check if customer number 1 under nice and


satisfaction has the same (equal), lower, or
higher value than the other customers from
number 2 to 10. to be able to get the needed
data from the formula.

Somer’s Delta:
Computation
Based on the computation. There are

5 concordant Answer: 0.667


Round off into a whole number
0 discordant
=1
4 tied
Yate’s C orr ec ti on of
Continui ty Yates (1902-1994)
Theory of Frank was one of the
pioneers of 20th century statistics

In statics Yates Correction for Continuity (or Yate’s


chi-square test) is used in certain situations when
testing for independence in a contingency table.

In some cases, Yate’s correction may adjust too far,


and so its current use is limited.

The effect of Yate’s correction is to prevent


overestimation of statistical significance for small
data. Yate’s
This formula is chiefly used when at least one cell of Correction of
the table has an expected count smaller than 5
Continuity
Unfortunately, Yate’s correction may tend to
overcorrect. This can result in an overly conservative
result that fails to reject the null hypothesis when it
Yate’s C orr ec ti on of
Continui ty test used in
Yate’s chi-square a certain
situation when testing independence in a
contingency table

When problem arising particularly in a 2x2 Yate’s


table with 1 degree of freedom the procedure is
to subtract 0.5 from the absolute value of the Correction of
difference between observe frequency and
expected frequency Continuity
So each observed which is larger that its Example
expected is decreased by 0.5 and each
observed which is smaller that its expected is
increased by 0.5
Yate’s C orr ec ti on of
Continui tyis Yate’s corrected
The following version of
Pearson’s chi-squared statistics

Yate’s
Correction of
Continuity:
Computation
Fisher-Ir wi n E xa ct Te st
It is a test for independence in a 2 X 2 table. It
is most useful when the total number size and
the expected value are small. The test holds the
marginal total fixed and computes the
hypergeometric probability that N11 is at least
as large as the observed value

It is useful when E (cells count) < 5


Fisher-Irwin
Exact Test
Fisher-Ir wi n E xa ct Te st
Example:
2 X 2 table with cells count a, b, c, d. assuming
marginal totals are fixed:

M1 = a + b, M2 = c + d, N1 = a + c, N2 = b +
d, for convenience assume N1 < N2, M1 < M2.
possible value of a are: 0, 1, …min (M1, N1).

Probability distribution of cell count a follows Fisher-Irwin


Exact Test:
a hypergeometric distribution:

Example
N = a + b + c + d = N1 + N2 = M1 + M2

Pr (x=a) = N1! N2! M1! M2!/ [N!a!b!c!d!]


Mean (x) = M1N1/ N
Var (x) = M1M2N1N2/[N2 (N-1)]

Fisher exact test is based on this


hypergeometric distribution
Fisher-Ir wi n E xa ct Te st

Fisher-Irwin
Exact Test:
Formula
Fisher-Ir wi n E xa ct Te st
Is HIV Infection related to Hx of STDs in Sub
Saharan African Countries? Test at 5% level.

Fisher-Irwin
Probability of observing this specific table
given fixed marginal totals is
Exact Test:
Pr (3, 7, 5, 10) = 10!15!8!17!/[25!3!7!5!10!]
Computation
= 0.3332
Note the above is not the p-value. Why?
Not the accumulative probability, or not the
tail probability.
Tail probability = sum of all value (a = 3, 2, 1,
Exercises
An instructor makes out his final grades for 200 A die is thrown 132 times with following results. Is
students in his subject Introduction to Statistics. the die unbiased?
He is curious to see if his grade distribution
No. Observed Computed (E-O) (E-O) 2 (E-O) 2
resembles the “normal curve” and notes from the Turned Up Frequency Frequency E

college catalog that in a normal distribution of 1 16


2 20
grades 45% of them would be C’s, 24% would be
3 25
B’s, 24% D’s, 3.5% of them would be A’s and 3.5%
4 14
F’s. The instructor compared the frequency given in
5 29
his class to the normal curve. The frequency of
6 28
each grade is given.
Equivalenc Observed Percentag Expected (E-O) (E-O) 2 (E-O) 2
e Frequency e Frequency E Let us take note that the hypothesis that the die is
Frequency
unbiased. If that is so, the probability of obtaining
A 15 3.5%
any one of the six is 1/6 and as such the expected
B 53 24%
C 87 45%
frequency of any one number coming upward.
D 33 24%
F 12 3.5%
Exercises
Use Fisher-Irwin Exact Test Use Yate’s Correction of Continuity

3 1

1 3

You might also like