Professional Documents
Culture Documents
CH IV - Chi-Square
CH IV - Chi-Square
CHI-SQUARE DISTRIBUTIONS
2
A Chi-square ( x ) distribution is a continuous distribution ordinarily derived as the sampling
distribution of s sum of squares of independent standard normal variables.
A x 2 test of independence is used to analyze the frequencies of two variables with multiple
categories to determine whether the two variables are independent. That is, the Chi-square
distribution involves using sample data to test for the independence of two variables. The sample
data is given in two way table called a contingency table. Because the x 2 test of independence
uses a contingency table, the test is sometimes referred to as CONTINGENCY ANALYSIS
(contingency table test). The x 2 test is used to analyze, for example, the following cases:
Page 1 of 27
Whether employee absenteeism is independent of job classification
Whether beer preference is independent of sex (gender)
Whether favorite sport is independent of nationality
Whether type of financial investment is independent of Geographic region.
Solution :
I. H o :Choice of TV program an individual watches is independent of the
individuals income
H 1 : income and choice of TV program are not independent
II. Decision rule:
α =0.05
v=( R−1 )( C−1 )
v=( 3−1 ) (3−1)=4
2 2
x α ,v =x 0.05 , 4=9.49
Reject H o if sample x is greater than 9.49
2
Page 2 of 27
250 ×250 250 × 150 250× 100
e 11= =125 e 12= =75 e 13= =50
500 500 500
200 × 250 200 × 150 200× 100
e 21= =100 e 22= =60 e 23= =40
500 500 500
50 × 250 50 × 150 50 × 100
e 31= =25 e 32= =15 e 33= =10
500 500 500
A test of the null hypothesis that variables are independent of one another is based on the
magnitude of the differences between the observed frequencies and the expected frequencies.
Large differences between Oij ∧e ijprovide evidence that the null hypothesis is false. The test is
based on the following chi-square test statistic.
2 2
(Oij −eij ) (f o−f e )
x =∑ ∨x =∑
2 2
e ij fe
Where Oij ( f o ) =observed frequency for contingency table category ∈row i∧column j .
Eij ( f e ) =expected frequency for contingency table∈row i∧column j .
2 2 2 2 2 2 2 2 2
2 (143−125) (70−75) (37−50) (90−100) (67−60) (17−25) (13−15) (20−10) (43−40)
x= + + + + + + + +
125 75 50 100 60 25 15 10 40
IV. Reject the null hypothesis that choice of TV program is independent from income level
2. A human resource manager at EAGEL Inc. was interested in knowing whether the
voluntary absence behavior of the firm’s employees was independent of the marital status.
The employee files contained data on material status and on voluntary absenteeism
behavior for a sample of 500 employees is shown below.
Marital status
Absence of behavior Married Divorced Widowed Single Total
Often absent 36 16 14 34 100
Seldom absent 64 34 20 82 200
Never absent 50 50 16 84 200
Total 150 100 50 200 500
Test the hypothesis that the absence behavior is independent of marital status at a significance
level of 1 %.
Solution:
I. H o : Voluntary absence behavior is independent of the marital status
H 1 : Voluntary absence behavior and marital status are dependent
Page 3 of 27
II. Decision rule:
α =0.01
v=( R−1 )( C−1 )
v=( 3−1 ) (4−1)=6
2 2
x α ,v =x 0.01 ,6 =16.81
Reject H o if sample x 2> 16.81
III. Compute the test statistic
A. The x 2 test of independence was suggested as a way of determining if the decision to hire
7 males and females should be interpreted as having a selection bias in favor of males.
Conduct the test of independence using α =0.10 . what is your conclusion?
B. Using the same test , would the decision to hire 8 males and 4 females suggest concern for
a selection bias?
Page 4 of 27
C. How many males could be hired for the 12 open positions before the procedure would
concern for a selection bias?
Solution:
A.
I.
H o : There is no selection bias in favor of males. (selection status and gender of the applicant are
independent).
H 1 : There is selection bias in favor of males. (selection status and gender of the applicant are not
independent).
II. Decision rule:
α =0.10
v=( R−1 )( C−1 )
v=( 2−1 ) (2−1)=1
2 2
x α ,v =x 0.10 ,1 >2.71
Reject H o if sample x 2> 2.71
III. Sample x 2
observed expected (f o−f e )
2 2
( f o−f e )
frequency ( f o ) frequency(E¿¿ ij) ¿ fe
7 6 1 0.1667
33 34 1 0.0294
5 6 1 0.1667
35 34 1 0.0294
2
(f −f ) 0.3922
∑ of e
e
Page 5 of 27
III. Sample x 2
observed expected (f o−f e )
2 2
( f o−f e )
frequency ( f o ) frequency(E¿¿ ij) ¿ fe
8 6 4 0.6667
32 34 4 0.1176
4 6 4 0.6667
36 34 4 0.1176
2
(f −f ) 1.5686
∑ of e
e
Page 6 of 27
The chi-square test for independence is useful in helping to determine whether a relationship
exists between two variables, but it does not enable us to estimate or predict the values of one
variable based on the value of the other. If it is determine that a dependence does exist between
two quantitative variables, then the techniques of regression analysis are useful in helping to find
a mathematical formula that expresses the nature of mathematical relationship.
Small expected frequencies can lead to inordinately large chi-square values with the chi-square
test of independence. Hence contingency tables should not be used with expected cell values of
less than 5 one way to avoid small expected values is to combine columns or rows whenever
possible and whenever doing so makes sense.
H o : P1=P2=P3 =P20 ; P3=P 30 ; … Pk =P ko ;∧t h e alternative hypot h esis takes t h e following form
H 1 : t h e population proportion are not equal ¿ t h e hypot h esized values .
The degree of freedom is determined as v=k −1; where k referes to the number of proportions
and all expected cell values must be greater than or equal to 5
Example:
1. In the business credit institution industry the accounts receivable for companies are
classified as being “current”. “moderately late”, “very late” and “uncollectible”. Industry
figure shows that the ratio of these four classes is 9 :3 :3 :1
I. ENDURANCE firm has 800 accounts receivable, with 439, 168, 133, and 60 failing in
each class. Are these proportions in agreement with the industry ratio? Let α =0.05
Solution:
9 3 3 1
I. H o : : P 1= P 2= ; P3= ; P 4=
16 16 16 16
H 1 : at least one account is different from the other.
II. α =0.05
v=( K−1 ) =( 4−1 )=3
2 2
x α ,v =x 0.05 ,3 =7.81
Reject H o if sample x 2> 7.81
III. Test statistic (sample x 2)
Page 7 of 27
Class observed expected (f o−f e )
2
( f o−f e )
2
IV. Reject H o ; because 20>5.99 . This means that customers do have color preference. It
appears that red is the most popular color and blue is the least popular.
Page 8 of 27
3. Rating sciences, Inc., a TV program-rating service, surveyed 600 families where the
television was turned on during the prime time on week nights. They found the following
numbers of people turned to the various networks.
Name of the network Type Number of viewers
EBS Commercial 210
Arts 170
Balageru 165
EBC Non commercial 55
600
A. Test the hypothesis that all four networks have equal proportions of viewers during this
prime time period. using α =0.05 .
B. Eliminate the results for EBC and repeat the test of hypothesis for the three commercial
networks, using α =0.05 .
C. Test the hypothesis that each of the three major networks has 30% of the weeknight prime
time market and EBC has 10% using α =0.005 .
Solution:
A.
1
I. H o : : All of the four networks do have equal number of viewers ; P 1=P2=P3=P 4=
4
H 1 : All of the four networks do not have equal number of viewers
II. α =0.05
v=( K−1 ) =( 4−1 )=3
2 2
x α ,v =x 0.05 ,3 =7.81
Reject H o if sample x 2> 7.81
Page 9 of 27
EBC 55 150 9,025 60.1667
2
(f o −f e ) 88.3334
∑ f
e
Page 10 of 27
Class observed expected 2
(f o−f e )
2
( f o−f e )
frequency ( f o ) 1 fe
frequency ( f e=npi ) ; p i=
3
EBS 210 180 900 5.00
Arts 170 180 100 0.55
Balageru 165 180 225 1.25
EBC 55 60 25 0.42
2
(f o −f e ) 7.22
∑ f
e
Page 11 of 27
Class observed expected frequency (f ¿ ¿ e)¿ 2
(f o−f e )
2
( f o−f e )
frequency ( f o ) fe
A 95 90 25 0.2778
B 85 80 25 0.1250
Others 20 30i 100 3.3333
2
(f −f ) 3.9236
∑ of e
e
The chi-square test is widely used for a variety of analysis. One of the more important uses of chi-
square is the goodness-of-fit-test. That is, it can be used to decide whether a particular probability
distribution, such as the binomial, Poisson, or normal distribution. This is an important ability,
because as decision makers using statistics, we will need to choose a certain probability
distribution to represent the distribution of the data we happen to be considering.
In tests of hypothesis (Previous chapter), we assumed that the population was normal and tested
the hypothesis μ=μ o, ρ=ρo, , etc. but what if we want to check on the assumption of normality
itself? The multinomial x 2 goodness-of-fit-test can be applied.
The null hypothesis for a goodness-of-fit-test is test in that the distribution of the population from
which a sample it taken is the one specified. The alternative hypothesis is that the actual
distribution is not the specified distribution. Generally, a researcher specifies only the name of
distribution and uses the sample data to estimate the particular parameters of the distribution. In
this situation one degree of freedom is test for each parameter that has to be estimated. However,
if the research completely specifies the distribution including parameter values, then no additional
degrees of freedom is lost.
i
For the R ×C contingency table, the degree of freedom are calculated as ( R−1 )( C−1 ) . The degrees of freedom
refers to the number of expected frequencies that can be chosen freely provided the row and column totals of
expected frequencies are identical to the row and column totals of the observed frequency table.
Page 12 of 27
Null hypothesis Parameters to be Degrees of
estimated freedom lost
H o : population is normal μ,σ 2
H o : population is normal with μ=x σ 1
H o : population is normal with σ = y μ 1
H o: population is normal with None 0
μ=x , σ = y
H o : population is Poisson λ 1
H o : population is Poisson with λ=Ζ None 0
H o : population is Binomial p , q=w None 0
Example (Binomial)
1. Miss Tsion, saleswoman for Moon paper company, has five accounts to visit per day. It is
suggested that sales by Miss Tsion May be described by the binomial distribution, with the
probability of selling each account being 0.40. given the following frequency distribution
of Miss Tsion’s number of sales per day, can we conclude that the data do in fact follow
the binomial distribution? Uses 0.05 significance level.
No of sales per day 0 1 2 3 4 5
Frequency 10 41 60 20 6 3
Solution:
I. H o : :The frequency distributionis binomial withn=5∧ p=0.40
H 1 : The frequency distribution is not binomial with n=5∧ p=0.40
II. α =0.05
v=K−1−m=5−1−0=4
2 2
x α ,v =x 0.05 , 4=9.49
Reject H o if sample x 2> 9.49
Because of the above change H o is translated as :
IV. Do not reject H o the data are well described by the binomial distribution with
n=5 , p=0.40
2. A professional baseball player, Philippos, was at bat five times in each of 100 games.
Philippos claims that he has a probability of 0.40 of getting a hit each time he goes to bat.
Test his claim at the 0.05 level by seeing if the following data are distributed binomially.
No of hits/game 0 1 2 3 4 5
No of games with that number of hits 12 38 27 17 5 1
Solution:
I.
H o :The frequency distributioncan be best described by binomial distributionwith n=5∧ p=0.40
H 1 : The frequency distribution can ' t be best described by binomial distribution with n=5∧ p=0.40
II. α =0.05
v=K−1−m=5−1−0=4
2 2
x α ,v =x 0.05 , 4=9.49
Reject H o if sample x 2> 9.49
Because of the above change H o is translated as :
IV. Reject H o the # of hit over the same is not normally distributed
3. The Ethiopian postal service is interested in modeling the “mangled letter” problem. It has
been suggested that any letter sent to a certain area has a 0.15 chance of being mangled.
Since the post office is so big, it can be assumed that two letters chances of being mangled
are independent. A sample of 310 people was selected, and two test letters were mailed to
each of them. The number of people receiving zero, one, or two mangled letters was
260,40 and 10, respectively. At 0.10 level of significance, is it reasonable to conclude that
the number of mangled letters received by people follows a binomial distribution with
P=0.15?
Solution:
𝐻𝑜: The number of mangled letters received by people follows a binomial distribution with
𝑛=2 𝑎𝑛𝑑 𝑝=0.15
𝐻1: The number of mangled letters received by people doesn’t′ follow a binomial
distribution with 𝑛=2 𝑎𝑛𝑑 𝑝=0.15
Solution:
I.
H o :The number of mangled letters received by people follows a binomial distribution withn=2∧ p=0.15
'
H 1 : The number of mangled letters received by people doesn t follow a binomial distribution withn=2∧ p=0.15
II. α =0.10
v=K−1−m=3−1−0=2
2 2
x α ,v =x 0.10 ,2 =4.61
Reject H o if sample x 2> 4.61
III. Test statistic (sample x 2)
Page 15 of 27
1 0.2550 40 79.0500 1524.9025 19.2904
2 0.0225 10 6.9750 9.1506 1.3119
2
(f −f ) 26.3967
∑ of e
e
IV. Reject H o.
'
The number of mangled letters received by people doesn t follow a binomial distributionwith n=2∧ p=0.15
Example (Poisson)
1. It is hypothesized that the number of breakdowns per month of a computer system at a
major university follows a Poisson distribution with μ=2. The data below show the
observed number of breakdowns per month during a sample of 100 months. Use a 5%
level of significance and test the null hypothesis.
Breakdowns 0 1 2 3 4 5 and above
Observed frequency 14 20 34 22 5 3
Solution:
I. H o :The population distribution is Poisson with λ=2.
H 1 : The population distributionis not Poisson with λ=2.
II. α =0.05
v=K−1−m=6−1−0=5
2 2
x α ,v =x 0.05 ,5 =11.07
Reject H o if sample x 2> 11.07
III. Test statistic (sample x 2)
IV. Do not reject H o . The number of breakdowns per month of a computer system at the
university follows a Poisson Distribution with μ=2. .
Page 16 of 27
2. Suppose that a teller supervisor believes that the distribution of random arrivals at local
bank is Poisson and sets out to test the hypothesis by gathering information. The following
data represent a distribution of frequency of arrivals during one minute intervals at a bank.
Use α =0.05 to test these data in an effort to determine whether they are Poisson
Distributed.
Solution:
Before we solve the question, first we have to compute the arrival rate per minute, and
hence one degree of freedom is lost.
λ=
∑ (number of arrivals∗observed frequency)
∑ (observed frequency)
( 0∗7 ) + ( 18∗1 )+ ( 25∗2 ) + ( 17∗3 ) ( 12∗4 )+(5∗5) 192
¿ = =2.3 cust /min
84 84
I. H o :The arrival of customers at a bank is Poisson distributed with λ=2.3 .
H 1 : The arrival of customers at a bank is not Poisson distributed with λ=2.3 .
II. α =0.05
v=K−1−m=6−1−1=4
2 2
x α ,v =x 0.05 , 4=9.488
Reject H o if sample x 2> 9.488
III. Test statistic (sample x 2)
IV. Do not reject H o . The arrival of customers at a bank follows a Poisson distribution with
λ=2.3
Page 17 of 27
3. The number of automobile accidents occurring per day in a particular city is believed to
have a Poisson distribution. A sample of 80 days during the past year gives the data shown
below. Do the data support the belief that the number of accidents per day has a poison
distribution? Use α =0.05
No of accidents 0 1 2 3 4
Observed frequency(days) 34 25 11 7 3
Solution:
Before we solve the question, first we have to compute the occurrence rate per day, and
hence one degree of freedom is lost.
λ=
∑ (number of arrivals∗observed frequency)
∑ (observed frequency)
( 0∗34 ) + ( 25∗1 ) + ( 11∗2 ) + ( 7∗3 ) (3∗4 ) 80
= =1 accident /day
80 80
I. H o :The occurence of acciddents per day follows a Poisson distribution with λ=1.0
H 1 : :The occurence of acciddents per day does not follow a Poisson distribution with λ=1.0
II. α =0.05
v=K−1−m=4−1−1=2
2 2
x α ,v =x 0.05 ,2 =5.99
Reject H o if sample x 2> 5.99
III. Test statistic (sample x 2)
IV. Do not reject H o . The occurence of acciddents follows a Poisson distribution with λ=1.0
Example (Normal)
1. Suppose that Ato Paulos developed an overall attitude scale to determine how his
company’s employees feel toward their company. In theory the scores can vary from 0 to
50. Ato Paulos retests his measurement instrument on a randomly selected group of 100
employees. He tallies the scores and summarizes them in to six categories as shown
Page 18 of 27
below. Are these retest scores approximately normally distributed with
μ=24.9∧σ =7.194 ? Use α =0.05
Page 19 of 27
30−24.9 0.26115
z 30= = +0.71
7.194
35−24.9 0.41924
z 35= = +1.40
7.194
Expected probability 0.25716
For category 35-40 Probability
35−24.9 0.41924
z 35= = +1.40
7.194
40−24.9 0.48214
z 40= = +2.10
7.194
Expected probability 0.06290
The six probabilities do not sum to 1.00. even though observed frequencies were obtained only
for these six categories, getting a score less than 10 or greater than 40 was also possible. Because
0.50 of the probabilities liee in each half of a normal distribution utilizing the sum of expected
probabilities on each side of the mean, 24.9, we can obtain a probability of the
¿ 10 category : 0.5−( 0.06456 +0.16446+0.25175 )=0.01923 . Similarly , wwe can obtainthe probability of > 40 cate
expected frequencies can then be obtained by multiplying each expected probability by thee total
frequency (100), as shown below.
As the ¿ 10∧¿ 40 categories have values of less than 5, each must be combined with the
adjacent category. As a result, the ¿ 10 category becomes part of the 10-15 category and
the ¿ 40 category becomes part of the 35-40 category.
Page 20 of 27
30−35 0.15809 15.809
35−40 0.08076 8.076
IV. Do not reject H o . The attitude score are normally distributed with mean 24.9 and standard
deviation 7.194.
2. The director of a major soccer team believes that the ages f purchasers of game tickets are
normally distributed. If the following data represent the distribution of ages for a sample
of observed purchasers of major soccer game tickets, use the chi-square goodness-of-fit-
test to determine whether this distribution is significantly different from the normal
distribution. Assume that α =0.05 .
categor frequency(f )
y
10-20 16 15 240 3600
20-30 44 25 1100 27500
Page 21 of 27
30-40 61 35 2135 74725
40-50 56 45 2520 113400
50-60 35 55 1925 105875
60-70 19 65 1235 80275
231 ∑ fm=¿ ¿91 ∑ fm =¿ 405375
2
55
x=
∑ fm = 9155 =39.63
n 231
s= √∑ fm −¿ ¿ ¿ ¿ ¿
2
X−μ
With z= , the expected probability of each category can be obtained as follows:
σ
The six probabilities do not sum to 1.00. Even though the observed frequencies were obtained
only for these six categories, getting a score less than 10 or greater than 70 was also possible.
For ¿ 10
Probability between 10 and the mean =0.06030+0.16392+0.26115=0.48537
Probability ¿ 10=0.5−0.48537=0.01463
For ¿ 70
Probability between 70 and the mean =0.05394+0.15682+0.2640+0.01197=0. 48713
Probability ¿ 70=0.5−0.48713=0.01287
Then, the expected frequencies can be obtained by multiplying each expected probability by thee
total frequency (231), as shown below.
As the ¿ 10∧¿70 categories have values of less than 5, each must be combined with the
adjacent category. As a result, the ¿ 10 category becomes part of the 10-20 category and
the ¿ 70 category becomes part of the 60-70 category.
Page 23 of 27
10-20 0.07493 0.07493
20-30 0.16392 0.16392
30-40 0.27312 0.27312
40-50 0.26440 0.26440
50-60 0.15682 0.15682
60-70 0.06681 0.06681
IV. Do not reject H o . The age of purchasers of soccer game tickets are normally distributed.
3. The instructor for introductory statistics course attempts to construct the final examination
so that the grades are normally distributed with a mean of 65. From the sample of grades
appearing in the accompanying frequency distribution table, can you conclude that they
have achieved his objective? Use α =0.05 .
frequency(f ) (m)
30-40 4 35 140 4,900
Page 24 of 27
40-50 17 45 765 34,425
50-60 29 55 1595 87,725
60-70 49 65 3185 207,025
70-80 33 75 2475 185,625
80-90 18 85 1530 130,050
150 ∑ fm=¿ ¿9 ∑ fm =¿ 649,75
2
690 0
x=
∑ fm = 9690 =64.60 65
n 150
s= √∑ fm −¿ ¿ ¿ ¿ ¿
2
X−μ
With z= , the expected probability of each category can be obtained as follows:
σ
Page 25 of 27
Expected probability 0.22756
For category 80-90 Probability
80−65 −0.38298
z 80= = +1.19
12.63
90−65 0.47615
z 90= = +1.98
12.63
Expected probability 0.09317
The six probabilities do not sum to 1.00. Even though the observed frequencies were obtained
only for these six categories, getting a score less than 30 or greater than 90 was also possible.
For ¿ 30
Probability between 30 and the mean =0.02105+0.09317+0.22756+0.15542=0.49720
Probability ¿ 30=0.5−0.49720=0.00280
For ¿ 90
Probability between 90 and the mean =0.15542+0.022756+0.09317=0..47615
Probability ¿ 90=0.5−0.0 .47615=0.02385
Then, the expected frequencies can be obtained by multiplying each expected probability by thee
total frequency (150), as shown below.
Grade Probability Expected frequency
( f e =n pi ¿
¿ 30 0.00280 0.42
30-40 0.02105 3.1575 17.553
40-50 0.09317 13.9755
50-60 0.22756 34.134 34.134
60-70 0.31084 46.626 46.626
70-80 0.22756 34.134 34.134
80-90 0.09317 13.9755
¿ 90 0.02385 3.5775 17.553
Since the ¿ 30 , 30−40∧¿ 90 categories have values of less than 5, each must be combined with
the adjacent category. As a result, the ¿ 30∧30−40 category becomes part of the 40-50 category
and the ¿ 90 category becomes part of the 80-90 category.
Grade Probability Expected frequency (
f e =n pi ¿
40-50 0.11702 17.553
50-60 0.22756 52.5664
60-70 0.31084 71.8040
70-80 0.22756 52.5664
80-90 0.11702 17.553
Page 26 of 27
The value of the chi-square can then be computed.
IV. Do not reject H o . The grades of students are normally distributed with a mean 65.
Page 27 of 27