Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 14

CHI-SQAURE TEST

A. CHI SQAURE TEST AS PARAMETRIC TEST


HYPOTHESIS TESTING FOR COMPARING A VARIANCE TO SOME
HYPOTHESISED POPULATION VARIANCE

The test we use for comparing a sample variance to some theoretical or hypothesised variance of
population is different than z-test or the t-test. The test we use for this purpose is known as chi-square test
and the test statistic symbolised as c2 , known as the chi-square value, is worked out.
The chi-square value to test the null hypothesis viz,
H0: ss2 = sp2 worked out as under:

2
σ 2s
c = 2 (n−1)
σp
Where;
ss2 = variance of the sample
sp2 = variance of the population
(n – 1) = degree of freedom, n being the number of items in the sample.

Then by comparing the calculated value of c2 with its table value for (n – 1) degrees of freedom at a
given level of significance, we may either accept H0 or reject it. If the calculated value of c2 is equal to
or less than the table value, the null hypothesis is accepted; otherwise the null hypothesis is rejected. This
test is based on chi-square distribution which is not symmetrical and all the values happen to be positive;
one must simply know the degrees of freedom for using such a distribution.

Example: A sample of 10 is drawn randomly from a certain population. The sum of the squared
deviations from the mean of the given sample is 50. Test the hypothesis that the variance of the
population is 5 at 5 per cent level of significance.

n= 10, ss2 = ?, å(Xi-X̅)2 = 50 ,sp2 = 5, SL= 5%


ss2 = å(Xi-X̅)2/n-1, = 50/9 =
Answer: null hypothesis as H0: ss2 = sp2

c2 = 9.99 or 10
The table value of c2 at 5 per cent level for 9 d.f. is 16.92. The calculated value of c2 (10) is less than this table
value (calculated value<table value= null hypothesis is not rejected), so we accept the null hypothesis and
conclude that the variance of the population is 5 as given in the question.
B. CHI SQAURE TEST AS NON PARAMETRIC TEST

As a non-parametric test, chi-square can be used


(i) Test of goodness of fit:
a. c2 test enables us to see how well does the assumed theoretical distribution (such as
Binomial distribution, Poisson distribution or Normal distribution) fit to the observed
data.
b. When some theoretical distribution is fitted to the given data, we are always interested in
knowing as to how well this distribution fits with the observed data.
c. If the calculated value of c2 < the table value = fit is considered to be a good one which
means that the divergence between the observed and expected frequencies is attributable to
fluctuations of sampling.
d. If the calculated value of c2 > table value, the fit is not considered to be a good one.

(ii) Test of independence:


a. This test enables us to explain whether or not two attributes are associated.
b. For instance, we may be interested in knowing whether a new medicine is effective in
controlling fever or not. In such a situation, we proceed with the null hypothesis that the two
attributes (viz., new medicine and control of fever) are independent which means that new
medicine is not effective in controlling fever.
c. On this basis we first calculate the expected frequencies and then work out the value of c2 .
d. If the calculated value of c2 < table value =null hypothesis stands which means that the two
attributes are independent or not associated (i.e., the new medicine is not effective in controlling
the fever).
e. If the calculated value of c2 > table value, = null hypothesis does not hold good which means the
two attributes are associated and the association is not because of some chance factor but it exists
in reality (i.e., the new medicine is effective in controlling the fever and as such may be
prescribed).
f. It may, however, be stated here that c2 is not a measure of the degree of relationship or the form of
relationship between two attributes, but is simply a technique of judging the significance of such
association or relationship between two attributes.

 CONDITIONS FOR THE APPLICATION OF c2 TEST


The following conditions should be satisfied before c2 test can be applied:
(i) Observations recorded and used are collected on a random basis.
(ii) All the itmes in the sample must be independent.
(iii) No group should contain very few items, say less than 10. In case where the frequencies are less than
10, regrouping is done by combining the frequencies of adjoining groups so that the new frequencies
become greater than 10. Some statisticians take this number as 5, but 10 is regarded as better by most of
the statisticians.
(iv) The overall number of items must also be reasonably large. It should normally be at least 50,
howsoever small the number of groups may be.
(v) The constraints must be linear. Constraints which involve linear equations in the cell frequencies of a
contingency table (i.e., equations containing no squares or higher powers of the frequencies) are known
as linear constraints.
 IMPORTANT CHARACTERISTICS OF c2 TEST
(i) This test (as a non-parametric test) is based on frequencies and not on the parameters like
mean and standard deviation.
(ii) The test is used for testing the hypothesis and is not useful for estimation.
(iii) This test possesses the additive property as has already been explained.
(iv) This test can also be applied to a complex contingency table with several classes and as
such is a very useful test in research work.
(v) This test is an important non-parametric test as no rigid assumptions are necessary in
regard to the type of population, no need of parameter values and relatively less mathematical
details are involved.

CALCULATION STEPS of c2
(f ¿ ¿ o−f e )2
ᵡ 2=∑ ¿
fe
Where,
F0 = observed frequency of each of response categories
Fe = expected frequency in each of the response categories

Step 1:
( Row total for the row of that cell)∗(Columntotal for thecolumn of that cell)
Expected frequency of any cell=
Grand total
Step 2: Find out the difference between observed and expected frequencies and find out the squares of such
differences i.e., calculate (fo – fe)2.
Step 3: Divide the (fo – fe)2 obtained as stated above by the corresponding expected frequency to get (fo – fe)2/fe and
this should be done for all the cell frequencies or the group frequencies.
( f ¿ ¿ o−f e )2
(iv) Find the summation of (fo –fe) /fe values or what we call∑
2
¿ This is the required c2 value.
fe
The c2 value obtained as such should be compared with relevant table value of c2 and then inference be
drawn.
Example: Mr. George Mcmohan, president of National general health insurance company, is opposed to
national health insurance. He argues that it would be too costly to implement, particularly since the
existence of such a system would, among other effects, tend to encourage people to spend more time in
hospitals. George believes that lengths of stays in hospitals are dependent on the types of health insurance
that people have. He asked donna, his staff statistician. To check the matter. Donna collected data on a
random sample of 660 hospital stays and summarized them in the given table: test at 99% confidence
interval.

Days in hospital
Fraction costs <5 5-10 >10 Total
covered by <25% 40 75 65 180
insurance 25-50% 30 45 75 150
>50% 40 100 190 330
Total 110 220 330 660

Answer: null hypothesis H0: Length of stay and type of insurance are independent
Alternate hypothesis Ha: Length of stay depends on type of insurance

Fe=(RT*CT/GT (Fo-
s.no. Row Column Fo ) Fo-Fe (Fo-Fe)^2 Fe)^2/Fe
3.33333333
1 1 1 40 30 10 100 3
2 1 2 75 60 15 225 3.75
6.94444444
3 1 3 65 90 -25 625 4
4 2 1 30 25 5 25 1
5 2 2 45 50 -5 25 0.5
6 2 3 75 75 0 0 0
4.09090909
7 3 1 40 55 -15 225 1
0.90909090
8 3 2 100 110 -10 100 9
3.78787878
9 3 3 190 165 25 625 8
24.3156565
            chi-sqaure 7

2
As the value of ᵡ is 24.31 which is greater than table value of 13.27 at 4 degree of freedom at 1 %
significance level we can say that null hypothesis is not accepted and thus there is significant association
between insurance type and duration of stay in hospital. (insurance coverage and length of hospital stay
are dependent on each other.)

Chi-Sqaure in SPSS

Applicable: When DV and IDV both are Non-metric(measured on nominal/ ordinal scale)

Steps in SPSS: AnalyzeexploreCross tabs


Output : Table 1: statistical summary

Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

Income category in 6400 100.0% 0 0.0% 6400 100.0%


thousands * Job satisfaction

Table 2: Cross tabulation

Income category in thousands * Job satisfaction Crosstabulation


Count

Job satisfaction Total

Highly Somewhat Neutral Somewhat Highly


dissatisfied dissatisfied satisfied satisfied

Income Under $25 350 316 250 157 101 1174


category
$25 - $49 531 534 542 496 285 2388
in thousands

$50 - $74 137 215 261 262 245 1120


$75+ 91 203 340 491 593 1718

Total 1109 1268 1393 1406 1224 6400

Table 3: Chi-square test


Null hypothesis: Ho: there is no significant association between income and Job satisfaction of the
employees.
Ha: there is significant association between income and Job satisfaction of the
employees.

Chi-Square Tests

Value df Asymp. Sig. (2-


sided)
a
Pearson Chi-Square 823.879 12 .000
Likelihood Ratio 854.177 12 .000
Linear-by-Linear Association 792.550 1 .000
N of Valid Cases 6400

a. 0 cells (0.0%) have expected count less than 5. The minimum


expected count is 194.08.

Interpretation: Chi-sqaure test statistics of 823.87 (significance value <0.05) indicates that there is
significant association between two variables (Income and job satisfaction).
1) For checking assumption go to AnalyzeexploreCross tabs click on cells
Output table:

Income category in thousands * Job satisfaction Crosstabulation

Job satisfaction Total

Highly Somewhat Neutral Somewhat Highly


dissatisfied dissatisfied satisfied satisfied

Count 350 316 250 157 101 1174

Under $25
Expected 203.4 232.6 255.5 257.9 224.5 1174.0
Count

Count 531 534 542 496 285 2388

$25 - $49
Income Expected 413. 8 473.1 519.8 524.6 456.7 2388.0
category Count
in 137 215 261 262 245 1120
Count
thousands
$50 - $74
Expected 194.1 221.9 243.8 246.1 214.2 1120.0
Count

Count 91 203 340 491 593 1718

$75+
Expected 297.7 340.4 373.9 377.4 328.6 1718.0
Count
Count 1109 1268 1393 1406 1224 6400

Total
Expected 1109.0 1268.0 1393.0 1406.0 1224.0 6400.0
Count

Look for expected frequencies if there is any frequency <5 then chi -sqaure assumption is violated
and chi-sqaure test cannot be applied. If assumption is fulfilled then go ahead.

ANALYSIS OF VARIANCE (ANOVA)

ANOVA is essentially a procedure for testing the difference among different groups of data for
homogeneity. Earlier, we noted that t-test can be used to study the means of one or two samples. But,
if there are more than two samples, then multiple t-tests will need to be applied. This process may be
very complex. Instead, ANOVA can be used to study the means of two or more populations.
ANOVA involves analysis of dependent variable should be interval or ratio scale.
and Independent variable should be categorical (nominal or ordinal scale).
Through ANOVA technique one can, in general, investigate any number of factors which are
hypothesized or said to influence the dependent variable. One may as well investigate the differences
amongst various categories within each of these factors which may have a large number of possible
values. If we take only one factor and investigate the differences amongst its various categories
having numerous possible values, we are said to use one-way ANOVA and in case we investigate
two factors at the same time, then we use two-way ANOVA. In a two or more way ANOVA, the
interaction (i.e., inter-relation between two independent variables/factors), if any, between two
independent variables affecting a dependent variable can as well be studied for better decisions.
Two estimates of population variance viz., one based on between samples variance and the other
based on within samples variance. Then the said two estimates of population variance are compared
with F-test,
Estimate of population variance based on between samples variance
F=
Estimate of population variance based on within samples varianc e

N-WAY ANOVA IN SPSS


N= NUMBER OF INDEPENDENT VARIABLES
IF IDV=1 THEN, ONE WAY ANOVA
IF IDV= 2 THEN, TWO WAY ANOVA
First two steps are similar as application of t-test
1) Test for normality
2) Test for homogeneity of variances

Step 3: a) if the null hypothesis accepted (p value >0.05), that means there is no significant difference
between the mean scores of different categories
b) if the null hypothesis is rejected (p value <0.05), that means At least one of the categories of IDV
differ significantly from the rest in their mean scores.  then apply post -hoc analysis for checking
the different category.

TWO WAY ANOVA


WHEN, IDV=2 (Non metric), DV=1 (Metric)
Example: DV= sales (metric), IDV 1= Territory (3 categories; A,B,C), IDV 2= Season= (3 categories:
Summer, Rainy, winter)

Claim: to check the combined effect of season and territory on sales


Apply 2-way ANOVA

Steps in SPSS:
1. Analyze General Linear ModelUnivariate
PLOTS POST HOC

OPTIONS

OUTPUT
Table 1: This table represents the count of each category for each IDV variable
Between-Subjects Factors

Value Label N

1 A 9

Territory 2 B 8

3 C 13
1 Summer 10

Season 2 Winter 10

3 Rainy 10

Table 2: shows the assumption of homogeneity of variances where levene test is applied to test the
hypothesis

Levene's Test of Equality of Error Variancesa


Dependent Variable: Sales

F df1 df2 Sig.

1.871 8 21 .119

Tests the null hypothesis that the error variance


of the dependent variable is equal across groups.
a. Design: Intercept + teritory + season + teritory
* season

Table3: The Main ANOVA table


 It tells us whether any of the independent variables have had an effect on the dependent variable.
 The important things to look at in the table are the significance values of the independent
variables.
 From the table we can say that there is no significant effect of territory on sales (p value=0.159) (p
value >0.05).
 From the table we can say that there is no significant effect of Season on sales(p value= 0.680) (p
value >0.05).
 So as there is no significant effect of territory and season on sales (p value= 0.836), (p value=
>0.05)

Tests of Between-Subjects Effects


Dependent Variable: Sales

Source Type III Sum of df Mean Square F Sig.


Squares

Corrected Model 26.133a 8 3.267 .809 .603


Intercept 521.550 1 521.550 129.107 .000
Territory 16.251 2 8.126 2.011 .159
Season 3.179 2 1.589 .393 .680
teritory * season 5.781 4 1.445 .358 .836
Error 84.833 21 4.040
Total 683.000 30
Corrected Total 110.967 29

a. R Squared = .236 (Adjusted R Squared = -.056)

Table 4: Summary statistics

Territory * Season
Dependent Variable: Sales

Territory Season Mean Std. Error 95% Confidence Interval

Lower Bound Upper Bound

Summer 3.333 1.160 .920 5.747

A Winter 2.000 1.160 -.413 4.413

Rainy 4.333 1.160 1.920 6.747


Summer 5.000 1.160 2.587 7.413
B Winter 5.000 1.160 2.587 7.413
Rainy 4.500 1.421 1.544 7.456
Summer 5.000 1.005 2.910 7.090

C Winter 4.500 1.005 2.410 6.590

Rainy 5.000 .899 3.131 6.869

Table 5: POST HOC ANLAYSIS

Multiple Comparisons
Dependent Variable: Sales
Tukey HSD

(I) Territory (J) Territory Mean Difference Std. Error Sig. 95% Confidence Interval
(I-J) Lower Bound Upper Bound

B -1.65 .977 .231 -4.11 .81


A
C -1.62 .872 .174 -3.82 .57
A 1.65 .977 .231 -.81 4.11
B
C .03 .903 .999 -2.25 2.31
A 1.62 .872 .174 -.57 3.82
C
B -.03 .903 .999 -2.31 2.25
Based on observed means.
The error term is Mean Square(Error) = 4.040.

Multiple Comparisons
Dependent Variable: Sales
Tukey HSD

(I) Season (J) Season Mean Difference Std. Error Sig. 95% Confidence Interval
(I-J) Lower Bound Upper Bound

Winter .60 .899 .785 -1.67 2.87


Summer
Rainy -.20 .899 .973 -2.47 2.07
Summer -.60 .899 .785 -2.87 1.67
Winter
Rainy -.80 .899 .652 -3.07 1.47
Summer .20 .899 .973 -2.07 2.47
Rainy
Winter .80 .899 .652 -1.47 3.07

Based on observed means.


The error term is Mean Square(Error) = 4.040.

Table 6: Homogenous subsets


Sales
Tukey HSD

Territory N Subset

1
Sales
A Tukey HSD 9 3.22
C 13 4.85
Season N Subset
B 8 4.88
1
Sig. .194
Winter 10 3.90
Means for groups in homogeneous subsets are displayed. Based on observed means.
Summer 10 4.50
The error term is Mean Square(Error) = 4.040.
Rainy 10 4.70
a. Uses Harmonic Mean Sample Size = 9.584.
Sig. .652
b. The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.
Means for groups in homogeneous subsets are displayed. Based on observed means.
c. Alpha = .05.
The error term is Mean Square(Error) = 4.040.
a. Uses Harmonic Mean Sample Size = 10.000.
b. Alpha = .05.
PROFILE PLOTS: Estimated Margin Means of Sales for Territory And Season

You might also like