Chi Squared Test: Goodness of Fits and Independence of Attributes

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 18

Chi Squared Test

Goodness of Fits and Independence of Attributes


Chi-squared Test
▪ Chi-squared test is a statistical hypothesis test that is valid to perform when
the test statistic is chi-squared distribution under null hypothesis
▪ Chi-squared distribution is the distribution of the sum of squares of k
independent standard normal random variables
▪ The Chi-squared distribution is one of the most widely used probability
distributions in inferential statistics and hypothesis testing
▪ This test can also be used to determine whether it correlates to the
categorical variables in our data. It helps to find out whether a difference
between two categorical variables is due to chance or a relationship between
them.
▪ The Chi-squared test is only applicable for categorical data – for example
men and women falling under the categories of age, height, weight, etc.
Properties
The following are the important properties of
the chi-square test:
• Two times the number of degrees of freedom
is equal to the variance.
• The number of degree of freedom is equal to
the mean distribution
• The chi-square distribution curve approaches
the normal distribution when the degree of
freedom increases.
Why and Where do we use Chi – Squared test ?

• The Chi-squared test can be used to see if your data follows a well-known
theoretical probability distribution like the Normal or Poisson distribution.
• The Chi-squared test allows you to assess your trained regression
model's goodness of fit on the training, validation, and test data sets.
▪ Chi-square is most commonly used by researchers who are studying
survey response data because it applies to categorical variables.
▪ Demography, consumer and marketing research, political science, and
economics are all examples of this type of research.
Independence of Attributes
▪ The chi-square test of independence also known as the chi-square test of
association which is used to determine the association between the categorical
variables.
▪ It is considered as a non-parametric test.
▪ Non-parametric tests are experiments that do not require the underlying population
for assumptions. It does not rely on any data referring to any particular parametric
group of probability distributions. Non-parametric methods are also called
distribution-free tests since they do not have any underlying population.
▪ It is mostly used to test statistical independence.
▪ For this test, the data must meet the following requirements:
• Two categorical variables
• Relatively large sample size
• Categories of variables (two or more)
• Independence of observations
Formulas

With degrees
Where r – number of rows, c - number of columns
𝐸𝑖𝑗 of freedom = (r-1)(c-1)
O11, O12, O13 .......,Orc – Observed Values for every cell

E11, E12, E13 .......,Erc – Expected Values for every cell

Null Hypothesis – H0 : Two categorical variables are independent of each other


Alternate Hypothesis – H1 : Two categorical variables are dependent of each other

Rejection Criterion: When, X2 > X2 α,(r-1)(c-1) - H0 is rejected.


Application Problem

Grades in Fluid Mechanics course and Dynamic of Machinery course taken


simultaneously by a group of students are given below. Are the grades in both the
courses related? Use α = 0.01 level of significance to reach your conclusion.

Fluid Dynamics of Machinery


Mechanics A B C Other
A 25 6 17 13
B 17 16 15 6
C 18 4 18 10
Other 10 8 11 20
Solution:
H0 : Grades in Fluid Mechanics and Dynamics of Machinery are independent of each other

H1 : Grades in are Fluid Mechanics and Dynamics of Machinery are dependent of each other

Table of Observed Values


Fluid Dynamics of Machinery
Mechanics A B C Other Total
A 25 6 17 13 61
B 17 16 15 6 54
C 18 4 18 10 50
Other 10 8 11 20 49
Total 70 34 61 49 214
Fluid Dynamics of Machinery
Mechanics A B C Other Total
O1j 25 6 17 13 61
E1j 19.95 9.69 17.39 13.97
O2j 17 16 15 6 54
E2j 17.66 8.58 15.39 12.36
O3j 18 4 18 10 50
E3j 16.36 7.94 14.25 11.45
O3j 10 8 11 20 49
E3j 16.03 7.79 14.25 11.22
Total 70 34 61 49 214

Expected value calculation,


Therefore, H0 is rejected. This means that the grades
of the two courses are dependent on each other.
Goodness of Fit
▪ The Chi-square goodness of fit test is a statistical hypothesis test used to
determine whether a variable is likely to come from a specified
distribution or not. It is often used to evaluate whether sample data is
representative of the full population.
▪ For the goodness of fit test, we need one variable. We also need an
idea, or hypothesis, about how that variable is distributed.
▪ To apply the goodness of fit test to a data set we need:
• Data values that are a simple random sample from the full population.
• Categorical or nominal data. The Chi-square goodness of fit test is not
appropriate for continuous data.
• A data set that is large enough so that at least five values are expected in
each of the observed data categories. 
Formulas

With degrees
Where k – number of intervals, p – number of parameters
𝐸𝑖 of freedom = k-p-1
O1, O2, O3 .......,Ok – Observed Values for every cell

E1, E2, E3 .......,Ek – Expected Values for every cell

Null Hypothesis – H0 : The sample data follows a particular distribution


Alternate Hypothesis – H1 : The sample data does not follow the particular distribution

Rejection Criterion: When, X02 > X2α,k-p-1 - H0 is rejected.


Application problem

This is the distribution of the number of observed flaws in a box of gears


from the quality control department of Green Enterprises ltd. 75 machined
workpieces are inspected and the following data was observed. Check if the
following data follows a Poisson distribution with a significance value of
0.01.

Flaws 1 2 3 4 5 6 7 8

Observed
1 11 8 13 11 12 10 9
Frequency
H0 : The form of distribution of the flaws follows poissons distribution
H1 : The form of distribution of the flaws does not follows poisons distribution
The expected frequencies can be calculated using poisson distribution,

Where is the parameter of the poisson distribution and can be estimated from
the sample mean
Expected
Probability by
Observed Frequency =
Flaws Poisson
Frequency (prob. x no. of
Distribution
specimens)
1 1 0.036 2.721
2 11 0.089 6.677
3 8 0.145 10.921
4 13 0.179 13.397
5 11 0.175 13.146
6 12 0.143 10.753
7 10 0.101 7.538
8 9 0.062 4.624
Expected
Observed Probability Frequency
Flaws Frequency by Poisson (Ei) = (prob. x (Oi-Ei)
(Oi) Distribution no. of
specimens)
1,2 12 0.125 9.398 2.602 0.720
3 8 0.145 10.921 2.921 0.782
4 13 0.179 13.397 0.397 0.012
5 11 0.175 13.146 2.146 0.351
6 12 0.143 10.753 1.247 0.145
7,8 19 0.163 12.162 6.838 3.845
5.8545
The degree of freedom for the data is, (p = 1
for poisson)

Therefore H0 is accepted and that means the


given set of data of the flaws follows poissons
distribution.
Thank You
20M134 – Neelakantan V
20M135 – Nikhil Nathanael
Ilango

You might also like