chi-SquaredTest - Vishal (21DM217) - Vatsal (21DM216) - Preeti (21DM242) - Absent On4th&11th July.

Machine Learning Group Assignment
Chi-Square, RFE, IG
Submitted by: Submitted to:

Vishal Poduri 21DM217 Dr. Mahak Sharma
Vatsal Mittal 21DM 216
Preeti Mittal 21DM242
Chi-Squared Test
What is it? Types
A Pearson’s chi-square test is a statistical test

The chi-square goodness of fit test is used to
for categorical data. It is used to determine
test whether the frequency distribution of a
whether your data are significantly different
categorical variable is different from your
from what you expected.
expectations.
The chi-square test of independence is used
to test whether two categorical variables are
related to each other.

Chi-Squared Test
When to use it? How to perform a chi-square test
1. Create a table of the observed and expected

A Pearson’s chi-square test may be an frequencies.
appropriate option for your data if all of the 2. Calculate the chi-square value from your
following are true: observed and expected frequencies using the
1. You want to test a hypothesis about one chi-square formula.
or more categorical variables. 3. Find the critical chi-square value in a chi-
2. The sample was randomly selected from square critical value table or using statistical
the population. software.
3. There are a minimum of five observations 4. Compare the chi-square value to the critical
expected in each group or combination of value to determine which is larger.
groups. 5. Decide whether to reject the null hypothesis.
Chi-Squared Test
What is a Chi-Square Statistic? Comparing the critical value
You could take your calculated chi-square

value and compare it to a critical value from
a chi-square table. If the chi-square value is
The subscript “c” is the degrees of freedom. more than the critical value, then there is a
“O” is your observed value and E is your significant difference.
expected value.
A low value for chi-square means there is a high You could also use a p-value. First state
correlation between your two sets of data. In the null hypothesis and the alternate
theory, if your observed and expected values hypothesis. Then generate a chi-square
were equal (“no difference”) then chi-square curve for your results along with a p-
would be zero — highly unlikely to happen value
Chi-Squared Test
Consider an array arr[] = {2, 10, 8, 7}
Example question: 256 visual artists were surveyed to find out their zodiac sign. The results were: Aries (29), Taurus
(24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23), Capricorn (18), Aquarius
(20), Pisces (23). Test the hypothesis that zodiac signs are evenly distributed across visual artists.
H0: Equal distribution of across the 12 signs

H1: Not equally distributed
α = 0.05
Step 1: Make a table with columns for “Categories,”

“Observed,” “Expected,” “Residual (Obs-Exp)”, “(Obs-
Exp)2” and “Component (Obs-Exp)2 / Exp.” Don’t worry
what these mean right now; We’ll cover that in the
following steps.
Chi-Squared Test
Step 2: Fill in your categories. Categories should be Step 3: Write your counts. Counts are the number of
given to you in the question. There are 12 zodiac signs, each items in each category in column 2. You’re given
so: the counts in the question:
Chi-Squared Test
Step 4: Calculate your expected value for column 3. In Step 5: Subtract the expected value (Step 4) from the
this question, we would expect the 12 zodiac signs to be Observed value (Step 3) and place the result in the
evenly distributed for all 256 people, so 256/12=21.333. “Residual” column. For example, the first row is Aries:
Write this in column 3. 29-21.333=7.667.
Chi-Squared Test
Step 6: Square your results from Step 5 and place the Step 7: Divide the amounts in Step 6 by the expected
amounts in the (Obs-Exp)2 column. value (Step 4) and place those results in the final
column. Finally add up (sum) all the values in the last
column.
Chi-Squared Test
H0: Equal distribution of across the 12 signs

This is the chi-square statistic: 5.094.
H1: Not equally distributed
dof = 11 α = 0.05
critical value =4.575

the chi-square value is more than the critical value, then there is a
significant difference in the zodiac signs.
Information Gain
What is it?
It measures the reduction in entropy or surprise by splitting a
dataset according to a given value of a random variable.
Information...
information quantifies how surprising an event is in bits.
Lower probability events have more information, higher
probability events have less information.
Entropy
It quantifies how much information there is in a random

variable, or more specifically its probability
distribution.
Information Gain
Example
For example, in a binary classification problem (two
classes), we can calculate the entropy of the data sample
as follows:
Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))
Syntax
Information Gain
Example
Predicting the Gender of an unborn baby.
1. At the first step, when the mother gets pregnant, gender of

the foetus can be male or female and the uncertainty is high
or at best you are 50% certain or uncertain about the
prediction.
2. However, at the time of first ultrasound test in the first trimester,

the certainty of prediction becomes better, say it becomes 75%. The
drop in uncertainty is the loss of entropy and that is also your
Information or Knowledge Gain, because the loss in entropy has
resulted in an equal gain of certainty.
3. Now after the ultrasound in third trimester of pregnancy, there

is no uncertainty and the prediction can be close to 100% certain.
Information Gain
Example
Predict whether there will be a golf game today or not.
1. The result here depends on the outlook or weather forecast.
So the gain here while predicting PlayGolf in presence of Outlook

will be the difference between
“entropy when trying to predict PlayGolf alone”

and
“entropy when trying to predict PlayGolf with knowledge of
Outlook” for the day.
Recursive Feature Elimination Method
What is it? RFE can be used to handle problems presented by the two
models listed below:
Classification: Classification predicts the class of selected
Feature elimination in machine learning is
data points. Classes are also known as targets, labels, or
referred to as choosing a subset of relevant
categories. Classification predictive modeling involves
features from the dataset to use in further model
approximating a mapping function (f) from input variables (X)
construction achieving the optimum number to discrete output variables (y).
needed to assure peak performance. Regression: Regression models supply a function describing
the relationship between one (or more) independent
Steps variables and a response, dependent, or target variable.
We make the feature selection from the sklearn

SelectFromModel class
We do the feature selection using RFE with cross-
validation to prevent overfitting.
The estimator is nothing but the tree-based algorithms.

Implementing RFE algorithm in Python Real World decision-making scenario
Data Preparation You and your five friends are trying to decide whether to go
To start with, we will import the following libraries. out to eat or not?
factors come up for consideration:
Who is hungry enough to eat a full meal
How people’s available funds are holding up
How late people can stay up
What kind of food people do want
The location and types of local eateries
How late do people want to stay out
Who has a car
Read the dataset Result

After spending way too much time debating, two features were selected:
Who is hungry?
The location and types of local eateries
So, we have eliminated many features and have reduced the amount of
time needed to decide.
Example
1.To show how this works in practice, we’ll start with a
contrived example using a dataset that has only 3
informative features out of 25.
2. This figure shows an ideal RFECV curve, the curve

jumps to an excellent accuracy when the three
informative features are captured, then gradually
decreases in accuracy as the non informative features
are added into the model.
3. The shaded area represents the variability of cross-

validation, one standard deviation above and below the
mean accuracy score drawn by the curve.
Thank
you!!

chi-SquaredTest - Vishal (21DM217) - Vatsal (21DM216) - Preeti (21DM242) - Absent On4th&11th July.

Uploaded by

Copyright:

Available Formats

You might also like

chi-SquaredTest - Vishal (21DM217) - Vatsal (21DM216) - Preeti (21DM242) - Absent On4th&11th July.

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

chi-SquaredTest - Vishal (21DM217) - Vatsal (21DM216) - Preeti (21DM242) - Absent On4th&11th July.

Uploaded by

Copyright:

Available Formats

Machine Learning Group Assignment

Submitted by: Submitted to:

What is it? Types

A Pearson’s chi-square test is a statistical test

When to use it? How to perform a chi-square test

1. Create a table of the observed and expected

What is a Chi-Square Statistic? Comparing the critical value

You could take your calculated chi-square

H0: Equal distribution of across the 12 signs

Step 1: Make a table with columns for “Categories,”

H0: Equal distribution of across the 12 signs

critical value =4.575

It quantifies how much information there is in a random

Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))

1. At the first step, when the mother gets pregnant, gender of

2. However, at the time of first ultrasound test in the first trimester,

3. Now after the ultrasound in third trimester of pregnancy, there

1. The result here depends on the outlook or weather forecast.

So the gain here while predicting PlayGolf in presence of Outlook

“entropy when trying to predict PlayGolf alone”

We make the feature selection from the sklearn

The estimator is nothing but the tree-based algorithms.

Read the dataset Result

2. This figure shows an ideal RFECV curve, the curve

3. The shaded area represents the variability of cross-

You might also like