chi-SquaredTest - Vishal (21DM217) - Vatsal (21DM216) - Preeti (21DM242) - Absent On4th&11th July.

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Machine Learning Group Assignment

Chi-Square, RFE, IG

Submitted by: Submitted to:


Vishal Poduri 21DM217 Dr. Mahak Sharma
Vatsal Mittal 21DM 216
Preeti Mittal 21DM242
Chi-Squared Test

What is it? Types

A Pearson’s chi-square test is a statistical test


The chi-square goodness of fit test is used to
for categorical data. It is used to determine
test whether the frequency distribution of a
whether your data are significantly different
categorical variable is different from your
from what you expected.
expectations.
The chi-square test of independence is used
to test whether two categorical variables are
related to each other.

Chi-Squared Test

When to use it? How to perform a chi-square test

1. Create a table of the observed and expected


A Pearson’s chi-square test may be an frequencies.
appropriate option for your data if all of the 2. Calculate the chi-square value from your
following are true: observed and expected frequencies using the
1. You want to test a hypothesis about one chi-square formula.
or more categorical variables. 3. Find the critical chi-square value in a chi-
2. The sample was randomly selected from square critical value table or using statistical
the population. software.
3. There are a minimum of five observations 4. Compare the chi-square value to the critical
expected in each group or combination of value to determine which is larger.
groups. 5. Decide whether to reject the null hypothesis.
Chi-Squared Test

What is a Chi-Square Statistic? Comparing the critical value

You could take your calculated chi-square


value and compare it to a critical value from
a chi-square table. If the chi-square value is
The subscript “c” is the degrees of freedom. more than the critical value, then there is a
“O” is your observed value and E is your significant difference.
expected value.

A low value for chi-square means there is a high You could also use a p-value. First state
correlation between your two sets of data. In the null hypothesis and the alternate
theory, if your observed and expected values hypothesis. Then generate a chi-square
were equal (“no difference”) then chi-square curve for your results along with a p-
would be zero — highly unlikely to happen value
Chi-Squared Test
Consider an array arr[] = {2, 10, 8, 7}

Example question: 256 visual artists were surveyed to find out their zodiac sign. The results were: Aries (29), Taurus
(24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20), Sagittarius (23), Capricorn (18), Aquarius
(20), Pisces (23). Test the hypothesis that zodiac signs are evenly distributed across visual artists.

H0: Equal distribution of across the 12 signs


H1: Not equally distributed
α = 0.05

Step 1: Make a table with columns for “Categories,”


“Observed,” “Expected,” “Residual (Obs-Exp)”, “(Obs-
Exp)2” and “Component (Obs-Exp)2 / Exp.” Don’t worry
what these mean right now; We’ll cover that in the
following steps.
Chi-Squared Test
Consider an array arr[] = {2, 10, 8, 7}

Step 2: Fill in your categories. Categories should be Step 3: Write your counts. Counts are the number of
given to you in the question. There are 12 zodiac signs, each items in each category in column 2. You’re given
so: the counts in the question:
Chi-Squared Test
Consider an array arr[] = {2, 10, 8, 7}

Step 4: Calculate your expected value for column 3. In Step 5: Subtract the expected value (Step 4) from the
this question, we would expect the 12 zodiac signs to be Observed value (Step 3) and place the result in the
evenly distributed for all 256 people, so 256/12=21.333. “Residual” column. For example, the first row is Aries:
Write this in column 3. 29-21.333=7.667.
Chi-Squared Test
Consider an array arr[] = {2, 10, 8, 7}

Step 6: Square your results from Step 5 and place the Step 7: Divide the amounts in Step 6 by the expected
amounts in the (Obs-Exp)2 column. value (Step 4) and place those results in the final
column. Finally add up (sum) all the values in the last
column.
Chi-Squared Test
Consider an array arr[] = {2, 10, 8, 7}

H0: Equal distribution of across the 12 signs


This is the chi-square statistic: 5.094.
H1: Not equally distributed
dof = 11 α = 0.05

critical value =4.575


the chi-square value is more than the critical value, then there is a
significant difference in the zodiac signs.
Information Gain

What is it?
It measures the reduction in entropy or surprise by splitting a
dataset according to a given value of a random variable.

Information...
information quantifies how surprising an event is in bits.
Lower probability events have more information, higher
probability events have less information.

Entropy

It quantifies how much information there is in a random


variable, or more specifically its probability
distribution.
Information Gain

Example
For example, in a binary classification problem (two
classes), we can calculate the entropy of the data sample
as follows:

Entropy = -(p(0) * log(P(0)) + p(1) * log(P(1)))

Syntax
Information Gain

Example
Predicting the Gender of an unborn baby.

1. At the first step, when the mother gets pregnant, gender of


the foetus can be male or female and the uncertainty is high
or at best you are 50% certain or uncertain about the
prediction.

2. However, at the time of first ultrasound test in the first trimester,


the certainty of prediction becomes better, say it becomes 75%. The
drop in uncertainty is the loss of entropy and that is also your
Information or Knowledge Gain, because the loss in entropy has
resulted in an equal gain of certainty.

3. Now after the ultrasound in third trimester of pregnancy, there


is no uncertainty and the prediction can be close to 100% certain.
Information Gain

Example
Predict whether there will be a golf game today or not.

1. The result here depends on the outlook or weather forecast.

So the gain here while predicting PlayGolf in presence of Outlook


will be the difference between

“entropy when trying to predict PlayGolf alone”


and
“entropy when trying to predict PlayGolf with knowledge of
Outlook” for the day.
Recursive Feature Elimination Method

What is it? RFE can be used to handle problems presented by the two
models listed below:
Classification: Classification predicts the class of selected
Feature elimination in machine learning is
data points. Classes are also known as targets, labels, or
referred to as choosing a subset of relevant
categories. Classification predictive modeling involves
features from the dataset to use in further model
approximating a mapping function (f) from input variables (X)
construction achieving the optimum number to discrete output variables (y).
needed to assure peak performance. Regression: Regression models supply a function describing
the relationship between one (or more) independent
Steps variables and a response, dependent, or target variable.

We make the feature selection from the sklearn


SelectFromModel class
We do the feature selection using RFE with cross-
validation to prevent overfitting.

The estimator is nothing but the tree-based algorithms.


Recursive Feature Elimination Method
Implementing RFE algorithm in Python Real World decision-making scenario

Data Preparation You and your five friends are trying to decide whether to go
To start with, we will import the following libraries. out to eat or not?
factors come up for consideration:
Who is hungry enough to eat a full meal
How people’s available funds are holding up
How late people can stay up
What kind of food people do want
The location and types of local eateries
How late do people want to stay out
Who has a car

Read the dataset Result


After spending way too much time debating, two features were selected:
Who is hungry?
The location and types of local eateries
So, we have eliminated many features and have reduced the amount of
time needed to decide.
Recursive Feature Elimination Method

Example
1.To show how this works in practice, we’ll start with a
contrived example using a dataset that has only 3
informative features out of 25.

2. This figure shows an ideal RFECV curve, the curve


jumps to an excellent accuracy when the three
informative features are captured, then gradually
decreases in accuracy as the non informative features
are added into the model.

3. The shaded area represents the variability of cross-


validation, one standard deviation above and below the
mean accuracy score drawn by the curve.
Thank
you!!

You might also like