Daily Task 9 - Statistical Tests - Jupyter Notebook

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

I. T Test
1 Sample
2 Sample
Paired T Test
1 Sample proportion test
2 Sample proportion test
Anova Test

1. One Sample T test

So, in this example, suppose we want to sample a spending time by mobile


users to check if the average time is more than 120 minutes.
Our null hypothesis is: the mean time is exactly 120 minutes
Our alternative hypothesis is: the mean time is not 120 minutes

In [1]:

#Time spending by mobile users in minutes


minutes = [180,122,100,120,60,90,120,110,140,175,120,130,120,225,125,30,200,158,120,135,40,
len(minutes) # Population

Out[1]:

30

In [2]:

import numpy as np
np.mean(minutes) # Population mean

Out[2]:

122.06666666666666

In [3]:

random_selection_from_pop = np.random.choice(a = minutes, size=5)


random_selection_from_pop

Out[3]:

array([120, 146, 90, 225, 120])

In [4]:

np.mean(random_selection_from_pop) # Sample mean

Out[4]:

140.2

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 1/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [5]:

from scipy import stats


_,pval = stats.ttest_1samp(a = random_selection_from_pop, popmean = 122 )

In [6]:

print(_)
print(pval)

0.7920233342827273

0.47266930756565606

In [7]:

#Level if significance - 10%. ie, At 10% level of significance, do we reject or not reject?
if pval<0.1:
print('We can reject the Null Hypothesis and we can claim that there is a significant d
else:
print('We do not reject the Null Hypothesis and we can claim that there is no significa

We do not reject the Null Hypothesis and we can claim that there is no signi
ficant difference in the population mean and sample mean

2. Two Sample T test

So, in this example, suppose we want to sample a mango trees for different
fertilizers
Our null hypothesis is: Organic fertilizer is good for production
Our alternative hypothesis is: Chemical fertilizer is good for production

In [8]:

import numpy as np
from scipy import stats

Organic_mangos = [50,55,45,30,20,25,40,42,36,48] # No. of mangos with organic fertilizer f


Chemical_mangos = [80,60,75,83,85,75,65,100,90,95] # No. of mangos with chemical fertilizer

In [9]:

np.mean(Organic_mangos)

Out[9]:

39.1

In [10]:

np.mean(Chemical_mangos)

Out[10]:

80.8

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 2/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [11]:

_,pval = stats.ttest_ind(a = Organic_mangos,b = Chemical_mangos)


pval

Out[11]:

3.5677940572369257e-07

In [12]:

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?


if pval<0.05:
print('We reject the Null Hypothesis and we can claim that there is a significant diffe
else:
print('We do not reject the Null Hypothesis and we can claim that there is no significa

We reject the Null Hypothesis and we can claim that there is a significant d
ifference in the average nos of mangos with chemical fertilizer and organic
fertilizer for production

3. Paired T Test

Here we are going to calculate working time with focus of human with and
without meditation in minutes
Our null hypothesis is: We can focus more than 10 hrs without meditation
Our alternative hypothesis is: We cannot focus more than 10 hrs without meditation

In [13]:

pre_meditation_program = [480,420,400,360,300,320,340,500,380,420]
post_meditation_program = [540,600,720,650,700,630,560,670,800,740]

In [14]:

np.mean(pre_meditation_program)

Out[14]:

392.0

In [15]:

np.mean(post_meditation_program)

Out[15]:

661.0

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 3/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [16]:

_,pval = stats.ttest_rel(a = pre_meditation_program, b = post_meditation_program)


pval

Out[16]:

3.137229239240941e-05

In [17]:

#Level if significance - 10%. ie, At 10% level of significance, do we reject or not reject?
if pval<0.1:
print('We reject the Null Hypothesis and we can claim that there is a significant diffe
else:
print('We do not reject the Null Hypothesis and we can claim that there is no significa

We reject the Null Hypothesis and we can claim that there is a significant d
ifference in the average working hrs with & without meditaion

4. One Sample proportion test

Example:
Null hypothesis is: 80% of the tests pass
Alternative hypothesis is: more than 80% of the tests pass
We sampled 500 tests, and found 410 passed

In [18]:

from statsmodels.stats.proportion import proportions_ztest

In [19]:

sample_success = 410
sample_size = 500
null_hypothesis = 0.80

In [20]:

stat, p_value = proportions_ztest(count=sample_success, nobs=sample_size, value=null_hypoth


p_value

Out[20]:

0.12220177493249235

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 4/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [21]:

#Level if significance - 10%. ie, At 10% level of significance, do we reject or not reject?
significance = 0.1
if p_value < significance:
print ("Fail to reject the null hypothesis - we have nothing else to say")
else:
print ("Reject the null hypothesis - suggest the alternative hypothesis is true")

Reject the null hypothesis - suggest the alternative hypothesis is true

5. One Sample proportion test


H0 : There is 40% are non-smoker
Ha : There is no more than 40% are smokers (Statistically significant).
For our understanding we have taken tips data as sample

In [22]:

import seaborn as sns


from seaborn import load_dataset
from statsmodels.stats.proportion import proportions_ztest

In [23]:

sns.get_dataset_names()

Out[23]:

['anagrams',

'anscombe',

'attention',

'brain_networks',

'car_crashes',

'diamonds',

'dots',

'exercise',

'flights',

'fmri',

'gammas',

'geyser',

'iris',

'mpg',

'penguins',

'planets',

'taxis',

'tips',

'titanic']

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 5/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [24]:

tips_data = load_dataset('tips')
tips_data

Out[24]:

total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

... ... ... ... ... ... ... ...

239 29.03 5.92 Male No Sat Dinner 3

240 27.18 2.00 Female Yes Sat Dinner 2

241 22.67 2.00 Male Yes Sat Dinner 2

242 17.82 1.75 Male No Sat Dinner 2

243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

In [25]:

tips_data['smoker'].value_counts()

Out[25]:

No 151

Yes 93

Name: smoker, dtype: int64

In [26]:

per_yes = round(93/244*100,2)
print("per_yes is",per_yes,'%')

per_yes is 38.11 %

In [27]:

per_no = round(151/244*100,2)
print("per_no is",per_no,'%')

per_no is 61.89 %

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 6/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [28]:

stat, p_value = proportions_ztest(count=93, nobs=244, value=0.4, alternative='larger')


p_value

Out[28]:

0.7278585473640354

In [29]:

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?


significance = 0.05
if p_value < significance:
print ("Fail to reject the null hypothesis")
else:
print ("Reject the null hypothesis & consider the alternative hypothesis is true")

Reject the null hypothesis & consider the alternative hypothesis is true

6. Two Sample proportion test


H0 : People never be depressed
Ha : People can be depressed (Statistically significant)

In [30]:

import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest

In [31]:

depression_data = pd.read_csv("Depression_status.csv")
depression_data.head(10)

Out[31]:

Old Therapy New Therapy

0 Yes No

1 Yes No

2 Yes No

3 Yes No

4 Yes No

5 Yes No

6 Yes No

7 Yes No

8 No No

9 Yes No

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 7/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [32]:

depression_data.columns = [column.replace(" ","_") for column in depression_data.columns]


depression_data.head(10)

Out[32]:

Old_Therapy New_Therapy

0 Yes No

1 Yes No

2 Yes No

3 Yes No

4 Yes No

5 Yes No

6 Yes No

7 Yes No

8 No No

9 Yes No

In [33]:

depression_data.value_counts()

Out[33]:

Old_Therapy New_Therapy

Yes No 37

No No 7

Yes Yes 4

No Yes 2

dtype: int64

In [34]:

per_old_depression = 41/50*100
print("Old Therapy Depression level is",per_old_depression,"%")

Old Therapy Depression level is 82.0 %

In [35]:

per_new_depression = 6/50*100
print("New Therapy Depression level is",per_new_depression,"%")

New Therapy Depression level is 12.0 %

In [36]:

depressed_people_a, sample_size_a = (41, 50)


depressed_people_b, sample_size_b = (6, 50)

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 8/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [37]:

depressed_people = np.array([depressed_people_a, depressed_people_b])


samples = np.array([sample_size_a, sample_size_b])

In [38]:

stat, p_value = proportions_ztest(count=depressed_people, nobs=samples, alternative='two-si


p_value

Out[38]:

2.3387247876563156e-12

In [39]:

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?


if p_value<0.05:
print('We reject the Null Hypothesis and we can claim that there is a significant diffe
else:
print('We do not reject the Null Hypothesis and we can claim that there is no significa

We reject the Null Hypothesis and we can claim that there is a significant d
ifference in the depression level with old therapy & new therapy, So people
can feel less depressed with new therapy

7. ANOVA (Analysis of Variance) Test

Example: Average between two data samples are significantly independent and
different.

Hypothesis Formulation
H0: the mean between two samples are equal .
H1: the mean between two samples are not equal.

In [40]:

from scipy.stats import f_oneway

In [41]:

data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]

In [42]:

stat, p_value = f_oneway(data1, data2, data3)


p_value

Out[42]:

0.9083957433926546

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 9/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [43]:

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?


if p_value > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')

Probably the same distribution

II. Chi-squared Test


H0 : There is no association between Male species and female species.
Ha : Male species are more & female species are less (Statistically significant).

In [44]:

import seaborn as sns


from seaborn import load_dataset

In [45]:

sns.get_dataset_names()

Out[45]:

['anagrams',

'anscombe',

'attention',

'brain_networks',

'car_crashes',

'diamonds',

'dots',

'exercise',

'flights',

'fmri',

'gammas',

'geyser',

'iris',

'mpg',

'penguins',

'planets',

'taxis',

'tips',

'titanic']

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 10/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [46]:

penguins_data = load_dataset('penguins')
penguins_data

Out[46]:

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g s

0 Adelie Torgersen 39.1 18.7 181.0 3750.0 M

1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Fem

2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Fem

3 Adelie Torgersen NaN NaN NaN NaN N

4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Fem

... ... ... ... ... ... ...

339 Gentoo Biscoe NaN NaN NaN NaN N

340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Fem

341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 M

342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Fem

343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 M

344 rows × 7 columns

In [47]:

import pandas as pd
observed_table = pd.crosstab(index = penguins_data['sex'], columns = penguins_data['species
observed_table

Out[47]:

species Adelie Chinstrap Gentoo

sex

Female 73 34 58

Male 73 34 61

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 11/12


7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

In [48]:

from scipy import stats


chi2_score,pval,dof,expected_table = stats.chi2_contingency(observed = observed_table)
print('**************************************************************')
print('Chi-squared value : ',round(chi2_score,5))
print('P-val : ',round(pval,5))
print('Degree of Freedom : ',dof)
print('Expected Table :\n',expected_table)

**************************************************************

Chi-squared value : 0.04861

P-val : 0.97599

Degree of Freedom : 2

Expected Table :

[[72.34234234 33.69369369 58.96396396]

[73.65765766 34.30630631 60.03603604]]

In [49]:

#Level if significance - 10%. ie, At 10% level of significance, do we reject or not reject?
if pval<0.1:
print('We can reject the Null Hypothesis and we can claim that there is a association b
else:
print('We do not reject the Null Hypothesis and we can claim that there is no associati

We do not reject the Null Hypothesis and we can claim that there is no assoc
iation between male penguins and female penguins.

THE END!!

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 12/12

You might also like