Daily Task 9 - Statistical Tests - Jupyter Notebook

7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook
I. T Test
1 Sample
2 Sample
Paired T Test
1 Sample proportion test
2 Sample proportion test
Anova Test
1. One Sample T test
So, in this example, suppose we want to sample a spending time by mobile

users to check if the average time is more than 120 minutes.
Our null hypothesis is: the mean time is exactly 120 minutes
Our alternative hypothesis is: the mean time is not 120 minutes
In [1]:
#Time spending by mobile users in minutes

minutes = [180,122,100,120,60,90,120,110,140,175,120,130,120,225,125,30,200,158,120,135,40,
len(minutes) # Population
Out[1]:
30
In [2]:
import numpy as np
np.mean(minutes) # Population mean
Out[2]:
122.06666666666666
In [3]:
random_selection_from_pop = np.random.choice(a = minutes, size=5)

random_selection_from_pop
Out[3]:
array([120, 146, 90, 225, 120])
In [4]:
np.mean(random_selection_from_pop) # Sample mean
Out[4]:
140.2
localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 1/12

In [5]:
from scipy import stats

_,pval = stats.ttest_1samp(a = random_selection_from_pop, popmean = 122 )
In [6]:
print(_)
print(pval)
0.7920233342827273
0.47266930756565606
In [7]:
#Level if significance - 10%. ie, At 10% level of significance, do we reject or not reject?
if pval<0.1:
print('We can reject the Null Hypothesis and we can claim that there is a significant d
else:
print('We do not reject the Null Hypothesis and we can claim that there is no significa
We do not reject the Null Hypothesis and we can claim that there is no signi
ficant difference in the population mean and sample mean
2. Two Sample T test
So, in this example, suppose we want to sample a mango trees for different
fertilizers
Our null hypothesis is: Organic fertilizer is good for production
Our alternative hypothesis is: Chemical fertilizer is good for production
In [8]:
import numpy as np
Organic_mangos = [50,55,45,30,20,25,40,42,36,48] # No. of mangos with organic fertilizer f

Chemical_mangos = [80,60,75,83,85,75,65,100,90,95] # No. of mangos with chemical fertilizer
In [9]:
np.mean(Organic_mangos)
Out[9]:
39.1
In [10]:
np.mean(Chemical_mangos)
Out[10]:
80.8

In [11]:
_,pval = stats.ttest_ind(a = Organic_mangos,b = Chemical_mangos)

pval
Out[11]:
3.5677940572369257e-07
In [12]:

if pval<0.05:
print('We reject the Null Hypothesis and we can claim that there is a significant diffe
else:
We reject the Null Hypothesis and we can claim that there is a significant d
ifference in the average nos of mangos with chemical fertilizer and organic
fertilizer for production
3. Paired T Test
Here we are going to calculate working time with focus of human with and
without meditation in minutes
Our null hypothesis is: We can focus more than 10 hrs without meditation
Our alternative hypothesis is: We cannot focus more than 10 hrs without meditation
In [13]:
pre_meditation_program = [480,420,400,360,300,320,340,500,380,420]
post_meditation_program = [540,600,720,650,700,630,560,670,800,740]
In [14]:
np.mean(pre_meditation_program)
Out[14]:
392.0
In [15]:
np.mean(post_meditation_program)
Out[15]:
661.0

In [16]:
_,pval = stats.ttest_rel(a = pre_meditation_program, b = post_meditation_program)

pval
Out[16]:
3.137229239240941e-05
In [17]:
if pval<0.1:
else:
ifference in the average working hrs with & without meditaion
4. One Sample proportion test
Example:
Null hypothesis is: 80% of the tests pass
Alternative hypothesis is: more than 80% of the tests pass
We sampled 500 tests, and found 410 passed
In [18]:
from statsmodels.stats.proportion import proportions_ztest
In [19]:
sample_success = 410
sample_size = 500
null_hypothesis = 0.80
In [20]:
stat, p_value = proportions_ztest(count=sample_success, nobs=sample_size, value=null_hypoth

p_value
Out[20]:
0.12220177493249235

In [21]:
significance = 0.1
if p_value < significance:
print ("Fail to reject the null hypothesis - we have nothing else to say")
else:
print ("Reject the null hypothesis - suggest the alternative hypothesis is true")
Reject the null hypothesis - suggest the alternative hypothesis is true
5. One Sample proportion test

H0 : There is 40% are non-smoker
Ha : There is no more than 40% are smokers (Statistically significant).
For our understanding we have taken tips data as sample
In [22]:
import seaborn as sns

from seaborn import load_dataset
In [23]:
sns.get_dataset_names()
Out[23]:
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'exercise',
'flights',
'fmri',
'gammas',
'geyser',
'iris',
'mpg',
'penguins',
'planets',
'taxis',
'tips',
'titanic']

In [24]:
tips_data = load_dataset('tips')
tips_data
Out[24]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2
244 rows × 7 columns
In [25]:
tips_data['smoker'].value_counts()
Out[25]:
No 151
Yes 93
Name: smoker, dtype: int64
In [26]:
per_yes = round(93/244*100,2)
print("per_yes is",per_yes,'%')
per_yes is 38.11 %
In [27]:
per_no = round(151/244*100,2)
print("per_no is",per_no,'%')
per_no is 61.89 %

In [28]:
stat, p_value = proportions_ztest(count=93, nobs=244, value=0.4, alternative='larger')

p_value
Out[28]:
0.7278585473640354
In [29]:

significance = 0.05
if p_value < significance:
print ("Fail to reject the null hypothesis")
else:
print ("Reject the null hypothesis & consider the alternative hypothesis is true")
Reject the null hypothesis & consider the alternative hypothesis is true
6. Two Sample proportion test

H0 : People never be depressed
Ha : People can be depressed (Statistically significant)
In [30]:
import pandas as pd
import numpy as np
In [31]:
depression_data = pd.read_csv("Depression_status.csv")
depression_data.head(10)
Out[31]:
Old Therapy New Therapy
0 Yes No
1 Yes No
2 Yes No
3 Yes No
4 Yes No
5 Yes No
6 Yes No
7 Yes No
8 No No
9 Yes No

In [32]:
depression_data.columns = [column.replace(" ","_") for column in depression_data.columns]

depression_data.head(10)
Out[32]:
Old_Therapy New_Therapy
0 Yes No
1 Yes No
2 Yes No
3 Yes No
4 Yes No
5 Yes No
6 Yes No
7 Yes No
8 No No
9 Yes No
In [33]:
depression_data.value_counts()
Out[33]:
Old_Therapy New_Therapy
Yes No 37
No No 7
Yes Yes 4
No Yes 2
dtype: int64
In [34]:
per_old_depression = 41/50*100
print("Old Therapy Depression level is",per_old_depression,"%")
Old Therapy Depression level is 82.0 %
In [35]:
per_new_depression = 6/50*100
print("New Therapy Depression level is",per_new_depression,"%")
New Therapy Depression level is 12.0 %
In [36]:
depressed_people_a, sample_size_a = (41, 50)

depressed_people_b, sample_size_b = (6, 50)

In [37]:
depressed_people = np.array([depressed_people_a, depressed_people_b])

samples = np.array([sample_size_a, sample_size_b])
In [38]:
stat, p_value = proportions_ztest(count=depressed_people, nobs=samples, alternative='two-si

p_value
Out[38]:
2.3387247876563156e-12
In [39]:

if p_value<0.05:
else:
ifference in the depression level with old therapy & new therapy, So people
can feel less depressed with new therapy
7. ANOVA (Analysis of Variance) Test
Example: Average between two data samples are significantly independent and
different.
Hypothesis Formulation
H0: the mean between two samples are equal .
H1: the mean between two samples are not equal.
In [40]:
from scipy.stats import f_oneway
In [41]:
data1 = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
data2 = [1.142, -0.432, -0.938, -0.729, -0.846, -0.157, 0.500, 1.183, -1.075, -0.169]
data3 = [-0.208, 0.696, 0.928, -1.148, -0.213, 0.229, 0.137, 0.269, -0.870, -1.204]
In [42]:
stat, p_value = f_oneway(data1, data2, data3)

p_value
Out[42]:
0.9083957433926546

In [43]:

if p_value > 0.05:
print('Probably the same distribution')
else:
print('Probably different distributions')
Probably the same distribution
II. Chi-squared Test

H0 : There is no association between Male species and female species.
Ha : Male species are more & female species are less (Statistically significant).
In [44]:
import seaborn as sns

from seaborn import load_dataset
In [45]:
sns.get_dataset_names()
Out[45]:
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'exercise',
'flights',
'fmri',
'gammas',
'geyser',
'iris',
'mpg',
'penguins',
'planets',
'taxis',
'tips',
'titanic']

In [46]:
penguins_data = load_dataset('penguins')
penguins_data
Out[46]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g s
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 M
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Fem
3 Adelie Torgersen NaN NaN NaN NaN N
... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN N
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Fem
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 M
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Fem
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 M
344 rows × 7 columns
In [47]:
import pandas as pd
observed_table = pd.crosstab(index = penguins_data['sex'], columns = penguins_data['species
observed_table
Out[47]:
species Adelie Chinstrap Gentoo
sex
Female 73 34 58
Male 73 34 61

In [48]:

chi2_score,pval,dof,expected_table = stats.chi2_contingency(observed = observed_table)
print('**************************************************************')
print('Chi-squared value : ',round(chi2_score,5))
print('P-val : ',round(pval,5))
print('Degree of Freedom : ',dof)
print('Expected Table :\n',expected_table)
**************************************************************
Chi-squared value : 0.04861
P-val : 0.97599
Degree of Freedom : 2
Expected Table :
[[72.34234234 33.69369369 58.96396396]
[73.65765766 34.30630631 60.03603604]]
In [49]:
if pval<0.1:
print('We can reject the Null Hypothesis and we can claim that there is a association b
else:
print('We do not reject the Null Hypothesis and we can claim that there is no associati
We do not reject the Null Hypothesis and we can claim that there is no assoc
iation between male penguins and female penguins.
THE END!!

Daily Task 9 - Statistical Tests - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

You might also like

Daily Task 9 - Statistical Tests - Jupyter Notebook

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Daily Task 9 - Statistical Tests - Jupyter Notebook

Uploaded by

Copyright:

Available Formats

7/20/22, 11:36 AM Daily task 9 - Statistical Tests - Jupyter Notebook

1. One Sample T test

So, in this example, suppose we want to sample a spending time by mobile

#Time spending by mobile users in minutes

random_selection_from_pop = np.random.choice(a = minutes, size=5)

array([120, 146, 90, 225, 120])

np.mean(random_selection_from_pop) # Sample mean

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 1/12

from scipy import stats

2. Two Sample T test

Organic_mangos = [50,55,45,30,20,25,40,42,36,48] # No. of mangos with organic fertilizer f

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 2/12

_,pval = stats.ttest_ind(a = Organic_mangos,b = Chemical_mangos)

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 3/12

_,pval = stats.ttest_rel(a = pre_meditation_program, b = post_meditation_program)

4. One Sample proportion test

from statsmodels.stats.proportion import proportions_ztest

stat, p_value = proportions_ztest(count=sample_success, nobs=sample_size, value=null_hypoth

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 4/12

Reject the null hypothesis - suggest the alternative hypothesis is true

5. One Sample proportion test

import seaborn as sns

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 5/12

total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

... ... ... ... ... ... ... ...

239 29.03 5.92 Male No Sat Dinner 3

240 27.18 2.00 Female Yes Sat Dinner 2

241 22.67 2.00 Male Yes Sat Dinner 2

242 17.82 1.75 Male No Sat Dinner 2

243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

Name: smoker, dtype: int64

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 6/12

stat, p_value = proportions_ztest(count=93, nobs=244, value=0.4, alternative='larger')

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?

6. Two Sample proportion test

Old Therapy New Therapy

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 7/12

depression_data.columns = [column.replace(" ","_") for column in depression_data.columns]

Old Therapy Depression level is 82.0 %

New Therapy Depression level is 12.0 %

depressed_people_a, sample_size_a = (41, 50)

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 8/12

depressed_people = np.array([depressed_people_a, depressed_people_b])

stat, p_value = proportions_ztest(count=depressed_people, nobs=samples, alternative='two-si

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?

7. ANOVA (Analysis of Variance) Test

from scipy.stats import f_oneway

stat, p_value = f_oneway(data1, data2, data3)

localhost:8888/notebooks/Python by John/Daily Tasks/Daily task 9 - Statistical Tests.ipynb 9/12

#Level if significance - 5%. ie, At 5% level of significance, do we reject or not reject?

Probably the same distribution