Tamur Khan: Import As From Import Import As

17/05/2019 Workshop 5
1608275 Tamur Khan

In [1]: import pandas as pd
from math import log2
import math as m
In [2]: df = pd.read_csv("recipe_1608275.csv")
Showcasing whole dataset
In [3]: df
Out[3]:
meat carbs veg fruit spice outcome
0 pork pasta cabbage orange lavender yumyum
1 pork turnip artichoke plum rosemary yukyuk
2 duck pasta avacado fig fenugreek yukyuk
3 duck dumplings avacado pineapple jalapeno yumyum
4 pork pasta artichoke peach capsicum yukyuk
5 pork dumplings artichoke banana fennel yukyuk
6 sausage pasta artichoke cherry rosemary yumyum
7 sausage dumplings avacado apple capsicum yumyum
8 duck turnip artichoke cherry jalapeno yumyum
9 sausage pasta avacado plum dill yumyum
All good outcomes
In [4]: pos = df[df.outcome == "yumyum"]

pos
Out[4]:
All bad outcomes
localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 1/7

17/05/2019 Workshop 5
In [5]: neg = df[df.outcome == "yukyuk"]

neg
Out[5]:
In [6]: pos.shape #shape of the positives dataframe
Out[6]: (6, 6)
In [7]: neg.shape #shape of the negatives dataframe
Out[7]: (4, 6)
Whole dataset entropy
In [8]: def entropy(full_size, n_positive):

if(n_positive == 0 or n_positive == full_size):
return 0
n_negative = full_size - n_positive
p = n_positive / full_size
n = n_negative / full_size
return (-p * m.log2(p) + (-n * m.log2(n)))
print('entropy of the whole dataset:',(entropy(10,4)))
entropy of the whole dataset: 0.9709505944546686
Meat Attribute
In [9]: pork = df[df.meat == "pork"]

duck = df[df.meat == "duck"]
sausage = df[df.meat == "sausage"]
print(pork)
print(duck)
print(sausage)


17/05/2019 Workshop 5
In [10]: def entropy(n_p, n_n):

if(n_p == 0 or n_n ==0):
return 0
p_plus_n = n_p + n_n
p_ratio = n_p / p_plus_n
n_ratio = n_n / p_plus_n
return -(p_ratio * log2(p_ratio)) - (n_ratio * log2(n_ratio))
print('Entropy for pork:',(entropy(1, 3))) #pork
print('Entropy for duck:',(entropy(2, 1))) #duck
print('Entropy for sausage:',(entropy(3, 0))) #sausage
Entropy for pork: 0.8112781244591328

Entropy for duck: 0.9182958340544896
Entropy for sausage: 0
Fraction of each value in the attribute against the total of all values. Muliplying the fraction by the
entropy of that value gives us entropy of the whole attribute.
So for the meat attribute to get the entropy would be:
total values for pork are 4, divided by all the values in the attribute which are 10. Then multiplying that
by the entropy of pork. Do this for the remaining two values and add them together.
4/10 x 0.811 plus 3/10 x 0.918 etc.
In [11]: MeatEntropy = -(4/10 * 0.811+3/10 * 0.918+3/10 * 0)

infoGainMeat = 0.970-0.599 # entropy of whole dataset - entropy of specific attribute
print('meat entropy:',(MeatEntropy))
print('meat info gain:',(infoGainMeat))
meat entropy: -0.5998

meat info gain: 0.371
Carbs Attribute
In [12]: pasta = df[df.carbs == "pasta"]

turnip = df[df.carbs == "turnip"]
dumplings = df[df.carbs == "dumplings"]
print(pasta)
print(turnip)
print(dumplings)


17/05/2019 Workshop 5

if(n_p == 0 or n_n ==0):
return 0
print('Entropy for pasta:',(entropy(3, 2))) #pasta
print('Entropy for turnip:',(entropy(1, 1))) #tunip
print('Entropy for dumplings:',(entropy(2, 1))) #dumplings
Entropy for pasta: 0.9709505944546686

Entropy for turnip: 1.0
Entropy for dumplings: 0.9182958340544896
In [14]: CarbsEntropy = -(5/10 * 0.97 + 2/10 * 1.0 +3/10 * 0.918)

infoGainCarbs = 0.970-0.960
print('Carbs entropy:',(CarbsEntropy))
print('Carbs Info Gain ',(infoGainCarbs))
Carbs entropy: -0.9604

Carbs Info Gain 0.010000000000000009
Veg Attribute
In [15]: cabbage = df[df.veg == "cabbage"]

artichoke = df[df.veg == "artichoke"]
avacado = df[df.veg == "avacado"]
print(cabbage)
print(artichoke)
print(avacado)


if(n_p == 0 or n_n ==0):
return 0
print('Entropy for cabbage:',(entropy(1, 0))) #cabbage
print('Entropy for artichoke:',(entropy(2, 3))) #artichoke
print('Entropy for avacado:',(entropy(3, 1))) #avacado
Entropy for cabbage: 0

Entropy for artichoke: 0.9709505944546686
Entropy for avacado: 0.8112781244591328

17/05/2019 Workshop 5
In [17]: VegEntropy = -(1/10 * 0 + 5/10 * 0.970 + 4/10 * 0.811)

infoGainVeg = 0.970-0.809
print('Entropy for veg',(VegEntropy))
print('Info gain for veg',(infoGainVeg))
Entropy for veg -0.8094

Info gain for veg 0.16099999999999992
Fruit attribute
In [18]: df[['fruit','outcome']]
#for fruit in df:

# print(df[['fruit', 'outcome']])
Out[18]:
fruit outcome
0 orange yumyum
1 plum yukyuk
2 fig yukyuk
3 pineapple yumyum
4 peach yukyuk
5 banana yukyuk
6 cherry yumyum
7 apple yumyum
8 cherry yumyum
9 plum yumyum

if(n_p == 0 or n_n ==0):
return 0
print('Entropy for orange:',(entropy(1, 0))) #orange
print('Entropy for plum:',(entropy(1, 1))) #plum
print('Entropy for fig:',(entropy(0, 1))) #fig
print('Entropy for pineapple:',(entropy(1, 0))) #pineapple
print('Entropy for peach:',(entropy(0, 1))) #peach
print('Entropy for banana:',(entropy(0, 1))) #banana
print('Entropy for cherry:',(entropy(2, 0))) #cherry
print('Entropy for apple:',(entropy(1, 0))) #apple
Entropy for orange: 0

Entropy for plum: 1.0
Entropy for fig: 0
Entropy for pineapple: 0
Entropy for peach: 0
Entropy for banana: 0
Entropy for cherry: 0
Entropy for apple: 0

17/05/2019 Workshop 5
In [20]: FruitEntropy = -(1/10 * 0 + 2/10 * 1.0 + 1/10 * 0 + 1/10 * 0 + 1/10 * 0 + 1/10 * 0 + 2/10
infoGainFruit = 0.970 - 0.2
print('Entropy for the fruit attribute is',(FruitEntropy))
print('Information gain for the fruit attribute is',(infoGainFruit))
Entropy for the fruit attribute is -0.2

Information gain for the fruit attribute is 0.77
Spice Attribute
In [21]: df[['spice','outcome']]
Out[21]:
spice outcome
0 lavender yumyum
1 rosemary yukyuk
2 fenugreek yukyuk
3 jalapeno yumyum
4 capsicum yukyuk
5 fennel yukyuk
6 rosemary yumyum
7 capsicum yumyum
8 jalapeno yumyum
9 dill yumyum

if(n_p == 0 or n_n ==0):
return 0
print('Entropy for lavander:',(entropy(1, 0))) #lavender
print('Entropy for rosemary:',(entropy(1, 1))) #rosemary
print('Entropy for fengreek:',(entropy(0, 1))) #fenugreek
print('Entropy for jelepeno:',(entropy(2, 0))) #jelepeno
print('Entropy for capsicum:',(entropy(1, 1))) #capsicum
print('Entropy for fennel:',(entropy(0, 1))) #fennel
print('Entropy for dill:',(entropy(1, 0))) #dill
Entropy for lavander: 0

Entropy for rosemary: 1.0
Entropy for fengreek: 0
Entropy for jelepeno: 0
Entropy for capsicum: 1.0
Entropy for fennel: 0
Entropy for dill: 0
In [23]: SpiceEntropy = -(1/10 * 0 + 2/10 * 1.0 + 1/10 * 0 + 2/10 * 0 + 2/10 * 1 + 1/10 * 0 + 1/10
infoGainSpice = 0.970 - 0.4
print(SpiceEntropy,('is the entropy for the spice attribute'))
print(infoGainSpice,('is the information gain for the spice attribute'))
-0.4 is the entropy for the spice attribute

0.57 is the information gain for the spice attribute

17/05/2019 Workshop 5
In [24]: print('Meat Entropy is:',(MeatEntropy))

print('Carbs Entropy is:',(CarbsEntropy))
print('Veg Entropy is:',(VegEntropy))
print('Fruit Entropy is:',(FruitEntropy))
print('Spice Entropy is:',(SpiceEntropy))
Meat Entropy is: -0.5998

Carbs Entropy is: -0.9604
Veg Entropy is: -0.8094
Fruit Entropy is: -0.2
Spice Entropy is: -0.4
In [25]: print('Meat Information Gain =',(infoGainMeat))

print('Carbs Information Gain =',(infoGainCarbs))
print('Veg Information Gain =',(infoGainVeg))
print('Fruit Information Gain =',(infoGainFruit))
print('Spice Information Gain =',(infoGainSpice))
Meat Information Gain = 0.371

Carbs Information Gain = 0.010000000000000009
Veg Information Gain = 0.16099999999999992
Fruit Information Gain = 0.77
Spice Information Gain = 0.57
The attribute with the lowest entropy and highest info gain is Fruit
The best attribute to use as the root node of the decision tree would be the one with the lowest entropy, which
in turn would have the highest information gain. From the results we can see that the fruit attribute has the
lowest entropy resulting in the highest info gain. But it would not be suitable to use this attribute as the root
node because the dataset is not large enough and the fruit attribute contains 8 different values out of the total
10 in the whole dataset. If we were to use the fruit attribute as the root node we would have to split on 8
values resulting in a complex decision tree. If the dataset was larger we could get an accurate entropy
calculation result for the fruit attribute. Therefore the best attribute to use as the root node for the decision
tree would be the meat attribute. If we look at the meat attribute and then split it into three subsets of the
values that it contains (pork, sausage and duck) we can look at more attributes to see if the outcome is either
"yumyum" or "yukyuk". For sausage the outcome is always "yumyum" (entropy is 0) so it doesn't matter what
the ingredients are in the other attirbutes.
The next value in the meat attribute with the lowest entropy is pork but the outcome for pork is not concrete
so it needs another attribute to help decide the outcome. To do that we need to look at the attribute after Meat
with the highest information gain and that attribute is Veg, from here we can see the outcome. If it contains
cabbage it will be "yumyum" and if it contains artichoke the outcome will be "yukyuk". Similarly we can look at
the duck value and the only attribute left to decide is carbs which has 3 different values in making a decision
on the outcome; dumplings = yumyum, turnip = yumyum, pasta = yukyuk.

Tamur Khan: Import As From Import Import As

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tamur Khan: Import As From Import Import As

Uploaded by

Copyright:

Available Formats

17/05/2019 Workshop 5

1608275 Tamur Khan

Showcasing whole dataset

0 pork pasta cabbage orange lavender yumyum

1 pork turnip artichoke plum rosemary yukyuk

2 duck pasta avacado fig fenugreek yukyuk

3 duck dumplings avacado pineapple jalapeno yumyum

4 pork pasta artichoke peach capsicum yukyuk

5 pork dumplings artichoke banana fennel yukyuk

6 sausage pasta artichoke cherry rosemary yumyum

7 sausage dumplings avacado apple capsicum yumyum

8 duck turnip artichoke cherry jalapeno yumyum

9 sausage pasta avacado plum dill yumyum

All good outcomes

In [4]: pos = df[df.outcome == "yumyum"]

0 pork pasta cabbage orange lavender yumyum

3 duck dumplings avacado pineapple jalapeno yumyum

6 sausage pasta artichoke cherry rosemary yumyum

7 sausage dumplings avacado apple capsicum yumyum

8 duck turnip artichoke cherry jalapeno yumyum

9 sausage pasta avacado plum dill yumyum

All bad outcomes

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 1/7

In [5]: neg = df[df.outcome == "yukyuk"]

1 pork turnip artichoke plum rosemary yukyuk

2 duck pasta avacado fig fenugreek yukyuk

4 pork pasta artichoke peach capsicum yukyuk

5 pork dumplings artichoke banana fennel yukyuk

In [6]: pos.shape #shape of the positives dataframe

In [7]: neg.shape #shape of the negatives dataframe

Whole dataset entropy

In [8]: def entropy(full_size, n_positive):

entropy of the whole dataset: 0.9709505944546686

In [9]: pork = df[df.meat == "pork"]

meat carbs veg fruit spice outcome

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 2/7

In [10]: def entropy(n_p, n_n):

Entropy for pork: 0.8112781244591328

So for the meat attribute to get the entropy would be:

4/10 x 0.811 plus 3/10 x 0.918 etc.

In [11]: MeatEntropy = -(4/10 * 0.811+3/10 * 0.918+3/10 * 0)

meat entropy: -0.5998

In [12]: pasta = df[df.carbs == "pasta"]

meat carbs veg fruit spice outcome

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 3/7

In [13]: def entropy(n_p, n_n):

Entropy for pasta: 0.9709505944546686

In [14]: CarbsEntropy = -(5/10 * 0.97 + 2/10 * 1.0 +3/10 * 0.918)

Carbs entropy: -0.9604

In [15]: cabbage = df[df.veg == "cabbage"]

meat carbs veg fruit spice outcome

In [16]: def entropy(n_p, n_n):

Entropy for cabbage: 0

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 4/7

In [17]: VegEntropy = -(1/10 * 0 + 5/10 * 0.970 + 4/10 * 0.811)

Entropy for veg -0.8094

#for fruit in df:

In [19]: def entropy(n_p, n_n):

Entropy for orange: 0

localhost:8888/notebooks/Desktop/AI %26 Machine Learning/Workshop 5.ipynb#Fraction-of-each-value-in-the-attribute-against-the-total-of-all-val… 5/7

Entropy for the fruit attribute is -0.2

In [22]: def entropy(n_p, n_n):