Professional Documents
Culture Documents
Tamur Khan: Import As From Import Import As
Tamur Khan: Import As From Import Import As
In [2]: df = pd.read_csv("recipe_1608275.csv")
In [3]: df
Out[3]:
meat carbs veg fruit spice outcome
Out[4]:
meat carbs veg fruit spice outcome
Out[5]:
meat carbs veg fruit spice outcome
Out[6]: (6, 6)
Out[7]: (4, 6)
Meat Attribute
Fraction of each value in the attribute against the total of all values. Muliplying the fraction by the
entropy of that value gives us entropy of the whole attribute.
total values for pork are 4, divided by all the values in the attribute which are 10. Then multiplying that
by the entropy of pork. Do this for the remaining two values and add them together.
Carbs Attribute
Veg Attribute
Fruit attribute
In [18]: df[['fruit','outcome']]
Out[18]:
fruit outcome
0 orange yumyum
1 plum yukyuk
2 fig yukyuk
3 pineapple yumyum
4 peach yukyuk
5 banana yukyuk
6 cherry yumyum
7 apple yumyum
8 cherry yumyum
9 plum yumyum
In [20]: FruitEntropy = -(1/10 * 0 + 2/10 * 1.0 + 1/10 * 0 + 1/10 * 0 + 1/10 * 0 + 1/10 * 0 + 2/10
infoGainFruit = 0.970 - 0.2
print('Entropy for the fruit attribute is',(FruitEntropy))
print('Information gain for the fruit attribute is',(infoGainFruit))
Spice Attribute
In [21]: df[['spice','outcome']]
Out[21]:
spice outcome
0 lavender yumyum
1 rosemary yukyuk
2 fenugreek yukyuk
3 jalapeno yumyum
4 capsicum yukyuk
5 fennel yukyuk
6 rosemary yumyum
7 capsicum yumyum
8 jalapeno yumyum
9 dill yumyum
In [23]: SpiceEntropy = -(1/10 * 0 + 2/10 * 1.0 + 1/10 * 0 + 2/10 * 0 + 2/10 * 1 + 1/10 * 0 + 1/10
infoGainSpice = 0.970 - 0.4
print(SpiceEntropy,('is the entropy for the spice attribute'))
print(infoGainSpice,('is the information gain for the spice attribute'))
The attribute with the lowest entropy and highest info gain is Fruit
The best attribute to use as the root node of the decision tree would be the one with the lowest entropy, which
in turn would have the highest information gain. From the results we can see that the fruit attribute has the
lowest entropy resulting in the highest info gain. But it would not be suitable to use this attribute as the root
node because the dataset is not large enough and the fruit attribute contains 8 different values out of the total
10 in the whole dataset. If we were to use the fruit attribute as the root node we would have to split on 8
values resulting in a complex decision tree. If the dataset was larger we could get an accurate entropy
calculation result for the fruit attribute. Therefore the best attribute to use as the root node for the decision
tree would be the meat attribute. If we look at the meat attribute and then split it into three subsets of the
values that it contains (pork, sausage and duck) we can look at more attributes to see if the outcome is either
"yumyum" or "yukyuk". For sausage the outcome is always "yumyum" (entropy is 0) so it doesn't matter what
the ingredients are in the other attirbutes.
The next value in the meat attribute with the lowest entropy is pork but the outcome for pork is not concrete
so it needs another attribute to help decide the outcome. To do that we need to look at the attribute after Meat
with the highest information gain and that attribute is Veg, from here we can see the outcome. If it contains
cabbage it will be "yumyum" and if it contains artichoke the outcome will be "yukyuk". Similarly we can look at
the duck value and the only attribute left to decide is carbs which has 3 different values in making a decision
on the outcome; dumplings = yumyum, turnip = yumyum, pasta = yukyuk.