Professional Documents
Culture Documents
Root Split For The Iris Data Set
Root Split For The Iris Data Set
Root Split For The Iris Data Set
2. Show that the Gini impurity function for binary classification with two
classesC1andC2canbe simplified to Gini(T) = 2p(1−p) where p is the relative
frequency of classC1inT.
#3.Calculate the maximum entropy for a set containing a mixture of four
classes? Repeat the calculation for n classes; simplify the final formula as
much as possible (down to a single “log”).
4.Explain the reason for gain ratio to be the preferred measure for selecting
the best split when using Gini or entropy as the purity function. What is the
possible problem if we don’t use it and why is that problem bad? Is gain ratio
needed in all types of decision trees that use Gini or entropy or are some
versions immune?
5. The split in Figure 1 is the split at the root of the decision tree for the Iris
data set generated by the Scikit-learn DecisionTreeClassifier().
Please calculate the following measures for this split. Note: Keep in mind that
the Iris data set has three classes. In the figure, they are reported using the
order[setosa, versicolor, virginica]