Root Split For The Iris Data Set

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

1.

Explain what is being decreased precisely (I don’t mean “impurity”) when we


select the best split using Gini index? Be precise and to the point. (“Decrease
in impurity” or any variation of that is not the answer.)

2. Show that the Gini impurity function for binary classification with two
classesC1andC2canbe simplified to Gini(T) = 2p(1−p) where p is the relative
frequency of classC1inT.
#3.Calculate the maximum entropy for a set containing a mixture of four
classes? Repeat the calculation for n classes; simplify the final formula as
much as possible (down to a single “log”).
4.Explain the reason for gain ratio to be the preferred measure for selecting
the best split when using Gini or entropy as the purity function. What is the
possible problem if we don’t use it and why is that problem bad? Is gain ratio
needed in all types of decision trees that use Gini or entropy or are some
versions immune?

5. The split in Figure 1 is the split at the root of the decision tree for the Iris
data set generated by the Scikit-learn DecisionTreeClassifier().
Please calculate the following measures for this split. Note: Keep in mind that
the Iris data set has three classes. In the figure, they are reported using the
order[setosa, versicolor, virginica]

a. Purity gain using Gini impurity function.


b. Information gain.
Figure 1:Root split for the Iris data set
6. Use Gini index to build a decision tree with multi-way splits using the
training examples in Figure 2 below. Solve the problem by providing
responses to the following prompts.
a. Explain why Customer IDs hould not be used as an attribute test
condition.
b. Select the best split for the root. List/show all the splits you considered
together with their corre-sponding values of the Gini index. Justify your
selection for the root split condition.
c. Find all the remaining splits to construct a full decision tree where all leaves
contain only a single class. Show all the splits that you considered, include
Gini index for each one. Assign a class to each leaf.
d. Did you have difficulty assigning classes to leaves in part (c)? Note that we
cannot split a leaf if the records have identical attributes (see slide 78 “Pre-
pruning: Stopping Criteria”). This situation is called a “clash” and there are
various methods of dealing with it. In this assignment we just leave it as is, but
how would you handle it if you were to encounter it in a real project?
e. Use the final tree to classify the record (F, Family, Small).
f. Suppose we use as topping criterion that disallows leaves with fewer than
two examples. Modify the tree accordingly and reclassify the record (F,
Family, Small). (Note that there is no need tore-build the tree from scratch!)
g. Find the overall impurity of the tree in part (f).
Figure 2:Training set for Problem 6
6.Consider the set of training examples in Figure 3 below.
Figure 3:Training set for Problem 7

a. Compute two-level decision tree (use tree depth/height as the stopping


criterion) using the classi-fication error rate as the splitting criterion. Calculate
the overall error rate of the induced tree.
b. Repeat part (a) using X as the first splitting attribute and then choose the
best remaining attribute for splitting at each of the two successor nodes. What
is the error rate of the induced tree?
c. Compare the error rates of the trees induced in parts (a) and (b). Comment
on the difference.
[Hint:The error rates should be different.] What important property of decision
tree algorithms does it illustrate

8. As a data scientist, you will have to use software and programming


documentation on daily basis. And if you did not use an online documentation
in the past, then the Scikit-learn documentation is a great one to start with! It
is truly excellent! Thus for this problem, you will consult the Scikit-learn online
documentation and describe the type of decision tree that it implements. Make
sure to list which algorithm it implements, what is the accepted Type of input,
type of target variable, split selection function (impurity or error), stopping
criteria, pruning method, type of splits, etc

You might also like