Professional Documents
Culture Documents
Data Mining Homework
Data Mining Homework
Data Mining Homework
2. List two advantages and two disadvantages of k- nearest neighbors for classification.
Advantages:
• There are only two hyperparameters to tune: the number of neighbors k and the method to account
for distance (e.g. inverse distance, Euclidian distance).
• K-nearest neighbors can produce more flexible decision boundaries than other rectilinear decision
boundaries.
Disadvantages:
• The choose of the number of hyperparameter k. If it is too small, the model is prone to overfitting.
If k is to large, the model could underfit the data set.
• The k-nearest neighbors’ algorithm is a lazy learner. That implies that for every prediction, the
algorithm must be applied again from scratch; recomputing could be expensive in with large
datasets.
3. List one limitation of “Accuracy” as a metric used for the performance evaluation of a classifier.
Accuracy is inadequate to treat imbalanced datasets. That is, it would classify as good a classifier that
performs poorly in rare cases.
4. Consider the training dataset shown in the following table. Predict the class label (stolen?) for a
test example (Color=Red, Type=SUV, Origin=Domestic) using the naïve Bayes approach. Show
your calculations.
1
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
5. Consider the training examples shown in the previous question. Which attribute would a decision
tree induction algorithm choose as the first splitting attribute (root node)? Use Entropy as the
measure of node impurity for selecting the best split. Show your calculations.
a. Because there a 5 stolen vehicles (stolen=yes) and 5 non stolen vehicles (stolen=no), the initial
entropy is 1.
b. Next, we compute the gain for the categorical features. The results are the following:
Attribute Color
# cases Red Yellow Entropy Gain
#cases Y N MI 1 #cases Y N MI 2
10 0.970951 0.029049
5 3 2 0.970951 5 2 3 0.970951
3 3 2 2
𝑀𝐼1 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.971
5 5 5 5
2
2 2 3 3
𝑀𝐼2 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.971
5 5 5 5
5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ∗ 𝑀𝐼1 + 𝑀𝐼2 = 0.971
10 10
𝐺𝑎𝑖𝑛 = 1 − 0.971 = 0.029
Attribute Type
# cases Sport SUV Entropy Gain
#cases Y N MI 1 #cases Y N MI 2
10 0.875489 0.124511
6 4 2 0.918296 4 1 3 0.811278
4 4 2 2
𝑀𝐼1 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.918
6 6 6 6
1 1 3 3
𝑀𝐼2 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.875
4 4 4 4
6 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ∗ 𝑀𝐼1 + 𝑀𝐼2 = 0.875
10 10
𝐺𝑎𝑖𝑛 = 1 − 0.875 = 0.1245
Attribute Origin
# cases Domestic Imported Entropy Gain
#cases Y N MI 1 #cases Y N MI 2
10 0.970951 0.029049
5 2 3 0.970951 5 3 2 0.970951
Given that TYPE gives the largest gain, this feature will be the first split. For details in the calculations,
please check the annexed spreadsheet.