Data Mining Homework

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

DATA MINING HOMEWORK

DUE DATE: 03/06/2020

Please submit the assignment at the beginning of the lecture.

1. Explain why data exploration is important for data mining.


Data exploration is useful because it is a preliminary investigation of the data to better understand its
specific characteristics. Data exploration can aid in choosing the appropriate preprocessing and data
analysis techniques. Furthermore, it can answer questions regarding data mining, such as finding patterns
in data. Finally, it helps to better understand and interpret data mining results.

2. List two advantages and two disadvantages of k- nearest neighbors for classification.
Advantages:
• There are only two hyperparameters to tune: the number of neighbors k and the method to account
for distance (e.g. inverse distance, Euclidian distance).
• K-nearest neighbors can produce more flexible decision boundaries than other rectilinear decision
boundaries.

Disadvantages:
• The choose of the number of hyperparameter k. If it is too small, the model is prone to overfitting.
If k is to large, the model could underfit the data set.
• The k-nearest neighbors’ algorithm is a lazy learner. That implies that for every prediction, the
algorithm must be applied again from scratch; recomputing could be expensive in with large
datasets.

3. List one limitation of “Accuracy” as a metric used for the performance evaluation of a classifier.
Accuracy is inadequate to treat imbalanced datasets. That is, it would classify as good a classifier that
performs poorly in rare cases.

4. Consider the training dataset shown in the following table. Predict the class label (stolen?) for a
test example (Color=Red, Type=SUV, Origin=Domestic) using the naïve Bayes approach. Show
your calculations.

Example No. Color Type Origin Stolen?


1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes

1
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes

𝑃(𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑦𝑒𝑠) = 0.5


𝑃(𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑛𝑜) = 0.5
𝑃(𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑 | 𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑦𝑒𝑠) = 3/5
𝑃(𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑 | 𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑛𝑜) = 2/5
𝑃(𝑡𝑦𝑝𝑒 = 𝑆𝑈𝑉 | 𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑦𝑒𝑠) = 1/5
𝑃(𝑡𝑦𝑝𝑒 = 𝑆𝑈𝑉 | 𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑦𝑒𝑠) = 3/5
𝑃(𝑜𝑟𝑖𝑔𝑖𝑛 = 𝑑𝑜𝑚𝑒𝑠𝑡𝑖𝑐| 𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑦𝑒𝑠) = 2/5
𝑃(𝑜𝑟𝑖𝑔𝑖𝑛 = 𝑑𝑜𝑚𝑒𝑠𝑡𝑖𝑐| 𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑛𝑜) = 3/5

𝑋: 𝑐𝑜𝑙𝑜𝑟 = 𝑟𝑒𝑑 , 𝑡𝑦𝑝𝑒 = 𝑆𝑈𝑉, 𝑜𝑟𝑖𝑔𝑖𝑛 = 𝑑𝑜𝑚𝑒𝑠𝑡𝑖𝑐


312 1
𝑃(𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑦𝑒𝑠 | 𝑋) ≈ 𝑃(𝑋|𝑦𝑒𝑠)𝑃(𝑦𝑒𝑠) ≈ ( ) ≈ 0.024 ≈ 2.4%
555 2
233 1
𝑃(𝑠𝑡𝑜𝑙𝑒𝑛 = 𝑛𝑜 | 𝑋) ≈ 𝑃(𝑋|𝑛𝑜)𝑃(𝑛𝑜) ≈ ( ) ≈ 0.072 ≈ 7.2%
555 2

5. Consider the training examples shown in the previous question. Which attribute would a decision
tree induction algorithm choose as the first splitting attribute (root node)? Use Entropy as the
measure of node impurity for selecting the best split. Show your calculations.

a. Because there a 5 stolen vehicles (stolen=yes) and 5 non stolen vehicles (stolen=no), the initial
entropy is 1.
b. Next, we compute the gain for the categorical features. The results are the following:

Attribute Color
# cases Red Yellow Entropy Gain
#cases Y N MI 1 #cases Y N MI 2
10 0.970951 0.029049
5 3 2 0.970951 5 2 3 0.970951

3 3 2 2
𝑀𝐼1 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.971
5 5 5 5

2
2 2 3 3
𝑀𝐼2 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.971
5 5 5 5
5 5
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ∗ 𝑀𝐼1 + 𝑀𝐼2 = 0.971
10 10
𝐺𝑎𝑖𝑛 = 1 − 0.971 = 0.029

Attribute Type
# cases Sport SUV Entropy Gain
#cases Y N MI 1 #cases Y N MI 2
10 0.875489 0.124511
6 4 2 0.918296 4 1 3 0.811278

4 4 2 2
𝑀𝐼1 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.918
6 6 6 6
1 1 3 3
𝑀𝐼2 = − 𝑙𝑜𝑔2 ( ) − 𝑙𝑜𝑔2 ( ) = 0.875
4 4 4 4
6 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = ∗ 𝑀𝐼1 + 𝑀𝐼2 = 0.875
10 10
𝐺𝑎𝑖𝑛 = 1 − 0.875 = 0.1245

Attribute Origin
# cases Domestic Imported Entropy Gain
#cases Y N MI 1 #cases Y N MI 2
10 0.970951 0.029049
5 2 3 0.970951 5 3 2 0.970951

Similar computation for feature COLOR

Given that TYPE gives the largest gain, this feature will be the first split. For details in the calculations,
please check the annexed spreadsheet.

You might also like