Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Task 1

Try Random Forest Classifier for the Iris dataset example, and when creating decision trees from the
bootstrapped dataset, use different number of features:
• a subset of 2 features
• a subset of 3 features
• a subset of 4 features
And compare the accuracy of your Random Forest Classifier. Which one is better? Discuss your
observations and justify your answer.
What is the largest number of features you may have?

a subset of 2 features

a subset of 3 features

a subset of 4 features
After running the python script, we get 3 Accuracy scores of the date with 3 difference features. We can see
that when the Feature is 2 (Feature2) we can get the Accuracy is 0.9375 and when the Feature is 3
(Feature3) we can get the Accuracy is 0.96875. It means the Feature3 better than the Feature2, because
the Feature3 we can get the higher Accuracy.
Also when we used the Feature is 4 (Feature4) we get the Accuracy is 0.96875, it is the same score of
Feature3. That means the largest number of features we have is 3.

Task 2
You have been given a binary classification problem (positive/negative) where the original dataset
contains 29 positive and 35 negative samples. We have 2 features of A1 and A2 which can be used for
splitting the data. We would like to build a decision tree and figure below provides the resulting splits
for each feature. Which feature would you use to split the data? A1 or A2? Justify your answer.
Tip: use information gain
After the calculate of task 2, we can get the information Gain of A1 is ~0.2659 and A2 is 01214.
That means A1 is gives more information than A2, So I would use A1 to split the data.

You might also like