LAB4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

BI12-215 Đàm Hữu Khoa

BI12-206 Lê Quang Khánh

REPORT LABWORK 4

I, Decition Tree - DT
Iris Dataset

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5 3.6 1.4 0.2

Divide the original dataset into two subsets: one for training (80%) and one for testing (20%).

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state =


0, test_size = 0.20)
print("Shape of input - training set", X_train.shape)
print("Shape of output - training set", y_train.shape)
print("Shape of input - testing set", X_test.shape)
print("Shape of output - testing set", y_test.shape)
Shape of input - training set (120, 4)
Shape of output - training set (120,)
Shape of input - testing set (30, 4)
Shape of output - testing set (30,)

Build a DT for the training subset and test the built model for data from the testing subset.

from sklearn.tree import DecisionTreeClassifier


dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
DecisionTreeClassifier
DecisionTreeClassifier()

# Test the model on the testing set


y_pred = dt_model.predict(X_test)
Model score: 0.9666666666666667
Try the “tree” package from sklearn:

Calculate the classification error:


Accuracy score on train data: 0.9583333333333334
Accuracy score on test data: 0.9666666666666667
Confusion Matrix - Train:
[[39 0 0]
[ 0 36 1]
[ 0 4 40]]
Confusion Matrix - Test:
[[11 0 0]
[ 0 13 0]
[ 0 1 5]]
MSE - train dataset:
0.041666666666666664
MSE - test dataset:
0.03333333333333333
II, Random Forests
Seeds Dataset

area perimeter compactness lengthofkernel widthofkernel asymmetrycoefficient lengthofkernelgroove selector


0 15.26 14.84 0.871 5.763 3.312 2.221 5.22 1
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 1
2 14.29 14.09 0.905 5.291 3.337 2.699 4.825 1
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 1
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 1

Create K = 100 training set (using cross-validation):

- Split the dataset into training and testing sets:

X_train_seed, X_test_seed, y_train_seed, y_test_seed =


train_test_split(X_seed, y_seed, test_size=0.20)

- Creating a list to store cross-validation scores:

cross_scores = []

- Perform 100-fold cross-validation for a KNN classifier with 100 neighbors

knn = KNeighborsClassifier(n_neighbors=100)
scores = cross_val_score(knn, X_train_seed, y_train_seed, cv=10,
scoring='accuracy')
cross_scores.append(scores.mean())
cross_scores : [0.8801470588235294]

Build a DT for each training set:

seed_dt = DecisionTreeClassifier()
seed_dt.fit(X_train_seed,y_train_seed)

-Test on the test set:

y_predict_seed = seed_dt.predict(X_test_seed)

-Test on the train set:

y_predict_seed_train = seed_dt.predict(X_train_seed)
Accuracy score on train data: 0.9404761904761905
Accuracy score on test data: 0.8571428571428571
Confusion Matrix - Train:
[[58 0 1]
[ 2 53 0]
[ 7 0 47]]
Confusion Matrix - Test:
[[ 9 1 1]
[ 0 15 0]
[ 4 0 12]]
MSE- train dataset:
0.20238095238095238
MSE- test dataset:
0.5

Classify data from the testing set using one DT:

0 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40 41
target 0 0 2 2 1 1 2 1 2 2 ... 1 1 2 2 2 1 2 1 1 1
Predicted 0 0 2 0 1 1 2 1 2 0 ... 1 1 2 0 2 1 2 1 1 1

Accuracy : 0.8571428571428571

Classify data from the testing set using all DTs:


Accuracy score on train data: 1.0
Accuracy score on test data: 0.9285714285714286
Confusion Matrix - Train:
[[59 0 0]
[ 0 50 0]
[ 0 0 59]]
Confusion Matrix - Test:
[[10 0 1]
[ 2 18 0]
[ 0 0 11]]
MSE- train dataset:
0.0
MSE- test dataset:
0.14285714285714285

0 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40 41
target 2 1 2 1 1 1 2 2 0 0 ... 0 1 0 2 1 1 1 1 0 1
Predicted 2 1 2 1 1 1 2 2 0 0 ... 0 1 0 2 1 0 1 1 0 1

Accuracy: 0.9285714285714286

Conclusion:

The Decision Tree model on the Iris dataset demonstrated high accuracy on both training and testing sets,
with minimal classification errors.

The Random Forests analysis on the Seeds dataset showcased improved accuracy when aggregating
predictions from multiple Decision Trees, highlighting the ensemble's strength in enhancing predictive
performance.

This analysis provides insights into the effectiveness of Decision Trees and Random Forests for classification
tasks, showcasing their potential in different scenarios.

You might also like