Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.

ipynb - Colab

This lab deals with GridSearchCV for tuning the hyper-parameters of an estimator and
applying vectorization techniques to the movie reviews dataset for classification task.

keyboard_arrow_down Import libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from prettytable import PrettyTable
from sklearn import svm, datasets
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

from google.colab import drive


drive.mount('/content/drive')

Mounted at /content/drive

keyboard_arrow_down Task 1. With iris dataset


1.1. Apply GridSearchCV for SVM to find the best hyperparameters using the following param_grid.

param_grid = {'C': [0.1, 1, 10, 100, 1000],


'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf','linear']}

param_grid = {'C': [0.1, 1, 10, 100, 1000],


'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf','linear']}
iris = datasets.load_iris()
X = iris.data
Y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
svm = SVC()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=svm,cv = 3,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
print(f"CV results: {grid_search_svm.cv_results_}")

Best param: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}


CV results: {'mean_fit_time': array([0.00394201, 0.00194025, 0.00213122, 0.00178782, 0.00175977,
0.00142312, 0.00162784, 0.00131456, 0.00171169, 0.00151682,
0.00158453, 0.00129326, 0.00144958, 0.00161902, 0.00146437,
0.00297554, 0.00179935, 0.0047915 , 0.00278982, 0.00140007,
0.00144506, 0.0013903 , 0.0013717 , 0.00132434, 0.00158079,
0.00142956, 0.00159621, 0.00170469, 0.00441432, 0.00225147,
0.00149473, 0.00285451, 0.00308839, 0.00198825, 0.00216699,
0.00250498, 0.00477314, 0.00167116, 0.00391547, 0.00233269,
0.00686407, 0.00225139, 0.00246263, 0.00209705, 0.00141033,
0.00222683, 0.00159399, 0.00188096, 0.00165097, 0.00349728]), 'std_fit_time': array([1.48778401e-03, 1.05259035e-04, 1.82140040
1.75816527e-05, 2.85087215e-05, 9.76143496e-05, 1.17470834e-04,
7.72678123e-05, 8.04916095e-05, 7.86138040e-05, 2.43423177e-05,
1.41754633e-04, 2.01766018e-04, 6.51370147e-05, 1.13451346e-03,
9.70805791e-05, 4.82880031e-03, 1.07405610e-03, 6.11955469e-05,

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 1/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
1.12554446e-05, 5.70660512e-05, 8.41844565e-05, 5.11892827e-05,
1.92707740e-04, 9.79794971e-05, 5.80580653e-05, 2.54616532e-04,
2.19482172e-03, 8.56727682e-04, 5.07304590e-05, 2.09923166e-03,
2.24146313e-03, 5.51595706e-04, 4.44685868e-04, 3.89296211e-04,
2.98484326e-03, 3.39075666e-04, 1.82325689e-03, 1.17155954e-03,
3.26750513e-03, 6.63944076e-04, 5.76978708e-04, 2.38492663e-04,
5.05538598e-05, 1.66069793e-04, 4.03823802e-05, 3.87078533e-04,
1.41060722e-04, 2.16061048e-03]), 'mean_score_time': array([0.00153669, 0.00136471, 0.00157507, 0.00111755, 0.00113511,
0.00096226, 0.00108926, 0.00086832, 0.00119472, 0.00104149,
0.00105238, 0.00092347, 0.00363111, 0.00112542, 0.00103553,
0.00180014, 0.0012358 , 0.0011042 , 0.00147541, 0.00095566,
0.00110571, 0.00097116, 0.00095518, 0.00094938, 0.00365416,
0.0009764 , 0.00234723, 0.00217104, 0.00228731, 0.00121005,
0.00101479, 0.00165081, 0.00321507, 0.0036397 , 0.00355387,
0.00132362, 0.00124248, 0.0020589 , 0.00142026, 0.00396721,
0.00619411, 0.00297944, 0.00392747, 0.00118558, 0.00099413,
0.00132791, 0.00111198, 0.00118407, 0.0011541 , 0.00232164]), 'std_score_time': array([3.25144051e-05, 9.62587271e-05, 9.087352
8.94837580e-06, 4.69440471e-05, 2.42432607e-05, 2.05095417e-05,
1.20718149e-04, 2.73882892e-05, 6.04299000e-05, 7.92739286e-06,
3.77944747e-03, 1.23151806e-04, 5.08438761e-05, 9.68730889e-04,
8.77365974e-05, 1.77272119e-04, 8.25218408e-05, 2.79213390e-05,
2.03797949e-04, 6.90059688e-05, 2.05891465e-05, 2.12295200e-05,
3.75232223e-03, 3.32587983e-05, 1.55457669e-03, 1.53476861e-03,
1.10835448e-03, 8.22608211e-05, 3.72957248e-05, 1.06415071e-03,
2.90691982e-03, 1.82736642e-03, 1.96850376e-03, 3.01201677e-04,
2.34260325e-05, 1.54305400e-03, 9.54403201e-05, 4.10423000e-03,
3.43919079e-03, 2.49006286e-03, 3.34125131e-03, 7.05744053e-06,
1.86356654e-05, 2.29775513e-04, 3.57675549e-05, 2.05767834e-04,
4.16504512e-05, 1.47037935e-03]), 'param_C': masked_array(data=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000,
1000, 1000],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False],
fill_value='?',
dtype=object) 'param gamma': masked array(data=[1 1 0 1 0 1 0 01 0 01 0 001 0 001 0 0001

1.2. Apply GridSearchCV for kNN to find the best hyperparameters using the following param_grid.

grid_params = { 'n_neighbors' : [5,7,9,11,13,15],


'weights' : ['uniform','distance'],
'metric' : ['minkowski','euclidean','manhattan']}

where

* **n_neighbors**: Decide the best k based on the values we have computed earlier.
* **weights**: Check whether adding weights to the data points is beneficial to the model or not. 'uniform' assigns no weight, while 'distance' w
* **metric**: The distance metric to be used will calculating the similarity.

grid_params = { 'n_neighbors' : [5,7,9,11,13,15],


'weights' : ['uniform','distance'],
'metric' : ['minkowski','euclidean','manhattan']}
iris = datasets.load_iris()
X = iris.data
Y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
kNN = KNeighborsClassifier()
grid_search_svm = GridSearchCV(param_grid=grid_params,estimator=kNN,cv = 5,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
print(f"CV results: {grid_search_svm.cv_results_}")

Best param: {'metric': 'minkowski', 'n_neighbors': 9, 'weights': 'uniform'}


CV results: {'mean_fit_time': array([0.00550156, 0.00423613, 0.00319223, 0.00340414, 0.00123158,
0.00233359, 0.00220909, 0.0059979 , 0.00432267, 0.00295324,
0.0011517 , 0.00373769, 0.0021512 , 0.001092 , 0.0058701 ,
0.00315609, 0.00280704, 0.00500274, 0.01122355, 0.00118332,
0.00363441, 0.00232773, 0.00276389, 0.00675564, 0.00141692,

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 2/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
0.00549245, 0.00245981, 0.00316486, 0.00123477, 0.00546222,
0.00797062, 0.00397177, 0.00324492, 0.00137267, 0.00270839,
0.00217333]), 'std_fit_time': array([0.0040854 , 0.00385502, 0.0040699 , 0.00446546, 0.00021731,
0.00260024, 0.00157195, 0.00521458, 0.00353601, 0.00354001,
0.00023039, 0.00527138, 0.00213062, 0.00016684, 0.00575109,
0.00412882, 0.00327057, 0.00475698, 0.0102467 , 0.00018194,
0.00256553, 0.00237529, 0.00317934, 0.00309037, 0.00030895,
0.00527827, 0.00260494, 0.00400612, 0.00015444, 0.0063484 ,
0.00569277, 0.00335759, 0.00383773, 0.0001405 , 0.00201755,
0.00186806]), 'mean_score_time': array([0.0165205 , 0.00400467, 0.01517892, 0.00395317, 0.01917338,
0.00842667, 0.01479707, 0.00358806, 0.0150527 , 0.00678368,
0.01771269, 0.00766015, 0.01491094, 0.01212606, 0.01975141,
0.00746527, 0.02318592, 0.00455399, 0.01362185, 0.01224132,
0.01085978, 0.007938 , 0.0231586 , 0.00627565, 0.01746416,
0.0047296 , 0.01648173, 0.0059576 , 0.01424918, 0.00607514,
0.01670666, 0.0045033 , 0.01955967, 0.00687408, 0.01278129,
0.01136312]), 'std_score_time': array([0.00475718, 0.00400224, 0.00227941, 0.00406912, 0.00770729,
0.00542684, 0.00864613, 0.00296894, 0.00374499, 0.00601403,
0.00693388, 0.00473999, 0.00147884, 0.00956051, 0.00390904,
0.00725684, 0.00884027, 0.00398355, 0.00207084, 0.0065258 ,
0.00526554, 0.00407174, 0.00934129, 0.00609684, 0.00679873,
0.00349006, 0.00207935, 0.00467491, 0.00150784, 0.00397382,
0.00095733, 0.00303253, 0.00393672, 0.00405426, 0.00957903,
0.00309952]), 'param_metric': masked_array(data=['minkowski', 'minkowski', 'minkowski', 'minkowski',
'minkowski', 'minkowski', 'minkowski', 'minkowski',
'minkowski', 'minkowski', 'minkowski', 'minkowski',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'manhattan', 'manhattan', 'manhattan', 'manhattan',
'manhattan', 'manhattan', 'manhattan', 'manhattan',
'manhattan', 'manhattan', 'manhattan', 'manhattan'],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False],
fill_value='?',
dtype=object), 'param_n_neighbors': masked_array(data=[5, 5, 7, 7, 9, 9, 11, 11, 13, 13, 15, 15, 5, 5, 7, 7,
9, 9, 11, 11, 13, 13, 15, 15, 5, 5, 7, 7, 9, 9, 11, 11,
13, 13, 15, 15],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False],
fill_value='?',
dtype=object), 'param_weights': masked_array(data=['uniform', 'distance', 'uniform', 'distance',
'uniform', 'distance', 'uniform', 'distance',
'uniform', 'distance', 'uniform', 'distance',
'uniform' 'distance' 'uniform' 'distance'

1.3. Apply GridSearchCV for Random Forest to find the best hyperparameters using the following param_grid.

param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}

param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}
iris = datasets.load_iris()
X = iris.data
Y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
rf = RandomForestClassifier()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=rf,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
print(f"CV results: {grid_search_svm.cv_results_}")

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 3/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab

Best param: {'max_depth': 3, 'max_features': 'sqrt', 'max_leaf_nodes': 3, 'n_estimators': 50}


CV results: {'mean_fit_time': array([0.09626365, 0.19386959, 0.33848476, 0.39952242, 0.06788683,
0.12062657, 0.24843252, 0.40627825, 0.06324041, 0.13456333,
0.25947177, 0.39266622, 0.06654894, 0.13419437, 0.24994969,
0.35609734, 0.06352544, 0.12715888, 0.257774 , 0.39287281,
0.06294632, 0.1202997 , 0.25043786, 0.36444044, 0.05897141,
0.12308168, 0.2338872 , 0.37996411, 0.08199823, 0.12918973,
0.46896315, 1.02874255, 0.28192902, 0.46077454, 0.94185531,
1.52785122, 0.22577679, 0.38003027, 0.4804796 , 0.90849817,
0.07998693, 0.2358743 , 0.50397515, 0.55759704, 0.14304745,
0.24104714, 0.53048098, 0.60793018, 0.16695142, 0.29579484,
0.63079655, 0.7163173 , 0.07583296, 0.23173535, 1.1034236 ,
0.71914017, 0.09688044, 0.2289629 , 0.53386199, 1.02129507,
0.1929884 , 0.46223927, 1.07445621, 1.47580087, 0.19636631,
0.45616353, 0.69491649, 0.69370258, 0.09769869, 0.27946591,
0.5507021 , 0.74141169, 0.05704927, 0.20792031, 0.35604513,
0.58248448, 0.09743941, 0.26799977, 0.59253764, 0.60888839,
0.06344068, 0.22148716, 0.45078409, 0.60174286, 0.10135698,
0.16656816, 0.38578033, 0.7397728 , 0.09358644, 0.17752957,
0.38714671, 0.45944977, 0.0661217 , 0.2027812 , 0.38623881,
0.71230388, 0.08746386, 0.27263665, 0.47315311, 0.73286963,
0.13612664, 0.22056353, 0.47867465, 0.62718594, 0.09935689,
0.21530187, 0.48689878, 0.37506032]), 'std_fit_time': array([3.66985798e-02, 2.46651173e-02, 5.26280403e-02, 2.58672237e-03,
3.14760208e-03, 1.07419491e-03, 1.38986111e-03, 1.57178640e-02,
2.99370289e-03, 7.63809681e-03, 5.96177578e-03, 8.34143162e-03,
1.46877766e-03, 2.26142406e-02, 1.07953548e-02, 9.22644138e-03,
1.82056427e-03, 3.51476669e-03, 1.28937960e-02, 2.25250721e-02,
2.60710716e-03, 3.83555889e-03, 1.07344389e-02, 2.09908485e-02,
1.86681747e-04, 3.70478630e-03, 3.81231308e-03, 5.61475754e-03,
1.68405771e-02, 5.51342964e-03, 8.48400593e-02, 4.22458649e-01,
2.52189636e-02, 6.16244078e-02, 2.27940083e-03, 2.42227912e-01,
2.49289274e-02, 1.95847750e-02, 6.86343908e-02, 3.80941629e-02,
2.00568438e-02, 6.15977049e-02, 1.41236782e-01, 4.53673601e-02,
1.38007402e-02, 5.04703522e-02, 2.28638768e-01, 6.01632595e-02,
6.24537468e-03, 9.80269909e-03, 4.55105305e-03, 1.85681939e-01,
1.37783289e-02, 1.09553337e-04, 7.44407177e-02, 1.29102349e-01,
4.04217243e-02, 4.42497730e-02, 4.63163853e-03, 1.51729107e-01,
7.99472332e-02, 4.67195511e-02, 1.44925117e-02, 1.23915315e-01,
2.24790573e-02, 1.54620409e-02, 7.99157619e-02, 6.46895170e-02,
3.39543819e-02, 4.47223186e-02, 1.03324413e-01, 8.18197727e-02,
1.44958496e-03, 4.56249714e-02, 1.21725798e-02, 3.30686569e-03,
3.40625048e-02, 1.05935335e-02, 2.46868134e-02, 8.42628479e-02,
1.99198723e-04, 7.37153292e-02, 4.38448191e-02, 1.89301848e-01,
4.22897339e-02, 5.67680597e-02, 6.35635853e-02, 7.45415688e-02,
2.97818184e-02, 5.69701195e-02, 2.04999447e-02, 8.02578926e-02,
1.48260593e-03, 2.92479992e-02, 4.06510830e-02, 2.66563892e-02,
2.41703987e-02, 1.79657936e-02, 1.24981403e-02, 1.17967129e-01,
4.74393368e-03, 3.44985723e-02, 7.47213364e-02, 1.03959680e-01,
1.13527775e-02, 8.00049305e-03, 4.07377481e-02, 3.43456268e-02]), 'mean_score_time': array([0.00789189, 0.00917006, 0.00980103,
0.00586593, 0.01055908, 0.01444721, 0.00394166, 0.00607753,
0.01117384, 0.01728272, 0.00380838, 0.00577998, 0.01037383,
0.01441264, 0.00391507, 0.00594628, 0.0098151 , 0.01465213,
0.00350702, 0.00615108, 0.01049054, 0.01500177, 0.00347066,
0.00582671, 0.01007795, 0.01737535, 0.00812364, 0.00848341,
0.01422465, 0.01258302, 0.01682293, 0.01729858, 0.03477144,
0.05068195, 0.01304996, 0.01105046, 0.01730144, 0.02266669,
0.00541842, 0.01041937, 0.01020813, 0.04236603, 0.0066148 ,
0 0 9 0 0 893 28 0 0232 0 0038 38 0 0 0080

1.4 Compare the best obtained results from 1.1 to 1.3 (use PrettyTable to dispaly the results)

keyboard_arrow_down Task 2.
For breast cancer dataset (https://tinyurl.com/3vme8hr3) which could be loaded from datasets in sklearn as follows:

# Import scikit-learn dataset library


from sklearn import datasets

# Load dataset
cancer = datasets.load_breast_cancer()

Apply GridSearchCV to different classification algorithms such as SVM, kNN, LogisticRegression, RandomForest.
Compare the results obtained by the best hyperparameters among classification algorithms.

2.1. Apply GridSearchCV to SVM

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 4/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab

# code
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf','linear']}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
svm = SVC()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=svm,cv = 3,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
# print(f"CV results: {grid_search_svm.cv_results_}")

Best param: {'C': 1, 'gamma': 1, 'kernel': 'linear'}

2.2. Apply GridSearchCV to kNN

#code
grid_params = { 'n_neighbors' : [5,7,9,11,13,15,17,19],
'weights' : ['uniform','distance','custom'],
'metric' : ['minkowski','euclidean','manhattan']}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
kNN = KNeighborsClassifier()
grid_search_kNN = GridSearchCV(param_grid=grid_params,estimator=kNN,cv = 5,scoring='accuracy',n_jobs=2)
grid_search_kNN.fit(X_train,y_train)
grid_search_kNN.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
# print(f"CV results: {grid_search_svm.cv_results_}")

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
120 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:


--------------------------------------------------------------------------------
61 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py", line 213, in fit
self._validate_params()
File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
validate_parameter_constraints(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'weights' parameter of KNeighborsClassifier must be a str among {'uniform', '

--------------------------------------------------------------------------------
59 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py", line 213, in fit
self._validate_params()
File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
validate_parameter_constraints(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'weights' parameter of KNeighborsClassifier must be a str among {'distance',

warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-fini
0.92721519 0.92468354 nan 0.91712025 0.9221519 nan
0.91962025 0.9196519 nan 0.9296519 0.92218354 nan
0.92212025 0.9246519 nan 0.92462025 0.92212025 nan
0.92718354 0.92968354 nan 0.91968354 0.92218354 nan
0.92721519 0.92468354 nan 0.91712025 0.9221519 nan
0.91962025 0.9196519 nan 0.9296519 0.92218354 nan
0.92212025 0.9246519 nan 0.92462025 0.92212025 nan
0.93221519 0.93221519 nan 0.93221519 0.93724684 nan
0.92974684 0.92971519 nan 0.93474684 0.93724684 nan
0.9246519 0.93471519 nan 0.91958861 0.92462025 nan

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 5/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
0.92712025 0.9296519 nan 0.9296519 0.9246519 nan]
warnings.warn(
Best param: {'C': 1, 'gamma': 1, 'kernel': 'linear'}

2.3. Apply GridSearchCV to LogisticRegression

#code
param_grid = {
'n_estimators': [25, 50, 100, 150,170,200],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9,12,15,18],
'max_leaf_nodes': [3, 6, 9,12,15,18]
}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
rfc = RandomForestClassifier()
grid_search_rfc = GridSearchCV(param_grid=param_grid,estimator=rfc,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_rfc.fit(X_train,y_train)
grid_search_rfc.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")

2.4. Apply GridSearchCV to RandomForest

#code
param_grid = {
'penalty': ['l1', 'l2'], # Regularization penalty ('l1' or 'l2')
'C': [0.001, 0.01, 0.1, 1, 10, 100], # Inverse regularization strength (smaller values specify stronger regularization)
'solver': ['liblinear', 'saga'], # Algorithm to use in optimization problem ('liblinear' for small datasets, 'saga' for large dat
'max_iter': [100, 200, 300], # Maximum number of iterations for optimization algorithm
'class_weight': [None, 'balanced'], # Weights associated with classes ('balanced' to adjust weights inversely proportional to class
# 'multi_class': ['auto', 'ovr', 'multinomial'] # Multiclass strategy (uncomment this line for multiclass classification)
}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
rf = LogisticRegression()
grid_search_rf = GridSearchCV(param_grid=param_grid,estimator=rf,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_rf.fit(X_train,y_train)
grid_search_rf.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")

2.5. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)

#code

tableTask2 = PrettyTable(["grid search algorithms","Accuracy"])


tableTask2.add_row([grid_search_svm.best_params_,metrics.accuracy_score(y_test,grid_search_svm.predict(_X_test))])
tableTask2.add_row([grid_search_kNN,metrics.accuracy_score(y_test,grid_search_kNN.predict(X_test))])
tableTask2.add_row([grid_search_rfc,metrics.accuracy_score(y_test,grid_search_rfc.predict(X_test))])
tableTask2.add_row([grid_search_rf,metrics.accuracy_score(y_test,grid_search_rf.predict(X_test))])
print(tableTask2)

+------------------------------------------------------------------------+--------------------+
| grid search algorithms | Accuracy |
+------------------------------------------------------------------------+--------------------+
| KNeighborsClassifier(metric='manhattan', weights='distance') | 1.0 |
| SVC(C=10, gamma=1, kernel='linear') | 0.9473684210526315 |
| LogisticRegression(C=0.1, max_iter=1000) | 0.9385964912280702 |
| RandomForestClassifier(max_depth=6, max_leaf_nodes=6, n_estimators=50) | 0.9385964912280702 |
+------------------------------------------------------------------------+--------------------+

keyboard_arrow_down Task 3. With mobile price classification dataset


3.1. Apply GridSearchCV for SVM, kNN, RandomForest algorithms to find the best hyperparameters for each classification algorithm.
3.2. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 6/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab

# from google.colab import drive


# drive.mount('/content/gdrive')
%cd '/content/gdrive/MyDrive/Lab7/data'

/content/gdrive/MyDrive/Lab7/data

svm_param_grid = {'C': [0.1, 1, 10],


'gamma': [1, 0.1],}
svm_grid_serach = GridSearchCV(estimator=svm.SVC(),param_grid=svm_param_grid,n_jobs=-1)

mobile = pd.read_csv("mobile.csv")
X = mobile.drop(columns="price_range")
y = mobile[["price_range"]]
newX = SelectKBest(chi2,k=5).fit_transform(X,y)
X_train,X_test,y_train,y_test = train_test_split(newX,y,test_size=0.2)
svm_grid_serach.fit(X_train,y_train)
mobile_svm_best_estimator = svm_grid_serach.best_estimator_
mobile_svm_best_estimator

/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversion
y = column_or_1d(y, warn=True)
▾ SVC
SVC(C=0.1, gamma=1)

kNN_grid_serach.fit(X_train,y_train)
mobile_kNN_best_estimator = kNN_grid_serach.best_estimator_
mobile_kNN_best_estimator

/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py:215: DataConversionWarning: A column-vector y was passed wh


return self._fit(X, y)
▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=11, weights='distance')

LR_grid_serach.fit(X_train,y_train)
mobile_LR_best_estimator = LR_grid_serach.best_estimator_
mobile_LR_best_estimator

/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversion
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: Convergen
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
▾ LogisticRegression
LogisticRegression(C=0.001, max_iter=1000)

random_forest_grid_serach.fit(X_train,y_train)
mobile_random_forest_best_estimator = random_forest_grid_serach.best_estimator_
mobile_random_forest_best_estimator

/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:909: DataConv
self.best_estimator_.fit(X, y, **fit_params)
▾ RandomForestClassifier
RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=6,
n_estimators=50)

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 7/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
tableTask3 = PrettyTable(["grid search algorithms","Accuracy"])
tableTask3.add_row([mobile_kNN_best_estimator,metrics.accuracy_score(y_test,kNN_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_svm_best_estimator,metrics.accuracy_score(y_test,svm_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_LR_best_estimator,metrics.accuracy_score(y_test,LR_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_random_forest_best_estimator,metrics.accuracy_score(y_test,random_forest_grid_serach.predict(X_test))])
print(tableTask3)

+----------------------------------------------------------------------------+----------+
| grid search algorithms | Accuracy |
+----------------------------------------------------------------------------+----------+
| KNeighborsClassifier(n_neighbors=11, weights='distance') | 0.94 |
| SVC(C=0.1, gamma=1) | 0.2325 |
| LogisticRegression(C=0.001, max_iter=1000) | 0.975 |
| RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=6, | 0.8275 |
| n_estimators=50) | |
+----------------------------------------------------------------------------+----------+

keyboard_arrow_down Task 4.
The dataset consists of 2000 user-created movie reviews archived on the IMDb(Internet Movie Database). The reviews are equally partitioned
into a positive set and a negative set (1000+1000). Each review consists of a plain text file (.txt) and a class label representing the overall user
opinion. The class attribute has only two values: pos (positive) or neg (negative).

4.1 Importing additional libraries

import nltk, random


nltk.download('movie_reviews')#download movie reviews dataset
from nltk.corpus import movie_reviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package movie_reviews to /root/nltk_data...


[nltk_data] Unzipping corpora/movie_reviews.zip.

4.2. Movie reviews information

#code
print(len(movie_reviews.fileids()))
print(movie_reviews.categories())
print(movie_reviews.words()[:100])
print(movie_reviews.fileids()[:10])

2000
['neg', 'pos']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt

4.3. Create dataset from movie reviews

documents = [(list(movie_reviews.words(fileid)), category)


for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.seed(123)
random.shuffle(documents)

print('Number of Reviews/Documents: {}'.format(len(documents)))


print('Corpus Size (words): {}'.format(np.sum([len(d) for (d,l) in documents])))
print('Sample Text of Doc 1:')
print('-'*30)
print(' '.join(documents[0][0][:50])) # first 50 words of the first document

Number of Reviews/Documents: 2000


Corpus Size (words): 1583820
Sample Text of Doc 1:
------------------------------

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 8/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
most movies seem to release a third movie just so it can be called a trilogy . rocky iii seems to kind of fit in that category , but man

sentiment_distr = Counter([label for (words, label) in documents])


print(sentiment_distr)

Counter({'pos': 1000, 'neg': 1000})

4.4. Train test split

train, test = train_test_split(documents, test_size = 0.33, random_state=42)

## Sentiment Distrubtion for Train and Test


print(Counter([label for (words, label) in train]))
print(Counter([label for (words, label) in test]))

Counter({'neg': 674, 'pos': 666})


Counter({'pos': 334, 'neg': 326})

X_train = [' '.join(words) for (words, label) in train]


X_test = [' '.join(words) for (words, label) in test]
y_train = [label for (words, label) in train]
y_test = [label for (words, label) in test]

4.5. Text Vectorization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')


X_train_bow = tfidf_vec.fit_transform(X_train) # fit train
X_test_bow = tfidf_vec.transform(X_test) # transform test
new_X_train_bow = SelectKBest(chi2,k=1000).fit_transform(X_train_bow,y_train)
new_X_test_bow = SelectKBest(chi2,k=1000).fit_transform(X_test_bow,y_test)

(1340, 1000)

4.6. Apply SVM with GridSearchCV

svm_grid_serach.fit(new_X_train_bow,y_train)
reviews_svm_best_estimator = svm_grid_serach.best_estimator_
reviews_svm_best_estimator

▾ SVC
SVC(C=10, gamma=1)

4.7. Apply RandomForest with GridSearchCV

random_forest_grid_serach.fit(new_X_train_bow,y_train)
reviews_random_forest_best_estimator = random_forest_grid_serach.best_estimator_
reviews_random_forest_best_estimator

▾ RandomForestClassifier
RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=3)

4.8. Apply kNN with GridSearchCV

Knn_grid_params = { 'n_neighbors' : [5,7,9],


'weights' : ['uniform','distance'],
}
kNN_grid_serach = GridSearchCV(estimator=KNeighborsClassifier(),param_grid=Knn_grid_params,n_jobs=-1)

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 9/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
kNN_grid_serach.fit(new_X_train_bow,y_train)
reviews_kNN_best_estimator = kNN_grid_serach.best_estimator_
reviews_kNN_best_estimator

▾ KNeighborsClassifier
KNeighborsClassifier(weights='distance')

4.9. Apply LogisticRegression with GridSearchCV

LR_grid_serach.fit(new_X_train_bow,y_train)
reviews_LR_best_estimator = LR_grid_serach.best_estimator_
reviews_LR_best_estimator

▾ LogisticRegression
LogisticRegression(C=0.1, max_iter=1000)

4.10. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)

tableTask4 = PrettyTable(["grid search algorithms","Accuracy"])


tableTask4.add_row([reviews_kNN_best_estimator,metrics.accuracy_score(y_test,kNN_grid_serach.predict(new_X_test_bow))])
tableTask4.add_row( reviews_svm_best_estimator,metrics.accuracy_score(y_test,svm_grid_serach.predict(new_X_test_bow))])
tableTask4.add_row([reviews_LR_best_estimator,metrics.accuracy_score(y_test,LR_grid_serach.predict(new_X_test_bow))])
tableTask4.add_row([reviews_random_forest_best_estimator,metrics.accuracy_score(y_test,random_forest_grid_serach.predict(new_X_test_bow))])
print(tableTask4)

+----------------------------------------------------------------------------+---------------------+
| grid search algorithms | Accuracy |
+----------------------------------------------------------------------------+---------------------+
| KNeighborsClassifier(weights='distance') | 0.5181818181818182 |
| SVC(C=10, gamma=1) | 0.5196969696969697 |
| LogisticRegression(C=0.1, max_iter=1000) | 0.49393939393939396 |
| RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=3) | 0.5 |
+----------------------------------------------------------------------------+---------------------+

Finally,

Save a copy in your Github. Remember renaming the notebook.

https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 10/10

You might also like