Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab

06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.
ipynb - Colab
This lab deals with GridSearchCV for tuning the hyper-parameters of an estimator and
applying vectorization techniques to the movie reviews dataset for classification task.
keyboard_arrow_down Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from prettytable import PrettyTable
from sklearn import svm, datasets
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from google.colab import drive

drive.mount('/content/drive')
Mounted at /content/drive
keyboard_arrow_down Task 1. With iris dataset

1.1. Apply GridSearchCV for SVM to find the best hyperparameters using the following param_grid.
param_grid = {'C': [0.1, 1, 10, 100, 1000],

'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf','linear']}
param_grid = {'C': [0.1, 1, 10, 100, 1000],

'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
iris = datasets.load_iris()
X = iris.data
Y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
svm = SVC()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=svm,cv = 3,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
print(f"CV results: {grid_search_svm.cv_results_}")
Best param: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}

CV results: {'mean_fit_time': array([0.00394201, 0.00194025, 0.00213122, 0.00178782, 0.00175977,
0.00142312, 0.00162784, 0.00131456, 0.00171169, 0.00151682,
0.00158453, 0.00129326, 0.00144958, 0.00161902, 0.00146437,
0.00297554, 0.00179935, 0.0047915 , 0.00278982, 0.00140007,
0.00144506, 0.0013903 , 0.0013717 , 0.00132434, 0.00158079,
0.00142956, 0.00159621, 0.00170469, 0.00441432, 0.00225147,
0.00149473, 0.00285451, 0.00308839, 0.00198825, 0.00216699,
0.00250498, 0.00477314, 0.00167116, 0.00391547, 0.00233269,
0.00686407, 0.00225139, 0.00246263, 0.00209705, 0.00141033,
0.00222683, 0.00159399, 0.00188096, 0.00165097, 0.00349728]), 'std_fit_time': array([1.48778401e-03, 1.05259035e-04, 1.82140040
1.75816527e-05, 2.85087215e-05, 9.76143496e-05, 1.17470834e-04,
7.72678123e-05, 8.04916095e-05, 7.86138040e-05, 2.43423177e-05,
1.41754633e-04, 2.01766018e-04, 6.51370147e-05, 1.13451346e-03,
9.70805791e-05, 4.82880031e-03, 1.07405610e-03, 6.11955469e-05,
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 1/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
1.12554446e-05, 5.70660512e-05, 8.41844565e-05, 5.11892827e-05,
1.92707740e-04, 9.79794971e-05, 5.80580653e-05, 2.54616532e-04,
2.19482172e-03, 8.56727682e-04, 5.07304590e-05, 2.09923166e-03,
2.24146313e-03, 5.51595706e-04, 4.44685868e-04, 3.89296211e-04,
2.98484326e-03, 3.39075666e-04, 1.82325689e-03, 1.17155954e-03,
3.26750513e-03, 6.63944076e-04, 5.76978708e-04, 2.38492663e-04,
5.05538598e-05, 1.66069793e-04, 4.03823802e-05, 3.87078533e-04,
1.41060722e-04, 2.16061048e-03]), 'mean_score_time': array([0.00153669, 0.00136471, 0.00157507, 0.00111755, 0.00113511,
0.00096226, 0.00108926, 0.00086832, 0.00119472, 0.00104149,
0.00105238, 0.00092347, 0.00363111, 0.00112542, 0.00103553,
0.00180014, 0.0012358 , 0.0011042 , 0.00147541, 0.00095566,
0.00110571, 0.00097116, 0.00095518, 0.00094938, 0.00365416,
0.0009764 , 0.00234723, 0.00217104, 0.00228731, 0.00121005,
0.00101479, 0.00165081, 0.00321507, 0.0036397 , 0.00355387,
0.00132362, 0.00124248, 0.0020589 , 0.00142026, 0.00396721,
0.00619411, 0.00297944, 0.00392747, 0.00118558, 0.00099413,
0.00132791, 0.00111198, 0.00118407, 0.0011541 , 0.00232164]), 'std_score_time': array([3.25144051e-05, 9.62587271e-05, 9.087352
8.94837580e-06, 4.69440471e-05, 2.42432607e-05, 2.05095417e-05,
1.20718149e-04, 2.73882892e-05, 6.04299000e-05, 7.92739286e-06,
3.77944747e-03, 1.23151806e-04, 5.08438761e-05, 9.68730889e-04,
8.77365974e-05, 1.77272119e-04, 8.25218408e-05, 2.79213390e-05,
2.03797949e-04, 6.90059688e-05, 2.05891465e-05, 2.12295200e-05,
3.75232223e-03, 3.32587983e-05, 1.55457669e-03, 1.53476861e-03,
1.10835448e-03, 8.22608211e-05, 3.72957248e-05, 1.06415071e-03,
2.90691982e-03, 1.82736642e-03, 1.96850376e-03, 3.01201677e-04,
2.34260325e-05, 1.54305400e-03, 9.54403201e-05, 4.10423000e-03,
3.43919079e-03, 2.49006286e-03, 3.34125131e-03, 7.05744053e-06,
1.86356654e-05, 2.29775513e-04, 3.57675549e-05, 2.05767834e-04,
4.16504512e-05, 1.47037935e-03]), 'param_C': masked_array(data=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000,
1000, 1000],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False],
fill_value='?',
dtype=object) 'param gamma': masked array(data=[1 1 0 1 0 1 0 01 0 01 0 001 0 001 0 0001
1.2. Apply GridSearchCV for kNN to find the best hyperparameters using the following param_grid.
grid_params = { 'n_neighbors' : [5,7,9,11,13,15],

'weights' : ['uniform','distance'],
'metric' : ['minkowski','euclidean','manhattan']}
where
* **n_neighbors**: Decide the best k based on the values we have computed earlier.
* **weights**: Check whether adding weights to the data points is beneficial to the model or not. 'uniform' assigns no weight, while 'distance' w
* **metric**: The distance metric to be used will calculating the similarity.
grid_params = { 'n_neighbors' : [5,7,9,11,13,15],

X = iris.data
Y = iris.target
kNN = KNeighborsClassifier()
grid_search_svm = GridSearchCV(param_grid=grid_params,estimator=kNN,cv = 5,scoring='accuracy',n_jobs=2)
Best param: {'metric': 'minkowski', 'n_neighbors': 9, 'weights': 'uniform'}

0.00233359, 0.00220909, 0.0059979 , 0.00432267, 0.00295324,
0.0011517 , 0.00373769, 0.0021512 , 0.001092 , 0.0058701 ,
0.00315609, 0.00280704, 0.00500274, 0.01122355, 0.00118332,
0.00363441, 0.00232773, 0.00276389, 0.00675564, 0.00141692,
0.00549245, 0.00245981, 0.00316486, 0.00123477, 0.00546222,
0.00797062, 0.00397177, 0.00324492, 0.00137267, 0.00270839,
0.00217333]), 'std_fit_time': array([0.0040854 , 0.00385502, 0.0040699 , 0.00446546, 0.00021731,
0.00260024, 0.00157195, 0.00521458, 0.00353601, 0.00354001,
0.00023039, 0.00527138, 0.00213062, 0.00016684, 0.00575109,
0.00412882, 0.00327057, 0.00475698, 0.0102467 , 0.00018194,
0.00256553, 0.00237529, 0.00317934, 0.00309037, 0.00030895,
0.00527827, 0.00260494, 0.00400612, 0.00015444, 0.0063484 ,
0.00569277, 0.00335759, 0.00383773, 0.0001405 , 0.00201755,
0.00186806]), 'mean_score_time': array([0.0165205 , 0.00400467, 0.01517892, 0.00395317, 0.01917338,
0.00842667, 0.01479707, 0.00358806, 0.0150527 , 0.00678368,
0.01771269, 0.00766015, 0.01491094, 0.01212606, 0.01975141,
0.00746527, 0.02318592, 0.00455399, 0.01362185, 0.01224132,
0.01085978, 0.007938 , 0.0231586 , 0.00627565, 0.01746416,
0.0047296 , 0.01648173, 0.0059576 , 0.01424918, 0.00607514,
0.01670666, 0.0045033 , 0.01955967, 0.00687408, 0.01278129,
0.01136312]), 'std_score_time': array([0.00475718, 0.00400224, 0.00227941, 0.00406912, 0.00770729,
0.00542684, 0.00864613, 0.00296894, 0.00374499, 0.00601403,
0.00693388, 0.00473999, 0.00147884, 0.00956051, 0.00390904,
0.00725684, 0.00884027, 0.00398355, 0.00207084, 0.0065258 ,
0.00526554, 0.00407174, 0.00934129, 0.00609684, 0.00679873,
0.00349006, 0.00207935, 0.00467491, 0.00150784, 0.00397382,
0.00095733, 0.00303253, 0.00393672, 0.00405426, 0.00957903,
0.00309952]), 'param_metric': masked_array(data=['minkowski', 'minkowski', 'minkowski', 'minkowski',
'minkowski', 'minkowski', 'minkowski', 'minkowski',
'minkowski', 'minkowski', 'minkowski', 'minkowski',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'manhattan', 'manhattan', 'manhattan', 'manhattan',
'manhattan', 'manhattan', 'manhattan', 'manhattan',
'manhattan', 'manhattan', 'manhattan', 'manhattan'],
False, False, False, False],
fill_value='?',
dtype=object), 'param_n_neighbors': masked_array(data=[5, 5, 7, 7, 9, 9, 11, 11, 13, 13, 15, 15, 5, 5, 7, 7,
9, 9, 11, 11, 13, 13, 15, 15, 5, 5, 7, 7, 9, 9, 11, 11,
13, 13, 15, 15],
False, False, False, False],
fill_value='?',
dtype=object), 'param_weights': masked_array(data=['uniform', 'distance', 'uniform', 'distance',
'uniform', 'distance', 'uniform', 'distance',
'uniform', 'distance', 'uniform', 'distance',
'uniform' 'distance' 'uniform' 'distance'
1.3. Apply GridSearchCV for Random Forest to find the best hyperparameters using the following param_grid.
param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}
param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}
X = iris.data
Y = iris.target
rf = RandomForestClassifier()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=rf,cv = 2,scoring='accuracy',n_jobs=2)
Best param: {'max_depth': 3, 'max_features': 'sqrt', 'max_leaf_nodes': 3, 'n_estimators': 50}

0.12062657, 0.24843252, 0.40627825, 0.06324041, 0.13456333,
0.25947177, 0.39266622, 0.06654894, 0.13419437, 0.24994969,
0.35609734, 0.06352544, 0.12715888, 0.257774 , 0.39287281,
0.06294632, 0.1202997 , 0.25043786, 0.36444044, 0.05897141,
0.12308168, 0.2338872 , 0.37996411, 0.08199823, 0.12918973,
0.46896315, 1.02874255, 0.28192902, 0.46077454, 0.94185531,
1.52785122, 0.22577679, 0.38003027, 0.4804796 , 0.90849817,
0.07998693, 0.2358743 , 0.50397515, 0.55759704, 0.14304745,
0.24104714, 0.53048098, 0.60793018, 0.16695142, 0.29579484,
0.63079655, 0.7163173 , 0.07583296, 0.23173535, 1.1034236 ,
0.71914017, 0.09688044, 0.2289629 , 0.53386199, 1.02129507,
0.1929884 , 0.46223927, 1.07445621, 1.47580087, 0.19636631,
0.45616353, 0.69491649, 0.69370258, 0.09769869, 0.27946591,
0.5507021 , 0.74141169, 0.05704927, 0.20792031, 0.35604513,
0.58248448, 0.09743941, 0.26799977, 0.59253764, 0.60888839,
0.06344068, 0.22148716, 0.45078409, 0.60174286, 0.10135698,
0.16656816, 0.38578033, 0.7397728 , 0.09358644, 0.17752957,
0.38714671, 0.45944977, 0.0661217 , 0.2027812 , 0.38623881,
0.71230388, 0.08746386, 0.27263665, 0.47315311, 0.73286963,
0.13612664, 0.22056353, 0.47867465, 0.62718594, 0.09935689,
0.21530187, 0.48689878, 0.37506032]), 'std_fit_time': array([3.66985798e-02, 2.46651173e-02, 5.26280403e-02, 2.58672237e-03,
3.14760208e-03, 1.07419491e-03, 1.38986111e-03, 1.57178640e-02,
2.99370289e-03, 7.63809681e-03, 5.96177578e-03, 8.34143162e-03,
1.46877766e-03, 2.26142406e-02, 1.07953548e-02, 9.22644138e-03,
1.82056427e-03, 3.51476669e-03, 1.28937960e-02, 2.25250721e-02,
2.60710716e-03, 3.83555889e-03, 1.07344389e-02, 2.09908485e-02,
1.86681747e-04, 3.70478630e-03, 3.81231308e-03, 5.61475754e-03,
1.68405771e-02, 5.51342964e-03, 8.48400593e-02, 4.22458649e-01,
2.52189636e-02, 6.16244078e-02, 2.27940083e-03, 2.42227912e-01,
2.49289274e-02, 1.95847750e-02, 6.86343908e-02, 3.80941629e-02,
2.00568438e-02, 6.15977049e-02, 1.41236782e-01, 4.53673601e-02,
1.38007402e-02, 5.04703522e-02, 2.28638768e-01, 6.01632595e-02,
6.24537468e-03, 9.80269909e-03, 4.55105305e-03, 1.85681939e-01,
1.37783289e-02, 1.09553337e-04, 7.44407177e-02, 1.29102349e-01,
4.04217243e-02, 4.42497730e-02, 4.63163853e-03, 1.51729107e-01,
7.99472332e-02, 4.67195511e-02, 1.44925117e-02, 1.23915315e-01,
2.24790573e-02, 1.54620409e-02, 7.99157619e-02, 6.46895170e-02,
3.39543819e-02, 4.47223186e-02, 1.03324413e-01, 8.18197727e-02,
1.44958496e-03, 4.56249714e-02, 1.21725798e-02, 3.30686569e-03,
3.40625048e-02, 1.05935335e-02, 2.46868134e-02, 8.42628479e-02,
1.99198723e-04, 7.37153292e-02, 4.38448191e-02, 1.89301848e-01,
4.22897339e-02, 5.67680597e-02, 6.35635853e-02, 7.45415688e-02,
2.97818184e-02, 5.69701195e-02, 2.04999447e-02, 8.02578926e-02,
1.48260593e-03, 2.92479992e-02, 4.06510830e-02, 2.66563892e-02,
2.41703987e-02, 1.79657936e-02, 1.24981403e-02, 1.17967129e-01,
4.74393368e-03, 3.44985723e-02, 7.47213364e-02, 1.03959680e-01,
1.13527775e-02, 8.00049305e-03, 4.07377481e-02, 3.43456268e-02]), 'mean_score_time': array([0.00789189, 0.00917006, 0.00980103,
0.00586593, 0.01055908, 0.01444721, 0.00394166, 0.00607753,
0.01117384, 0.01728272, 0.00380838, 0.00577998, 0.01037383,
0.01441264, 0.00391507, 0.00594628, 0.0098151 , 0.01465213,
0.00350702, 0.00615108, 0.01049054, 0.01500177, 0.00347066,
0.00582671, 0.01007795, 0.01737535, 0.00812364, 0.00848341,
0.01422465, 0.01258302, 0.01682293, 0.01729858, 0.03477144,
0.05068195, 0.01304996, 0.01105046, 0.01730144, 0.02266669,
0.00541842, 0.01041937, 0.01020813, 0.04236603, 0.0066148 ,
0 0 9 0 0 893 28 0 0232 0 0038 38 0 0 0080
1.4 Compare the best obtained results from 1.1 to 1.3 (use PrettyTable to dispaly the results)
keyboard_arrow_down Task 2.
For breast cancer dataset (https://tinyurl.com/3vme8hr3) which could be loaded from datasets in sklearn as follows:
# Import scikit-learn dataset library

from sklearn import datasets
# Load dataset
cancer = datasets.load_breast_cancer()
Apply GridSearchCV to different classification algorithms such as SVM, kNN, LogisticRegression, RandomForest.
Compare the results obtained by the best hyperparameters among classification algorithms.
2.1. Apply GridSearchCV to SVM
# code
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
X = cancer.data
Y = cancer.target
svm = SVC()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=svm,cv = 3,scoring='accuracy',n_jobs=2)
# print(f"CV results: {grid_search_svm.cv_results_}")
Best param: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
2.2. Apply GridSearchCV to kNN
#code
grid_params = { 'n_neighbors' : [5,7,9,11,13,15,17,19],
'weights' : ['uniform','distance','custom'],
X = cancer.data
Y = cancer.target
kNN = KNeighborsClassifier()
grid_search_kNN = GridSearchCV(param_grid=grid_params,estimator=kNN,cv = 5,scoring='accuracy',n_jobs=2)
grid_search_kNN.fit(X_train,y_train)
grid_search_kNN.predict(X_test)
# print(f"CV results: {grid_search_svm.cv_results_}")
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
120 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:

--------------------------------------------------------------------------------
61 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py", line 213, in fit
self._validate_params()
File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
validate_parameter_constraints(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'weights' parameter of KNeighborsClassifier must be a str among {'uniform', '
--------------------------------------------------------------------------------
59 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py", line 213, in fit
self._validate_params()
File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
validate_parameter_constraints(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'weights' parameter of KNeighborsClassifier must be a str among {'distance',
warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-fini
0.92721519 0.92468354 nan 0.91712025 0.9221519 nan
0.91962025 0.9196519 nan 0.9296519 0.92218354 nan
0.92212025 0.9246519 nan 0.92462025 0.92212025 nan
0.92718354 0.92968354 nan 0.91968354 0.92218354 nan
0.92721519 0.92468354 nan 0.91712025 0.9221519 nan
0.91962025 0.9196519 nan 0.9296519 0.92218354 nan
0.92212025 0.9246519 nan 0.92462025 0.92212025 nan
0.93221519 0.93221519 nan 0.93221519 0.93724684 nan
0.92974684 0.92971519 nan 0.93474684 0.93724684 nan
0.9246519 0.93471519 nan 0.91958861 0.92462025 nan
0.92712025 0.9296519 nan 0.9296519 0.9246519 nan]
warnings.warn(
Best param: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
2.3. Apply GridSearchCV to LogisticRegression
#code
param_grid = {
'n_estimators': [25, 50, 100, 150,170,200],
'max_depth': [3, 6, 9,12,15,18],
'max_leaf_nodes': [3, 6, 9,12,15,18]
}
X = cancer.data
Y = cancer.target
rfc = RandomForestClassifier()
grid_search_rfc = GridSearchCV(param_grid=param_grid,estimator=rfc,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_rfc.fit(X_train,y_train)
grid_search_rfc.predict(X_test)
2.4. Apply GridSearchCV to RandomForest
#code
param_grid = {
'penalty': ['l1', 'l2'], # Regularization penalty ('l1' or 'l2')
'C': [0.001, 0.01, 0.1, 1, 10, 100], # Inverse regularization strength (smaller values specify stronger regularization)
'solver': ['liblinear', 'saga'], # Algorithm to use in optimization problem ('liblinear' for small datasets, 'saga' for large dat
'max_iter': [100, 200, 300], # Maximum number of iterations for optimization algorithm
'class_weight': [None, 'balanced'], # Weights associated with classes ('balanced' to adjust weights inversely proportional to class
# 'multi_class': ['auto', 'ovr', 'multinomial'] # Multiclass strategy (uncomment this line for multiclass classification)
}
X = cancer.data
Y = cancer.target
rf = LogisticRegression()
grid_search_rf = GridSearchCV(param_grid=param_grid,estimator=rf,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_rf.fit(X_train,y_train)
grid_search_rf.predict(X_test)
2.5. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)
#code
tableTask2 = PrettyTable(["grid search algorithms","Accuracy"])

tableTask2.add_row([grid_search_svm.best_params_,metrics.accuracy_score(y_test,grid_search_svm.predict(_X_test))])
tableTask2.add_row([grid_search_kNN,metrics.accuracy_score(y_test,grid_search_kNN.predict(X_test))])
tableTask2.add_row([grid_search_rfc,metrics.accuracy_score(y_test,grid_search_rfc.predict(X_test))])
tableTask2.add_row([grid_search_rf,metrics.accuracy_score(y_test,grid_search_rf.predict(X_test))])
print(tableTask2)
+------------------------------------------------------------------------+--------------------+
| grid search algorithms | Accuracy |
+------------------------------------------------------------------------+--------------------+
| KNeighborsClassifier(metric='manhattan', weights='distance') | 1.0 |
| SVC(C=10, gamma=1, kernel='linear') | 0.9473684210526315 |
| LogisticRegression(C=0.1, max_iter=1000) | 0.9385964912280702 |
| RandomForestClassifier(max_depth=6, max_leaf_nodes=6, n_estimators=50) | 0.9385964912280702 |
+------------------------------------------------------------------------+--------------------+
keyboard_arrow_down Task 3. With mobile price classification dataset

3.1. Apply GridSearchCV for SVM, kNN, RandomForest algorithms to find the best hyperparameters for each classification algorithm.
# from google.colab import drive

# drive.mount('/content/gdrive')
%cd '/content/gdrive/MyDrive/Lab7/data'
/content/gdrive/MyDrive/Lab7/data
svm_param_grid = {'C': [0.1, 1, 10],

'gamma': [1, 0.1],}
svm_grid_serach = GridSearchCV(estimator=svm.SVC(),param_grid=svm_param_grid,n_jobs=-1)
mobile = pd.read_csv("mobile.csv")
X = mobile.drop(columns="price_range")
y = mobile[["price_range"]]
newX = SelectKBest(chi2,k=5).fit_transform(X,y)
X_train,X_test,y_train,y_test = train_test_split(newX,y,test_size=0.2)
svm_grid_serach.fit(X_train,y_train)
mobile_svm_best_estimator = svm_grid_serach.best_estimator_
mobile_svm_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversion
y = column_or_1d(y, warn=True)
▾ SVC
SVC(C=0.1, gamma=1)
kNN_grid_serach.fit(X_train,y_train)
mobile_kNN_best_estimator = kNN_grid_serach.best_estimator_
mobile_kNN_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py:215: DataConversionWarning: A column-vector y was passed wh

return self._fit(X, y)
▾ KNeighborsClassifier
KNeighborsClassifier(n_neighbors=11, weights='distance')
LR_grid_serach.fit(X_train,y_train)
mobile_LR_best_estimator = LR_grid_serach.best_estimator_
mobile_LR_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversion
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: Convergen
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
▾ LogisticRegression
LogisticRegression(C=0.001, max_iter=1000)
random_forest_grid_serach.fit(X_train,y_train)
mobile_random_forest_best_estimator = random_forest_grid_serach.best_estimator_
mobile_random_forest_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:909: DataConv
self.best_estimator_.fit(X, y, **fit_params)
▾ RandomForestClassifier
RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=6,
n_estimators=50)
tableTask3.add_row([mobile_kNN_best_estimator,metrics.accuracy_score(y_test,kNN_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_svm_best_estimator,metrics.accuracy_score(y_test,svm_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_LR_best_estimator,metrics.accuracy_score(y_test,LR_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_random_forest_best_estimator,metrics.accuracy_score(y_test,random_forest_grid_serach.predict(X_test))])
print(tableTask3)
+----------------------------------------------------------------------------+----------+
+----------------------------------------------------------------------------+----------+
| KNeighborsClassifier(n_neighbors=11, weights='distance') | 0.94 |
| SVC(C=0.1, gamma=1) | 0.2325 |
| RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=6, | 0.8275 |
| n_estimators=50) | |
+----------------------------------------------------------------------------+----------+
keyboard_arrow_down Task 4.
The dataset consists of 2000 user-created movie reviews archived on the IMDb(Internet Movie Database). The reviews are equally partitioned
into a positive set and a negative set (1000+1000). Each review consists of a plain text file (.txt) and a class label representing the overall user
opinion. The class attribute has only two values: pos (positive) or neg (negative).
4.1 Importing additional libraries
import nltk, random

nltk.download('movie_reviews')#download movie reviews dataset
from nltk.corpus import movie_reviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.model_selection import train_test_split
[nltk_data] Downloading package movie_reviews to /root/nltk_data...

[nltk_data] Unzipping corpora/movie_reviews.zip.
4.2. Movie reviews information
#code
print(len(movie_reviews.fileids()))
print(movie_reviews.categories())
print(movie_reviews.words()[:100])
print(movie_reviews.fileids()[:10])
2000
['neg', 'pos']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt
4.3. Create dataset from movie reviews
documents = [(list(movie_reviews.words(fileid)), category)

for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.seed(123)
random.shuffle(documents)
print('Number of Reviews/Documents: {}'.format(len(documents)))

print('Corpus Size (words): {}'.format(np.sum([len(d) for (d,l) in documents])))
print('Sample Text of Doc 1:')
print('-'*30)
print(' '.join(documents[0][0][:50])) # first 50 words of the first document
Number of Reviews/Documents: 2000

Corpus Size (words): 1583820
Sample Text of Doc 1:
------------------------------
most movies seem to release a third movie just so it can be called a trilogy . rocky iii seems to kind of fit in that category , but man
sentiment_distr = Counter([label for (words, label) in documents])

print(sentiment_distr)
Counter({'pos': 1000, 'neg': 1000})
4.4. Train test split
train, test = train_test_split(documents, test_size = 0.33, random_state=42)
## Sentiment Distrubtion for Train and Test

print(Counter([label for (words, label) in train]))
print(Counter([label for (words, label) in test]))
Counter({'neg': 674, 'pos': 666})

Counter({'pos': 334, 'neg': 326})
X_train = [' '.join(words) for (words, label) in train]

X_test = [' '.join(words) for (words, label) in test]
y_train = [label for (words, label) in train]
y_test = [label for (words, label) in test]
4.5. Text Vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')

X_train_bow = tfidf_vec.fit_transform(X_train) # fit train
X_test_bow = tfidf_vec.transform(X_test) # transform test
new_X_train_bow = SelectKBest(chi2,k=1000).fit_transform(X_train_bow,y_train)
new_X_test_bow = SelectKBest(chi2,k=1000).fit_transform(X_test_bow,y_test)
(1340, 1000)
4.6. Apply SVM with GridSearchCV
svm_grid_serach.fit(new_X_train_bow,y_train)
reviews_svm_best_estimator = svm_grid_serach.best_estimator_
reviews_svm_best_estimator
▾ SVC
SVC(C=10, gamma=1)
4.7. Apply RandomForest with GridSearchCV
random_forest_grid_serach.fit(new_X_train_bow,y_train)
reviews_random_forest_best_estimator = random_forest_grid_serach.best_estimator_
reviews_random_forest_best_estimator
▾ RandomForestClassifier
RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=3)
4.8. Apply kNN with GridSearchCV
Knn_grid_params = { 'n_neighbors' : [5,7,9],

}
kNN_grid_serach = GridSearchCV(estimator=KNeighborsClassifier(),param_grid=Knn_grid_params,n_jobs=-1)
kNN_grid_serach.fit(new_X_train_bow,y_train)
reviews_kNN_best_estimator = kNN_grid_serach.best_estimator_
reviews_kNN_best_estimator
▾ KNeighborsClassifier
KNeighborsClassifier(weights='distance')
4.9. Apply LogisticRegression with GridSearchCV
LR_grid_serach.fit(new_X_train_bow,y_train)
reviews_LR_best_estimator = LR_grid_serach.best_estimator_
reviews_LR_best_estimator
▾ LogisticRegression
LogisticRegression(C=0.1, max_iter=1000)

tableTask4.add_row([reviews_kNN_best_estimator,metrics.accuracy_score(y_test,kNN_grid_serach.predict(new_X_test_bow))])
tableTask4.add_row( reviews_svm_best_estimator,metrics.accuracy_score(y_test,svm_grid_serach.predict(new_X_test_bow))])
tableTask4.add_row([reviews_LR_best_estimator,metrics.accuracy_score(y_test,LR_grid_serach.predict(new_X_test_bow))])
tableTask4.add_row([reviews_random_forest_best_estimator,metrics.accuracy_score(y_test,random_forest_grid_serach.predict(new_X_test_bow))])
print(tableTask4)
+----------------------------------------------------------------------------+---------------------+
+----------------------------------------------------------------------------+---------------------+
| KNeighborsClassifier(weights='distance') | 0.5181818181818182 |
| SVC(C=10, gamma=1) | 0.5196969696969697 |
| RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=3) | 0.5 |
+----------------------------------------------------------------------------+---------------------+
Finally,
Save a copy in your Github. Remember renaming the notebook.

Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab

Uploaded by

Copyright:

Available Formats

You might also like

Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab

Uploaded by

Copyright:

Available Formats

06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.

keyboard_arrow_down Import libraries

from google.colab import drive

keyboard_arrow_down Task 1. With iris dataset

param_grid = {'C': [0.1, 1, 10, 100, 1000],

param_grid = {'C': [0.1, 1, 10, 100, 1000],

Best param: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}

grid_params = { 'n_neighbors' : [5,7,9,11,13,15],

grid_params = { 'n_neighbors' : [5,7,9,11,13,15],

Best param: {'metric': 'minkowski', 'n_neighbors': 9, 'weights': 'uniform'}

Best param: {'max_depth': 3, 'max_features': 'sqrt', 'max_leaf_nodes': 3, 'n_estimators': 50}

# Import scikit-learn dataset library

2.1. Apply GridSearchCV to SVM

Best param: {'C': 1, 'gamma': 1, 'kernel': 'linear'}

2.2. Apply GridSearchCV to kNN

Below are more details about the failures:

2.3. Apply GridSearchCV to LogisticRegression

2.4. Apply GridSearchCV to RandomForest

tableTask2 = PrettyTable(["grid search algorithms","Accuracy"])

keyboard_arrow_down Task 3. With mobile price classification dataset

# from google.colab import drive

svm_param_grid = {'C': [0.1, 1, 10],

/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py:215: DataConversionWarning: A column-vector y was passed wh

4.1 Importing additional libraries

import nltk, random

[nltk_data] Downloading package movie_reviews to /root/nltk_data...

4.2. Movie reviews information

4.3. Create dataset from movie reviews

documents = [(list(movie_reviews.words(fileid)), category)

print('Number of Reviews/Documents: {}'.format(len(documents)))

Number of Reviews/Documents: 2000

sentiment_distr = Counter([label for (words, label) in documents])

Counter({'pos': 1000, 'neg': 1000})

4.4. Train test split

train, test = train_test_split(documents, test_size = 0.33, random_state=42)

## Sentiment Distrubtion for Train and Test

Counter({'neg': 674, 'pos': 666})

X_train = [' '.join(words) for (words, label) in train]

4.5. Text Vectorization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')

4.6. Apply SVM with GridSearchCV

4.7. Apply RandomForest with GridSearchCV

4.8. Apply kNN with GridSearchCV

Knn_grid_params = { 'n_neighbors' : [5,7,9],

4.9. Apply LogisticRegression with GridSearchCV

tableTask4 = PrettyTable(["grid search algorithms","Accuracy"])

Save a copy in your Github. Remember renaming the notebook.

You might also like