Professional Documents
Culture Documents
Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab
Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab
Lab - 7 - 21130616 - TranhThanhVu - Ipynb - Colab
ipynb - Colab
This lab deals with GridSearchCV for tuning the hyper-parameters of an estimator and
applying vectorization techniques to the movie reviews dataset for classification task.
Mounted at /content/drive
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 1/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
1.12554446e-05, 5.70660512e-05, 8.41844565e-05, 5.11892827e-05,
1.92707740e-04, 9.79794971e-05, 5.80580653e-05, 2.54616532e-04,
2.19482172e-03, 8.56727682e-04, 5.07304590e-05, 2.09923166e-03,
2.24146313e-03, 5.51595706e-04, 4.44685868e-04, 3.89296211e-04,
2.98484326e-03, 3.39075666e-04, 1.82325689e-03, 1.17155954e-03,
3.26750513e-03, 6.63944076e-04, 5.76978708e-04, 2.38492663e-04,
5.05538598e-05, 1.66069793e-04, 4.03823802e-05, 3.87078533e-04,
1.41060722e-04, 2.16061048e-03]), 'mean_score_time': array([0.00153669, 0.00136471, 0.00157507, 0.00111755, 0.00113511,
0.00096226, 0.00108926, 0.00086832, 0.00119472, 0.00104149,
0.00105238, 0.00092347, 0.00363111, 0.00112542, 0.00103553,
0.00180014, 0.0012358 , 0.0011042 , 0.00147541, 0.00095566,
0.00110571, 0.00097116, 0.00095518, 0.00094938, 0.00365416,
0.0009764 , 0.00234723, 0.00217104, 0.00228731, 0.00121005,
0.00101479, 0.00165081, 0.00321507, 0.0036397 , 0.00355387,
0.00132362, 0.00124248, 0.0020589 , 0.00142026, 0.00396721,
0.00619411, 0.00297944, 0.00392747, 0.00118558, 0.00099413,
0.00132791, 0.00111198, 0.00118407, 0.0011541 , 0.00232164]), 'std_score_time': array([3.25144051e-05, 9.62587271e-05, 9.087352
8.94837580e-06, 4.69440471e-05, 2.42432607e-05, 2.05095417e-05,
1.20718149e-04, 2.73882892e-05, 6.04299000e-05, 7.92739286e-06,
3.77944747e-03, 1.23151806e-04, 5.08438761e-05, 9.68730889e-04,
8.77365974e-05, 1.77272119e-04, 8.25218408e-05, 2.79213390e-05,
2.03797949e-04, 6.90059688e-05, 2.05891465e-05, 2.12295200e-05,
3.75232223e-03, 3.32587983e-05, 1.55457669e-03, 1.53476861e-03,
1.10835448e-03, 8.22608211e-05, 3.72957248e-05, 1.06415071e-03,
2.90691982e-03, 1.82736642e-03, 1.96850376e-03, 3.01201677e-04,
2.34260325e-05, 1.54305400e-03, 9.54403201e-05, 4.10423000e-03,
3.43919079e-03, 2.49006286e-03, 3.34125131e-03, 7.05744053e-06,
1.86356654e-05, 2.29775513e-04, 3.57675549e-05, 2.05767834e-04,
4.16504512e-05, 1.47037935e-03]), 'param_C': masked_array(data=[0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 100, 100, 100, 100, 100, 100, 100, 100, 100,
100, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000,
1000, 1000],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False],
fill_value='?',
dtype=object) 'param gamma': masked array(data=[1 1 0 1 0 1 0 01 0 01 0 001 0 001 0 0001
1.2. Apply GridSearchCV for kNN to find the best hyperparameters using the following param_grid.
where
* **n_neighbors**: Decide the best k based on the values we have computed earlier.
* **weights**: Check whether adding weights to the data points is beneficial to the model or not. 'uniform' assigns no weight, while 'distance' w
* **metric**: The distance metric to be used will calculating the similarity.
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 2/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
0.00549245, 0.00245981, 0.00316486, 0.00123477, 0.00546222,
0.00797062, 0.00397177, 0.00324492, 0.00137267, 0.00270839,
0.00217333]), 'std_fit_time': array([0.0040854 , 0.00385502, 0.0040699 , 0.00446546, 0.00021731,
0.00260024, 0.00157195, 0.00521458, 0.00353601, 0.00354001,
0.00023039, 0.00527138, 0.00213062, 0.00016684, 0.00575109,
0.00412882, 0.00327057, 0.00475698, 0.0102467 , 0.00018194,
0.00256553, 0.00237529, 0.00317934, 0.00309037, 0.00030895,
0.00527827, 0.00260494, 0.00400612, 0.00015444, 0.0063484 ,
0.00569277, 0.00335759, 0.00383773, 0.0001405 , 0.00201755,
0.00186806]), 'mean_score_time': array([0.0165205 , 0.00400467, 0.01517892, 0.00395317, 0.01917338,
0.00842667, 0.01479707, 0.00358806, 0.0150527 , 0.00678368,
0.01771269, 0.00766015, 0.01491094, 0.01212606, 0.01975141,
0.00746527, 0.02318592, 0.00455399, 0.01362185, 0.01224132,
0.01085978, 0.007938 , 0.0231586 , 0.00627565, 0.01746416,
0.0047296 , 0.01648173, 0.0059576 , 0.01424918, 0.00607514,
0.01670666, 0.0045033 , 0.01955967, 0.00687408, 0.01278129,
0.01136312]), 'std_score_time': array([0.00475718, 0.00400224, 0.00227941, 0.00406912, 0.00770729,
0.00542684, 0.00864613, 0.00296894, 0.00374499, 0.00601403,
0.00693388, 0.00473999, 0.00147884, 0.00956051, 0.00390904,
0.00725684, 0.00884027, 0.00398355, 0.00207084, 0.0065258 ,
0.00526554, 0.00407174, 0.00934129, 0.00609684, 0.00679873,
0.00349006, 0.00207935, 0.00467491, 0.00150784, 0.00397382,
0.00095733, 0.00303253, 0.00393672, 0.00405426, 0.00957903,
0.00309952]), 'param_metric': masked_array(data=['minkowski', 'minkowski', 'minkowski', 'minkowski',
'minkowski', 'minkowski', 'minkowski', 'minkowski',
'minkowski', 'minkowski', 'minkowski', 'minkowski',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'euclidean', 'euclidean', 'euclidean', 'euclidean',
'manhattan', 'manhattan', 'manhattan', 'manhattan',
'manhattan', 'manhattan', 'manhattan', 'manhattan',
'manhattan', 'manhattan', 'manhattan', 'manhattan'],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False],
fill_value='?',
dtype=object), 'param_n_neighbors': masked_array(data=[5, 5, 7, 7, 9, 9, 11, 11, 13, 13, 15, 15, 5, 5, 7, 7,
9, 9, 11, 11, 13, 13, 15, 15, 5, 5, 7, 7, 9, 9, 11, 11,
13, 13, 15, 15],
mask=[False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False,
False, False, False, False],
fill_value='?',
dtype=object), 'param_weights': masked_array(data=['uniform', 'distance', 'uniform', 'distance',
'uniform', 'distance', 'uniform', 'distance',
'uniform', 'distance', 'uniform', 'distance',
'uniform' 'distance' 'uniform' 'distance'
1.3. Apply GridSearchCV for Random Forest to find the best hyperparameters using the following param_grid.
param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}
param_grid = {
'n_estimators': [25, 50, 100, 150],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9],
'max_leaf_nodes': [3, 6, 9],
}
iris = datasets.load_iris()
X = iris.data
Y = iris.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
rf = RandomForestClassifier()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=rf,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
print(f"CV results: {grid_search_svm.cv_results_}")
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 3/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
1.4 Compare the best obtained results from 1.1 to 1.3 (use PrettyTable to dispaly the results)
keyboard_arrow_down Task 2.
For breast cancer dataset (https://tinyurl.com/3vme8hr3) which could be loaded from datasets in sklearn as follows:
# Load dataset
cancer = datasets.load_breast_cancer()
Apply GridSearchCV to different classification algorithms such as SVM, kNN, LogisticRegression, RandomForest.
Compare the results obtained by the best hyperparameters among classification algorithms.
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 4/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
# code
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf','linear']}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
svm = SVC()
grid_search_svm = GridSearchCV(param_grid=param_grid,estimator=svm,cv = 3,scoring='accuracy',n_jobs=2)
grid_search_svm.fit(X_train,y_train)
grid_search_svm.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
# print(f"CV results: {grid_search_svm.cv_results_}")
#code
grid_params = { 'n_neighbors' : [5,7,9,11,13,15,17,19],
'weights' : ['uniform','distance','custom'],
'metric' : ['minkowski','euclidean','manhattan']}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
kNN = KNeighborsClassifier()
grid_search_kNN = GridSearchCV(param_grid=grid_params,estimator=kNN,cv = 5,scoring='accuracy',n_jobs=2)
grid_search_kNN.fit(X_train,y_train)
grid_search_kNN.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
# print(f"CV results: {grid_search_svm.cv_results_}")
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py:378: FitFailedWarning:
120 fits failed out of a total of 360.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
--------------------------------------------------------------------------------
59 fits failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.10/dist-packages/sklearn/neighbors/_classification.py", line 213, in fit
self._validate_params()
File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
validate_parameter_constraints(
File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
raise InvalidParameterError(
sklearn.utils._param_validation.InvalidParameterError: The 'weights' parameter of KNeighborsClassifier must be a str among {'distance',
warnings.warn(some_fits_failed_message, FitFailedWarning)
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:952: UserWarning: One or more of the test scores are non-fini
0.92721519 0.92468354 nan 0.91712025 0.9221519 nan
0.91962025 0.9196519 nan 0.9296519 0.92218354 nan
0.92212025 0.9246519 nan 0.92462025 0.92212025 nan
0.92718354 0.92968354 nan 0.91968354 0.92218354 nan
0.92721519 0.92468354 nan 0.91712025 0.9221519 nan
0.91962025 0.9196519 nan 0.9296519 0.92218354 nan
0.92212025 0.9246519 nan 0.92462025 0.92212025 nan
0.93221519 0.93221519 nan 0.93221519 0.93724684 nan
0.92974684 0.92971519 nan 0.93474684 0.93724684 nan
0.9246519 0.93471519 nan 0.91958861 0.92462025 nan
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 5/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
0.92712025 0.9296519 nan 0.9296519 0.9246519 nan]
warnings.warn(
Best param: {'C': 1, 'gamma': 1, 'kernel': 'linear'}
#code
param_grid = {
'n_estimators': [25, 50, 100, 150,170,200],
'max_features': ['sqrt', 'log2', None],
'max_depth': [3, 6, 9,12,15,18],
'max_leaf_nodes': [3, 6, 9,12,15,18]
}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
rfc = RandomForestClassifier()
grid_search_rfc = GridSearchCV(param_grid=param_grid,estimator=rfc,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_rfc.fit(X_train,y_train)
grid_search_rfc.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
#code
param_grid = {
'penalty': ['l1', 'l2'], # Regularization penalty ('l1' or 'l2')
'C': [0.001, 0.01, 0.1, 1, 10, 100], # Inverse regularization strength (smaller values specify stronger regularization)
'solver': ['liblinear', 'saga'], # Algorithm to use in optimization problem ('liblinear' for small datasets, 'saga' for large dat
'max_iter': [100, 200, 300], # Maximum number of iterations for optimization algorithm
'class_weight': [None, 'balanced'], # Weights associated with classes ('balanced' to adjust weights inversely proportional to class
# 'multi_class': ['auto', 'ovr', 'multinomial'] # Multiclass strategy (uncomment this line for multiclass classification)
}
cancer = datasets.load_breast_cancer()
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.3)
rf = LogisticRegression()
grid_search_rf = GridSearchCV(param_grid=param_grid,estimator=rf,cv = 2,scoring='accuracy',n_jobs=2)
grid_search_rf.fit(X_train,y_train)
grid_search_rf.predict(X_test)
print(f"Best param: {grid_search_svm.best_params_}")
2.5. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)
#code
+------------------------------------------------------------------------+--------------------+
| grid search algorithms | Accuracy |
+------------------------------------------------------------------------+--------------------+
| KNeighborsClassifier(metric='manhattan', weights='distance') | 1.0 |
| SVC(C=10, gamma=1, kernel='linear') | 0.9473684210526315 |
| LogisticRegression(C=0.1, max_iter=1000) | 0.9385964912280702 |
| RandomForestClassifier(max_depth=6, max_leaf_nodes=6, n_estimators=50) | 0.9385964912280702 |
+------------------------------------------------------------------------+--------------------+
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 6/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
/content/gdrive/MyDrive/Lab7/data
mobile = pd.read_csv("mobile.csv")
X = mobile.drop(columns="price_range")
y = mobile[["price_range"]]
newX = SelectKBest(chi2,k=5).fit_transform(X,y)
X_train,X_test,y_train,y_test = train_test_split(newX,y,test_size=0.2)
svm_grid_serach.fit(X_train,y_train)
mobile_svm_best_estimator = svm_grid_serach.best_estimator_
mobile_svm_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversion
y = column_or_1d(y, warn=True)
▾ SVC
SVC(C=0.1, gamma=1)
kNN_grid_serach.fit(X_train,y_train)
mobile_kNN_best_estimator = kNN_grid_serach.best_estimator_
mobile_kNN_best_estimator
LR_grid_serach.fit(X_train,y_train)
mobile_LR_best_estimator = LR_grid_serach.best_estimator_
mobile_LR_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversion
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py:458: Convergen
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
▾ LogisticRegression
LogisticRegression(C=0.001, max_iter=1000)
random_forest_grid_serach.fit(X_train,y_train)
mobile_random_forest_best_estimator = random_forest_grid_serach.best_estimator_
mobile_random_forest_best_estimator
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:909: DataConv
self.best_estimator_.fit(X, y, **fit_params)
▾ RandomForestClassifier
RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=6,
n_estimators=50)
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 7/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
tableTask3 = PrettyTable(["grid search algorithms","Accuracy"])
tableTask3.add_row([mobile_kNN_best_estimator,metrics.accuracy_score(y_test,kNN_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_svm_best_estimator,metrics.accuracy_score(y_test,svm_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_LR_best_estimator,metrics.accuracy_score(y_test,LR_grid_serach.predict(X_test))])
tableTask3.add_row([mobile_random_forest_best_estimator,metrics.accuracy_score(y_test,random_forest_grid_serach.predict(X_test))])
print(tableTask3)
+----------------------------------------------------------------------------+----------+
| grid search algorithms | Accuracy |
+----------------------------------------------------------------------------+----------+
| KNeighborsClassifier(n_neighbors=11, weights='distance') | 0.94 |
| SVC(C=0.1, gamma=1) | 0.2325 |
| LogisticRegression(C=0.001, max_iter=1000) | 0.975 |
| RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=6, | 0.8275 |
| n_estimators=50) | |
+----------------------------------------------------------------------------+----------+
keyboard_arrow_down Task 4.
The dataset consists of 2000 user-created movie reviews archived on the IMDb(Internet Movie Database). The reviews are equally partitioned
into a positive set and a negative set (1000+1000). Each review consists of a plain text file (.txt) and a class label representing the overall user
opinion. The class attribute has only two values: pos (positive) or neg (negative).
#code
print(len(movie_reviews.fileids()))
print(movie_reviews.categories())
print(movie_reviews.words()[:100])
print(movie_reviews.fileids()[:10])
2000
['neg', 'pos']
['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt', 'neg/cv005_29357.txt
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 8/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
most movies seem to release a third movie just so it can be called a trilogy . rocky iii seems to kind of fit in that category , but man
(1340, 1000)
svm_grid_serach.fit(new_X_train_bow,y_train)
reviews_svm_best_estimator = svm_grid_serach.best_estimator_
reviews_svm_best_estimator
▾ SVC
SVC(C=10, gamma=1)
random_forest_grid_serach.fit(new_X_train_bow,y_train)
reviews_random_forest_best_estimator = random_forest_grid_serach.best_estimator_
reviews_random_forest_best_estimator
▾ RandomForestClassifier
RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=3)
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 9/10
06/05/2024, 23:57 Lab_7_21130616_TranhThanhVu.ipynb - Colab
kNN_grid_serach.fit(new_X_train_bow,y_train)
reviews_kNN_best_estimator = kNN_grid_serach.best_estimator_
reviews_kNN_best_estimator
▾ KNeighborsClassifier
KNeighborsClassifier(weights='distance')
LR_grid_serach.fit(new_X_train_bow,y_train)
reviews_LR_best_estimator = LR_grid_serach.best_estimator_
reviews_LR_best_estimator
▾ LogisticRegression
LogisticRegression(C=0.1, max_iter=1000)
4.10. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)
+----------------------------------------------------------------------------+---------------------+
| grid search algorithms | Accuracy |
+----------------------------------------------------------------------------+---------------------+
| KNeighborsClassifier(weights='distance') | 0.5181818181818182 |
| SVC(C=10, gamma=1) | 0.5196969696969697 |
| LogisticRegression(C=0.1, max_iter=1000) | 0.49393939393939396 |
| RandomForestClassifier(max_depth=3, max_features='log2', max_leaf_nodes=3) | 0.5 |
+----------------------------------------------------------------------------+---------------------+
Finally,
https://colab.research.google.com/drive/11B5XYEKV70Nqx6d8GtVj727wfVDZ22w5?hl=vi#printMode=true 10/10