Better Data Science - Hyperparameter Tuning With GridSearch

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Better Data Science | Hyperparameter Tuning with

GridSearch
● Library imports
● You'll use the Iris dataset for training and tuning
In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix

iris =
pd.read_csv('https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22
642a06d3d5b9bcbad9890c8ee534/iris.csv')
iris.head()

● Separate the dataset on features (X) and target (y)


● Make the train/test split
In [6]:
X = iris.drop('species', axis=1)
y = iris['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Baseline model

● Model with default hyperparameters


In [7]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(f'Accuracy = {round(accuracy_score(y_test, preds), 2)}')


print()
print(confusion_matrix(y_test, preds))
Manual hyperparameter optimization - Method #1

● Declare parameter dictionaries beforehand


● Train and evaluate multiple models
● Can become really tedious really fast
● Not scalable
In [8]:
# 3 sets of hyperparameters
params_1 = {'criterion': 'gini', 'splitter': 'best', 'max_depth': 10}
params_2 = {'criterion': 'entropy', 'splitter': 'random', 'max_depth': 1000}
params_3 = {'criterion': 'gini', 'splitter': 'random', 'max_depth': 100}

# 3 separate models
model_1 = DecisionTreeClassifier(**params_1)
model_2 = DecisionTreeClassifier(**params_2)
model_3 = DecisionTreeClassifier(**params_3)

model_1.fit(X_train, y_train)
model_2.fit(X_train, y_train)
model_3.fit(X_train, y_train)

# 3 separate prediction sets


preds_1 = model_1.predict(X_test)
preds_2 = model_3.predict(X_test)
preds_3 = model_2.predict(X_test)

print(f'Accuracy on Model 1 = {round(accuracy_score(y_test, preds_1), 5)}')


print(f'Accuracy on Model 2 = {round(accuracy_score(y_test, preds_2), 5)}')
print(f'Accuracy on Model 3 = {round(accuracy_score(y_test, preds_3), 5)}')

Manual hyperparameter optimization - Method #2

● Better than the first method


● Still way too manual
● Nested for loops don't look nice
In [9]:
# Define parameter possibilities as lists
p_criterion = ['gini', 'entropy']
p_splitter = ['best', 'random']
p_max_depth = [1, 10, 100, 1000]
# The scores will go here
results = []

# Nested loops - we need to test for all combinations


for criterion in p_criterion:
for splitter in p_splitter:
for max_depth in p_max_depth:
# Train the model
model = DecisionTreeClassifier(
criterion=criterion,
splitter=splitter,
max_depth=max_depth
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
# Append current results
results.append({
'Accuracy': round(accuracy_score(y_test, preds), 5),
'P_Criterion': criterion,
'P_Splitter': splitter,
'P_MaxDepth': max_depth
})

# Convert to Pandas DataFrame and sort descendingly by accuracy


results = pd.DataFrame(results)
results = results.sort_values(by='Accuracy', ascending=False)
results

Go-to approach: GridSearchCV

● Define model and hyperparameter space beforehand


● Use GridSearchCV for optimization
● Also does the cross validation for you
In [10]:
model = DecisionTreeClassifier()
params = {
'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 10, 100, 1000]
}

clf = GridSearchCV(
estimator=model,
param_grid=params,
cv=10, # 10-fold cross validation
n_jobs=-1 # run in parallel
)
clf.fit(X_train, y_train)

● Convert best parameters array to a Pandas DataFrame:


In [11]:
cv_results = pd.DataFrame(clf.cv_results_)
cv_results.head()

● Keep only what matters


● Sort descendingly by average test score
In [12]:
cv_results = cv_results[['mean_test_score', 'param_criterion', 'param_splitter', 'param_max_depth']]
cv_results.sort_values(by='mean_test_score', ascending=False)

● Get the best parameters


In [13]:
clf.best_params_

You might also like