Simple Guide To Optuna For Hyperparameters OptimizationTuning

Simple Guide to Optuna for Hyperparameters Optimization/Tuning
coderzcolumn.com/tutorials/machine-learning/simple-guide-to-optuna-for-hyperparameters-optimization-tuning
Machine learning is a branch of artificial intelligence that focuses on designing algorithms that can automate a task by learning from data or
from experience. Machine learning algorithms are nowadays used in the majority of fields like object detection, image classification, house
price prediction, email classification, and many more. Majority of machine learning algorithm has a bunch of parameters whose different
values need to be tried in order to get good results. These parameters of the algorithms are generally referred to as hyperparameters.
Hyperparameters are generally penalties for an algorithm (l1 or l2), a number of layers for neural networks, activation functions, learning rate,
optimization algorithms (SGD, adam, etc), etc.
When designing an ML algorithm, we try different combinations of these hyperparameters to get good results. We want a generalized
algorithm that can work well in different conditions. With an increase in complexity, people started creating algorithms with lots of different
hyperparameters. Trying many different combinations for all hyperparameters can take a lot of time (sometimes even days if there is a lot of
data) even on powerful computers. Libraries like scikit-learn provide an implementation for grid search which tries all different combinations
of a list of hyperparameters combination and for random search which tries a random list of combinations from all possible combinations. The
grid search algorithm can take a lot of time if there are many different combinations and the random search algorithm can ignore some
combinations which might have given good results. We need some ways to try only hyperparameters settings that are giving good results.
Optuna is a framework designed specifically for the purpose of hyperparameters optimization. Optuna helps us find the best
hyperparameters for our algorithm faster and it works with a majority of current famous ML libraries like scikit-learn, xgboost, PyTorch,
TensorFlow, skorch, lightgbm, Keras, fast-ai, etc.
Optuna Strategy for Optimization
Optuna overall uses the below strategy for finding the best hyperparameters combination.
1. Sampling Strategy - It uses a sampling algorithm for selecting the best parameter combination from a list of all possible combinations.
It concentrates on areas where hyperparameters are giving good results and ignores others resulting in time savings.
2. Pruning Strategy - It uses a pruning strategy that constantly checks for algorithm performance during training and prunes
(terminates) training for particular hyperparameters combination if it's not giving good results. This also results in time-saving.
As a part of this tutorial, we'll explain how we can use optuna to find the best hyperparameter settings faster. We'll be using scikit-learn and its
algorithms for explanation purposes. This tutorial will get you started with Optuna. We have tried to explain the usage of it with simple and
easy-to-understand examples.
Steps to use Optuna

Below we have listed steps that will be most commonly followed to use optuna.
1. Create an objective function.

This function will have logic for creating a model, training it, and evaluating it on the validation dataset. After evaluation, it should
return a single value which is generally the output of the evaluation metric (accuracy, MSE, etc.) and needs to be
minimized/maximized.
This function takes as input a single parameter which is an instance of Trial class. This object has details about one combination of
hyperparameters with which the ML algorithm will be executed.
2. Create Study object.
3. Call optimize() method on Study by giving objective function created in the first step to find best hyperparameters combination. It'll
execute the objective function more than once by giving different Trial instances each having different hyperparameters combinations.
Optuna is based on the concept of Study and Trial.
The trial is one combination of hyperparameters that will be tried with an algorithm.
The study is the process of trying different combinations of hyperparameters to find the one combination that gives the best results. The
study generally consists of many trials.
Sections of Tutorial
This ends our small introduction to Optuna. We'll now start explaining the usage with examples. Our tutorial consists of the below sections.
We'll start by importing optuna.
import optuna
print("Optuna Version : {}".format(optuna.__version__))
Optuna Version : 2.9.1
1/29
Minimize Simple Line Formula
As a part of this section, we'll introduce how we can use Optuna to find the best parameters that can minimize the output of the simple line
function. We'll be trying to minimize the line function 5x-21. We want to find the best value of x at which the output of function 5x-21 is 0.
This is a simple function and we can easily calculate the output but we'll let optuna suggest us values of parameter x which will minimize the
function.
We'll be following the steps which we had discussed earlier at the beginning of the tutorial.
We'll start by creating an objective function that takes as input instance of Trial and return the value which we want to minimize/maximize. In
our case, we want to minimize the value of line function 5x-21. We have wrapped the line formula in abs function because we want to
minimize the function to 0. If we don't use abs then it'll take negative values of the line formula as a minimum.
Our hyperparameter for this line formula is x. We want to find the value of x at which formula abs(5-21) is minimum. We'll be using methods
of Trial instance for suggesting values of hyperparameter x for this purpose.
Important Methods of Trial Instance
suggest_float(name,low,high,step=None,log=False) - This method takes as input hyperparameter name and it's low and high
values as input. It then suggests float values in the range of [low, high]. We can specify step value if we want to increase the value using
that step size. We can also set log parameter to True to follow a logarithmic pattern in suggesting values. Logarithmic suggestion slowly
increases the value of a parameter.
suggest_int(name,low,high,step=1,log=False) - This method works exactly like suggest_float() with only difference that it
suggests integer values instead.
suggest_uniform(name,low,high) - This method takes parameter name and low & high value of parameter. It then uniformly
suggest values of parameter.
suggest_categorical(name, choices) - This method takes as input parameter name and list of different values of that parameter that
we want to try. It's generally used for categorical variables of data.
We have declared hyperparameter x using suggest_float() method inside of objective function. This will make sure that values of x are
suggested as float and in the range 0-5. The new value of x will be suggested each time the objective function is called with Trial instance.
def objective(trial):
x = trial.suggest_float("x", 0, 5)
return abs(5*x - 21)
In this cell, we have created an instance of Study using create_study() method. This object will be used to try different combinations of
hyperparameter combinations. In our case, different values of x will be tried by this study object.
create_study(study_name=None,direction=None,sampler=None,pruner=None,storage=None,load_if_exists=None) -
This method creates an instance of Study which will be used for optimization. It has list of optional parameters.
The study_name parameter accepts string specifying the name of the study.
The direction accepts string 'minimize' if we want to minimize the output of objective function else 'maximize'. By default, the
objective function is minimized.
The sampler parameter accepts an instance of Sampler specifying which sampling strategy to use for selecting hyperparameter
combinations. Below is a list of samplers available with Optuna.
TPESampler - By default this sampler will be used if none is provided. It uses a Tree-structured Parzen Estimator algorithm
for selecting hyperparameter combinations.
CmaEsSampler - This sampler uses cmaes library for selecting hyperparameter combinations. It's implementation of
covariance matrix adaptation evolution strategy (CMA-ES) algorithm.
NSGAIISampler - This is a multi-objective sampler based on the NSGA-II algorithm.
MOTPESampler - This is a multi-objective sampler based on the MOTPE algorithm.
GridSampler - This is the same sampler as scikit-learn's grid search will try all combinations.
RandomSampler - This is the same sampler as scikit-learn's random search which will randomly select few combinations.
The pruner parameter accepts an instance of Pruner which will be used to prune a particular trial of objective function if it's not
giving good results before it completes. Below is a list of pruners available with Optuna.
MedianPruner - By default this pruner will be used if none is provided. It uses the median stopping rule to prune trials.
NopPruner - This pruner will not perform pruning.
PercentilePruner - This pruner will keep the specified percentile of trials from all possible trials.
SuccessiveHalvingPruner - This pruner uses an asynchronous successive halving algorithm for pruning trials.
HyperbandPruner - It uses a hyperband algorithm for pruning.
ThresholdPruner - It uses a certain threshold to prune trials.
PatientPruner - This pruner wraps another pruner with particular tolerance and prunes trial based on it.
We'll be using the default sampler and pruner available in our examples of this tutorial.
2/29
Once we have created a Study object, we can instruct it to try different values of hyperparameters and find the combination which gives us the
best result. In our case, it'll try to find the best value of x which minimizes the line formula. We also need to give it a number of trials to
perform.
Important Methods of Study Object
optimize(func,n_trials=None,storage=None,timeout=None,n_jobs=1,catch=(),show_progress_bar=False) - This
method takes as input objective function and tries different combination of hyperparameters with it. It has list of important parameters.
The n_trials parameter accepts integer values specifying the number of trials to execute.
The timeout parameter accepts float value specifying the number of seconds to wait before terminating the study. It'll try different
combinations until a specified amount of seconds is passed.
The n_jobs parameter accepts integer values specifying the number of cores/CPU to use on the computer. If we set it to -1 then it'll
use all cores of the computer.
The catch parameter accepts a list of exceptions. The study will continue with other trials if one of the exception specified in this
list happens during the execution of an objective function. By default, this list is empty which means that any exception that
happened in the objective function will result in a halt of study. We can provide exceptions that we want to avoid as a part of this
parameter.
The show_progress_bar parameter accepts a boolean value which if set to True will show the progress bar of study.
The storage parameter accepts database URL where trial results will be saved. This will be useful when we want to run trials in
parallel on many different computers. They will communicate and divide trials using this database.
Below we have instructed Study object to try the objective function 10 times so that it'll try 10 different values of parameter x and will keep
track of formula output for each try. We can then retrieve which value gave the best result using Study object attributes.
study1 = optuna.create_study(study_name="MinimizeFunction")
study1.optimize(objective, n_trials=10)
[I 2021-09-17 07:06:51,088] A new study created in memory with name: MinimizeFunction

[I 2021-09-17 07:06:51,094] Trial 0 finished with value: 13.841006675454917 and parameters: {'x': 1.4317986649090164}. Best is
trial 0 with value: 13.841006675454917.
trial 0 with value: 13.841006675454917.
trial 2 with value: 12.105439771280711.
[I 2021-09-17 07:06:51,102] Trial 3 finished with value: 1.359435784302029 and parameters: {'x': 4.471887156860406}. Best is trial
3 with value: 1.359435784302029.
3 with value: 1.359435784302029.
3 with value: 1.359435784302029.
3 with value: 1.359435784302029.
7 with value: 1.016466974164011.
7 with value: 1.016466974164011.
7 with value: 1.016466974164011.
Study object has a list of important attributes which can be used to find out the best parameter settings and result details once the study
completes executing all trials.
The best_params attribute has a dictionary specifying a value of each hyperparameter that gave the best results (minimum value of an
objective function in this case).
Below we have printed the best value of x which gave the minimum value for our line formula.
best_params = study1.best_params
best_params
{'x': 4.403293394832803}
found_x = best_params["x"]
print("Found x: {}, (5*x - 21): {}".format(found_x, (5*found_x - 21)))
Found x: 4.403293394832803, (5*x - 21): 1.016466974164011
The best_value attribute has a value of the best result. It'll be holding the minimum value of the line formula that we got after performing
different trials.
study1.best_value
3/29
1.016466974164011
The best_trial attribute has an instance of FrozenTrial which has details about a trial that gave the best results. It has information about a
trial state which is COMPLETE in this case. The trial state can be failed or pruned as well.
study1.best_trial
FrozenTrial(number=7, values=[1.016466974164011], datetime_start=datetime.datetime(2021, 9, 17, 7, 6, 51, 112203),

datetime_complete=datetime.datetime(2021, 9, 17, 7, 6, 51, 112935), params={'x': 4.403293394832803}, distributions={'x':
UniformDistribution(high=5.0, low=0.0)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=7,
state=TrialState.COMPLETE, value=None)
The trials attribute has a list of FrozenTrial instances holding information about each individual trial and their states.
print("Total Trials : {}".format(len(study1.trials)))
Total Trials : 10
The trials_dataframe() method returns pandas dataframe summarizing all trials of study.
study1.trials_dataframe()
number value datetime_start datetime_complete duration params_x state
0 13.841007 2021-09-17 2021-09-17 0 days 1.431799 COMPLETE

0 07:06:51.092090 07:06:51.093569 00:00:00.001479
1 19.763319 2021-09-17 2021-09-17 0 days 0.247336 COMPLETE

1 07:06:51.095844 07:06:51.096511 00:00:00.000667
2 12.105440 2021-09-17 2021-09-17 0 days 1.778912 COMPLETE

2 07:06:51.098360 07:06:51.099039 00:00:00.000679
3 1.359436 2021-09-17 2021-09-17 0 days 4.471887 COMPLETE

3 07:06:51.101022 07:06:51.101690 00:00:00.000668
4 2.072014 2021-09-17 2021-09-17 0 days 4.614403 COMPLETE

4 07:06:51.103664 07:06:51.104350 00:00:00.000686
5 2.623199 2021-09-17 2021-09-17 0 days 3.675360 COMPLETE

5 07:06:51.106561 07:06:51.107281 00:00:00.000720
6 15.832072 2021-09-17 2021-09-17 0 days 1.033586 COMPLETE

6 07:06:51.109402 07:06:51.110122 00:00:00.000720
7 1.016467 2021-09-17 2021-09-17 0 days 4.403293 COMPLETE

7 07:06:51.112203 07:06:51.112935 00:00:00.000732
8 1.504497 2021-09-17 2021-09-17 0 days 3.899101 COMPLETE

8 07:06:51.115067 07:06:51.115816 00:00:00.000749
9 9.916126 2021-09-17 2021-09-17 0 days 2.216775 COMPLETE

9 07:06:51.118023 07:06:51.118741 00:00:00.000718
We can continue our trials further by calling optimize() function. It'll try that many more trials. If we are not satisfied with the results of the
initial trials then we can call optimize() again so that it tries a few more trials to improve results further.
Below we have executed 15 more trials using optimize().
4/29
7 with value: 1.016466974164011.
7 with value: 1.016466974164011.
7 with value: 1.016466974164011.
trial 13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
trial 13 with value: 0.16666595745332557.
Below we have printed the best parameter and the best value after trying 15 more trials.
best_params = study1.best_params
best_params
{'x': 4.233333191490665}
found_x = best_params["x"]
print("Found x: {}, (5*x - 21): {}".format(found_x, (5*found_x - 21)))
Found x: 4.233333191490665, (5*x - 21): 0.16666595745332557

Total Trials : 25
Regression (Ridge)
As a part of this section, we'll explain how we can use optuna with scikit-learn estimators. We'll be working on a regression problem and try to
solve it using ridge regression. We'll be using the Boston housing dataset available from scikit-learn for our purpose. We'll start by importing
all necessary libraries and functions that we'll be using throughout this section. We'll also compare the results of optuna with results of grid
search and random search of scikit-learn.
import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import Ridge, LogisticRegression

from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
Below we have loaded the Boston housing dataset available from scikit-learn. It has information about houses like the average number of
rooms per dwelling, property tax, the crime rate in the area, etc. We'll be predicting the median value of the house in 1000 dollars. We have
loaded the dataset and saved it in a data frame for display purposes.
We have stored 13 features of data into variable X and target value in variable Y.
5/29
boston = datasets.load_boston()
X,Y = boston.data, boston.target
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
boston_df["HousePrice"] = boston.target
boston_df
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT HousePrice
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06263 0.0 11.93 0.0 0.573 6.593 69.1 2.4786 1.0 273.0 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0.0 0.573 6.120 76.7 2.2875 1.0 273.0 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0.0 0.573 6.976 91.0 2.1675 1.0 273.0 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0.0 0.573 6.794 89.3 2.3889 1.0 273.0 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0.0 0.573 6.030 80.8 2.5050 1.0 273.0 21.0 396.90 7.88 11.9
506 rows × 14 columns
Below we have divided data into train (80%) and test (20%) sets using train_test_split() scikit-learn function.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80, random_state=123)
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
((404, 13), (102, 13), (404,), (102,))
Below we have declared the objective function that we'll be using for our purpose. We have declared 4 hyperparameters that we'll be
optimizing.
alpha
fit_intercept
tol
solver
We have used suggest_float() to suggest floating-point values for hyperparameters alpha and tol. We have used suggest_categorical() to
suggest categorical values for hyperparameters fit_intercept and solver. Values of hyperparameters will be selected from ranges suggested
by these methods during each trial of the optuna study.
We have then created a model with these parameters and fitted training data to it. At last, we have calculated the mean squared error (MSE) on
test data and returned the value of it. We'll be minimizing this MSE during the study.
alpha = trial.suggest_float("alpha", 0, 10)
intercept = trial.suggest_categorical("fit_intercept", [True, False])
tol = trial.suggest_float("tol", 0.001, 0.01, log=True)
solver = trial.suggest_categorical("solver", ["auto", "svd","cholesky", "lsqr", "saga", "sag"])
## Create Model
regressor = Ridge(alpha=alpha,fit_intercept=intercept,tol=tol,solver=solver)
## Fit Model
regressor.fit(X_train, Y_train)
return mean_squared_error(Y_test, regressor.predict(X_test))
Below we have created an instance of Study and run 15 trials of the objective function that we created in the previous cell. This will try to find
the best hyperparameter settings that minimize MSE on test data using TPESampler of optuna.
6/29
%%time
study2 = optuna.create_study(study_name="RidgeRegression")
[I 2021-09-17 07:06:52,922] A new study created in memory with name: RidgeRegression

[I 2021-09-17 07:06:53,137] Trial 0 finished with value: 32.64447951723032 and parameters: {'alpha': 1.867480452457675,
'fit_intercept': False, 'tol': 0.00556052962595254, 'solver': 'svd'}. Best is trial 0 with value: 32.64447951723032.
'fit_intercept': False, 'tol': 0.0010713988518477274, 'solver': 'auto'}. Best is trial 0 with value: 32.64447951723032.
'fit_intercept': False, 'tol': 0.0036028261758664004, 'solver': 'saga'}. Best is trial 0 with value: 32.64447951723032.
'fit_intercept': True, 'tol': 0.002148432418111609, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': False, 'tol': 0.003023111779978088, 'solver': 'auto'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': True, 'tol': 0.005942102414029408, 'solver': 'sag'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': False, 'tol': 0.004563590007643111, 'solver': 'saga'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': True, 'tol': 0.007069246250713737, 'solver': 'saga'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': False, 'tol': 0.0037751409287642063, 'solver': 'lsqr'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': True, 'tol': 0.00128362461455875, 'solver': 'sag'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': True, 'tol': 0.0016580512349069821, 'solver': 'cholesky'}. Best is trial 3 with value: 28.73396591701217.
CPU times: user 163 ms, sys: 8.21 ms, total: 171 ms
Wall time: 382 ms
Below we have printed hyperparameters combination that gave the least MSE. We have then created a ridge regression model using the best
hyperparameters that we found out using optuna. We have evaluated the performance of the model on train and test set by evaluating MSE on
both.
print("Best Params : {}".format(study2.best_params))
print("\nBest MSE : {}".format(study2.best_value))
Best Params : {'alpha': 5.313845395415553, 'fit_intercept': True, 'tol': 0.002148432418111609, 'solver': 'lsqr'}
Best MSE : 28.73396591701217
ridge = Ridge(**study2.best_params)
ridge.fit(X_train, Y_train)
print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))

print("Ridge Regression MSE on Test Dataset : {}".format(mean_squared_error(Y_test, ridge.predict(X_test))))
Ridge Regression MSE on Train Dataset : 26.041719809438128

Ridge Regression MSE on Test Dataset : 28.73396591701217
Here, we have created a ridge regression model with default parameters for comparison purposes. We have a default model with train data and
then evaluated it on both train and test sets.
ridge = Ridge()


Now we'll optimize the objective function again for 10 trials to check whether it's improving results further or not. We have printed the best
parameter settings and MSE after these 10 trials. This trial will work keeping 15 trials that we performed earlier as a part of this study. It'll
continue to search in a direction where it had got good results (least MSE) when it ran 15 trials earlier.
7/29
%%time
'fit_intercept': True, 'tol': 0.0028285638127246316, 'solver': 'cholesky'}. Best is trial 3 with value: 28.73396591701217.
'fit_intercept': True, 'tol': 0.0015349667709175423, 'solver': 'svd'}. Best is trial 3 with value: 28.73396591701217.
CPU times: user 83.6 ms, sys: 7.69 ms, total: 91.3 ms
Wall time: 89.6 ms
Below we have printed the best parameters and MSE for that model found out after another 10 trials. We have also trained the model using
these settings and evaluated it as well.
Best Params : {'alpha': 4.238734413985723, 'fit_intercept': True, 'tol': 0.003364829406180997, 'solver': 'lsqr'}
Best MSE : 28.732108565206946
ridge = Ridge(**study2.best_params)


Comparison with Grid Search
As a part of this section, we'll compare optuna with the grid search algorithm of scikit-learn. We'll be trying the same parameter settings that
we tried with optuna but we'll use grid search for training purposes. The grid search algorithm will try all possible combinations rather than
looking in the direction of good results.
If you are interested in learning about how to use grid search from scikit-learn then please feel free to check our tutorial on the same.
Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch
Grid Search without Parallelization
Below we are trying grid search without any kind of parallelization. We have performed a grid search on training data first. Later on, we have
created a model with the best parameter setting that grid search found. We have also evaluated the model on train and test set for verifying
results. We have chosen the same hyperparameter ranges that we had used when using optuna. Grid search in total below will try 3000
different combinations (25-alpha x 2-fit_intercept x 10-tol x 6-solver) of hyperparameters.
We can notice from the output that results are almost the same as that of optuna. The MSE is almost the same in both train and test sets but
the time taken by grid search is a lot more than that compared to optuna.
8/29
%%time
param_grid = {"alpha" : np.linspace(0, 10, 25),

"fit_intercept": [True, False],
"tol": np.linspace(0.001, 0.01,10),
"solver": ["auto", "svd","cholesky", "lsqr", "saga", "sag"]
}
grid = GridSearchCV(Ridge(), param_grid, cv=5)
grid.fit(X_train, Y_train)
grid.best_params_
CPU times: user 46.9 s, sys: 59.2 ms, total: 47 s

Wall time: 47 s
{'alpha': 0.0, 'fit_intercept': True, 'solver': 'svd', 'tol': 0.001}
ridge = Ridge(**grid.best_params_)


Grid Search with Parallelization
Below we have tried the same grid search algorithm that we tried in the previous step but this time we have used parallelization to check
whether there is any improvement in speed. The code has only one change compared to the previous section that we have set n_jobs
parameter to -1 instructing it to use all cores of the computer. We can notice from the output that the grid search now completes in almost half
of the time compared to the previous run but still even after parallelization it took a lot more time compared to optuna.
%%time

"tol": np.linspace(0.001, 0.01,10),
}
grid = GridSearchCV(Ridge(), param_grid, cv=5, n_jobs=-1)
grid.best_params_
CPU times: user 2.29 s, sys: 37.6 ms, total: 2.32 s

Wall time: 21.4 s
{'alpha': 0.0, 'fit_intercept': True, 'solver': 'svd', 'tol': 0.001}


Comparison with Random Search

As a part of this section, we'll be comparing the performance of optuna with that of the random search algorithm of scikit-learn for
hyperparameters optimization. We have covered details about how to use random search in the same tutorial in which we have discussed grid
search. We have given the link above for it.
Random Search without Parallelization
Below we have used random search with the same range of hyperparameters values which we had used with optuna. We have instructed the
random search algorithm to try 25 random iterations so that it'll try the algorithm with 25 different randomly chosen hyperparameters
settings. We chose 25 because we had run optuna earlier for 25 trials (15 first and then 10).
9/29
We can notice that random search completes quite faster compared to grid search because it tries only 25 hyperparameters combinations
whereas grid search was trying 3000. If we compare timing with optuna, the time is still quite more compared to optuna and results are almost
the same. Optuna can run still more fast if we had used parallelization with it by setting n_jobs to -1 when optimizing an objective function.
%%time

"tol": np.linspace(0.001, 0.01,10),
}
grid = RandomizedSearchCV(Ridge(), param_grid, cv=5, n_iter=25, random_state=123)
grid.best_params_
CPU times: user 386 ms, sys: 7 µs, total: 386 ms

Wall time: 384 ms
{'tol': 0.003, 'solver': 'svd', 'fit_intercept': True, 'alpha': 3.75}


Random Search with Parallelization
Below we have run a random search with parallelization by setting n_jobs to -1. It now runs faster compared to the non-parallelized version
but still slower compared to optuna. The results are almost the same or a little bad compared to optuna.
%%time

"tol": np.linspace(0.001, 0.01,10),
}
grid = RandomizedSearchCV(Ridge(), param_grid, cv=5, n_iter=25, n_jobs=-1, random_state=123)
grid.best_params_
CPU times: user 75 ms, sys: 7.9 ms, total: 82.9 ms

Wall time: 230 ms
{'tol': 0.003, 'solver': 'svd', 'fit_intercept': True, 'alpha': 3.75}


Classification (Logistic Regression)

As a part of this section, we'll explain how we can use Optuna for classification problems. We'll be using the wine dataset available from scikit-
learn for our purpose. It has information about various ingredients of wines for three different categories of wines. We'll be using logistic
regression for explanation purposes and try to find out the best hyperparameters combination that gives the best accuracy.
Below we have loaded the wine dataset from scikit-learn. We have created a pandas data frame from wine data for the display purpose of its
features and target (Wine Type) variables.
We have loaded information about wine features into a variable named X and information about wine type into a target variable named Y.
10/29
wine = datasets.load_wine()
X,Y = wine.data, wine.target
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df["WineType"] = wine.target
wine_df
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyan
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82
... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06
174 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35
Below we have divided the wine dataset into the train (80%) and test (20%) sets.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80, stratify=Y, random_state=123)
((142, 13), (36, 13), (142,), (36,))
Below we have created an objective function for our classification problem. We'll be optimizing 5 different hyperparameters of a logistic
regression model.
penalty
tol
C
fit_intercept
solver
We have used suggest_categorical() method of Trial instance for suggesting categorical values for hyperparameters penalty,
fit_intercept and solver. We have used suggest_float() method for suggesting float values for hyperparameters tol and C. Values of
hyperparameters will be selected from ranges suggested by these methods during each trial of the optuna study.
We have then created logistic regression models using these hyperparameter variables. We have then trained the model on test data and
evaluated it on test data. The evaluation will find out accuracy in this case which will tell us how many percent of test labels our model
predicted correctly.
11/29
penalty = trial.suggest_categorical("penalty", ["l1", "l2"])
tol = trial.suggest_float("tol", 0.0001, 0.01, log=True)
C = trial.suggest_float("C", 1.0, 10.0, log=True)
intercept = trial.suggest_categorical("fit_intercept", [True, False])
solver = trial.suggest_categorical("solver", ["liblinear", "saga"])
## Create Model
classifier = LogisticRegression(penalty=penalty,
tol=tol,
C=C,
fit_intercept=intercept,
solver=solver,
multi_class="auto",
)
## Fit Model
classifier.fit(X_train, Y_train)
return classifier.score(X_test, Y_test)
Below we have created an instance of Study with the name LogisticRegression. We have set direction to maximize this time because we
want to maximize accuracy which is the output of an objective function. The default value of parameter direction is minimize which will
minimize the output of an objective function. It was used during the regression section where we wanted to minimize MSE.
We have then used the study object and instructed it to run the objective function for 15 trials with different hyperparameter combinations. It'll
try 15 combinations and store information about each.
%%time
study3 = optuna.create_study(study_name="LogisticRegression", direction="maximize")

[I 2021-09-17 07:08:04,282] A new study created in memory with name: LogisticRegression

[I 2021-09-17 07:08:04,323] Trial 0 finished with value: 1.0 and parameters: {'penalty': 'l2', 'tol': 0.00044329282917357363, 'C':
4.227876357919739, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
[I 2021-09-17 07:08:04,348] Trial 1 finished with value: 0.6944444444444444 and parameters: {'penalty': 'l2', 'tol':
0.00027839176368309095, 'C': 4.045876618765414, 'fit_intercept': False, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
0.0017661636335698377, 'C': 3.5819218693443595, 'fit_intercept': True, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
0.0011113283180062658, 'C': 2.8253108009422, 'fit_intercept': True, 'solver': 'saga'}. Best is trial 0 with value: 1.0.
6.581866524287295, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
0.0005947383488025259, 'C': 2.4953987523032026, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
0.0005102712159548399, 'C': 2.062790817916783, 'fit_intercept': True, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
CPU times: user 235 ms, sys: 8.18 ms, total: 243 ms
Wall time: 267 ms
Below we have printed the best hyperparameters settings which gave the best accuracy on train data.
print("\nBest Accuracy : {}".format(study3.best_value))
Best Params : {'penalty': 'l2', 'tol': 0.00044329282917357363, 'C': 4.227876357919739, 'fit_intercept': False, 'solver':
'liblinear'}
Best Accuracy : 1.0
Below we have created a logistic regression model with the best parameters that we found using optuna. We have then trained it and evaluated
it on train and test datasets.
12/29
We have also created a logistic regression model with default parameters to compare its performance with optuna. The results might look
almost the same because the dataset that we have used is small. But the result of optuna in the real world where there is a lot of data will easily
beat the models with default hyperparameter settings.
classifier = LogisticRegression(**study3.best_params, multi_class="auto")
print("Logistic Regression Accuracy on Train Dataset : {}".format(classifier.score(X_train, Y_train)))

print("Logistic Regression Accuracy on Test Dataset : {}".format(classifier.score(X_test, Y_test)))
Logistic Regression Accuracy on Train Dataset : 0.971830985915493

Logistic Regression Accuracy on Test Dataset : 1.0
classifier = LogisticRegression(multi_class="auto")


Below we have instructed the study object to optimize the objective function for 10 more trials to check whether it's improving the results
further or not.
%%time
0.00010380694985183597, 'C': 8.808456215349976, 'fit_intercept': False, 'solver': 'liblinear'}. Best is trial 0 with value: 1.0.
CPU times: user 177 ms, sys: 105 µs, total: 177 ms
Wall time: 172 ms
We have then printed the results after another 10 trials and the results are almost the same.
print("\nBest Accuracy : {}".format(study3.best_value))
Best Params : {'penalty': 'l2', 'tol': 0.00044329282917357363, 'C': 4.227876357919739, 'fit_intercept': False, 'solver':
'liblinear'}
Best Accuracy : 1.0
classifier = LogisticRegression(**study3.best_params, multi_class="auto")


Comparison with Grid Search

As a part of this section, we are comparing the grid search algorithm with optuna. We are trying the same ranges for each hyperparameter that
we tried when using optuna. We can notice after training and evaluation that results are almost the same but the time taken by grid search is a
lot more compared to optuna.
13/29
%%time
param_grid = {
"penalty": ["l1", "l2"],
"C" : np.linspace(1, 10.0, 25),
"tol": np.linspace(0.0001, 0.01,10),
"solver": ["liblinear", "saga"]
}
grid = GridSearchCV(LogisticRegression(multi_class="auto", max_iter=1000), param_grid, cv=5)
grid.best_params_
CPU times: user 1min 2s, sys: 13.5 ms, total: 1min 2s
Wall time: 1min 2s
{'C': 7.375,
'fit_intercept': True,
'penalty': 'l1',
'solver': 'liblinear',
'tol': 0.0034}
classifier = LogisticRegression(**grid.best_params_, multi_class="auto")


Comparison with Random Search

Below we have compared a random search algorithm with optuna. We have run a random search algorithm for 25 iterations which will try 25
different hyperparameters combinations on data. We can notice from the output that accuracy is almost the same as the one we got with
optuna. The random search runs a little faster compared to grid search because it does not try all possible combinations of hyperparameters
but is still it is slower compared to optuna.
We can come to the conclusion that optuna finds out the best hyperparameters combination in quite less time compared to random search and
grid search. This can increase the productivity of ml practitioners a lot as it'll save time which could have been wasted in trying all possible
settings rather than concentrating on ones that are giving good and ignoring others.
%%time
param_grid = {
"penalty": ["l1", "l2"],
"C" : np.linspace(1, 10.0, 25),
"tol": np.linspace(0.0001, 0.01,10),
"solver": ["liblinear", "saga"]
}
grid = RandomizedSearchCV(LogisticRegression(multi_class="auto", max_iter=1000), param_grid, cv=5, n_iter=25, random_state=123)
grid.best_params_
CPU times: user 879 ms, sys: 0 ns, total: 879 ms

Wall time: 878 ms
{'tol': 0.0067,
'solver': 'liblinear',
'penalty': 'l1',
'fit_intercept': True,
'C': 9.625}
classifier = LogisticRegression(**grid.best_params_, multi_class="auto")


14/29
Pruning Under Performing Hyperparameter Settings Earlier
As a part of this section, we'll explain how we can instruct Optuna to prune trials that are not performing well during the study process.
Typical machine learning algorithm deals with a lot of data in which case training does not complete in one go like we had explained in our
earlier examples that had quite less data that can fit in the main memory of the computer. Real-world problems generally have a lot of data and
the training process consists of going through batches of samples of data. It goes through the total data in batches to cover the total dataset.
Many neural networks even go through a dataset more than once during the training process.
When going through data in batches or even looping through the same data more than once during the particular trial of study, we can check
the performance of a model on set aside validation or test set. If it's not performing well then it can be pruned before it completes to save time
and resources for other trials of the study process. Whether to prune a particular trial or not is decided by the internal pruning algorithm of
Optuna.
We'll be using the California housing dataset available from scikit-learn as a part of this section. We'll be training a dataset in batches on a
multi-layer perceptron algorithm available from scikit-learn.
California housing dataset has information about houses (average bedrooms, the population of an area, house age, etc) in California and their
median house price. The median house price will be the target variable that our ML algorithm will be predicting. It'll be a regression problem.
Below we have loaded the California housing dataset which is available from scikit-learn. It's a big dataset compared to our previous datasets. It
has around 20k+ entries. We have stored the dataset in a pandas dataframe for display purposes. We have stored housing features in variable
X and our target variable (median house price) in variable Y.
calif_housing = datasets.fetch_california_housing()
X, Y = calif_housing.data, calif_housing.target
calif_housing_df = pd.DataFrame(calif_housing.data, columns=calif_housing.feature_names)
calif_housing_df["MedianHousePrice"] = calif_housing.target
calif_housing_df
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedianHousePrice
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422
... ... ... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -121.09 0.781
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -121.21 0.771
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -121.22 0.923
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -121.32 0.847
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -121.24 0.894
We have then divided the dataset into training (90%) and test (10%) sets as usual.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.90, random_state=123)
((18576, 8), (2064, 8), (18576,), (2064,))
Below we have reshaped our training and test dataset so that it has now entries in batches of samples. Each entry is a batch of 16 samples of
data. The ML model will loop through training data in batches where 16 samples will be fed to it for training each time.
X_train_batched, Y_train_batched = X_train.reshape(-1,16,8), Y_train.reshape(-1,16)
X_train_batched.shape, Y_train_batched.shape
((1161, 16, 8), (1161, 16))
15/29
Below we have created an objective function that we'll be using for our purpose. We'll be using the neural network algorithm available from
scikit-learn for our purpose. We'll be optimizing the values of the below hyperparameters of the model.
hidden_layer_sizes
activation
learning_rate
learning_rate_init
Our objective function uses suggest_categorical() method of Trial instance for suggesting categorical values for hyperparameters
hidden_layer_sizes, activation and learning_rate. We have used suggest_float() method for suggesting float values for
hyperparameter learning_rate_init. After that, we have initialized model with these parameters.
The training process this time consists of a loop where we go through training data and each time partially fit a model to a single batch of
training data. We calculate the mean squared error on test data during each batch as well.
In order to prune underperforming trials, we have introduced few extra lines of code. We are calling report() method of Trial instance which
takes as input value which we are optimizing and number inside of our training loop. Then we have put if condition which checks whether this
particular trial should be pruned using should_prune() method of Trial instance. If this method returns True then we raise
TrialPruned() which will raise an exception. This will inform Study instance that we should prune this trial and should not train it more.
This will result in saving of time and resources which would have been wasted behind this trial which would have resulted in underperformed
results.
The default pruning algorithm of the study instance is MedianPruner which decides whether to prune a particular trial or not. Based on the
decision taken by this algorithm, should_prune() method returns True or False. The MedianPruner algorithm takes decisions based on
MSE values that we reported through various calls of report() method during the training process.
from sklearn.neural_network import MLPRegressor
hidden_layers = trial.suggest_categorical("hidden_layer_sizes", [(50,100),(100,100),(50,75,100),(25,50,75,100)])
activation = trial.suggest_categorical("activation", ["relu", "identity"])
#solver = trial.suggest_categorical("solver", ["sgd", "adam"])
learning_rate = trial.suggest_categorical("learning_rate", ['constant', 'invscaling', 'adaptive'])
learning_rate_init = trial.suggest_float("learning_rate_init", 0.001, 0.01)
## Create Model
mlp_regressor = MLPRegressor(
hidden_layer_sizes=hidden_layers,
activation=activation,
#solver=solver,
learning_rate=learning_rate,
learning_rate_init=learning_rate_init,
#early_stopping=True
)
## Fit Model
for i, (X_batch, Y_batch) in enumerate(zip(X_train_batched,Y_train_batched)):
mlp_regressor.partial_fit(X_batch, Y_batch)
mse = mean_squared_error(Y_test, mlp_regressor.predict(X_test))
trial.report(mse, i+1)
if trial.should_prune():
raise optuna.TrialPruned()
return mse
Below we have created an instance of Study for trying various trials. We are then running 15 different trials to optimize the output (MSE on
test dataset) of the objective function.
This time, we can notice from the output that few of the trials are pruned by the algorithm during a study which it thinks would not have
resulted in a good performance.
%%time
study4 = optuna.create_study(study_name="MLPRegressor")
16/29
[I 2021-09-17 07:09:09,083] A new study created in memory with name: MLPRegressor
[I 2021-09-17 07:09:11,468] Trial 0 finished with value: 5.430809559222708 and parameters: {'hidden_layer_sizes': (100, 100),
'activation': 'identity', 'learning_rate': 'invscaling', 'learning_rate_init': 0.006063649495374317}. Best is trial 0 with value:
5.430809559222708.
[I 2021-09-17 07:09:16,623] Trial 1 finished with value: 2.6933340702103807 and parameters: {'hidden_layer_sizes': (25, 50, 75,
100), 'activation': 'relu', 'learning_rate': 'adaptive', 'learning_rate_init': 0.006593756750422365}. Best is trial 1 with value:
2.6933340702103807.
[I 2021-09-17 07:09:20,239] Trial 2 finished with value: 1.465392838316882 and parameters: {'hidden_layer_sizes': (100, 100),
'activation': 'relu', 'learning_rate': 'constant', 'learning_rate_init': 0.0035340788684059556}. Best is trial 2 with value:
1.465392838316882.
[I 2021-09-17 07:09:23,682] Trial 3 finished with value: 2.0013989256305473 and parameters: {'hidden_layer_sizes': (50, 75, 100),
'activation': 'relu', 'learning_rate': 'adaptive', 'learning_rate_init': 0.006840653092104957}. Best is trial 2 with value:
1.465392838316882.
[I 2021-09-17 07:09:31,306] Trial 4 finished with value: 1.6047483038968486 and parameters: {'hidden_layer_sizes': (25, 50, 75,
100), 'activation': 'identity', 'learning_rate': 'adaptive', 'learning_rate_init': 0.004993568961591413}. Best is trial 2 with
value: 1.465392838316882.
[I 2021-09-17 07:09:31,481] Trial 5 pruned.
[I 2021-09-17 07:09:31,510] Trial 6 pruned.
[I 2021-09-17 07:09:31,527] Trial 7 pruned.
[I 2021-09-17 07:09:31,551] Trial 8 pruned.
[I 2021-09-17 07:09:31,557] Trial 9 pruned.
[I 2021-09-17 07:09:31,582] Trial 10 pruned.
[I 2021-09-17 07:09:31,594] Trial 11 pruned.
[I 2021-09-17 07:09:31,612] Trial 12 pruned.
[I 2021-09-17 07:09:31,830] Trial 13 pruned.
[I 2021-09-17 07:09:31,859] Trial 14 pruned.
CPU times: user 54.8 s, sys: 27.1 s, total: 1min 21s

Wall time: 22.8 s
Below we have printed the best parameter settings and the least MSE that we got using those parameters.
Best Params : {'hidden_layer_sizes': (100, 100), 'activation': 'relu', 'learning_rate': 'constant', 'learning_rate_init':
0.0035340788684059556}
Best MSE : 1.465392838316882
Below we have printed the count of total trials, trials that were pruned, and trials that were completed successfully. We have used the state of
the trial to determine whether they completed or got pruned. We can notice that 10 trials were pruned out of a total of 15 trials.

print("Finished Trials : {}".format(len([t for t in study4.trials if t.state == optuna.trial.TrialState.COMPLETE])))
print("Prunned Trials : {}".format(len([t for t in study4.trials if t.state == optuna.trial.TrialState.PRUNED])))
Total Trials : 15
Finished Trials : 5
Prunned Trials : 10
Below we have trained multilayer perceptron with the best parameter settings that we got through optuna. We are then evaluating its
performance on the train and test dataset by calculating MSE on both.
We have also created a multi-layer perceptron with default parameter setting for comparison with optuna hyperparameters combination
trained model.
mlp_regressor = MLPRegressor(**study4.best_params, random_state=123)
mlp_regressor.fit(X_train, Y_train)
print("MLP Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, mlp_regressor.predict(X_train))))

print("MLP Regression MSE on Test Dataset : {}".format(mean_squared_error(Y_test, mlp_regressor.predict(X_test))))
MLP Regression MSE on Train Dataset : 0.7297391637455242

MLP Regression MSE on Test Dataset : 0.7772869137936619
mlp_regressor = MLPRegressor(random_state=123)


Below we have instructed the study instance to try 5 more trials to optimize an objective function. We are doing this to check whether it's able
to find hyperparameters combination which can give better results than we got during our earlier 15 trials.
17/29
%%time
[I 2021-09-17 07:09:38,756] Trial 15 pruned.

[I 2021-09-17 07:09:39,356] Trial 16 pruned.
[I 2021-09-17 07:09:39,378] Trial 17 pruned.
[I 2021-09-17 07:09:39,401] Trial 18 pruned.
[I 2021-09-17 07:09:39,438] Trial 19 pruned.
CPU times: user 1.48 s, sys: 695 ms, total: 2.18 s

Wall time: 698 ms
Below we have printed the best hyperparameter settings as usual. Then we have again trained the model with the best hyperparameters
settings that we got through optuna for verification purposes.
Best Params : {'hidden_layer_sizes': (100, 100), 'activation': 'relu', 'learning_rate': 'constant', 'learning_rate_init':
0.0035340788684059556}
Best MSE : 1.465392838316882
mlp_regressor = MLPRegressor(**study4.best_params,random_state=123)


Visualizations
As a part of this section, we'll be exploring various visualizations available through Optuna which can help us make better decisions. It gives
us inside into various hyperparameters and their impact on model performance.
We'll start by checking whether visualization support is available or not using is_available() function. It checks whether proper versions of
plotly and matplotlib are available or not for creating visualizations.
optuna.visualization.is_available()
True
Optimization History Plot

The first chart that we'll introduce is the optimization history chart. It plots the number of trials that we tried for finding the best
hyperparameters combination on the Y-axis and an objective value that we got for each trial on the Y-axis.
We can use this chart to check whether hyperparameters optimization is going in the right direction or not. This means that in our case of
regression task the value (MSE) of objective function should be minimized over time and in the case of classification value (Accuracy) of
objective function should increase.
plot_optimization_history(study,target_name='Objective Value') - This function takes as input Study object and plots
optimization history chart using plotly. We can give name of objective value which we were trying to minimize/maximize as the value of
parameter target_name.
Below we have plotted the optimization history chart using the study object that we created during the classification section of this tutorial.
optuna.visualization.plot_optimization_history(study3, target_name="Accuracy")
18/29
Below we have plotted the optimization history chart using the study object that we created in the multi-layer perceptron section. We can
notice from the output that the value of MSE is decreasing with an increase in trials. This confirms that Optuna was looking for a
hyperparameters combination in the right direction.
optuna.visualization.plot_optimization_history(study4, target_name="MSE of Median House Prices")
Below we have plotted an optimization history plot using matplotlib. Optuna provides us a majority of charts with matplotlib as backend as
well.
optuna.visualization.matplotlib.plot_optimization_history(study4, target_name="MSE of Median House Prices");
19/29
Parameter Importance Plot
The second chart that we'll plot is a bar chart representing the hyperparameter importance of hyperparameters whose combinations were tried
during the study process. This can help us understand which hyperparameters are contributing more towards minimizing/maximizing
objective value.
plot_param_importances(study, target_name='Objective Value') - This function takes as input Study object and plots bar
chart of hyperparameters importance using it.
Below we have plotted hyperparameters importance chart using study object from multi-layer perceptron model section. We can notice that
Optuna thinks that learning_rate is most important parameter to optimize followed by learning_rate_init, activation and
hidden_layer_sizes.
optuna.visualization.plot_param_importances(study4, target_name="MSE of Median House Prices")
Below we have plotted hyperparameters importance chart using study object from the classification section. It seems that solver is the most
important hyperparameter to tune from a chart.
optuna.visualization.plot_param_importances(study3, target_name="Accuracy")
20/29
HyperParameters Relationship Contour Plot
As a part of this section, we'll introduce a contour chart of the relationship between hyperparameters. It shows a relationship between different
combinations of hyperparameters and objective value for those combinations as a contour plot.
plot_contour(study,params=None,target_name='Objective Value') - This function takes as input study object and returns
contour chart showing relationship between all combinations of hyperparameters. We can provide params parameter with a list of
hyperparameters between which we want to see the relationship.
Below we have plotted contour plot using study object from multi layer perceptron section. We have plotted relationship between
hyperparameters hidden_layer_sizes and activation. We can notice from the chart that value of objective function is less where
hidden_layer_sizes is set to (50,75,100) and activation is set to identity.
optuna.visualization.plot_contour(study4, params=["hidden_layer_sizes", "activation"],

target_name="MSE of Median House Prices"
)
21/29
Below we have plotted another contour chart showing the relationship between learning_rate and learning_rate_init.
optuna.visualization.plot_contour(study4, params=["learning_rate", "learning_rate_init"],

target_name="MSE of Median House Prices"
)
Below we have plotted our third contour chart using the study object from the classification section. We have used 3 hyperparameters (penalty,
C, and solver) this time. This will create a plot with 9 contour charts where each contour chart will be showing the relationship between 2
hyperparameters.
optuna.visualization.plot_contour(study3, params=["penalty", "C", "solver"],

target_name="Accuracy"
)
22/29
HyperParameters Combinations and Objective Value Relationship Parallel Coordinates Plot
As part of this section, we'll introduce parallel coordinates chart of parameter combinations that leads to a particular value of an objective
function. The parallel coordinates chart has a single vertical line for each hyperparameter that we have tried using optuna. The vertical lines
will have different values for those parameters. Then there will be lines connecting various values of these hyperparameters showing one
combinations of these hyperparameters. The color of the line will be based on colormap which represents an objective value that we get using
those combinations of hyperparameters. The first vertical line will be representing actual values of the objective function that we were trying to
minimize/maximize.
Optuna provides plot_parallel_coordinate() function for this purpose.
plot_parallel_coordinate(study, target_name='Objective Value') - This function takes as input Study instance and creates
parallel coordinates chart showing relationship between hyperparameters combination and objective value.
Below we have created parallel coordinates plot using our Study instance from the multi-layer perceptron section. We had minimized the MSE
of median house prices in that section. We can notice below in parallel coordinates chart showing different combinations of hyperparameters
and their relationship with MSE.
optuna.visualization.plot_parallel_coordinate(study4, target_name="MSE of Median House Prices")
23/29
Below we have created another parallel coordinates chart using Study object from the classification section.
optuna.visualization.plot_parallel_coordinate(study3, target_name="Accuracy")
HyperParameters Combination and Objective Value Relationship Slice Plot
As a part of this section, we'll introduce a slice chart that shows the relationship between hyperparameter value and objective value. It has a
hyperparameter value on X-axis and an objective value on Y-axis. We have then dots for different combinations and the opacity of that dot
represents a number of trials taken to reach that objective value with that hyperparameter value of the ML model.
Optuna provides plot_slice() function for this purpose.
24/29
plot_slice(study,params=None,target_name=None) - This function takes as input Study instance and list of hyperparameter
names and then creates slice plot from it. The slice plot consists of a list of charts where an individual chart represents the relationship
between one hyperparameter and the objective value.
Below we have created a slice plot from Study object of the multi-layer perceptron section. We have created a slice plot of hyperparameters
learning_rate and learning_rate_init. We can notice from dot opacity that how many trials it took for that value of hyperparameter to
reach the value of MSE on the Y-axis.
optuna.visualization.plot_slice(study4, params=["learning_rate", "learning_rate_init"],

target_name="MSE of Median House Prices")
Below we have created a slice plot using Study instance of the classification section. We have included hyperparameters penalty, C, and
solver in the chart. It shows how many trials were taken by a particular value of hyperparameter to get particular accuracy.
optuna.visualization.plot_slice(study3, params=["penalty", "C", "solver"],

target_name="Accuracy"
)
25/29
Intermediate Values of Trials
As a part of this section, we'll introduce a chart that shows the progress of all trials on the study process. This chart shows one line per trial
showing how objective value is progressing (increasing/decreasing) during the training process of that trial. This can be useful to analyze trial
progress and why a particular set of trials were pruned. Optuna provides a method named plot_intermediate_values() for the creation of
this chart.
plot_intermediate_values(study) - This method takes as input Study instance and plots chart of lines where each line represents
the progress of the individual trial of study. The x-axis of the chart represents a number of steps of the trial and the Y-axis represents the
objective value.
The chart will have lines decreasing where we are trying to minimize objective value (MSE) and increasing where we are trying to maximize
objective value (Accuracy) over time. It'll have an entry for some trials till the end of the steps and for some till in between. The reason behind
some of the lines not running all steps of training is because they were deemed underperforming by Optuna and pruned before completion.
Below we have created an intermediate objective values chart of trials using Study object from the multi-layer perceptron section.
fig = optuna.visualization.plot_intermediate_values(study4)
fig
26/29
Below we have recreated the previous chart using matplotlib.
optuna.visualization.matplotlib.plot_intermediate_values(study4);
Empirical Distribution Function Plot
As a part of this section, we'll introduce the empirical cumulative distribution function of objective value. The chart consists of a single-step
line. The value on the X-axis represents an objective value that we are trying to minimize/maximize and Y-axis represents cumulative
probability. The cumulative probability at any point on the line represents the percentage of trials whose objective value is less than the
objective value at that point.
To explain it with an example, let’s say we take a point on the line where cumulative probability is 0.80 and objective value is 2.7. Then of all
trials that we tried as a part of the study process, 80% will have an objective value less than 2.7.
plot_edf(study,target_name='Objective Value') - This method takes as input study instance and creates eCDF chart of objective
value.
Below we have created an eCDF chart from Study instance from the multi-layer perceptron section. We can notice that MSE ranges from 0-
5.0+.
optuna.visualization.plot_edf(study4, target_name="MSE of Median House Prices")
27/29
Below we have created eCDF chart of objective value using Study instance from the classification section. The objective value for the
classification section was accuracy hence X-axis value ranges in 0-1.
optuna.visualization.plot_edf(study3, target_name="Accuracy")
Optuna Logging
As a part of this section, we'll introduce few functions which can be used to handle logging messages generated by Optuna.
Optuna by default displays all logging messages of level INFO and above. We can modify this default logging level. Optuna provides two
functions for checking and modifying logging levels.
get_verbosity() - This method returns current set logging level.

set_verbosity(level) - This method sets new logging level given to it.
28/29
If you are interested in learning about logging in python then please feel free to check our tutorial on the same. It tries to explain the topic with
simple and easy-to-understand examples.
logging - An In-Depth Guide to Log Events in Python with Simple Examples
Below we have printed the logging level which is default by optuna. The default logging level is INFO for optuna as we said above.
optuna.logging.get_verbosity()
20
optuna.logging.INFO
20
Below we have modified the logging level to WARNING from INFO. This will now suppress all messages with level INFO and below. It'll now
only print all messages with level WARNING and above.
optuna.logging.set_verbosity(optuna.logging.WARNING)
Below we have run our study object from a multi-layer perceptron section for 5 more trials. We can notice now that the info messages about
individual trials which were getting displayed earlier are suppressed now.
%%time
CPU times: user 13.3 s, sys: 5.96 s, total: 19.2 s

Wall time: 5.62 s
optuna.logging.WARNING
30
This ends our small tutorial explaining how we can use Optuna with scikit-learn models. We also covered various visualizations provided by
Optuna as a part of this tutorial. Please feel free to let us know your views in the comments section.
References
Sunny Solanki
Want to Share Your Views? Have Any Suggestions?

If you want to
provide some suggestions on topic

share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs
Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a
small contribution by clicking HERE.
29/29

Simple Guide To Optuna For Hyperparameters OptimizationTuning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simple Guide To Optuna For Hyperparameters OptimizationTuning

Uploaded by

Copyright:

Available Formats

Simple Guide to Optuna for Hyperparameters Optimization/Tuning

Optuna Strategy for Optimization

Steps to use Optuna

1. Create an objective function.

Optuna is based on the concept of Study and Trial.

We'll start by importing optuna.

print("Optuna Version : {}".format(optuna.__version__))

Optuna Version : 2.9.1

Important Methods of Trial Instance

Important Methods of Study Object

[I 2021-09-17 07:06:51,088] A new study created in memory with name: MinimizeFunction

Found x: 4.403293394832803, (5*x - 21): 1.016466974164011

FrozenTrial(number=7, values=[1.016466974164011], datetime_start=datetime.datetime(2021, 9, 17, 7, 6, 51, 112203),

print("Total Trials : {}".format(len(study1.trials)))

number value datetime_start datetime_complete duration params_x state

0 13.841007 2021-09-17 2021-09-17 0 days 1.431799 COMPLETE

1 19.763319 2021-09-17 2021-09-17 0 days 0.247336 COMPLETE

2 12.105440 2021-09-17 2021-09-17 0 days 1.778912 COMPLETE

3 1.359436 2021-09-17 2021-09-17 0 days 4.471887 COMPLETE

4 2.072014 2021-09-17 2021-09-17 0 days 4.614403 COMPLETE

5 2.623199 2021-09-17 2021-09-17 0 days 3.675360 COMPLETE

6 15.832072 2021-09-17 2021-09-17 0 days 1.033586 COMPLETE

7 1.016467 2021-09-17 2021-09-17 0 days 4.403293 COMPLETE

8 1.504497 2021-09-17 2021-09-17 0 days 3.899101 COMPLETE

9 9.916126 2021-09-17 2021-09-17 0 days 2.216775 COMPLETE

Below we have executed 15 more trials using optimize().

print("Total Trials : {}".format(len(study1.trials)))

Found x: 4.233333191490665, (5*x - 21): 0.16666595745332557

from sklearn.linear_model import Ridge, LogisticRegression

X,Y = boston.data, boston.target

boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)

506 rows × 14 columns

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.80, random_state=123)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((404, 13), (102, 13), (404,), (102,))

return mean_squared_error(Y_test, regressor.predict(X_test))

[I 2021-09-17 07:06:52,922] A new study created in memory with name: RidgeRegression

print("Best Params : {}".format(study2.best_params))

print("\nBest MSE : {}".format(study2.best_value))

Best MSE : 28.73396591701217

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))

Ridge Regression MSE on Train Dataset : 26.041719809438128

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))

Ridge Regression MSE on Train Dataset : 20.82386585083267

print("Best Params : {}".format(study2.best_params))

print("\nBest MSE : {}".format(study2.best_value))

Best MSE : 28.732108565206946

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))

Ridge Regression MSE on Train Dataset : 26.04165999021236

Comparison with Grid Search

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using GridSearch

Grid Search without Parallelization

param_grid = {"alpha" : np.linspace(0, 10, 25),

grid = GridSearchCV(Ridge(), param_grid, cv=5)

CPU times: user 46.9 s, sys: 59.2 ms, total: 47 s

{'alpha': 0.0, 'fit_intercept': True, 'solver': 'svd', 'tol': 0.001}

print("Ridge Regression MSE on Train Dataset : {}".format(mean_squared_error(Y_train, ridge.predict(X_train))))

Ridge Regression MSE on Train Dataset : 20.67710794781513

Grid Search with Parallelization

param_grid = {"alpha" : np.linspace(0, 10, 25),

grid = GridSearchCV(Ridge(), param_grid, cv=5, n_jobs=-1)

CPU times: user 2.29 s, sys: 37.6 ms, total: 2.32 s

print("Optuna Version : {}".format(optuna.version))