Professional Documents
Culture Documents
Bayesian Optimization For Accelerating Hyper-Parameter Tuning
Bayesian Optimization For Accelerating Hyper-Parameter Tuning
I. I NTRODUCTION
In many scientific and engineering applications, we need to
deal with design choices, as an example, in algorithm hyper-
parameter tuning [1] or engineering design [2], [3]. This design
process can be seen as optimizing a black-box function which Figure 1. Example of Gaussian process and Bayesian optimization which
is expensive to evaluate due to economic or computational suggests the next optimal point to evaluate given initial three observations.
reasons. This requires us to optimize a non-concave objective
function using sequential and noisy observations. Critically,
the objective functions are unknown and expensive to evaluate. II. BAYESIAN O PTIMIZATION
The challenge is to find the maximum of such expensive
We use f to denote a black-box function for which we
objective functions in a few sequential queries to minimize
have no closed-form expression. Furthermore, this black-box
time and cost.
function is expensive to evaluate. Formally, let f : X → R
Bayesian optimization (BO) [4], [5], [6] is an approach to
be a well-behaved function defined on a subset X ⊆ Rd . Our
optimize these above expensive functions. BO finds a solution
goal is to solve the following global optimization problem
of an expensive black-box function x∗ = arg maxx∈X f (x)
by making a series of evaluations x1 , ..., xT of f such that x∗ = arg max f (x) . (1)
the optimum of f is found in the fewest iterations. BO x∈X
is best-suited for optimization over continuous domains of
less than 20 dimensions, and tolerates stochastic noise in Bayesian optimization aims to find the global optimum of
function evaluations. It builds a surrogate for the objective and the black-box function f (x) by constructing a probabilistic
quantifies the uncertainty in that surrogate using a Gaussian model for f (x) and then exploits this model to make decisions
process [7]. Then, BO uses an acquisition function defined about where in X to next evaluate the function, while integrat-
from this surrogate to suggest a next sample. ing out uncertainty. This results in a procedure that can find
In this paper, we are presenting our research summary in the minimum of difficult non-convex functions with relatively
BO in which our contribution can be summarized as follows: few evaluations, at the cost of performing more computation
to determine the next point to try [8], [9], [6]. We summarize
• Formulation of the novel settings in Bayesian optimiza-
the illustrattion for BO in Fig. 1 and the routine in Alg. 1.
tion including budgeted batch and weakly specified space.
• Studying the theoretical properties and convergence anal-
ysis.
A. Gaussian process
• Applications to advanced material optimization on the
aeronautical alloy design, heat-treatment design, short Bayesian optimization reasons about f by building a Gaus-
polymer fiber, 3D printing and algorithmic assurance. sian process [7] through evaluations. This flexible distribution
• Comprehensive comparison with the existing techniques allows us to associate a normally distributed random variable
using benchmark functions and real applications. at every point in the continuous input space. We get the
• Identifying several future research directions. predictive distribution for a new observation x that also
3: Evaluate yt = f (xt ) and Dt = Dt−1 ∪ (xt , yt ) B. Insight analysis and stopping condition for BO
4: end for Expected improvement (EI) [13] is one of the most widely
Output: xmax , ymax used acquisition functions for BO that finds the expectation of
the improvement function over the incumbent. The incumbent
is usually selected as the best-observed value so far, termed as
follows a Gaussian distribution [7] - its mean and variance ymax (for the maximizing problem). Recent work has studied
are given by: the convergence rate for EI under some mild assumptions or
zero noise of observations. Especially, the work of [14] has
μ (x ) = k(x , X)K(X, X)−1 y (2) derived the sublinear regret for EI under a stochastic noise.
However, due to the difficulty in stochastic noise setting and
σ 2 (x ) = k(x , x ) − k(x , X)K(X, X)−1 k(x , X)T (3) to make the convergent proof feasible, they use an alternative
choice for the incumbent as the maximum of the Gaussian
where K(U, V ) is a covariance matrix whose element (i, j) is process predictive mean, μmax . This modification makes the
calculated as ki,j = k(xi , xj ) with xi ∈ U and xj ∈ V . algorithm computationally inefficient because it requires an
additional global optimization step to estimate μmax that is
B. Acquisition functions costly and may be inaccurate.
To address this issue, we derive a sublinear convergence
As the original function f (x) is expensive to evaluate, we rate for EI using the commonly used ymax . Moreover, our
will build a cheaper function, called the acquisition function analysis is the first to study a stopping criteria for EI to prevent
α(x), from the surrogate model to determine a next point unnecessary evaluations. Our analysis complements the results
to evaluate. Therefore, instead of maximizing the original of [14] to theoretically cover two incumbent settings for EI.
function, we maximize the acquisition function to select the Finally, we demonstrate empirically that EI using ymax is both
next point xt+1 = arg maxx∈X αt (x) . In this auxiliary more computationally efficiency and more accurate than EI
maximization problem, the acquisition function form is known using μmax . We present our finding and experiments in [15].
and can be easily optimized by standard numerical techniques. We then propose another strategy for EI by observing the
The acquisition functions are carefully designed to trade off property that expected improvement can be over-exploited
between exploration of the search space and exploitation of when it hits to a local optimum. We propose a modification
current promising regions. Although many acquisition func- to EI that will allow for increased early exploration while
tions have been proposed [4], [10], [11], no single acquisition providing similar exploitation once the system has been suit-
strategy provides the best performance over all problem. ably explored. We call our approach as exploration enhanced
expected improvement (E3I) [16]. In addition, we prove that
III. S UMMARY OF R ECENT R ESEARCH our method has a sublinear convergence rate and test it on
We are going to summarize our recent research in Bayesian a range of functions to compare its performance against the
optimization including novel techniques and applications. In standard EI and other competing methods.
particular, we present the research problem being investigated
and the summary of contributions made. C. Filtering BO approach in weakly specified search space
The established Bayesian optimization approaches always
A. Budgeted batch BO with unknown batch sizes require a user-defined space to perform optimization. This pre-
defined space specifies the ranges of hyper-parameter values.
In situations where the black box function can be evaluated
In many situations, however, it can be difficult to prescribe
at multiple points simultaneously, batch Bayesian optimization
such spaces, as a prior knowledge is often unavailable. Setting
is desirable. Current batch BO approaches are restrictive in
these regions arbitrarily can lead to inefficient optimization
that they fix the number of evaluations per batch, and this
- if a space is too large, we can miss the optimum with a
can be wasteful when the number of specified evaluations
limited budget, on the other hand, if a space is too small,
is larger than the number of real maxima in the underlying
it may not contain the optimum point that we want to get.
acquisition function. We propose the budgeted batch Bayesian
The unknown search space problem is intractable to solve in
optimization (B3O) [12] to identify the appropriate batch size
practice. Therefore, we narrow down to consider specifically
for each iteration in an elegant way. To set the batch size
the setting of “weakly specified” search space for BO, pre-
flexible, we use the infinite Gaussian mixture model (IGMM)
sented in [17], [18]. By weakly specified space, we mean the
for automatically identifying the number of peaks in the
pre-defined space is placed at a sufficiently good region so
underlying acquisition functions. We solve the intractability of
that the optimization can expand and reach to the optimum.
estimating the IGMM directly from the acquisition function by
However, this pre-defined space need not include the global
formulating the batch generalized slice sampling to efficiently
303
Peak
Initial Region
304
However, these approaches are not yet ideal for optimizing [3] P. V. Balachandran, D. Xue, J. Theiler, J. Hogden, and T. Lookman,
“Adaptive strategies for materials design using uncertainties,” Scientific
less expensive functions when the computational cost of BO reports, vol. 6, 2016.
can dominate the cost of evaluating the black box function. [4] P. Hennig and C. J. Schuler, “Entropy search for information-efficient
Examples of these less expensive functions are cheap machine global optimization,” Journal of Machine Learning Research, vol. 13,
pp. 1809–1837, 2012.
learning models and inexpensive physical experiment through [5] S. Rana, C. Li, S. Gupta, V. Nguyen, and S. Venkatesh, “High
simulators. In this work, we consider a new batch BO setting dimensional Bayesian optimization with elastic gaussian process,” in
for situations where function evaluations are less expensive. Proceedings of the 34th International Conference on Machine Learning
(ICML), 2017, pp. 2883–2891.
Our model, called UCB-DE, is based on a new exploration [6] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas,
strategy using geometric distance that provides an alternative “Taking the human out of the loop: A review of Bayesian optimization,”
way for exploration, selecting a point far from the observed lo- Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
[7] C. E. Rasmussen, “Gaussian processes for machine learning,” 2006.
cations. Using that intuition, we propose to use Sobol sequence [8] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on bayesian
to guide exploration that will get rid of running multiple global optimization of expensive cost functions, with application to active
optimization steps as used in previous works. Based on the user modeling and hierarchical reinforcement learning,” arXiv preprint
arXiv:1012.2599, 2010.
proposed distance exploration, we present an efficient batch [9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimiza-
BO approach which outperforms other baselines and global tion of machine learning algorithms,” in Advances in neural information
optimization methods when the function evaluations are less processing systems, 2012, pp. 2951–2959.
[10] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian process
expensive (see Fig. 4). optimization in the bandit setting: No regret and experimental design,” in
Proceedings of the 27th International Conference on Machine Learning,
B. Knowing the what but not the where in BO 2010, pp. 1015–1022.
[11] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh, “Think globally,
Bayesian optimization has demonstrated impressive suc- act locally: a local strategy for Bayesian optimization,” in Workshop
cess in finding the optimum location x∗ and value f ∗ = on Bayesian Optimization at Neural Information Processing Systems
(NIPSW), 2016.
f (x∗ ) = maxx∈X f (x) of the black box function f . In [12] V. Nguyen, S. Rana, S. K. Gupta, C. Li, and S. Venkatesh, “Budgeted
some applications, however, the optimum value is known in batch Bayesian optimization,” in 16th International Conference on Data
advance and the goal is to find the corresponding optimum Mining (ICDM), 2016, pp. 1107–1112.
[13] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimiza-
location. Existing work in Bayesian optimization (BO) has tion of expensive black-box functions,” Journal of Global optimization,
not effectively exploited the knowledge of f ∗ for optimization. vol. 13, no. 4, pp. 455–492, 1998.
In this work, we consider a new setting in BO in which the [14] Z. Wang and N. de Freitas, “Theoretical analysis of bayesian optimisa-
tion with unknown gaussian process hyper-parameters,” arXiv preprint
knowledge of the optimum value is available. Our goal is to arXiv:1406.7758, 2014.
exploit the knowledge about f ∗ to search for the location x∗ [15] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh, “Regret
efficiently. for expected improvement over the best-observed value and stopping
condition,” in Proceedings of The 9th Asian Conference on Machine
Learning (ACML), 2017, pp. 279–294.
C. Efficient optimization with training curves [16] J. Berk, V. Nguyen, S. Gupta, S. Rana, and S. Venkatesh, “Exploration
enhanced expected improvement for bayesian optimization,” in Machine
Many ML models require running an iterative training Learning and Knowledge Discovery in Databases. Springer, 2018.
procedure for some number of iterations such as stochastic [17] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh, “Bayesian opti-
gradient descent [24], [25] and (deep) reinforcement learning mization in weakly specified search space,” in IEEE 17th International
Conference on Data Mining (ICDM), 2017.
[26]. This iterative training process is expensive in term of [18] ——, “Filtering bayesian optimization approach in weakly specified
time. For example, it takes roughly 75 hours to train an agent search space,” Knowledge and Information Systems (KAIS), 2018.
to play Atari Breakout game [27]. These training curve can [19] P. Lequeu, K. Smith, and A. Daniélou, “Aluminum-copper-lithium alloy
2050 developed for medium to thick plate,” Journal of Materials
bring useful information about the training process which can Engineering and Performance, vol. 19, no. 6, pp. 841–847, 2010.
be exploited to guide the search efficiently. We propose a new [20] R. Wagner, R. Kampmann, and P. W. Voorhees, “Homogeneous second-
Bayesian optimization to learn on the joint space of input phase precipitation,” Materials science and technology, 1991.
[21] C. Li, R. Santu, S. Gupta, V. Nguyen, S. Venkatesh, A. Sutti, D. R.
parameter and training iteration, i.e. number of episodes (in a D. C. Leal, T. Slezak, M. Height, M. Mohammed et al., “Accelerating
reinforcement learning context). Using this joint optimization, experimental design by incorporating experimenter hunches,” in IEEE
our algorithm will utilize cheap evaluations at lower fidelity International Conference on Data Mining (ICDM), 2018, pp. 257–266.
[22] S. Gopakumar, S. Gupta, S. Rana, V. Nguyen, and S. Venkatesh,
by training with less iterations to gain information about the “Algorithmic assurance: An active approach to algorithmic testing using
joint space. Then, our algorithm will identify and invest more bayesian optimisation,” in Advances in Neural Information Processing
computational resource at the promising hyper-parameters. Systems, 2018, pp. 5465–5473.
[23] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The non-
Our approach will be able find the optimal hyper-parameter stochastic multiarmed bandit problem,” SIAM journal on computing,
whilst requiring lower training time. vol. 32, no. 1, pp. 48–77, 2002.
[24] L. Bottou, “Stochastic gradient learning in neural networks,” Proceed-
ings of Neuro-Nımes, vol. 91, no. 8, p. 12, 1991.
R EFERENCES [25] T. Le, V. Nguyen, T. D. Nguyen, and D. Phung, “Nonparametric
[1] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, budgeted stochastic gradient descent,” in Proceedings of the 19th In-
M. Patwary, M. Prabhat, and R. Adams, “Scalable bayesian optimization ternational Conference on Artificial Intelligence and Statistics, 2016,
using deep neural networks,” in Proceedings of the 32nd International pp. 654–572.
Conference on Machine Learning, 2015, pp. 2171–2180. [26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[2] T. Dai Nguyen, S. Gupta, S. Rana, V. Nguyen, S. Venkatesh, K. J. Deane, MIT press Cambridge, 1998, vol. 1, no. 1.
and P. G. Sanders, “Cascade Bayesian optimization,” in Australasian [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
Joint Conference on Artificial Intelligence. Springer, 2016, pp. 268– stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
280. ing,” arXiv preprint arXiv:1312.5602, 2013.
305