Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE)

Bayesian Optimization for Accelerating


Hyper-parameter Tuning
Vu Nguyen
University of Oxford, United Kingdom
vu@robots.ox.ac.uk

Abstract—Bayesian optimization (BO) has recently emerged


as a powerful and flexible tool for hyper-parameter tuning and
more generally for the efficient global optimization of expensive
black box functions. Systems implementing BO has successfully
solved difficult problems in automatic design choices and machine
learning hyper-parameters tuning. Many recent advances in the
methodologies and theories underlying Bayesian optimization
have extended the framework to new applications and provided
greater insights into the behavior of these algorithms. In this pa-
per, we summarize our recent research in Bayesian optimization,
highlight our contribution and present future research directions.

I. I NTRODUCTION
In many scientific and engineering applications, we need to
deal with design choices, as an example, in algorithm hyper-
parameter tuning [1] or engineering design [2], [3]. This design
process can be seen as optimizing a black-box function which Figure 1. Example of Gaussian process and Bayesian optimization which
is expensive to evaluate due to economic or computational suggests the next optimal point to evaluate given initial three observations.
reasons. This requires us to optimize a non-concave objective
function using sequential and noisy observations. Critically,
the objective functions are unknown and expensive to evaluate. II. BAYESIAN O PTIMIZATION
The challenge is to find the maximum of such expensive
We use f to denote a black-box function for which we
objective functions in a few sequential queries to minimize
have no closed-form expression. Furthermore, this black-box
time and cost.
function is expensive to evaluate. Formally, let f : X → R
Bayesian optimization (BO) [4], [5], [6] is an approach to
be a well-behaved function defined on a subset X ⊆ Rd . Our
optimize these above expensive functions. BO finds a solution
goal is to solve the following global optimization problem
of an expensive black-box function x∗ = arg maxx∈X f (x)
by making a series of evaluations x1 , ..., xT of f such that x∗ = arg max f (x) . (1)
the optimum of f is found in the fewest iterations. BO x∈X
is best-suited for optimization over continuous domains of
less than 20 dimensions, and tolerates stochastic noise in Bayesian optimization aims to find the global optimum of
function evaluations. It builds a surrogate for the objective and the black-box function f (x) by constructing a probabilistic
quantifies the uncertainty in that surrogate using a Gaussian model for f (x) and then exploits this model to make decisions
process [7]. Then, BO uses an acquisition function defined about where in X to next evaluate the function, while integrat-
from this surrogate to suggest a next sample. ing out uncertainty. This results in a procedure that can find
In this paper, we are presenting our research summary in the minimum of difficult non-convex functions with relatively
BO in which our contribution can be summarized as follows: few evaluations, at the cost of performing more computation
to determine the next point to try [8], [9], [6]. We summarize
• Formulation of the novel settings in Bayesian optimiza-
the illustrattion for BO in Fig. 1 and the routine in Alg. 1.
tion including budgeted batch and weakly specified space.
• Studying the theoretical properties and convergence anal-
ysis.
A. Gaussian process
• Applications to advanced material optimization on the
aeronautical alloy design, heat-treatment design, short Bayesian optimization reasons about f by building a Gaus-
polymer fiber, 3D printing and algorithmic assurance. sian process [7] through evaluations. This flexible distribution
• Comprehensive comparison with the existing techniques allows us to associate a normally distributed random variable
using benchmark functions and real applications. at every point in the continuous input space. We get the
• Identifying several future research directions. predictive distribution for a new observation x that also

978-1-7281-1488-0/19/$31.00 ©2019 IEEE 302


DOI 10.1109/AIKE.2019.00060
Algorithm 1 Bayesian Optimization. draw samples from the acquisition function. We show empir-
Input: initial data D0 , #iter T ically that the proposed B3O outperforms the existing fixed
batch BO approaches in finding the optimum whilst requiring
1: for t = 1 to T do
a fewer number of evaluations, thus saving cost and time.
2: Fit a GP from Dt and obtain xt = arg max α (x)

3: Evaluate yt = f (xt ) and Dt = Dt−1 ∪ (xt , yt ) B. Insight analysis and stopping condition for BO
4: end for Expected improvement (EI) [13] is one of the most widely
Output: xmax , ymax used acquisition functions for BO that finds the expectation of
the improvement function over the incumbent. The incumbent
is usually selected as the best-observed value so far, termed as
follows a Gaussian distribution [7] - its mean and variance ymax (for the maximizing problem). Recent work has studied
are given by: the convergence rate for EI under some mild assumptions or
zero noise of observations. Especially, the work of [14] has
μ (x ) = k(x , X)K(X, X)−1 y (2) derived the sublinear regret for EI under a stochastic noise.
However, due to the difficulty in stochastic noise setting and
σ 2 (x ) = k(x , x ) − k(x , X)K(X, X)−1 k(x , X)T (3) to make the convergent proof feasible, they use an alternative
choice for the incumbent as the maximum of the Gaussian
where K(U, V ) is a covariance matrix whose element (i, j) is process predictive mean, μmax . This modification makes the
calculated as ki,j = k(xi , xj ) with xi ∈ U and xj ∈ V . algorithm computationally inefficient because it requires an
additional global optimization step to estimate μmax that is
B. Acquisition functions costly and may be inaccurate.
To address this issue, we derive a sublinear convergence
As the original function f (x) is expensive to evaluate, we rate for EI using the commonly used ymax . Moreover, our
will build a cheaper function, called the acquisition function analysis is the first to study a stopping criteria for EI to prevent
α(x), from the surrogate model to determine a next point unnecessary evaluations. Our analysis complements the results
to evaluate. Therefore, instead of maximizing the original of [14] to theoretically cover two incumbent settings for EI.
function, we maximize the acquisition function to select the Finally, we demonstrate empirically that EI using ymax is both
next point xt+1 = arg maxx∈X αt (x) . In this auxiliary more computationally efficiency and more accurate than EI
maximization problem, the acquisition function form is known using μmax . We present our finding and experiments in [15].
and can be easily optimized by standard numerical techniques. We then propose another strategy for EI by observing the
The acquisition functions are carefully designed to trade off property that expected improvement can be over-exploited
between exploration of the search space and exploitation of when it hits to a local optimum. We propose a modification
current promising regions. Although many acquisition func- to EI that will allow for increased early exploration while
tions have been proposed [4], [10], [11], no single acquisition providing similar exploitation once the system has been suit-
strategy provides the best performance over all problem. ably explored. We call our approach as exploration enhanced
expected improvement (E3I) [16]. In addition, we prove that
III. S UMMARY OF R ECENT R ESEARCH our method has a sublinear convergence rate and test it on
We are going to summarize our recent research in Bayesian a range of functions to compare its performance against the
optimization including novel techniques and applications. In standard EI and other competing methods.
particular, we present the research problem being investigated
and the summary of contributions made. C. Filtering BO approach in weakly specified search space
The established Bayesian optimization approaches always
A. Budgeted batch BO with unknown batch sizes require a user-defined space to perform optimization. This pre-
defined space specifies the ranges of hyper-parameter values.
In situations where the black box function can be evaluated
In many situations, however, it can be difficult to prescribe
at multiple points simultaneously, batch Bayesian optimization
such spaces, as a prior knowledge is often unavailable. Setting
is desirable. Current batch BO approaches are restrictive in
these regions arbitrarily can lead to inefficient optimization
that they fix the number of evaluations per batch, and this
- if a space is too large, we can miss the optimum with a
can be wasteful when the number of specified evaluations
limited budget, on the other hand, if a space is too small,
is larger than the number of real maxima in the underlying
it may not contain the optimum point that we want to get.
acquisition function. We propose the budgeted batch Bayesian
The unknown search space problem is intractable to solve in
optimization (B3O) [12] to identify the appropriate batch size
practice. Therefore, we narrow down to consider specifically
for each iteration in an elegant way. To set the batch size
the setting of “weakly specified” search space for BO, pre-
flexible, we use the infinite Gaussian mixture model (IGMM)
sented in [17], [18]. By weakly specified space, we mean the
for automatically identifying the number of peaks in the
pre-defined space is placed at a sufficiently good region so
underlying acquisition functions. We solve the intractability of
that the optimization can expand and reach to the optimum.
estimating the IGMM directly from the acquisition function by
However, this pre-defined space need not include the global
formulating the batch generalized slice sampling to efficiently

303
Peak

Initial Region

Figure 4. Comparison using our proposed UCB-DE against the baselines


where D is dimension and B is a batch size. Our approach outperforms
the baseline in time axis when the black box evaluation is less expensive.
We achieve comparable performance on iteration axis when the black box
evaluation is expensive.
Figure 2. We start the optimization at the initial location which need not to
contain the optimum (peak). Then, we gradually expand the search space to
find the optimum.
b) Alloy heat-treatment optimization: We next optimize
the alloy hardening process of Aluminum-scandium [20] con-
sisting of two stages: nucleation and precipitation coarsening.
We aim to maximize the hardness for Aluminum-scandium
alloys by designing the appropriate times and temperatures
for the two stages in the process. These experiments can be
found at [12], [2].
c) Short polymer fiber and designing of porous scaffold-
ing: With the advance development of 3D printing processes,
complex scaffolds are becoming a favorable feature in product
designs applications ranging from topology optimization to
tissue engineering structures. The ability to derived precise
solutions for the overall porosity of a resulting scaffold can
be problematic, requiring laborious trial and error based ap-
proaches to derive a solution. We summarize our experiments
Figure 3. The space expansion behavior in alloy design. We record the search
space Xt at iterations {0, 10, 25, 40} for four elements {Cu,Li,Mg,Zr} that in [21].
the initial region X0 is provided by our metallurgist collaborators. The y-axis 2) Algorithmic assurance: We introduce algorithmic assur-
indicates the upper and lower ranges of the space (in percent) for each element ance [22], the problem of testing whether machine learning
in the AA-2050 composition w.r.t. considered iterations. (best view in color).
algorithms are conforming to their intended design goal. We
address this problem by proposing an efficient framework
optimum. We tackle this problem by proposing the filtering for algorithmic testing. To provide assurance, we need to
expansion strategy for Bayesian optimization. Our approach efficiently discover scenarios where an algorithm decision
starts from the initial region and gradually expands the search deviates maximally from its intended gold standard. We math-
space. We develop an efficient algorithm for this strategy and ematically formulate this task as an optimization problem of
derive its regret analysis. We illustrate our setting in Fig. 2 and an expensive, black-box function. We use an active learning
the space expansion behavior for alloy development in Fig. 3. approach based on Bayesian optimization to solve this op-
timization problem. We extend this framework to algorithms
with vector-valued outputs by making appropriate modification
D. Real-world Applications in Bayesian optimization via the EXP3 [23] algorithm. We
Our Bayesian optimization techniques have been applied to theoretically analyze our methods for convergence. Using two
solve several applications in materials development, machine real-world applications, we demonstrate the efficiency of our
learning hyper-parameter tuning and algorithmic assurance. model. The significance of our problem formulation and initial
1) Materials Development: solutions is that it will serve as the foundation in assuring
a) Low density alloy AA-2050 optimization: We consider humans about machines making complex decisions.
the low density alloy AA-2050 [19] used in the aeronautical
industry. The considered alloy consists of 8 elements (Al, IV. F UTURE R ESEARCH
Cu, Li, Mg, Zr, Sc, Si and Fe). We aim to find the AA-
We present three future research directions continuing from
2050 composition to achieve the desired properties, such as
our research in previous section for Bayesian optimization.
low-density high corrosion resistant. The desired property for
the alloy is defined using the utility score which includes
maximizing good phases while minimizing the bad phases at A. Practical batch BO with less expensive black box functions
equilibrium of a heat treatment process. We refer the interested Traditional Bayesian optimization and its batch counterpart
readers to our papers [17], [18] for details. are successful for optimizing expensive black box functions.

304
However, these approaches are not yet ideal for optimizing [3] P. V. Balachandran, D. Xue, J. Theiler, J. Hogden, and T. Lookman,
“Adaptive strategies for materials design using uncertainties,” Scientific
less expensive functions when the computational cost of BO reports, vol. 6, 2016.
can dominate the cost of evaluating the black box function. [4] P. Hennig and C. J. Schuler, “Entropy search for information-efficient
Examples of these less expensive functions are cheap machine global optimization,” Journal of Machine Learning Research, vol. 13,
pp. 1809–1837, 2012.
learning models and inexpensive physical experiment through [5] S. Rana, C. Li, S. Gupta, V. Nguyen, and S. Venkatesh, “High
simulators. In this work, we consider a new batch BO setting dimensional Bayesian optimization with elastic gaussian process,” in
for situations where function evaluations are less expensive. Proceedings of the 34th International Conference on Machine Learning
(ICML), 2017, pp. 2883–2891.
Our model, called UCB-DE, is based on a new exploration [6] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas,
strategy using geometric distance that provides an alternative “Taking the human out of the loop: A review of Bayesian optimization,”
way for exploration, selecting a point far from the observed lo- Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
[7] C. E. Rasmussen, “Gaussian processes for machine learning,” 2006.
cations. Using that intuition, we propose to use Sobol sequence [8] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on bayesian
to guide exploration that will get rid of running multiple global optimization of expensive cost functions, with application to active
optimization steps as used in previous works. Based on the user modeling and hierarchical reinforcement learning,” arXiv preprint
arXiv:1012.2599, 2010.
proposed distance exploration, we present an efficient batch [9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimiza-
BO approach which outperforms other baselines and global tion of machine learning algorithms,” in Advances in neural information
optimization methods when the function evaluations are less processing systems, 2012, pp. 2951–2959.
[10] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian process
expensive (see Fig. 4). optimization in the bandit setting: No regret and experimental design,” in
Proceedings of the 27th International Conference on Machine Learning,
B. Knowing the what but not the where in BO 2010, pp. 1015–1022.
[11] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh, “Think globally,
Bayesian optimization has demonstrated impressive suc- act locally: a local strategy for Bayesian optimization,” in Workshop
cess in finding the optimum location x∗ and value f ∗ = on Bayesian Optimization at Neural Information Processing Systems
(NIPSW), 2016.
f (x∗ ) = maxx∈X f (x) of the black box function f . In [12] V. Nguyen, S. Rana, S. K. Gupta, C. Li, and S. Venkatesh, “Budgeted
some applications, however, the optimum value is known in batch Bayesian optimization,” in 16th International Conference on Data
advance and the goal is to find the corresponding optimum Mining (ICDM), 2016, pp. 1107–1112.
[13] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimiza-
location. Existing work in Bayesian optimization (BO) has tion of expensive black-box functions,” Journal of Global optimization,
not effectively exploited the knowledge of f ∗ for optimization. vol. 13, no. 4, pp. 455–492, 1998.
In this work, we consider a new setting in BO in which the [14] Z. Wang and N. de Freitas, “Theoretical analysis of bayesian optimisa-
tion with unknown gaussian process hyper-parameters,” arXiv preprint
knowledge of the optimum value is available. Our goal is to arXiv:1406.7758, 2014.
exploit the knowledge about f ∗ to search for the location x∗ [15] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh, “Regret
efficiently. for expected improvement over the best-observed value and stopping
condition,” in Proceedings of The 9th Asian Conference on Machine
Learning (ACML), 2017, pp. 279–294.
C. Efficient optimization with training curves [16] J. Berk, V. Nguyen, S. Gupta, S. Rana, and S. Venkatesh, “Exploration
enhanced expected improvement for bayesian optimization,” in Machine
Many ML models require running an iterative training Learning and Knowledge Discovery in Databases. Springer, 2018.
procedure for some number of iterations such as stochastic [17] V. Nguyen, S. Gupta, S. Rana, C. Li, and S. Venkatesh, “Bayesian opti-
gradient descent [24], [25] and (deep) reinforcement learning mization in weakly specified search space,” in IEEE 17th International
Conference on Data Mining (ICDM), 2017.
[26]. This iterative training process is expensive in term of [18] ——, “Filtering bayesian optimization approach in weakly specified
time. For example, it takes roughly 75 hours to train an agent search space,” Knowledge and Information Systems (KAIS), 2018.
to play Atari Breakout game [27]. These training curve can [19] P. Lequeu, K. Smith, and A. Daniélou, “Aluminum-copper-lithium alloy
2050 developed for medium to thick plate,” Journal of Materials
bring useful information about the training process which can Engineering and Performance, vol. 19, no. 6, pp. 841–847, 2010.
be exploited to guide the search efficiently. We propose a new [20] R. Wagner, R. Kampmann, and P. W. Voorhees, “Homogeneous second-
Bayesian optimization to learn on the joint space of input phase precipitation,” Materials science and technology, 1991.
[21] C. Li, R. Santu, S. Gupta, V. Nguyen, S. Venkatesh, A. Sutti, D. R.
parameter and training iteration, i.e. number of episodes (in a D. C. Leal, T. Slezak, M. Height, M. Mohammed et al., “Accelerating
reinforcement learning context). Using this joint optimization, experimental design by incorporating experimenter hunches,” in IEEE
our algorithm will utilize cheap evaluations at lower fidelity International Conference on Data Mining (ICDM), 2018, pp. 257–266.
[22] S. Gopakumar, S. Gupta, S. Rana, V. Nguyen, and S. Venkatesh,
by training with less iterations to gain information about the “Algorithmic assurance: An active approach to algorithmic testing using
joint space. Then, our algorithm will identify and invest more bayesian optimisation,” in Advances in Neural Information Processing
computational resource at the promising hyper-parameters. Systems, 2018, pp. 5465–5473.
[23] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The non-
Our approach will be able find the optimal hyper-parameter stochastic multiarmed bandit problem,” SIAM journal on computing,
whilst requiring lower training time. vol. 32, no. 1, pp. 48–77, 2002.
[24] L. Bottou, “Stochastic gradient learning in neural networks,” Proceed-
ings of Neuro-Nımes, vol. 91, no. 8, p. 12, 1991.
R EFERENCES [25] T. Le, V. Nguyen, T. D. Nguyen, and D. Phung, “Nonparametric
[1] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, budgeted stochastic gradient descent,” in Proceedings of the 19th In-
M. Patwary, M. Prabhat, and R. Adams, “Scalable bayesian optimization ternational Conference on Artificial Intelligence and Statistics, 2016,
using deep neural networks,” in Proceedings of the 32nd International pp. 654–572.
Conference on Machine Learning, 2015, pp. 2171–2180. [26] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[2] T. Dai Nguyen, S. Gupta, S. Rana, V. Nguyen, S. Venkatesh, K. J. Deane, MIT press Cambridge, 1998, vol. 1, no. 1.
and P. G. Sanders, “Cascade Bayesian optimization,” in Australasian [27] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
Joint Conference on Artificial Intelligence. Springer, 2016, pp. 268– stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
280. ing,” arXiv preprint arXiv:1312.5602, 2013.

305

You might also like