Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Bayesian Optimization : Theory and

Practice Using Python Peng Liu


Visit to download the full and correct content document:
https://ebookmass.com/product/bayesian-optimization-theory-and-practice-using-pyth
on-peng-liu/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Bayesian Optimization: Theory and Practice Using Python


1st Edition Peng Liu

https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-1st-edition-peng-liu/

Quantitative Trading Strategies Using Python: Technical


Analysis, Statistical Testing, and Machine Learning
Peng Liu

https://ebookmass.com/product/quantitative-trading-strategies-
using-python-technical-analysis-statistical-testing-and-machine-
learning-peng-liu/

Machine learning: A Bayesian and optimization


perspective 2nd Edition Theodoridis S

https://ebookmass.com/product/machine-learning-a-bayesian-and-
optimization-perspective-2nd-edition-theodoridis-s/

Advanced Data Analytics Using Python : With


Architectural Patterns, Text and Image Classification,
and Optimization Techniques 2nd Edition Sayan
Mukhopadhyay
https://ebookmass.com/product/advanced-data-analytics-using-
python-with-architectural-patterns-text-and-image-classification-
and-optimization-techniques-2nd-edition-sayan-mukhopadhyay/
Magnetic Communications: Theory and Techniques Liu

https://ebookmass.com/product/magnetic-communications-theory-and-
techniques-liu/

eTextbook 978-0134379760 The Practice of Computing


Using Python (3rd Edition)

https://ebookmass.com/product/etextbook-978-0134379760-the-
practice-of-computing-using-python-3rd-edition/

Critical thinking in clinical research : applied theory


and practice using case studies Fregni

https://ebookmass.com/product/critical-thinking-in-clinical-
research-applied-theory-and-practice-using-case-studies-fregni/

Tensors for Data Processing. Theory, Methods, and


Applications Yipeng Liu

https://ebookmass.com/product/tensors-for-data-processing-theory-
methods-and-applications-yipeng-liu/

Implementing Cryptography Using Python Shannon Bray

https://ebookmass.com/product/implementing-cryptography-using-
python-shannon-bray/
Peng Liu

Bayesian Optimization
Theory and Practice Using Python
Peng Liu
Singapore, Singapore

ISBN 978-1-4842-9062-0 e-ISBN 978-1-4842-9063-7


https://doi.org/10.1007/978-1-4842-9063-7

© Peng Liu 2023

Apress Standard

The use of general descriptive names, registered names, trademarks,


service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general
use.

The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, expressed or implied, with respect to the
material contained herein or for any errors or omissions that may have
been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress


Media, LLC, part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY
10004, U.S.A.
For my wife Zheng and children Jiaxin, Jiaran, and Jiayu.
Introduction
Bayesian optimization provides a unified framework that solves the
problem of sequential decision-making under uncertainty. It includes
two key components: a surrogate model approximating the unknown
black-box function with uncertainty estimates and an acquisition
function that guides the sequential search. This book reviews both
components, covering both theoretical introduction and practical
implementation in Python, building on top of popular libraries such as
GPyTorch and BoTorch. Besides, the book also provides case studies on
using Bayesian optimization to seek a simulated function's global
optimum or locate the best hyperparameters (e.g., learning rate) when
training deep neural networks. The book assumes readers with a
minimal understanding of model development and machine learning
and targets the following audiences:
Students in the field of data science, machine learning, or
optimization-related fields
Practitioners such as data scientists, both early and middle in their
careers, who build machine learning models with good-performing
hyperparameters
Hobbyists who are interested in Bayesian optimization as a global
optimization technique to seek the optimal solution as fast as
possible
All source code used in this book can be downloaded from
github.com/apress/Bayesian-optimization.
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub
(https://github.com/Apress). For more detailed information, please
visit http://www.apress.com/source-code.
Acknowledgments
This book summarizes my learning journey in Bayesian optimization
during my (part-time) Ph.D. study. It started as a personal interest in
exploring this area and gradually grew into a book combining theory
and practice. For that, I thank my supervisors, Teo Chung Piaw and
Chen Ying, for their continued support in my academic career.
Table of Contents
Chapter 1:​Bayesian Optimization Overview
Global Optimization
The Objective Function
The Observation Model
Bayesian Statistics
Bayesian Inference
Frequentist vs.​Bayesian Approach
Joint, Conditional, and Marginal Probabilities
Independence
Prior and Posterior Predictive Distributions
Bayesian Inference:​An Example
Bayesian Optimization Workflow
Gaussian Process
Acquisition Function
The Full Bayesian Optimization Loop
Summary
Chapter 2:​Gaussian Processes
Reviewing the Gaussian Basics
Understanding the Covariance Matrix
Marginal and Conditional Distribution of Multivariate
Gaussian
Sampling from a Gaussian Distribution
Gaussian Process Regression
The Kernel Function
Extending to Other Variables
Learning from Noisy Observations
Gaussian Process in Practice
Drawing from GP Prior
Obtaining GP Posterior with Noise-Free Observations
Working with Noisy Observations
Experimenting with Different Kernel Parameters
Hyperparameter Tuning
Summary
Chapter 3:​Bayesian Decision Theory and Expected Improvement
Optimization via the Sequential Decision-Making
Seeking the Optimal Policy
Utility-Driven Optimization
Multi-step Lookahead Policy
Bellman’s Principle of Optimality
Expected Improvement
Deriving the Closed-Form Expression
Implementing the Expected Improvement
Using Bayesian Optimization Libraries
Summary
Chapter 4:​Gaussian Process Regression with GPyTorch
Introducing GPyTorch
The Basics of PyTorch
Revisiting GP Regression
Building a GP Regression Model
Fine-Tuning the Length Scale of the Kernel Function
Fine-Tuning the Noise Variance
Delving into Kernel Functions
Combining Kernel Functions
Predicting Airline Passenger Counts
Summary
Chapter 5:​Monte Carlo Acquisition Function with Sobol Sequences
and Random Restart
Analytic Expected Improvement Using BoTorch
Introducing Hartmann Function
GP Surrogate with Optimized Hyperparameters
Introducing the Analytic EI
Optimization Using Analytic EI
Grokking the Inner Optimization Routine
Using MC Acquisition Function
Using Monte Carlo Expected Improvement
Summary
Chapter 6:​Knowledge Gradient:​Nested Optimization vs.​One-Shot
Learning
Introducing Knowledge Gradient
Monte Carlo Estimation
Optimizing Using Knowledge Gradient
One-Shot Knowledge Gradient
Sample Average Approximation
One-Shot Formulation of KG Using SAA
One-Shot KG in Practice
Optimizing the OKG Acquisition Function
Summary
Chapter 7:​Case Study:​Tuning CNN Learning Rate with BoTorch
Seeking Global Optimum of Hartmann
Generating Initial Conditions
Updating GP Posterior
Creating a Monte Carlo Acquisition Function
The Full BO Loop
Hyperparameter Optimization for Convolutional Neural
Network
Using MNIST
Defining CNN Architecture
Training CNN
Optimizing the Learning Rate
Entering the Full BO Loop
Summary
Index
About the Author
Peng Liu
is an assistant professor of quantitative
finance (practice) at Singapore
Management University and an adjunct
researcher at the National University of
Singapore. He holds a Ph.D. in Statistics
from the National University of
Singapore and has ten years of working
experience as a data scientist across the
banking, technology, and hospitality
industries.
About the Technical Reviewer
Jason Whitehorn
is an experienced entrepreneur and
software developer and has helped many
companies automate and enhance their
business solutions through data
synchronization, SaaS architecture, and
machine learning. Jason obtained his
Bachelor of Science in Computer Science
from Arkansas State University, but he
traces his passion for development back
many years before then, having first
taught himself to program BASIC on his
family’s computer while in middle
school. When he’s not mentoring and
helping his team at work, writing, or
pursuing one of his many side-projects,
Jason enjoys spending time with his wife and four children and living in
the Tulsa, Oklahoma, region. More information about Jason can be
found on his website: https://jason.whitehorn.us.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer
Nature 2023
P. Liu, Bayesian Optimization
https://doi.org/10.1007/978-1-4842-9063-7_1

1. Bayesian Optimization Overview


Peng Liu1
(1) Singapore, Singapore

As the name suggests, Bayesian optimization is an area that studies


optimization problems using the Bayesian approach. Optimization aims
at locating the optimal objective value (i.e., a global maximum or
minimum) of all possible values or the corresponding location of the
optimum in the environment (the search domain). The search process
starts at a specific initial location and follows a particular policy to
iteratively guide the following sampling locations, collect new
observations, and refresh the guiding policy.
As shown in Figure 1-1, the overall optimization process consists of
repeated interactions between the policy and the environment. The
policy is a mapping function that takes in a new input observation (plus
historical ones) and outputs the following sampling location in a
principled way. Here, we are constantly learning and improving the
policy, since a good policy guides our search toward the global
optimum more efficiently and effectively. In contrast, a good policy
would save the limited sampling budget on promising candidate
locations. On the other hand, the environment contains the unknown
objective function to be learned by the policy within a specific
boundary. When probing the functional value as requested by the
policy, the actual observation revealed by the environment to the policy
is often corrupted by noise, making learning even more challenging.
Thus, Bayesian optimization, a specific approach for global
optimization, would like to learn a policy that can help us efficiently and
effectively navigate to the global optimum of an unknown, noise-
corrupted environment as quickly as possible.

Figure 1-1 The overall Bayesian optimization process. The policy digests the
historical observations and proposes the new sampling location. The environment
governs how the (possibly noise-corrupted) observation at the newly proposed
location is revealed to the policy. Our goal is to learn an efficient and effective policy
that could navigate toward the global optimum as quickly as possible

Global Optimization
Optimization aims to locate the optimal set of parameters of interest
across the whole domain through carefully allocating limited resources.
For example, when searching for the car key at home before leaving for
work in two minutes, we would naturally start with the most promising
place where we would usually put the key. If it is not there, think for a
little while about the possible locations and go to the next most
promising place. This process iterates until the key is found. In this
example, the policy is digesting the available information on previous
searches and proposing the following promising location. The
environment is the house itself, revealing if the key is placed at the
proposed location upon each sampling.
This is considered an easy example since we are familiar with the
environment in terms of its structural design. However, imagine
locating an item in a totally new environment. The policy would need to
account for the uncertainty due to unfamiliarity with the environment
while sequentially determining the next sampling location. When the
sampling budget is limited, as is often the case in real-life searches in
terms of time and resources, the policy needs to argue carefully on the
utility of each candidate sampling location.
Let us formalize the sequential global optimization using
mathematical terms. We are dealing with an unknown scalar-valued
objective function f based on a specific domain Α. In other words, the
unknown subject of interest f is a function that maps a certain sample
in Α to a real number in ℝ, that is, f : Α → ℝ. We typically place no
specific assumption about the nature of the domain Α other than that it
should be a bounded, compact, and convex set.
Unless otherwise specified, we focus on the maximization setting
instead of minimization since maximizing the objective function is
equivalent to minimizing the negated objective, and vice versa. The
optimization procedure thus aims at locating the global maximum f∗ or
its corresponding location x∗ in a principled and systematic manner.
Mathematically, we wish to locate f∗ where

Or equivalently, we are interested in its location x∗ where

Figure 1-2 provides an example one-dimensional objective function


with its global maximum f ∗ and its location x∗ highlighted. The goal of
global optimization is thus to systematically reason about a series of
sampling decisions within the total search space Α, so as to locate the
global maximum as fast as possible, that is, sampling as few times as
possible.
Figure 1-2 An example objective function with the global maximum and its location
marked with star. The goal of global optimization is to systematically reason about a
series of sampling decisions so as to locate the global maximum as fast as possible
Note that this is a nonconvex function, as is often the case in real-life
functions we are optimizing. A nonconvex function means we could not
resort to first-order gradient-based methods to reliably search for the
global optimum since it will likely converge to a local optimum. This is
also one of the advantages of Bayesian optimization compared with
other gradient-based optimization procedures.

The Objective Function


There are different types of objective functions. For example, some
functions are wiggly shaped, while others are smooth; some are convex,
while others are nonconvex. An objective function is an unknown object
to us; the problem would be considered solved if we could access its
underlying mathematical form. Many complex functions are almost
impossible to be expressed using an explicit expression. For Bayesian
optimization, the specific type of objective function typically bears the
following attributes:
We do not have access to the explicit expression of the objective
function, making it a “black-box” function. This means that we can
only interact with the environment, that is, the objective function, to
perform a functional evaluation by sampling at a specific location.
The returned value by probing at a specific location is often
corrupted by noise and does not represent the exact true value of the
objective function at that location. Due to the indirect evaluation of
its actual value, we need to account for such noise embedded in the
actual observations from the environment.
Each functional evaluation is costly, thus ruling out the option for an
exhaustive probing. We need to have a sample-efficient method to
minimize the number of evaluations of the environment while trying
to locate its global optimum. In other words, the optimizer needs to
fully utilize the existing observations and systematically reason
about the next sampling decision so that the limited resource is well
spent on promising locations.
We do not have access to its gradient. When the functional evaluation
is relatively cheap and the functional form is smooth, it would be
very convenient to compute the gradient and optimize using the first-
order procedure such as gradient descent. Access to the gradient is
necessary for us to understand the adjacent curvature of a particular
evaluation point. With gradient evaluations, the follow-up direction
of travel is easier to determine.
The “black-box” function is challenging to optimize for the
preceding reasons. To further elaborate on the possible functional form
of the objective, we list three examples in Figure 1-3. On the left is a
convex function with only one global minimum; this is considered easy
for global optimization. In the middle is a nonconvex function with
multiple local optima; it is difficult to ascertain if the current local
optimum is also globally optimal. It is also difficult to identify whether
this is a flat region vs. a local optimum for a function with a flat region
full of saddle points. All three scenarios are in a minimization setting.
Figure 1-3 Three possible functional forms. On the left is a convex function whose
optimization is easy. In the middle is a nonconvex function with multiple local
minima, and on the right is also a nonconvex function with a wide flat region full of
saddle points. Optimization for the latter two cases takes a lot more work than for
the first case
Let us look at one example of hyperparameter tuning when training
machine learning models. A machine learning model is a function that
involves a set of parameters to be optimized given the input data. These
parameters are automatically tuned via a specific optimization
procedure, typically governed by a set of corresponding meta
parameters called hyperparameters, which are fixed before the model
training starts. For example, when training deep neural networks using
the gradient descent algorithm, a learning rate that determines the step
size of each parameter update needs to be manually selected in
advance. If the learning rate is too large, the model may diverge and
eventually fails to learn. If the learning rate is too small, the model may
converge very slowly as the weights are updated by only a small margin
in this iteration. See Figure 1-4 for a visual illustration.
Figure 1-4 Slow convergence due to a small learning rate on the left and divergence
due to a large learning rate on the right
Choosing a reasonable learning rate as a preset hyperparameter
thus plays a critical role in training a good machine learning model.
Locating the best learning rate and other hyperparameters is an
optimization problem that fits Bayesian optimization. In the case of
hyperparameter tuning, evaluating each learning rate is a time-
consuming exercise. The objective function would generally be the
model’s final test set loss (in a minimization setting) upon model
convergence. A model needs to be fully trained to obtain a single
evaluation, which typically involves hundreds of epochs to reach stable
convergence. Here, one epoch is a complete pass of the entire training
dataset. The book’s last chapter covers a case study on tuning the
learning rate using Bayesian optimization.
The functional form of the test set loss or accuracy may also be
highly nonconvex and multimodal for the hyperparameters. Upon
convergence, it is not easy to know whether we are in a local optimum,
a saddle point, or a global optimum. Besides, some hyperparameters
may be discrete, such as the number of nodes and layers when training
a deep neural network. We could not calculate its gradient in such a
case since it requires continuous support in the domain.
The Bayesian optimization approach is designed to tackle all these
challenges. It has been shown to deliver good performance in locating
the best hyperparameters under a limited budget (i.e., the number of
evaluations allowed). It is also widely and successfully used in other
fields, such as chemical engineering.
Next, we will delve into the various components of a typical
Bayesian optimization setup, including the observation model, the
optimization policy, and the Bayesian inference.

The Observation Model


Earlier, we mentioned that a functional evaluation would give an
observation about the true objective function, and the observation may
likely be different from the true objective value due to noise. The
observations gathered for the policy learning would thus be inexact and
corrupted by an additional noise term, which is often assumed to be
additive. The observation model is an approach to formalize the
relationship between the true objective function, the actual
observation, and the noise. It governs how the observations would be
revealed from the environment to the policy.
Figure 1-5 illustrates a list of observations of the underlying
objective function. These observations are dislocated from the objective
function due to additive random noises. These additive noises manifest
as the vertical shifts between the actual observations and the
underlying objective function. Due to these noise-induced deviations
inflicted on the observations, we need to account for such uncertainty
in the observation model. When learning a policy based on the actual
observations, the policy also needs to be robust enough to focus on the
objective function’s underlying pattern and not be distracted by the
noises. The model we use to approximate the objective function, while
accounting for uncertainty due to the additive noise, is typically a
Gaussian process. We will cover it briefly in this chapter and in more
detail in the next chapter.
Figure 1-5 Illustrating the actual observations (in dots) and the underlying
objective function (in dashed line). When sampling at a specific location, the
observation would be disrupted by an additive noise. The observation model thus
determines how the observation would be revealed to the policy, which needs to
account for the uncertainty due to noise perturbation
To make our discussion more precise, let us use f (x) to denote the
(unknown) objective function value at location x. We sometimes write f
(x) as f for simplicity. We use y to denote the actual observation at
location x, which will slightly differ from f due to noise perturbation. We
can thus express the observation model, which governs how the policy
sees the observation from the environment, as a probability
distribution of y based on a specific location x and true function value f:

Let us assume an additive noise term ε inflicted on f; the actual


observation y can thus be expressed as

Here, the noise term ε arises from measurement error or inaccurate


statistical approximation, although it may disappear in certain
computer simulations. A common practice is to treat the error as a
random variable that follows a Gaussian distribution with a zero mean
and fixed standard deviation σ, that is, ε~N(0, σ2). Note that it is
unnecessary to fix σ across the whole domain A; the Bayesian
optimization allows for both homoscedastic noise (i.e., fixed σ across A)
and heteroskedastic noise (i.e., different σ that depends on the specific
location in A).
Therefore, we can formulate a Gaussian observation model as
follows:

This means that for a specific location x, the actual observation y is


treated as a random variable that follows a Gaussian/normal
distribution with mean f and variance σ2. Figure 1-6 illustrates an
example probability distribution of y centered around f. Note that the
variance of the noise is often estimated by sampling a few initial
observations and is expected to be small, so that the overall observation
model still strongly depends on and stays close to f.

Figure 1-6 Assuming a normal probability distribution for the actual observation as
a random variable. The Gaussian distribution is centered around the objective
function f value evaluated at a given location x and spread by the variance of the
noise term
The following section introduces Bayesian statistics to lay the
theoretical foundation as we work with probability distributions along
the way.

Bayesian Statistics
Bayesian optimization is not a particular algorithm for global
optimization; it is a suite of algorithms based on the principles of
Bayesian inference. As the optimization proceeds in each iteration, the
policy needs to determine the next sampling decision or if the current
search needs to be terminated. Due to uncertainty in the objective
function and the observation model, the policy needs to cater to such
uncertainty upon deciding the following sampling location, which bears
both an immediate impact on follow-up decisions and a long-term
effect on all future decisions. The samples selected thus need to
reasonably contribute to the ultimate goal of global optimization and
justify the cost incurred due to sampling.
Using Bayesian statistics in optimization paves the way for us to
systematically and quantitatively reason about these uncertainties
using probabilities. For example, we would place a prior belief about
the characteristics of the objective function and quantify its
uncertainties by assigning high probability to specific ranges of values
and low probability to others. As more observations are collected, the
prior belief is gradually updated and calibrated toward the true
underlying distribution of the objective function in the form of a
posterior distribution.
We now cover the fundamental concepts and tools of Bayesian
statistics. Understanding these sections is essential to appreciate the
inner workings of Bayesian optimization.

Bayesian Inference
Bayesian inference essentially relies on the Bayesian formula (also
called Bayes’ rule) to reason about the interactions among three
components: the prior distribution p(θ) where θ represents the
parameter of interest, the likelihood p(data| θ) given a specific
parameter θ, and the posterior distribution p(θ| data). There is one
more component, the evidence of the data p(data), which is often not
computable. The Bayesian formula is as follows:

Let us look closely at this widely used, arguably the most important
formula in Bayesian statistics. Remember that any Bayesian inference
procedure aims to derive the posterior distribution p(θ| data) (or
calculate its marginal expectation) for the parameter of interest θ, in
the form of a probability density function. For example, we might end
up with a continuous posterior distribution as in Figure 1-7, where θ
varies from 0 to 1, and all the probabilities (i.e., area under the curve)
would sum to 1.

Figure 1-7 Illustrating a sample (continuous) posterior distribution for the


parameter of interest. The specific shape of the curve will change as new data are
being collected

We would need access to three components to obtain the posterior


distribution of θ. First, we need to derive the probability of seeing the
actual data given our choice of θ, that is, p(data| θ). This is also called
the likelihood term since we are assessing how likely it is to generate
the data after specifying a certain observation model for the data. The
likelihood can be calculated based on the assumed observation model
for data generation.
The second term p(θ) represents our prior belief about the
distribution of θ without observing any actual data; we encode our pre-
experimental knowledge of the parameter θ in this term. For example,
p(θ) could take the form of a uniform distribution that assigns an equal
probability to any value between 0 and 1. In other words, all values in
this range are equally likely, and this is a common prior belief we would
place on θ given that we do not have any information that suggests a
preference over specific values. However, as we collect more
observations and gather more data, the prior distribution will play a
decreasing role, and the subjective belief will gradually reduce in
support of the factual evidence in the data. As shown in Figure 1-8, the
distribution of θ will progressively approach a normal distribution
given that more data is being collected, thus forming a posterior
distribution that better approximates the true distribution of θ.

Figure 1-8 Updating the prior uniform distribution toward a posterior normal
distribution as more data is collected. The role of the prior distribution decreases as
more data is collected to support the approximation to the true underlying
distribution

The last term is the denominator p(data), also referred to as the


evidence, which represents the probability of obtaining the data over
all different choices of θ and serves as a normalizing constant
independent of θ in Bayes’ theorem. This is the most difficult part to
compute among all the components since we need to integrate over all
possible values of θ by taking an integration. For each given θ, the
likelihood is calculated based on the assumed observation model for
data generation, which is the same as how the likelihood term is
calculated. The difference is that the evidence considers every possible
value of θ and weights the resulting likelihood based on the probability
of observing a particular θ. Since the evidence is not connected to θ, it is
often ignored when analyzing the proportionate change in the
posterior. As a result, it focuses only on the likelihood and the prior
alone.
A relatively simple case is when the prior p(θ) and the likelihood
p(data| θ) are conjugate, making the resulting posterior p(θ| data)
analytic and thus easy to work with due to its closed-form expression.
Bayesian inference becomes much easier and less restrictive if we can
write down the explicit form and generate the exact shape of the
posterior p(θ| data) without resorting to sampling methods. The
posterior will follow the same distribution as the prior when the prior
is conjugate with the likelihood function. One example is when both the
prior and the likelihood functions follow a normal distribution, the
resulting posterior will also be normally distributed. However, when
the prior and the likelihood are not conjugate, we can still get more
insight on the posterior distribution via efficient sampling techniques
such as Gibbs sampling.

Frequentist vs. Bayesian Approach


The Bayesian approach is a systematic way of assigning probabilities to
possible values of θ and updating these probabilities based on the
observed data. However, sometimes we are only interested in the most
probable (expected) value of θ that gives rise to the data we observe.
This can be achieved using the frequentist approach, treating the
parameter of interest (i.e., θ) as a fixed quantity instead of a random
variable. This approach is often adopted in the machine learning
community, placing a strong focus on optimizing a specific objective
function to locate the optimal set of parameters.
More generally, we use the frequentist approach to find the correct
answer about θ. For example, we can locate the value of θ by
maximizing the joint probability of the actual data via maximum
likelihood estimation (MLE), where the resulting solution is
. There is no distribution involved with θ since
we treat it as a fixed quantity, which makes the calculation easier as we
only need to work with the probability distribution for the data. The
final solution using the frequentist approach is a specific value of θ. And
since we are working with samples that come from the underlying data-
generating distribution, different samples would vary from each other,
and the goal is to find the optimal parameter θ that best describes the
current sample we are observing.
On the other hand, the Bayesian approach takes on the extra
complexity by treating θ as a random variable with its own probability
distribution, which gets updated as more data is collected. This
approach offers a holistic view on all possible values of θ and the
corresponding probabilities instead of the most probable value of θ
alone. This is a different approach because the data is now treated as
fixed and the parameter θ as a random variable. The optimal
probability distribution for θ is then derived, given the observed fixed
sample. There is no right or wrong in the Bayesian approach, only
probabilities. The final solution is thus a probability distribution of θ
instead of one specific value. Figure 1-9 summarizes these two different
schools of thought.
Figure 1-9 Comparing the frequentist approach and the Bayesian approach
regarding the parameter of interest. The frequentist approach treats θ as a fixed
quantity that can be estimated via MLE, while the Bayesian approach employs a
probability distribution which gets refreshed as more data is collected

Joint, Conditional, and Marginal Probabilities


We have been characterizing the random variable θ using a
(continuous) probability distribution p(θ). A probability distribution is
a function that maps a specific value of θ to a probability, and the
probabilities of all values of θ sum to one, that is, ∫ p(θ)dθ = 1.
Things become more interesting when we work with multiple
(more than one) variables. Suppose we have two random variables x
and y, and we are interested in two events x = X and y = Y, where both X
and Y are specific values that x and y may assume, respectively. Also, we
assume the two random variables are dependent in some way. This
would lead us to three types of probabilities commonly used in modern
machine learning and Bayesian optimization literature: joint
probability, marginal probability, and conditional probability, which we
will look at now in more detail.
The joint probability of the two events refers to the probability of
them occurring simultaneously. It is also referred to as the joint
probability distribution since the probability now represents all
possible combinations of the two simultaneous events.
We can write the joint probability of the two events as p(X and
Y) = p(x = X ∩ y = Y) = p(X ∩ Y). Using the chain rule of probability, we
can further write p(X and Y) = p(X given Y) ∗ p(Y) = p(X| Y)p(Y), where
p(X| Y) denotes the probability of event x = X occurs given that the
event y = Y has occurred. It is thus referred to as conditional probability,
as the probability of the first event is now conditioned on the second
event. All conditional probabilities for a (continuous) random variable x
given a specific value of another random variable (i.e., y = Y) form the
conditional probability distribution p(x| y = Y). More generally, we can
write the joint probability distribution of random variables x and y as
p(x, y) and conditional probability distribution as p(x ∣ y).
The joint probability is also symmetrical, that is, p(X and Y) = p(Y
and X), which is a result of the exchangeability property of probability.
Plugging in the definition of joint probability using the chain rule gives
the following:

If you look at this equation more closely, it is not difficult to see that
it can lead to the Bayesian formula we introduced earlier, namely:

Understanding this connection gives us one more reason not to


memorize the Bayesian formula but to appreciate it. We can also
replace a single event x = X with the random variable x to get the
corresponding conditional probability distribution p(x| y = Y).
Lastly, we may only be interested in the probability of an event for
one random variable alone, disregarding the possible realizations of the
other random variable. That is, we would like to consider the
probability of the event x = X under all possible values of y. This is
called the marginal probability for the event x = X. The marginal
probability distribution for a (continuous) random variable x in the
presence of another (continuous) random variable y can be calculated
as follows:

The preceding definition essentially sums up possible values p(x| y)


weighted by the likelihood of occurrence p(y). The weighted sum
operation resolves the uncertainty in the random variable y and thus in
a way integrates it out of the original joint probability distribution,
keeping only one random variable. For example, the prior probability
p(θ) in Bayes’ rule is a marginal probability distribution of θ, which
integrates out other random variables, if any. The same goes for the
evidence term p(data) which is calculated by integrating over all
possible values of θ.
Similarly, we have the marginal probability distribution for random
variable y defined as follows:
Figure 1-10 summarizes the three common probability
distributions. Note that the joint probability distribution focuses on two
or more random variables, while both the conditional and marginal
probability distributions generally refer to a single random variable. In
the case of the conditional probability distribution, the other random
variable assumes a specific value and thus, in a way, “disappears” from
the joint distribution. In the case of the marginal probability
distribution, the other random variable is instead integrated out of the
joint distribution.

Figure 1-10 Three common probability distributions. The joint probability


distribution represents the probability distribution for two or more random
variables, while the conditional and marginal probability distributions generally
refer to the probability distribution for one random variable. The conditional
distribution represents the probabilities of a random variable by
assuming/conditioning a specific value for other variables, while the marginal
distribution converts a joint probability to a single random variable by integrating
out other variables

Let us revisit Bayes’ rule in the context of conditional and marginal


probabilities. Specifically, the likelihood term p(data| θ) can be treated
as the conditional probability of the data given the parameter θ, and the
evidence term p(data) is a marginal probability that needs to be
evaluated across all possible choices of θ. Based on the definition of
marginal probability, we can write the calculation of p(data) as a
weighted sum (assuming a continuous θ):

where we have a different likelihood conditioned by a specific


parameter θ, and these likelihood terms are weighted by the prior
probabilities. Thus, the evidence considers all the different ways we
could use to get to the particular data.

Independence
A special case that would impact the calculation of the three
probabilities mentioned earlier is independence, where the random
variables are now independent of each other. Let us look at the joint,
conditional, and marginal probabilities with independent random
variables.
When two random variables are independent of each other, the
event x = X would have nothing to do with the event y = Y, that is, the
conditional probability for x = X given y = Y becomes p(X| Y) = p(X). The
conditional probability distribution for two independent random
variables thus becomes p(x| y) = p(x). Their joint probability becomes
the multiplication of individual probabilities: p(X ∩ Y) = P(X|
Y)P(Y) = p(X)p(Y), and the joint probability distribution becomes a
product of individual probability distributions: p(x, y) = p(x)p(y). The
marginal probability of x is just its own probability distribution:

where we have used the fact that p(x) can be moved out of the
integration operation due to its independence with y, and the total area
under a probability distribution is one, that is, ∫ p(y)dy = 1.
We can also extend to conditional independence, where the random
variable x could be independent from y given another random variable
z. In other words, we have p(x, y| z) = p(x| z)p(y| z).

Prior and Posterior Predictive Distributions


Let us shift gear to focus on the actual predictions by quantifying the
uncertainties using Bayes’ rule. To facilitate the discussion, we will use
y to denote the data in Bayes’ formula or the actual observations as in
the Bayesian optimization setting. We are interested in its predictive
distribution p(y), that is, the possible values of y and the corresponding
probabilities. Our decision-making would be much more informed if we
had a good understanding of the predictive distribution of the future
unknown data, particularly in the Bayesian optimization framework
where one needs to decide the next sampling location carefully.
Before we collect any data, we would work with a prior predictive
distribution that considers all possible values of the underlying
parameter θ. That is, the prior predictive distribution for y is a marginal
probability distribution that could be calculated by integrating out all
dependencies on the parameter θ:

which is the exact definition of the evidence term in Bayes’ formula.


In a discrete world, we would take the prior probability for a specific
value of the parameter θ, multiply the likelihood of the resulting data
given the current θ, and sum across all weighted likelihoods.
Now let us look at the posterior predictive distribution for a new
data point y′ after observing a collection of data points collectively
denoted as We would like to assess how the future data would be
distributed and what value of y′ we would likely to observe if we were
to run the experiment and acquire another data point again, given that
we have observed some actual data. That is, we want to calculate the
posterior predictive distribution .
We can calculate the posterior predictive distribution by treating it
as a marginal distribution (conditioned on the collected dataset ) and
applying the same technique as before, namely:

where the second term is the posterior distribution of the


parameter θ that can be calculated by applying Bayes’ rule. However,
the first term is more involved. When assessing a new data
point after observing some existing data points, a common assumption
is that they are conditionally independent given a particular value of θ.
Such conditional independence implies that ,
which happens to be the likelihood term. Thus, we can simplify the
posterior predictive distribution as follows:

which follows the same pattern of calculation compared to the prior


predictive distribution. This would then give us the distribution of
observations we would expect for a new experiment (such as probing
the environment in the Bayesian optimization setting) given a set of
previously collected observations. The prior and posterior predictive
distributions are summarized in Figure 1-11.

Figure 1-11 Definition of the prior and posterior predictive distributions. Both are
calculated based on the same pattern of a weighted sum between the prior and the
likelihood

Let us look at an example of the prior predictive distribution under


a normal prior and likelihood function. Before the experiment starts,
we assume the observation model for the likelihood of the data y to
follow a normal distribution, that is, y~N(θ, σ2), or p(y| θ, σ2) = N(θ, σ2),
where θ is the underlying parameter and σ2 is a fixed variance. For
example, in the case of the observation model in the Bayesian
optimization setting introduced earlier, the parameter θ could
represent the true objective function, and the variance σ2 originates
from an additive Gaussian noise. The distribution of y is dependent on
θ, which itself is an uncertain quantity. We further assume the
parameter θ to follow a normal distribution as its prior, that is,
, or , where θ0 and are the mean
and variance of our prior normal distribution assumed before
collecting any data points. Since we have no knowledge of the
environment of interest, we would like to understand how the data
point (treated as a random variable) y could be distributed in this
unknown environment under different values of θ.
Understanding the distribution of y upon the start of any
experiment amounts to calculating its prior predictive distribution p(y).
Since we are working with a continuous θ, the marginalization needs to
consider all possible values of θ from negative to positive infinity in
order to integrate out the uncertainty due to θ:

The prior predictive distribution can thus be calculated by plugging


in the definition of normal likelihood term p(y| θ) and the normal prior
term p(θ). However, there is a simple trick we can use to avoid the
math, which would otherwise be pretty heavy if we were to plug in the
formula of the normal distribution directly.
Let us try directly working with the random variables. We will start
by noting that y = (y − θ) + θ. The first term y − θ takes θ away from y,
which decentralizes y by changing its mean to zero and removes the
dependence of y on θ. In other words, (y − θ)~N(0, σ2), which also
represents the distribution of the random noise in the observation
model of Bayesian optimization. Since the second term θ is also
normally distributed, we can derive the distribution of y as follows:

where we have used the fact that the addition of two independent
normally distributed random variables will also be normally
distributed, with the mean and variance calculated based on the sum of
individual means and variances.
Therefore, the marginal probability distribution of y becomes
. Intuitively, this form also makes sense. Before
we start to collect any observation about y, our best guess for its mean
would be θ0, the expected value of the underlying random variable θ. Its
variance is the sum of individual variances since we are considering
uncertainties due to both the prior and the likelihood; the marginal
distribution needs to absorb both variances, thus compounding the
resulting uncertainty. Figure 1-12 summarizes the derivation of the
prior predictive distributions under the normality assumption for the
likelihood and the prior for a continuous θ.

Figure 1-12 Derivation process of the prior predictive distribution for a new data
point before collecting any observations, assuming a normal distribution for both the
likelihood and the prior

We can follow the same line of reasoning for the case of posterior
predictive distribution for a new observation y′ after collecting some
data points under the normality assumption for the likelihood p(y′|
θ) and the posterior , where p(y′| θ) = N(θ, σ2) and
. We can see that the posterior distribution for θ
has an updated set of parameters θ′ and using Bayes’ rule as more
data is collected.
Now recall the definition of the posterior predictive distribution
with a continuous underlying parameter θ:

Again, plugging in the expression of two normally distributed


density functions and working with an integration operation would be
too tedious. We can instead write y′ = (y′ − θ) + θ, where (y′ − θ)~N(0,
σ2) and . Adding the two independent normal
distributions gives the following:

Figure 1-13 summarizes the derivation of the posterior predictive


distributions under normality assumption for the likelihood and the
prior for a continuous θ.

Figure 1-13 Derivation process of the posterior predictive distribution for a new
data point after collecting some observations, assuming a normal distribution for
both the likelihood and the prior

Bayesian Inference: An Example


After going through a quick and essential primer on Bayesian inference,
let us put the mathematics in perspective by going through a concrete
example. To start with, we will choose a probability density function for
the prior and the likelihood p(y| θ) = N(θ, σ2), both
of which are normally distributed. Note that the prior and the
likelihood are with respect to the random variables θ and y,
respectively, where θ represents the true underlying objective value
that is unknown, and y is the noise-corrupted actual observation that
follows an observation model y = θ + ε with an additive and normally
distributed random noise ε~N(0, σ2). We would like to infer the
distribution of θ based on actual observed realization of y.
The choice for the prior is based on our subjective experience with
the parameter of interest θ. Using the Bayes’ theorem, it helps jump-
start the learning toward its real probability distribution. The range of
possible values of θ spans across the full support of the prior normal
distribution. Upon observing an actual realization Y of y, two things will
happen: the probability of observing θ = Y will be calculated and
plugged in Bayes’ rule as the prior, and the likelihood function will be
instantiated as a conditional normal distribution p(y| θ = Y) = N(Y, σ2).
Figure 1-14 illustrates an example of the marginal prior distribution
and the conditional likelihood function (which is also a probability
distribution) along with the observation Y. We can see that both
distributions follow a normal curve, and the mean of the latter is
aligned to the actual observation Y due to the conditioning effect from
Y = θ. Also, the probability of observing Y is not very high based on the
prior distribution p(θ), which suggests a change needed for the prior in
the posterior update of the next iteration. We will need to change the
prior in order to improve such probability and conform the subjective
expectation to reality.
Figure 1-14 Illustrating the prior distribution and the likelihood function, both
following a normal distribution. The mean of the likelihood function is equal to the
actual observation due to the effect of conditioning
The prior distribution will then gradually get updated to
approximate the actual observations by invoking Bayes’ rule. This will
give the posterior distribution in solid line,
whose mean is slightly nudged from θ0 toward Y and updated to θ′, as
shown in Figure 1-15. The prior distribution and likelihood function
are displayed in dashed lines for reference. The posterior distribution
of θ is now more aligned with what is actually observed in reality.
Figure 1-15 Deriving the posterior distribution for θ using Bayes’ rule. The updated
mean θ′ is now between the prior mean θ0 and actual observation Y, suggesting an
alignment between subjective preference and reality
Finally, we would be interested in the predictive distribution of the
actual data point if we acquired a new observation. Treating it as a
random variable y′ enables us to express our uncertainty in the form of
an informed probability distribution, which benefits follow-up tasks
such as deciding where to sample next (more on this later). Based on
our previous discussion, the resulting probability distribution for y′ will
assume a normal distribution with the same mean as the posterior
distribution of θ and an inflated variance that absorbs uncertainties
from both θ and the observation model for y, as shown in Figure 1-16.
The prior and posterior distributions and the likelihood function are
now in the dashed line for reference.

Figure 1-16 Illustrating the posterior predictive distribution if we acquire another


observation from the system/environment under study. The posterior predictive
distribution shares the same mean as the posterior distribution of θ but now has a
larger spread due to uncertainty from both θ and the observation model

Bayesian Optimization Workflow


Having gone through the essentials in Bayesian statistics, you may
wonder how it connects to the Bayesian optimization setting. Recall
that the predictive posterior distribution quantifies the probabilities of
different observations if we were to probe the environment and receive
a new realization. This is a powerful tool when reasoning about the
utility of varying sampling choices x ∈ A in search of the global
maximum f ∗.
To put this in perspective, let us add the location x as an explicit
conditioning on the prior/posterior predictive distribution and use f
(the true objective value) to denote θ (the underlying parameter of
interest). For example, the prior predictive distribution p(y| x)
represents the conditional probability distribution of the actual
observation y at location x. Since we have many different locations
across the domain A, there will also be many prior predictive
distributions, each following the same class of probability distributions
when assuming specific properties about the underlying true objective
function before probing the environment.
Take a one-dimensional objective function f (x) ∈ ℝ, for example. For
any location x0 ∈ A, we will use a prior predictive distribution p(y| x0) to
characterize the possible values of the unknown true value f0 = f (x0).
These prior predictive distributions will jointly form our prior
probabilistic belief about the shape of the underlying objective
function f. Since there are infinitely many locations x0 and thus
infinitely many random variables y, these probability distributions of
the infinite collection of random variables are jointly used as a
stochastic process to characterize the true objective function. Here, the
stochastic process simplifies to our running example earlier when
limited to a single location.

Gaussian Process
A prevalent choice of stochastic process in Bayesian optimization is the
Gaussian process, which requires that these finite-dimensional
probability distributions are multivariate Gaussian distributions in a
continuous domain with infinite number of variables. It is a flexible
framework to model a broad family of functions and quantify their
uncertainties, thus being a powerful surrogate model used to
approximate the true underlying function. We will delve into the details
of the Gaussian process in the next chapter, but for now, let us look at a
few visual examples to see what it offers.
Figure 1-17 illustrates an example of a “flipped” prior probability
distribution for a single random variable selected from the prior belief
of the Gaussian process. Each point follows a normal distribution.
Plotting the mean (solid line) and 95% credible interval (dashed lines)
of all these prior distributions gives us the prior process for the
objective function regarding each location in the domain. The Gaussian
process thus employs an infinite number of normally distributed
random variables within a bounded range to model the underlying
objective function and quantify the associated uncertainty via a
probabilistic approach.

Figure 1-17 A sample prior belief of the Gaussian process represented by the mean
and 95% credible interval for each location in the domain. Every objective value is
modeled by a random variable that follows a normal prior predictive distribution.
Collecting the distributions of all random variables could help us quantify the
potential shape of the true underlying function and its probability

The prior process can thus serve as the surrogate data-generating


process to generate samples in the form of functions, an extension of
sampling single points from a probability distribution. For example, if
we were to repeatedly sample from the prior process earlier, we would
expect the majority (around 95%) of the samples to fall within the
credible interval and a minority outside this range. Figure 1-18
illustrates three functions sampled from the prior process.

Figure 1-18 Three example functions sampled from the prior process, where
majority of the functions fall within the 95% credible interval
In the Gaussian process, the uncertainty on the objective value of
each location is quantified using the credible interval. As we start to
collect observations and assume a noise-free and exact observation
model, the uncertainties at the collection locations will be resolved,
leading to zero variance and direct interpolation at these locations.
Besides, the variance increases as we move further away from the
observations, resulting from integrating the prior process with the
information provided by the actual observations. Figure 1-19 illustrates
the updated posterior process after collecting two observations. The
posterior process with updated knowledge based on the observations
will thus make a more accurate surrogate model and better estimate
the objective function.
Figure 1-19 Updated posterior process after incorporating two exact observations
in the Gaussian process. The posterior mean interpolates through the observations,
and the associated variance reduces as we move nearer the observations

Acquisition Function
The tools from Bayesian inference and the extension to the Gaussian
process provide principled reasoning on the distribution of the
objective function. However, we would still need to incorporate such
probabilistic information in our decision-making to search for the
global maximum. We need to build a policy that absorbs the most
updated information on the objective function and recommends the
following most promising sampling location in the face of uncertainties
across the domain. The optimization policy thus plays an essential role
in connecting the Gaussian process to the eventual goal of Bayesian
optimization. In particular, the posterior predictive distribution
provides an outlook on the objective value and associated uncertainty
for locations not explored yet, which could be used by the optimization
policy to quantify the utility of any alternative location within the
domain.
When converting the posterior knowledge about candidate
locations, that is, posterior parameters such as the mean and the
variance, to a single utility score, the acquisition function comes into
play. An acquisition function is a manually designed mechanism that
evaluates the relative potential of each candidate location in the form of
a scalar score, and the location with the maximum score will be used as
the recommendation for the next round of sampling. It is a function that
assesses how valuable a candidate location when we acquire/sample it.
The acquisition function is often cheap to evaluate as a side
computation since we need to evaluate it at every candidate location
and then locate the maximum utility score, another (inner)
optimization problem.
Many choices of acquisition function have been proposed in the
literature. In a later part of the book, we will cover the popular ones,
such as expected improvement (EI) and knowledge gradient (KG). Still,
it suffices, for now, to understand that it is a predesigned function that
needs to balance two opposing forces: exploration and exploitation.
Exploration encourages resolving the uncertainty across the domain by
sampling at unfamiliar and distant locations, since these areas may
bear a big surprise due to a high certainty. Exploitation recommends a
greedy move at promising regions where we expect the observation
value to be high. The exploration-exploitation trade-off is a common
topic in many optimization settings.
Another distinguishing feature is the short-term and long-term
trade-off. A short-term acquisition function only focuses on one step
ahead and assumes this is the last chance to sample from the
environment; thus, the recommendation is to maximize the immediate
utility. A long-term acquisition function employs a multi-step lookahead
approach by simulating potential evolutions/paths in the future and
making a final recommendation by maximizing the long-run utility. We
will cover both types of policies in the book.
There are many other emerging variations in the design of the
acquisition function, such as adding safety constraints to the system
under study. In any case, we would judge the quality of the policy using
a specific acquisition function based on how close we are to the location
of the global maximum upon exhausting our budget. The distance
between the current and optimal locations is often called instant regret
or simple regret. Alternatively, the cumulative regret (cumulative
distances between historical locations and the optimum location)
incurred throughout the sampling process can also be used.

The Full Bayesian Optimization Loop


Bayesian optimization is an iterative process between the
(uncontrolled) environment and the (controlled) policy. The policy
Another random document with
no related content on Scribd:
A W O R D to an

U N H A P P Y W O M A N.
1. HITHER are you going? To heaven or hell? Do you not know?
W Do you never think about it? Why do you not? Are you never
to die? Nay, it is appointed for all men to die. And what
comes after? Only heaven or hell.――Will the not thinking of death,
put it farther off? No; not a day: not one hour. Or will your not
thinking of hell, save you from it? O no: you know better. And you
know that every moment you are nearer hell, whether you are
thinking of it or no: that is, if you are not nearer heaven. You must be
nearer one or the other.

2. I intreat you, think a little on that plain question, Are you going
toward heaven or hell? To which of the two does this way lead? Is it
possible you should be ignorant? Did you never hear, that neither
adulterers nor fornicators, shall inherit the kingdom? That fornicators
and adulterers God will judge? And how dreadful will be their
sentence, “Depart ye cursed into everlasting fire, prepared for the
devil and his angels!”

3. Surely you do not mock at the word of God! You are not yet
sunk so low as this. Consider then that awful word, know ye not, that
ye are the temples of God? Was not you designed for the Spirit of
God to dwell in? Was not you devoted to God in baptism? But if any
man defile the temple of God, him shall God destroy. O do not
provoke him to it any longer. Tremble before the great, the holy God!
4. Know you not, that your body is, or ought to be, the temple of
the Holy Ghost which is in you? Know you not, that you are not your
own? For you are bought with a price. And, O how great a price! You
are not redeemed with corruptible things, as silver and gold: but with
the precious blood of Christ, as of a lamb without blemish and
without spot. O when will you glorify God, with your body and your
spirit, which are God’s!

5. Ah poor wretch! How far are you from this? How low are you
fallen? You yourself are ashamed of what you do. Are you not?
Conscience, speak in the sight of God? Does not your own heart
condemn you at this very hour? Do not you shudder at the condition
you are in?――Dare, for once, to lay your hand upon your breast,
and ask, “What am I doing? And what must the end of these things
be?” Destruction both of body and soul.

6. Destruction of body as well as of soul! Can it be otherwise?


Are you not plunging into misery in this world, as well as in the world
to come? What have you brought upon yourself already! What
infamy? What contempt? How could you now appear, among those
relations or friends, that were once so loved, and so loving to you?
What pangs have you given them? How do some of them still weep
for you in secret places? And will you not weep for yourself? When
you see nothing before you, but want, pain, diseases, death? O
spare yourself! Have pity upon your body, if not your soul. Stop!
Before you rot above ground and perish!

7. Do you ask, what shall I do? First, Sin no more. First of all,
secure this point. Now, this instant now, escape for your life. Stay
not. Look not behind you. Whatever you do, sin no more: starve, die,
rather than sin. Be more careful for your soul than your body. Take
care of that too: but of your poor soul first.
8. “But you have no friend: none, at least, that is able to help
you.” Indeed you have: one that is a present help in time of trouble.
You have a friend that has all power in heaven and earth, even
Jesus Christ the righteous. He loved sinners of old: and he does so
still. He then suffered the publicans and harlots to come unto him.
And one of them washed his feet with her tears, and wiped them with
the hairs of her head. I would to God you were in her place! Say,
Amen! Lift up your heart, and it shall be done. How soon will he say,
“Woman be of good chear! Thy sins which are many, are forgiven
thee.—Go in peace. Sin no more. Love much; for thou hast much
forgiven.”

9. Do you still ask, but what shall I do for ♦bread? For food to eat,
and raiment to put on? I answer, in the name of the Lord God, (and
mark well! His promise shall not fail) seek thou first the kingdom of
God and his righteousness, and all these things shall be added unto
thee.

♦ “read” replaced with “bread”

Settle it first in your heart, whatever I have or have not, I will not
have everlasting burnings. I will not sell my soul and body for bread:
better even starve on earth than burn in hell. Then ask help of God.
He is not slow to hear. He hath never failed them that seek him. He
who feeds the young ravens that call upon him, will not let you perish
for lack of sustenance. He will provide, in a way you thought not of, if
you seek him with your whole heart. O let your heart be toward him:
seek him from the heart. Fear sin, more than want, more than death.
And cry mightily to him who bore your sins, till you have bread to eat,
that the world knoweth not of; till you have angels food, even the
love of God, shed abroad in your heart: till you can say, now I know
that my Redeemer liveth, that he hath loved me and given himself for
me: and though after my skin worms destroy this body, yet in my
flesh shall I see God!
A W O R D to a

S M U G G L E R.
WHATgoods:
I. “
is smuggling?” It is the importing, selling, or buying of run
that is, those which have not paid the duty appointed
by law to be paid to the king.

1. Importing run goods. All smuggling vessels do this with an high


hand. It is the chief, if not the whole business of these, to bring
goods which have not paid duty.

2. Next to these are all sea captains, officers, sailors, or


passengers, who import any thing without paying the duty which the
law requires.

3. A third sort of smugglers are all those, who sell any thing which
has not paid the duty.

4. A fourth sort, those who buy tea, liquors, linen, handkerchiefs,


or any thing else which has not paid duty.

II. “But why should they not? What harm is there in it?”

1. I answer, open smuggling (such as was common a few years


ago, on the southern coasts especially) is robbing on the highway:
and as much harm as there is in this, just so much there is in
smuggling. A smuggler of this kind is no honester than an
highwayman. They may shake hands together.

2. Private smuggling is just the same with picking of pockets.


There is full as much harm in this as in that. A smuggler of this kind
is no honester than a pickpocket. These may shake hands together.
3. But open smugglers are worse than common highwaymen,
and private smugglers are worse than common pickpockets. For it is
undoubtedly worse to rob our father, than one we have no obligation
to. And it is worse still, far worse, to rob a good father, one who
sincerely loves us, and is at that very time doing all he can, to
provide for us, and to make us happy. Now this is exactly the present
case. King George is the father of all his subjects: and not only so,
but he is a good father. He shews his love to them on all occasions:
and is continually doing all that is in his power, to make his subjects
happy.

4. An honest man therefore would be ashamed to ask, where is


the harm in robbing such a father? His own reason, if he had any at
all, would give him a speedy answer. But you are a Christian: are
you not? You say, you believe the bible. Then I say to you, in the
name of God, and in the name of Christ, Thou shalt not steal. Thou
shalt not take what is not thine own, what is the right of another man.
But the duties appointed by law are the King’s right, as much as your
coat is your right. He has as good a right to them, as you have to
this: these are his property, as much as this is yours. Therefore you
are as much a thief if you take his duties, as a man is that takes your
coat.

5. If you believe the bible, I say to you, as our Saviour said to


them of old time, Render unto Cæsar the things that are Cæsar’s,
and unto God the things that are God’s. If then you mind our
Saviour’s words, be as careful to honour the King, as to fear God. Be
as exact in giving the king, what is due to the king, as in giving God
what is due to God. Upon no account whatever rob or defraud him of
the least thing which is his lawful property.

6. If you believe the bible, I say to you, as St. Paul said to the
ancient Christians, Render unto all their dues: in particular, Custom
to whom custom is due, tribute to whom tribute. Now custom is by
the laws of England due to the king. Therefore every one in England
is bound to pay it him. So that robbing the king herein, is abundantly
worse than common stealing, or common robbing on the highway.
7. And so it is, on another account also: for it is a general
robbery: it is, in effect, not only robbing the king, but robbing every
honest man in the nation. For the more the king’s duties are
diminished, the more the taxes must be increased. And these lie
upon us all: they are the burden not of some, but of all the people of
England. Therefore every smuggler is a thief-general, who picks the
pockets both of the king, and all his fellow-subjects. He wrongs them
all; and above all, the honest traders: many of whom he deprives of
their maintenance: constraining them either not to sell their goods at
all, or to sell them to no profit. Some of them are tempted hereby,
finding they cannot get bread for their families, to turn thieves too.
And then you are accountable for their sin as well as your own: you
bring their blood upon your own head. Calmly consider this, and you
will never more ask, “What harm there is in smuggling?”

III. 1. But for all this, cannot men find excuses for it? Yes,
abundance; such as they are. “I would not do this, says one; I would
not sell uncustomed goods: but I am under a necessity: I can’t live
without it.” I answer, may not the man who stops you on the highway,
say the very same? “I would not take your purse; but I am under a
necessity: I cannot live without it.” Suppose the case to be your own;
and will you accept of this excuse? Would not you tell him, “Let the
worst come to the worst, you had better be honest, though you
should starve.” But that need not be neither. Others who had no
more than you to begin with, yet find a way to live honestly. And
certainly so may you: however, settle it in your heart, “Live or die, I
will be an honest man.”

2. “Nay, says another, we do not wrong the king: for he loses


nothing by us. Yea, on the contrary, the king is rather a gainer,
namely by the seizures that are made.”
So you plunder the king, out of stark love and kindness! You rob
him, to make him rich! It is true, you take away his purse: but you put
an heavier in its place! Are you serious? Do you mean what you
say? Look me in the face and tell me so. You cannot. You know in
your own conscience, that what comes to the king, out of all seizures
made the year round, does not amount to the tenth, no not to the
hundredth part of what he is defrauded of.

But if he really gained more than he lost, that would not excuse
you. You are not to commit robbery, though the person robbed were
afterwards to gain by it. You are not to do evil, that good may come.
If you do, your damnation is just.

“But certainly, say some, the king is a gainer by it, or he might


easily suppress it.” Wilt you tell him, which way? By Custom-house
officers? But many of them have no desire to suppress it. They find
their account in its continuance: they come in for a share of the
plunder. But what if they had a desire to suppress it? They have not
the power. Some of them have lately made the experiment: and what
was the consequence? Why they lost a great part of their bread, and
were in danger of losing their lives.

♦8. Can the king suppress smuggling, by parties of soldiers? That


he cannot do. For all the soldiers he has are not enough, to watch
every port and every creek in Great-Britain. Besides, the soldiers
that are employed, will do little more than the Custom-house officers.
For there are ways and means to take off their edge too, and making
them as quiet as lambs.

♦ Paragraph number out of sequence.

“But many courtiers and great men, who know the king’s mind,
not only connive at smuggling, but practise it.” And what can we infer
from this? Only that those great men are great villains. They are
great highwaymen and pickpockets: and their greatness does not
excuse, but makes their crime tenfold more inexcusable.
But besides. Suppose the king were willing to be cheated, how
would this excuse your cheating his subjects? All your fellow-
subjects, every honest man, and in particular, every honest trader?
How would it excuse, your making it impossible for him to live,
unless he will turn knave as well as yourself?

3. “Well, but I am not convinced it is a sin: My conscience does


not condemn me for it.” No! Are you not convinced, that robbery is a
sin? Then I am sorry for you. And does not your conscience
condemn you for stealing? Then your conscience is asleep. I pray
God to smite you to the heart, and awaken it this day!

4. “Nay, but my soul is quite happy in the love of God: therefore I


cannot think it is wrong.” I answer, wrong it must be, if the bible is
right. Therefore either that love is a mere delusion, a fire of your own
kindling; or God may have hitherto winked at the times of ignorance.
But now you have the means of knowing better. Now light is offered
to you. And if you shut your eyes against the light, the love of God
cannot possibly continue.

5. “But I only buy a little brandy or tea now and then, just for my
own use.” That is, I only steal a little. God says, steal not at all.

6. “Nay, I do not buy any at all myself: I only send my child or


servant for it.” You receive it of them: Do you not? And the receiver is
as bad as the thief.

7. “Why I would not meddle with it, but I am forced, by my parent,


husband, or master.” If you are forced by your father or mother to
rob, you will be hanged nevertheless. This may lessen, but does not
take away the fault: for you ought to suffer rather than sin.

8. “But I do not know, that it was run.” No! Did not he that sold it,
tell you it was? If he sold it under the common price, he did. The
naming the price, was telling you, “This is run.”
9. “But I don’t know where to get tea which is not run.” I will tell
you where to get it. You may have it from those whose tea is duly
entered, and who make a conscience of it. But were it otherwise, if I
could get no wine, but what I knew to be stolen, I would drink water:
yea, though not only my health, but my life depended upon it: for it is
better to die, than to live by thieving.

10. “But if I could get what has paid duty, I am not able to pay the
price of it. And I can’t live without it.” I answer, 1. You can live without
it, as well as your grandmother did. But 2. If you could not live
without it, you ought to die, rather than steal. For death is a less evil
than sin.

11. “But my husband will buy it, whether I do or no. And I must
use what he provides, or have none.” Undoubtedly to have none is a
less evil, than to be partaker with a thief.

IV. Upon the whole then, I exhort all of you that fear God, and
desire to save your souls, without regarding what others do, resolve
at all hazards, to keep yourselves pure. Let your eye be fixed on the
word of God, not the examples of men. Our Lord says to every one
of you, What is that to thee? Follow thou me! Let no convenience, no
gain, no pleasure, no friend, draw you from following him. In spite of
all the persuasions, all the reasonings of men, keep to the word of
God. If all on the right-hand and the left will be knaves, be you an
honest man. Probably God will repay you (he certainly will, if this be
best for you) even with temporal blessings: there have not been
wanting remarkable instances of this. But if not, he will repay you
with what is far better: with the testimony of a good conscience
towards God; with joy in the Holy Ghost; with an hope full of
immortality; with the love of God shed abroad in your hearts. And the
peace of God, which passeth all understanding, shall keep your
hearts and minds in Christ Jesus!

London,
January 30, 1767.
A W O R D to a

Condemn’d Malefactor.
W HAT a condition are you in? The sentence is past: you are
condemned to die: and this sentence is to be executed
shortly. You have no way to escape; these fetters, these walls, these
gates and bars, these keepers cut off all hope. Therefore die you
must: but must you die like a beast, without thinking what it is to die?
You need not: you will not: you will think a little first: you will consider,
what is death? It is leaving this world, these houses, lands, and all
things under the sun; leaving all these things, never to return; your
place will know you no more. It is leaving these pleasures; for there
is no eating, drinking, gaming, no merriment in the grave. It is leaving
your acquaintance, companions, friends: your father, mother, wife,
children. You cannot stay with them, nor can they go with you: you
must part; perhaps for ever. It is leaving a part of yourself; leaving
this body which has accompanied you so long. Your soul must now
drop its old companion, to rot and moulder into dust. It must enter
upon a new, strange, unbodied state. It must stand naked before
God!
2. But O! how will you stand before God? The great, the holy, the
just, the terrible God? Is it not his own word, Without holiness no
man shall see the Lord? No man shall see him with joy: rather he will
call for the mountains to fall upon him and the rocks to cover him.
And what do you think holiness is? It is purity both of heart and life. It
is the mind that was in Christ, enabling us to walk as he also walked.
It is the loving God with all our heart, the loving our neighbour, every
man as ourselves, and the doing to all men, in every point, as we
would they should do unto us. The least part of holiness is, to do
good to all men, and to do no evil either in word or work. This is only
the outside of it. But this is more than you have. You are from it; far
as darkness from light. You have not the mind that was in Christ:
there was no pride, no malice in him: no hatred, no revenge, no
furious anger, no foolish or worldly desire. You have not walked as
Christ walked: no; rather as the devil would have walked, had he
been in a body; the works of the devil you have done, not the works
of God. You have not loved God with all your heart. You have not
loved him at all. You have not thought about him. You hardly knew or
cared, whether there was any God in the world. You have not done
to others as you would they should do to you; far, very far from it.
Have you done all the good you could to all men? If so, you had
never come to this place. You have done evil exceedingly: your sins
against God and man are more than the hairs of your head.
Insomuch that even the world cannot bear you; the world itself spues
you out. Even the men that know not God declare, you are not fit to
live upon the earth.
3. O repent, repent! Know yourself: see and feel what a sinner
you are. Think of the innumerable sins you have committed, even
from your youth up. How many wicked words have you spoken?
How many wicked actions have you done? Think of your inward sins!
Your pride, malice, hatred, anger, revenge, lust. Think of your sinful
nature, totally alienated from the life of God. How is your whole soul
prone to evil, void of good, corrupt, full of all abominations! Feel, that
your carnal mind is enmity against God. Well may the wrath of God
abide upon you. He is of purer eyes than to behold iniquity: he hath
said, The soul that sinneth, it shall die. It shall die eternally, shall be
punished with everlasting destruction, from the presence of the Lord
and from the glory of his power.

4. How then can you escape the damnation of hell? The lake of
fire burning with brimstone? Where the worm dieth not, and the fire
is not quenched? You can never redeem your own soul. You cannot
atone for the sins that are past. If you could leave off sin now, and
live unblamable for the time to come, that would be no atonement for
what is past. Nay, if you could live like an angel for a thousand years,
that would not atone for one sin. But neither can you do this: you
cannot leave off sin: it has the dominion over you. If all your past sins
were now to be forgiven, you would immediately sin again: that is,
unless your heart were cleansed; unless it were created anew. And
who can do this? Who can bring a clean thing out of an unclean?
Surely none but God. So you are utterly sinful, guilty, helpless! What
can you do to be saved?
5. One thing is needful: believe in the Lord Jesus Christ, and thou
shalt be saved! Believe (not as the devils only, but) with that faith
which is the gift of God, which is wrought in a poor, guilty, helpless
sinner, by the power of the Holy Ghost. See all thy sins on Jesus
laid. God laid on him the iniquities of us all. He suffered once the just
for the unjust. He bore our sins in his own body on the tree. He was
wounded for thy sins; he was bruised for thy iniquities. Behold the
Lamb of God, taking away the sin of the world! Taking away thy sins,
even thine, and reconciling thee unto God the Father! Look unto him
and be thou saved! If thou look unto him by faith, if thou cleave to
him with thy whole heart, if thou receive him both to atone, to teach
and to govern thee in all things, thou shalt be saved, thou art saved,
both from the guilt, the punishment, and all the power of sin. Thou
shalt have peace with God, and a peace in thy own soul, that
passeth all understanding. Thy soul shall magnify the Lord, and thy
Spirit rejoice in God thy Saviour. The love of God shall be shed
abroad in thy heart, enabling thee to trample sin under thy feet. And
thou wilt then have an hope full of immortality. Thou wilt no longer be
afraid to die, but rather long for the hour having a desire to depart,
and to be with Christ.
6. This is the faith that worketh by love, the way that leadeth to
the kingdom. Do you earnestly desire to walk therein? Then put
away all hindrances. Beware of company: At the peril of your soul,
keep from those who neither know nor seek God. Your old
acquaintance are no acquaintance for you, unless they too acquaint
themselves with God. Let them laugh at you, or say, you are running
mad. It is enough, if you have praise of God. Beware of strong drink.
Touch it not, lest you should not know when to stop. You have no
need of this to chear your spirits; but of the peace and the love of
God: beware of men that pretend to shew you the way to heaven,
and know it not themselves. There is no other name whereby you
can be saved, but the name of our Lord Jesus Christ. And there is no
other way whereby you can find the virtue of his name but by faith.
Beware of Satan transformed into an angel of light, and telling you, it
is presumption to believe in Christ, as your Lord and your God, your
wisdom and righteousness, sanctification and redemption. Believe in
him with your whole heart. Cast your whole soul upon his love. Trust
him alone: love him alone: fear him alone: and cleave to him alone:
Till he shall say to you (as to the dying malefactor of old,) This day
shalt thou be with me in paradise.
A W O R D in S E A S O N;

O R,

¹
Advice to an ENGLISHMAN.
¹ This was published at the beginning of the late rebellion.

1. O you ever think? Do you ever consider? If not, ’tis high time
D you should. Think a little, before it is too late. Consider what a
state you are in. And not you alone, but our whole nation. We
would have war. And we have it. And what is the fruit? Our armies
broken in pieces: And thousands of our men either killed on the spot
or made prisoners in one day. Nor is this all. We have now war at our
own doors: our own countrymen turning their swords against their
brethren. And have any hitherto been able to stand before them?
Have they not already seized upon one whole kingdom? Friend,
either think now, or sleep on and take your rest, till you drop into the
pit where you will sleep no more?

2. Think, what is likely to follow, if an army of French also, should


blow the trumpet in our land! What desolation may we not then
expect? What a wide-spread field of blood? And what can the end of
these things be? If they prevail, what but Popery and Slavery? Do
you know what the spirit of Popery is? Did you never hear of that in
queen Mary’s reign? And of the holy men who were then burnt alive
by the Papists, because they did not dare to do as they did? To
worship angels and saints; to pray to the virgin Mary; to bow down to
images, and the like. If we had a king of this spirit, whose life would
be safe? At least, what honest man’s? A knave indeed might turn
with the times. But what a dreadful thing would this be to a man of
conscience? “Either turn, or burn. Either go into that fire: or into the
fire that never shall be quenched.”
3. And can you dream that your property would be any safer than
your conscience? Nay, how should that be? Nothing is plainer than
that the Pretender cannot be king of England, unless it be by
conquest. But every conqueror may do what he will. The laws of the
land are no laws to him. And who can doubt, but one who should
conquer England by the assistance of France, would copy after the
French rules of government?

4. How dreadful then is the condition wherein we stand? On the


very brink of utter destruction! But why are we thus? I am afraid the
answer is too plain, to every considerate man. Because of our sins:
because we have well-nigh filled up the measure of our iniquities.
For, what wickedness is there under heaven, which is not found
among us at this day? Not to insist on the sabbath-breaking in every
corner of our land, the thefts, cheating, fraud, extortion; the injustice,
violence, oppression; the lying and dissimulating; the robberies,
sodomies and murders (which, with a thousand unnamed ♦villainies
are common to us and our neighbour Christians of Holland, France,
and Germany:) consider over and above, what a plentiful harvest we
have of wickedness almost peculiar to ourselves? For who can vie
with us, in the direction of courts of justice? In the management of
public charities? Or, in the accomplished, barefaced wickedness,
which so abounds in our prisons, and fleets, and armies? Who in
Europe can compare with the sloth, laziness, luxury and effeminacy
of the English gentry? Or with the drunkenness, and stupid,
senseless cursing and swearing, which are daily seen and heard in
our streets? One great inlet, no doubt, to that flood of perjury, which
so increases among us day by day: the like whereunto is not to be
found, in any other part of the habitable earth.

♦ “villanies” replaced with “villainies”


5. Add to all these (what is indeed the source as well as
completion of all) that open and profess’d Deism and rejection of the
Gospel, that public, avowed apostacy from the Christian faith, which
reigns among the rich and great, and hath spread from them to all
ranks and orders of men (the vulgar themselves not excepted) and
made us a people fitted for the destroyer of the Gentiles.

6. Because of these sins is this evil come upon us. For (whether
you are aware of it, or no) there is a God: a God, who tho’ he sits
upon the circle of the heavens, sees and knows all that is done upon
earth. And this God is holy; he does not love sin: he is just, rendering
to all their due. And he is strong; there is none able to withstand him:
he hath all power in heaven and in earth. He is patient indeed, and
suffers long; but he will at last repay the wicked to his face. He often
does so in this world; especially when a whole nation is openly and
insolently wicked. Then doth God arise and maintain his own cause;
then doth he terribly shew both his justice and power: that if these
will not repent, yet others may fear, and flee from the wrath to come.

7. There hath been among them that feared God, a general


expectation for many years, that the time was coming, when God
would thus arise, to be avenged on this sinful nation. At length the
time is come. The patience of God, long provoked, gives place to
justice. The windows of heaven begin to be opened, to rain down
judgments on the earth. And yet, with what tenderness does he
proceed? In the midst of wrath remembring mercy. By how slow
degrees does his vengeance move! Nor does his whole displeasure
yet arise.
8. Brethren, countrymen, Englishmen, What shall we do? To-day!
While it is called to-day! Before the season of mercy is quite expired,
and our destruction cometh as a whirlwind? Which way can we
remove the evils we feel? Which way prevent those we fear? Is there
any better way, than the making God our friend? The securing his
help against our enemies? Other helps are little worth. We see
armies may be destroyed, or even flee away from old men and
children. Fleets may be dashed to pieces in an hour, and sunk in the
depth of the sea. Allies may be treacherous, or slow, or foolish, or
weak, or cowardly. But God is a friend who cannot betray, and whom
none can either bribe or terrify. And who is wise, or swift, or strong
like him? Therefore, whatever we do, let us make God our friend. Let
us with all speed remove the cause of his anger. Let us cast away
our sins. Then shall his love have free course, and he will send us
help, sufficient help, against all our enemies.

9. Come; will you begin? Will you, by the grace of God, amend
one, and that without delay? First then, own those sins which have
long cried for vengeance in the ears of God. Confess, that we and all
(and you in particular) deserve for our inward and outward
abominations, not only to be swept from the face of the earth, but to
suffer the vengeance of eternal fire. Never aim at excusing either
yourself or others: Let your mouth be stopt. Plead guilty before God.
Above all, own that impudence of wickedness, that utter
♦carelessness, that pert stupidity, which is hardly to be found in any
part of the earth, (at least, not in such a degree) except in England.
Do you not know what I mean? You was not long since praying to
God for “damnation upon your own soul.” One who has heard you,
said, is that right? Does not God hear? “What if he takes you at your
word?” You replied, with equal impudence and ignorance, “What, Are
you a Methodist?”――What, if he is a Turk? Must thou therefore be a
Heathen?――God humble thy brutish, devilish spirit.

♦ “carlessness” replaced with “carelessness”


10. Lay thee in the dust, for this and for all thy sins. Let thy
laughter be turned into heaviness; thy joy into mourning; thy
senseless jollity and mirth, into sorrow and brokenness of heart. This
is no time to eat and drink and rise up to play; but to afflict thy soul
before the Lord. Desire of God a deep piercing sense of the
enormous sins of the nation, and of thy own. Remember that great
example: how when the king of Nineveh was warned of the near
approaching vengeance of God, he caused it to be proclaimed, Let
none taste any thing, let them not feed nor drink water. But let them
be covered with sackcloth, and cry mightily to God; yea let them turn
every one from his evil way; who can tell, if God will turn and repent,
and turn away from his fierce anger that we perish not. Jonah iii.

11. Let them turn every one from his evil way. Cease to do evil.
Learn to do well. And see that this reformation be universal: for there
is no serving God by halves. Avoid all evil, and do all good unto all
men; else you only deceive your own soul. See also, that it be from
the heart: lay the axe to the root of the tree. Cut up, by the grace of
God, evil desire, pride, anger, unbelief. Let this be your continual
prayer to God, the prayer of your heart, (as well as lips) “Lord, I
would believe: help thou mine unbelief! Give me the faith that
worketh by love. The life which I now live, let me live by faith in the
Son of God. Let me so believe, that I may love thee, with all my
heart, and mind, and soul, and strength! and that I may love every
child of man, even as thou hast loved us! Let me daily add to my
faith courage, knowledge, temperance, patience, brotherly kindness,
charity: that so an entrance may be ministered to me abundantly,
into the everlasting kingdom of our Lord and Saviour Jesus Christ.”

A n H Y M N.
R EGARD, thou righteous God and true,

Regard thy weeping people’s prayer,

Before the sword our land go through,

Before thy latest plague we bear,

Let all to thee their smiter turn,

Let all beneath thine anger mourn.

The sword, which first bereav’d abroad,

We now within our borders see:

We see, but slight thy nearer rod,

So oft so kindly warn’d by thee:

We still thy warning love despise,

And dare thine utmost wrath to rise.

Yet for the faithful remnants sake

Thine utmost wrath awhile defer,

If haply we at last may wake,

And trembling at destruction near

The cause of all our evils own,

And leave the sins for which we groan.

Or if the wicked will not mourn,

And ’scape the long-suspended blow,

Yet shall it to thy glory turn,

Yet shall they all thy patience know,

Thy slighted love and mercy clear,

And vindicate thy justice here.

You might also like