Professional Documents
Culture Documents
Castillo2020 Bayesian Predictive Optimization of Multiple and Profile Responses Systems in Industry
Castillo2020 Bayesian Predictive Optimization of Multiple and Profile Responses Systems in Industry
Castillo2020 Bayesian Predictive Optimization of Multiple and Profile Responses Systems in Industry
A R T I C L E I N F O A B S T R A C T
Keywords: Bayesian statistical methods provide a sound mathematical framework to combine prior knowledge about
High dimensional response optimization controllable variables of importance in a process, if available, with the actual data, and has proved useful in
Quality by design several data analytic tasks in the process industry. We present a review and some extensions of bayesian predictive
Hierarchical linear models
methods for process optimization based on experimental design data, an area that is critical in Quality by Design
Robust parameter design
activities and where the bayesian perspective has received limited attention from the Chemometrics and process
analytics communities. The goal of the methods is to maximize the probability of conformance of the predicted
responses to their specification limits by varying the process operating conditions. Optimization of multiple
response systems and of systems where the performance is given by a curve or “profile” are considered, as they are
more challenging to model and optimize, yet are increasingly common in practice. We discuss the particular case
of Robust Parameter Design, a technique due to G. Taguchi and popular in discrete manufacturing systems, and its
implementation within a bayesian optimization framework. The usefulness of the models and methods is illus-
trated with three real-life chemical process examples. MATLAB code that implements all methods and reproduces
all examples is made available.
1. Introduction: bayesian optimization and robust parameter describes the best operation of the process [60]. In this type of processes,
design the data collected at given operating conditions are a complete realiza-
tion of a function of some index variable over a given range, what in the
The utility of the bayesian paradigm to model uncertainties in com- Statistics literature is called Functional Data [56]. We consider only on the
plex engineering systems has long been recognized. Bayesian “predicti- optimization of a process based on models empirically derived from
vism” centers on forecasting future observable variables (typically, a experimental data, and leave the case of mechanistic models, based on
response of some system) in terms of the resulting predicting posterior first principles, outside of the scope of the paper.
distributions, and avoids traditional inferences on non-observable model The recognition of the usefulness of the bayesian paradigm has
parameters [4,62]. The focus of this paper is bayesian predictive tech- recently been pointed out by Tabora et al. [67] in the area of pharma-
niques that deal with experimental data, obtained with the goal to ceutical process development. These authors recommend the bayesian
optimize or improve a process or a product. predictive approach developed by Peterson et al. [13,38,46–48,50] to
A specific level of system complexity that can be analyzed in an consider the inherent uncertainties in process development and to
effective manner using a bayesian predictive approach is when the pro- quantify the risk in “Quality by Design” problems, as defined by the FDA.
cess response is multivariate, including the high dimensional case, Even though the diffusion of bayesian methods to systems engineering,
common in current industrial data-rich environments. Related to the process analytics and chemometrics is still limited, a number of impor-
multivariate response process case, but significantly different from it, is tant problems have been studied. For instance, Nounou et al. developed
the case we focus on this paper, that of a process whose performance is bayesian latent variable model estimation methodologies [42,43]. Chen
not given by a collection of correlated scalar responses but is given et al. [9] propose to study traditional chemometrics methods from a
instead by the shape of a continuous curve, i.e., a “profile” response bayesian point of view. Mechanistic modeling has benefitted from
system, in which an ideal profile or function shape is assumed to exist and bayesian formulations for the estimation of kinetic parameters from
* Corresponding author.
E-mail address: exd13@psu.edu (E. Castillo).
https://doi.org/10.1016/j.chemolab.2020.104121
Received 25 February 2020; Received in revised form 22 June 2020; Accepted 21 July 2020
Available online 25 August 2020
0169-7439/© 2020 Elsevier B.V. All rights reserved.
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Fig. 1. A schematic of a process as seen from the point of view of Taguchi’s Robust Parameter Design (RPD). The goal in RPD is to find settings for the controllable
factors (operating or design variables) that keep the responses of the process on the desired targets or goals in the presence of uncontrolled variability in the so-called
noise factors. RPD requires process models, often obtained via experimental tests and statistical modeling.
chemical reactions [29,44,52]. Applications to data rectification [2], controlled by the manufacturer anymore and, on the contrary, vary
state estimation [10] and sensor fusion [3,18,22,61,65] have also been randomly. This situation leads to the Robust Parameter Design (RPD)
developed for dynamical systems. More recently, several bayesian pro- problem (Taguchi referred to process variables as “parameters”), in
cess monitoring methodologies have also been developed, namely for which the goal is to find the settings of the controllable process/product
multimode, non-linear and non-stationary processes [24,31,33,75]. variables that optimize the responses despite the uncontrolled variability
Particularly, the bayesian predictive approach is the basis of many introduced by the noise factors (see Fig. 1), i.e., find an optimal solution
contemporary machine learning (ML) methods, but explicit ML models that is robust with respect to noise factor variability. Both controllable
and methods for the optimization of chemical processes are lacking. and noise factors can be either continuous variables or categorical vari-
In the optimization of a chemical process from data-based models ables (taking values over a discrete set of levels). For an overview of
obtained from experimental tests, it is of prime importance to compute Robust Parameter Design methods, see Ref. [1,13,40,74]. As it will be
the probability that a future experiment will reproduce the current result shown below, all these sources of uncertainty can be handled with a
considering the different uncertainties involved [58]. There are un- bayesian predictive optimization approach.
certainties not only due to measurement errors but also in the model In this paper, we present a review and some extensions of the
itself, or resulting from the type of experimental design used, that need to bayesian predictive approach for the solution of optimization problems
be considered during the optimization stage. Additional “common cause” in the broader context of the process industries, including the RPD case.
sources of variability in the process need to be considered as well. In the We focus on two types of empirical models: multiple response processes
discrete-parts industry -but not so much in the process industry- a pop- (perhaps with a high dimensional response) and processes where the
ular type of process optimization approach was initiated in the late 80s performance is described by some continuous curve or profile. These
by the Japanese engineer G. Taguchi [68] who introduced the key models have not been sufficiently covered in the literature, but are
concept of a noise factor. These are process or product variables that can becoming increasingly relevant as more and more applications involve
be manipulated in a carefully controlled experiment, but once the process these type of high-dimensional responses (rather than the more common
operates regularly (or the product is in the marketplace) they cannot be case of a high-dimensional predictor space as in machine learning ap-
plications). We illustrate the methods with 3 real-life applications in the
process industry where the system response is a profile.
The remainder of the paper is organized as follows. We first contrast
optimization approaches based on models fitted with classical (fre-
quentist) statistical methods with a bayesian predictive approach. Next,
two specific types of models, multivariate regression for multiple
response processes and hierarchical mixed effects models for processes
with a profile or curve response are presented, and their corresponding
bayesian modeling and optimization is discussed, including the case of
robust parameter design optimization. Here a hierarchical mixed effects
model previously used by del Castillo et al. [15] is adapted to the opti-
mization of high dimensional responses via principal component anal-
ysis. Next, we compare the hierarchical mixed effects model for a profile
response to the increasingly popular probabilistic latent variable models
used in Machine Learning, highlighting their different structure, and how
Fig. 2. Overlaid contour plot of the four fitted responses in a tire tread exper- the latter are not adequate to model high dimensional responses origi-
iment in the x1 x2 plane (keeping x3 constant at 0.0). Unshaded region nated from experimental data where controllable factors are manipu-
appears to be a “sweet spot” where to run the process as it seems to satisfy the lated. Finally, three examples are presented that illustrate the methods
response constraints. However, these are models for the mean responses, and as applied to industrial processes. MATLAB code is made available as
such neglect both the process variation and the model uncertainty. See text.
2
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
supplementary material that accompanies this paper, which implements neglects the covariance between the responses during the optimiza-
all methods and examples presented here. tion step, even if models were fitted via classical multivariate
regression, which does consider these covariances during the esti-
1.1. Limitations of process optimization based on classical statistical mation step.
models
1.2. Advantages of the bayesian predictive approach for process
A common approach for process optimization based on experimental optimization
data is to fit regression models to responses and overlay contour plots to
find a “sweet spot” where to run the process, a task facilitated by various Billheimer [4] has recently advocated the bayesian predictive
popular statistical software packages. For a classical example, consider approach for statistical inference. A natural way to optimize any process
the experiment reported in Ref. [16] for the optimization of a car tire from a quality and reliability standpoint is to maximize the probability of
tread compound. The controllable factors were x1 , hydrated silica level, conformance of the predicted responses to their specification limits.
x2 , silane coupling agent level, and x3 , sulfur level. The four responses to Bayesian predictive models and their use in industrial process optimi-
be optimized and their desired ranges were: PICO Abrasion index, y1 , zation were pioneered by Peterson [46] in pharmaceutical applications
120 < y1 ; 200% modulus, y2 , 1000 < y2 ; elongation at break, y3 , 400 < (see also [13] for a detailed presentation). As Billheimer [4] indicates, “A
y3 < 600, and hardness, y4 , 60 < y4 < 75. Quadratic polynomial models scientist is interested in the probability that a future experiment will
were fitted to each of these responses from experimental data. Fig. 2 reproduce the current result”. Likewise, in process optimization, an en-
shows overlaid contour plots (obtained with a popular statistical software gineer is interested in the probability that the future operation of the
package) which seem to imply it is safe to run the process within the process will result in the optimal performance that settings deduced from
“sweet spot” in the unshaded area. past experiments seems to indicate.
However, it can be risky to follow this common practice. Suppose we In the pharmaceutical sector, the ICH Q8 and Q11 guidelines [19] for
fit a regression model to a process response of the form Y b ¼ gðx1 ; x2 ; ::; industry promoted the concept of a “design space”, defined as the com-
xk Þ from experimental data. Under the usual regression assumptions, the bination of input variables and process parameters that have been
b ¼ E½Y.
model fit is a prediction of the mean of Y, i.e., Y b Assume, for demonstrated to provide assurance to quality”. Using the methods pre-
instance, that we wish to minimize the response by finding operating sented here it is possible to find a design space for a process that with
known probability is predicted to achieve particular goals if run inside the
conditions x* ¼ ðx1* ; …; xk* Þ such that Yðx* Þ ¼ E½Yðx* Þ u where u is an
space (see Ref. [66], and for an instance, see example 2 below).
acceptable upper bound for the response. It is important to note that all
In summary, the advantages of the bayesian predictive optimization
that E½Yðx* Þ u guarantees for future times we observe the process at
approach compared to optimizing models fitted classically are:
conditions x* is that
it provides a probability estimate that future responses at given
PðYðx* Þ uÞ 0:5
operating conditions x will satisfy the desired process specifications;
under the assumption of a symmetric distribution around the mean such it considers the uncertainty in the model parameters, not only in the
as the normal. Likewise, in the case of two responses, if we find operating measurements;
conditions x* such that E½Y1 ðx* Þ u1 and E½Y2 ðx* Þ u2 , all that is it takes into account the correlation between the responses during the
guaranteed at these conditions is that optimization (not only during model building);
it permits to include additional sources of variability such as “noise
PðY1 u1 ; Y2 u2 Þ > 0:25: factors” in the Taguchi sense;
it can be extended to different types of responses, as reviewed below.
In general, suppose we have q responses in a process, and we fit
corresponding regression models to E½Y1 ðxÞ; …; E½Yq ðxÞ. If we then find
For further discussion of disadvantages of the classical approach for
either numerically or simply by overlapping contour plots of the re-
response surface optimization and the advantages of a bayesian
sponses the process conditions x* such that E½Y1 ðx* Þ u1 and E½Y2 ðx* Þ approach, see [13, chapter 12].
u2 ; …; E½Yq ðx* Þhuq , all that is guaranteed at x* is that
2. Bayesian inference for multiple response and profile response
P Y1 ðx* Þ u1 ; Y1 ðx* Þ u2 ; …; Yq ðx* Þ uq 0:5q
processes
a probability that can indeed be very low. In practice, this probability
2.1. Central role of the predictive density in bayesian process optimization
bound is an overestimate, given that fitted models Yb j ðxÞ ¼ E½Y
b j ðxÞ are
used instead. Also, if the responses are positively correlated, then the Regardless of the model that is most appropriate for a response Y, the
lower bound will be higher and it may be easier to jointly optimize the object of interest for bayesian optimization is the posterior predictive
system. If the responses are negatively correlated, the opposite would ~ dataÞ
density (or predictive density, for short) of the response, f ðYjx;
happen [48,49]. Peterson and Lief [49] document how in six real data
where “data” denotes all the experimental data available in an n-run
industrial process optimization studies, including those in three pub-
experiment in k factors, including controllable and noise factors
lished papers from the literature, the posterior probability of meeting the
fxij gi¼1;…n;j¼1;…k and the corresponding observed response values
desired specifications actually varied from as low as 0.11, highlighting
fYi gj¼1;…;n , and the tilde on Y denotes a future response value not yet
the danger of optimizing fitted functions for the mean response that
neglect the uncertainties involved. In summary, optimizing regression observed at covariates x. The predictive density contains all the relevant
models fitted to a set of responses and looking at overlaying contour information, obtained from the observed data or a priori, about a
plots: response for which one wishes to make future inferences, and this in-
cludes the quality of the fit of the model as expressed in the dispersion of
neglects model parameter uncertainty; the density. Further model diagnostics are based on predictive checks
can not provide a probability of assurance or “reliability” about based on this density. Given a statistical model with parameters θ, the
whether future responses at given process settings x will satisfy pro- predictive posterior density is defined as:
cess specifications;
3
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Z
~ dataÞ ¼ ~ θjx; data dθ Y ¼ XΓ þ E (3)
f Yjx; pðY;
Z where Y is a N J matrix containing the observations of the J responses
~ x; dataÞpðθjdataÞdθ
¼ pðYjθ; in each of the N experimental runs (combinations of controllable and
Z (1) noise factors). It is important to note that the word “linear” refers to the
~ x; dataÞpðθjdataÞdθ
¼ pðYjθ; model parameters Γ, which enter linearly in the response model, not to
Z the covariates xi which can enter in matrix X via a nonlinear basis, such
~ xÞpðθjdataÞdθ
¼ pðYjθ; as a polynomial or a spline basis. Hence, curvature or nonlinearities in
each dependent variable or response (Yj ) can be modeled as a function of
where the last step follows because the density of Y~ at x is independent of the independent variables xi . Also, X is a N q design matrix, with all the
the data given the parameters θ (in this case its density only depends on controllable and noise factors expanded in model form according to the
the density of measurement error ε). Here, pðY;~ θjx; dataÞ is the likelihood same q-parameter model. Γ is a q J matrix containing all the q pa-
function, which is marginalized over the model parameters θ, pðYjθ; ~ xÞ is rameters for each response j ¼ 1; 2; …; J (note the same model form is
~ assumed for all responses), and E ½εij is a N J matrix of random er-
density function of the future response Y evaluated at covariates x, which
rors which are assumed to be probabilistically described, as follows:
depends on the statistical model assumed, and pðθjdataÞ is the posterior
distribution of the model parameters. Depending on the model for the εi: NJ ð0; ΣÞ 8i ¼ 1; …; N
response Y, the integral in equation (1) may have a closed form expres-
sion (this is the case of the multiple response regression case, see below). ε:j NN 0; σ 2j IN 8j ¼ 1; …; J:
In case it does not have a closed form, Markov Chain Monte Carlo
(MCMC) techniques can be used to find the posterior of the parameters That is, the model assumes that errors between responses from the same
and, with it, the posterior predictive density. Either informative or non- experiment (a row of E) can be correlated, but it also assumes that the
informative priors on θ can be used, but informative priors are usually errors along a column of E (errors for the same response Yj for different
difficult to justify and it is usually better to work with non-informative experimental runs i) are independent random variables. Under this
priors, see Ref. [12–14,25]. model, a J 1 vector y, containing a single, not yet observed, vector of
For a specific illustration of a simple predictive density, consider a responses, for given levels of the controllable and noise factors x, is
basic (univariate) linear regression model Y ¼ β0 þ β1 x1 þ; …; þβk xk þ ε assumed to follow the model:
where εeNð0; σ 2 Þ, with p ¼ k þ 1 parameters. In this case, the posterior
~ x; Y; XÞ (where the observed data ¼ ðY; XÞ) is y ¼ Γ’fðxÞ þ ε (4)
predictive density f ðYj~
given by a Student t distribution with N p degrees of freedom, mean ~x’b β where fðxÞ is a q 1 vector containing the values of the controllable and
(where b β is the ordinary least squares estimate of β ¼ ðβ0 ; …; βk Þ’ ¼ noise factors x ¼ ðxc ; xn Þ at which the prediction is desired (expanded in
ðX’XÞ1 X’Y) and variance S2 ð1 þ x~’ðX’XÞ1 ~xÞ. Here, S2 is the sample model form, same form as used in the columns of X and Γ) and ε has the
variance estimator of σ 2 and X is an N p matrix with one column per same distribution as a row of E, i.e., a Nð0; ΣÞ distribution.
model term and one row per experimental test (thus for this simple model Fortunately, the posterior predictive density of model (4) is available
in particular, X ¼ ½xi i¼1::n where xi ’ ¼ ð1; xi1 ; xi2 ; …; xik ÞÞ. From these in closed form. Here one can utilize non-informative priors, to avoid
results, it is easy to verify that the variance of the predictive density is a heavily weighted priors that are hard to justify. If condition (2) does not
function of: hold because the response is very high dimensional, the hierarchical
mixed effects model described further below, originally proposed for the
the amount of data (N, as seen in matrix X and the sample variance optimization of profile responses, can be used as well as we explain in
S2 ); example 1.
the inherent noise due to measurement error (estimated via S2 ); Under the classical non-informative joint prior for Γ and Σ in equation
how well the model (as defined by the columns of X) fits; (3), namely, pðB;ΣÞ∝ΣjðJþ1Þ=2 , it is well-known (see, e.g., see Ref. [53],
the model parameter uncertainty, determined in good part by the pp. 136 or [13]) that the bayesian predictive density for a new response
experimental design used, present in matrix X; vector y is given by a J-dimensional Student t distribution with ν ¼ N
the variability of the noise factors xn where x ¼ ðxc ; xn Þ are all factors q J þ 1 degrees of freedom:
in the experiment (controllable and noise factors, respectively);
our ability to control the noise factor variability thanks to the pres-
Γ νþJ
νþJ
ence of interaction terms in the model between xc and xn variables. If 2 pffiffiffiffiffiffiffi 1 2
f ð~yjx; dataÞ ¼ b ’ Hð~y B’xÞ
jHj 1 þ ð~y B’xÞ b (5)
such control noise interactions do not exist in the model, there is no ðπνÞJ=2 Γ ν ν
2
way to solve the RPD problem [13].
This is denoted TνJ ða; bÞ, where a is the mean vector and b is the
2.2. Bayesian optimization of multi-response linear regression models b b is the or-
variance matrix. That is, ~yx; dataeTνJ Γ’x; ν
ν2H
1
, where Γ
Let Y1 ; Y2 ; …; YJ be J responses of interest in a process. To investigate dinary least squares (OLS) estimator of Γ, and H is given by:
the input-output relations of the process, an experiment is designed and
conducted by varying k controllable factors x1 ; …; xk over N different test ν b 1
Σ
H¼
conditions or “runs”. Assume we fit a linear regression model to each of N q 1 þ x’ðX’XÞ1 x
the J responses (same model for all responses assumed in this section)
each containing q parameters. If the number of runs in the experiment N b is the usual Maximum Likelihood Estimator (MLE) of Σ.
where Σ
satisfies the inequality: Based on the bayesian multivariate regression formulation above,
J <N q þ 1 (2)
1
The Machine Learning and Statistics literature actually calls “PCA” the
then a standard multivariate linear regression model can be fitted, written
model where Σ ¼ σ 2 I while it calls the model a FA model when Σ is diagonal but
in matrix notation as: has different entries. Thus PCA in Chemical Engineering is closer to FA in Sta-
tistics and Machine Learning.
4
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Peterson [46] proposed a method to conduct multiple-response optimi- problem to the case a set of models fMi g fit the response adequately,
zation that accounts for the uncertainty in the parameters of the model solving instead the problem:
and for any correlation present between the responses. His method as-
XZ
sumes no noise factors, hence x ¼ xc and consists in solving max ~ i ; data; xÞdY~ pðMi jdataÞ
pðYjM
xc 2Rc A
i
max p ðxc Þ ¼ Pð~
y 2 Ajxc ; dataÞ
subject to : (6) where pðMi jdataÞ is the posterior density of each model i. The model-
xc 2 Rc averaged solution found is then robust with respect to variation in the
true model describing the response. This methodology was extended by
where A is a desired specification region for the J responses and Rc is a Ng [41] to the multiple response case, allowing a user to define alter-
feasible region for the controllable factors. In its simplest form, both A native models for each response.
and Rc are given by upper and lower bounds. The bayesian models presented above assume normally distributed
To obtain p ðxc Þ through solving (6), the predictive density f ð~yjxc ; data. To robustify this assumption, Rajagopal et al. [55] considered
dataÞ in (5) needs to be integrated (numerically) over the region A: instead the noise in the model as originating from a Student t distribu-
Z tion. The models above also assume a “steady state” input-output
p ðxc Þ ¼ f ð~
yjxc ; dataÞ d~
y: (7) behavior of the process. It is possible to extend bayesian optimization
A
ideas to the dynamic case where some of the variables are lagged. Vanli
For a multivariate regression model, this can easily be done by Monte and del Castillo [71] consider bayesian RPD optimization of a process
Carlo simulation of a multivariate t distribution, as explained in Ap- based on a linear regression model in which the noise factors randomly
pendix A. vary according to a time series model, and frequent reoptimization is
If noise factors are present in the system and affect the responses (RPD necessary to adapt to the varying noise factor state.
case), a bayesian approach that provides solutions which are robust both
to the noise factor variability and to the uncertainty in the model pa- 2.4. Bayesian optimization of response profiles
rameters was proposed by Miro et al. [38] as an extension to Peterson’s
method. It consists in solving the following optimization problem, In cases where the performance of a process is given by a curve or
R “profile”, as opposed to a set of scalar responses, a different model is
max pðxc ÞRPD ¼ y 2 Ajx; dataÞf ðxn Þdxn
Pð~ necessary, and here we review a useful hierarchical mixed effects model
subject to : (8)
for profile responses originally presented in Ref. [15]. Let us assume the
xc 2 Rc :
response YðsÞ can be observed at several fixed values of an auxiliary
That is, after obtaining pðxÞ for fixed x ¼ ðxc ;xn Þ, a second integration is variable s which we will refer to as the “locations” s1 ; s2 ; …; sJ (note here
performed over the (assumed known) distribution of the noise factors, J denotes number of points per profile). For each experimental run i (i ¼
f ðxn Þ, to obtain the probability of conformance to specifications (also 1; …; N) where controllable and noise factors xi ¼ ðxc ; xn Þi have been
called the “reliability” of the process [46]) for the RPD problem, pðxc ÞRPD . tried in a designed experiment, a complete profile or function is observed
Miro et al. [38] considered only normally distributed noise factors. We consisting of J points along the curve YðsÞ. Thus, rather than observing a
extend this below to the case of Bernoulli ðpÞ and generalized Bernoulli continuous curve Yi ðsjxi Þ, we observe the response at discrete locations sj
ðp1 ; p2 ; …; pL Þ noise factors, given that in many cases categorical noise and model it accordingly:
factors, with L levels each, are uncontrolled and can not be treated as
Yij ¼ g xi ; sj þ εi sj ; i ¼ 1; …; N; j ¼ 1; …; J; (9)
continuous random variables. The MATLAB code provided (see supple-
mentary materials) implements all integrations and the optimization
where g is some function to be specified/estimated and εi ðsj Þ is a random
needed to solve problem (8). Since the function pðxc ÞRPD is in general not
error, which can depend on the location sj .
concave, a non-linear optimization algorithm must be run from several
The multivariate regression approach could be used again treating the
initial points within the region Rc . Also, evaluation of pðxc ÞRPD requires
different response values at different locations sj as different responses. If
the integral be computed numerically via Monte Carlo simulation.
the number of experiments is large such that condition (2) holds, this
If the optimal probability after solving (8) is high, then there is a high
would be feasible. However, while this approach provides predictions at
degree of assurance that future responses inside the specification region
the predefined locations values sj , it is not possible to interpolate with this
A will be achieved, despite the noise factor and other sources of vari-
model, since the locations sj are not used explicitly in the model. That is,
ability. We would then have a robust solution to the RPD problem. The
it is not possible to make predictions at profile locations sj other than
solution will also consider the uncertainty in the model parameters and
those observed during the experiment.
measurement error, since it is based on the predictive density of the
When the number of points per profile J is large (and hence J N
response. Model parameter uncertainty is linked to the specific type of
q þ 1), the previous approach cannot be applied. An alternative then is to
experimental design used, which determines matrix X.
use informative priors and optimize the posterior predictive density, but
very informative priors are difficult to justify in general. A more general
2.3. Bayesian predictive optimization based on other response models approach is needed. Del Castillo et al. [15] propose to model a profile
response using a hierarchical Bayes approach.
The multivariate regression model presented earlier assumes all J Rather than trying to define a model for g as a function of all xi and s, a
responses follow the same exact model form with p parameters each. This hierarchical model breaks the functional dependency in two stages. In a
may constitute a limitation in practical applications. Therefore, this first stage, the curves or profiles are modeled as a regression function of
model and its use in process optimization has been extended in Ref. [50] the locations s:
to the case where each response can adopt a different linear model, the
Yij ¼ g sj θðxi Þ þ εij :
so-called “SUR” (seemingly unrelated regression) case, which can easily
be dealt with via Gibbs sampling [45]. On a second stage, the parameters θ ¼ ½fθðkÞ g are then modeled as a
The optimization depends on the models used. There are situations
where more than one model fits the data reasonable well, but their cor-
2
responding optima lie in different regions of the controllable factor space. A computational consideration for very large p is that the dense p p matrix
Rajagopal and del Castillo [54] expand the bayesian optimization Σ needs to be inverted to find bz , which is an Oðp3 Þ operation.
5
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Fig. 3. Other alternative model structures with latent variables. a) Principal components regression. b) Partial least squares; c) Canonical correlation analysis (adapted
from a figure by Ref. [39]). In situations where the input data ðx 2 XÞ comes from a designed experiment these structures are not appropriate, as there is no latent space
on the X space.
6
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
7
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Table 1
Estimated predictive posterior probabilities of satisfying the joint bounds A and each interval Ai ¼ ðLi ;Ui Þ. Here, pi ¼ pðLi Yi Ui jdataÞ using a multivariate regression
analysis.
pðYðxc Þ 2 AjdataÞ p1 p2 p3 p4 p5 p6 p7 p8 p9
0.110 0.279 0.557 0.496 0.537 0.524 0.663 0.678 0.936 1.000
12
10
6
Response value (standardized)
-2
-4
-6
-8
1 2 3 4 5 6 7 8 9
Y (response number)
Fig. 5. Predicted responses (blue stars) at the optimal settings with one std. deviation error bars, using the original experimental design, wine aroma experiment, N ¼
18 and ν ¼ 1 degree of freedom. Optimization using the multivariate regression model. (For interpretation of the references to colour in this figure legend, the reader is
referred to the Web version of this article.)
8
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
3.5
1.5
0.5
-0.5
1 2 3 4 5 6 7 8 9
Y (response number)
Fig. 6. Preposterior analysis, predicted responses (blue stars) at the optimal settings in the wine aroma experiment, with one std. deviation error bars assuming the
original design was duplicated, N ¼ 36 and ν ¼ 19 degrees of freedom (note different y-scale compared to Fig. 5). Optimization based on multivariate regression
model. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)
Table 3
Estimated predictive posterior probabilities of satisfying the joint bounds A and each interval Ai ¼ ðLi ;Ui Þ. Here, pi ¼ pðLi Yi Ui jdataÞ using the mixed effects model
based on a preliminary PCA of the response data.
pðYðxc Þ 2 AjdataÞ p1 p2 p3 p4 p5 p6 p7 p8 p9
0.857 0.998 0.991 0.997 0.992 0.970 0.935 0.935 0.951 0.986
9
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
2.5
0.5
-0.5
-1
-1.5
1 2 3 4 5 6 7 8 9
Y (response number)
Fig. 7. Optimal responses (stars) and 95% prediction interval for the responses in the wine aroma experiment under the solution obtained using the mixed effects
model based on the first 2 principal components of the 9 chromatographic responses. Dark lines at 0.5 and 3.5 are the specifications. The nine predicted responses are
inside their specifications, with variability predicted to be inside the desired specifications.
responses highly correlated, effectively diffusing the probability over a orientation angle that has a major effect in paper mechanical and
space that was too high dimensional with respect to the subspace where dimensional properties. For instance, higher angles tend to favor diago-
the data really lied. Using a smaller number of PCs (2) captures better the nal curl modes, which can be very detrimental to printing processes,
underlying probability distribution. Note that the optimal solution x*c did causing frequent jams, loss production and reduced operational effi-
not change much from model to model. What was gained was a higher ciency. The profile of TSO across the paper machine were obtained at 10
degree of assurance that the optimal process will meet its specifications. positions over the machine manifold and can be controlled by manipu-
lating several process variables (experimental factors), such as x1 ¼ VJ
(jet velocity, ranging from 800 to 850 m/min), x2 ¼(A, slice opening,
4.2. Example 2: optimization of the tensile stiffness orientation profile in a ranging from 40 to 80%), and x3 ¼(VR, manifold recirculation, ranging
paper machine from 40 to 80%). The wire velocity (VW) of the machine was fixed at 750
m/min (this sets the production pace, which was assumed to be kept
Reis and Saraiva [60] considered the prediction of a profile response fixed). For operational reasons, jVW VJj > 50 is a condition that was
in a paper manufacturing facility. The response of interest is the tensile always maintained during experimentation, as this is a machine
stiffness orientation (TSO) angle profile across a paper sheet produced in requirement to balance paper properties (surface quality, dimensional
a paper machine. The TSO orientation is closely related to the fiber
-5
-10
-15
-20
1 2 3 4 5 6 7 8 9 10
Position across paper sheet
10
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Run no: 1 2 3 4 5
10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
6 7 8 9 10
10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
11 12 13 14 15
10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
16 17 18 19 20
10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
Fig. 9. Dark lines: observed TSO profiles in each of the N ¼ 20 experimental runs. Lighter, dashed lines: predicted TSO profiles (mean of the posterior distribution).
The predictions are so close to the actual they are hard to see.
Table 6
Estimated predictive posterior probabilities of a TSO profile satisfying all bounds Ai ¼ ðLi ; Ui Þ simultaneously and at each point s. Here.pi ¼ pðLi Yðsi Þ Ui jdataÞ
pðYðxc Þ 2 AjdataÞ p1 p2 p3 p4 p5 p6 p7 p8 p9
0.898 0.983 0.997 0.939 1.000 0.977 0.994 0.939 0.935 0.962
11
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
10 1
0 0
-5 -0.5
-10 -1
-15 -1.5
Position across paper sheet
0 2 4 6 8 10 0 2 4 6 8 10
Position across paper sheet
1.5
10
1
TSO angle (degrees)
0 0
-0.5
-5
-1
-10
-1.5
-2 -15
0 2 4 6 8 10 0 2 4 6 8 10
Position across paper sheet Position across paper sheet
Fig. 10. Effect of each of the 3 controllable variables on the TSO response across the paper sheet. The difference between the two lines shows the effect on the curve or
profile response that the factor has, as it changes from a low to a high setting. Lines with ‘þ’ correspond to the average response when the factor was equal to its
highest experimental setting, dashed lines are the average TSO response when the factor was equal to its lowest experimental setting.
1
TSO angle (degrees)
-1
-2
-3
-4
1 2 3 4 5 6 7 8 9 10
Fig. 11. Contour plot of the predictive probability pðY 2 Ajdata; xÞ for x1 ¼ Position across paper sheet
847:4 while varying x2 and x3 within their feasible region (40%–80%). A given
Fig. 12. Blue line with dots: TSO profile from confirmation run at VJ ¼ 847:4,
contour can be used as the boundary defining the design space for this partic-
A ¼ 79:9 and VR ¼ 45:35. Crossed dark lines: upper and lowed specifications
ular process.
for the TSO profile. (For interpretation of the references to colour in this figure
legend, the reader is referred to the Web version of this article.)
is that if the latest profile (at stability time ¼ 6 months), corresponding to
the oldest stability time of drug, satisfies the product specifications, then
terms (xl2 ) only in the numeric factors x1 to x3 (but not in the categorical
it can be assured that the rest of the profiles at earlier stability times, will
factors, for which they would not make sense).
also satisfy them.
The experimental ran was a N ¼ 25 D-optimal design which allows us
The response Y is therefore the drug release percentage curve at the
to estimate all main effects and 2-factor interactions plus quadratic terms
last observed stability time (6 months) observed over 10 release time
in the non-categorical factors (18 parameters). The optimal settings
points s. Since the drug release vs. release time profiles are percentages,
found (assuming both noise factors are randomly varying as Ber-
they do not follow a normal distribution, necessary in our approach. We
noulli(0.50) random variables) are shown in Table 7.
therefore first normalize the data by applying the Probit transformation:
The optimal profile at the x*c settings is shown in Fig. 14 together with
Y’ ¼ ΦðYÞ1 95% prediction intervals. The estimated maximum joint probability of
having a future curve completely inside the bounds when running the
where Φð Þ is the cumulative distribution function of a standard normal process at settings x*c is 0.4060. The joint probability is low because of the
random variable. The observed curves at each of the 25 experimental difficulty of maintaining the response within bounds at the highest
runs are shown in Fig. 13, together with the predicted profiles. From release times, when it is more variable. Finally, Fig. 15 shows the main
observation of these profiles, a cubic polynomial is assumed in stage 1 of effect plots of factors RT, RH, DT, ME, and DS. The latter two are cate-
the mixed effects model for Y’ðsÞ, requiring four parameters θ. In a sec- gorical factors. From the plots, to keep the release curve as high as
ond stage, quadratic polynomial models were fit with pure quadratic possible, especially with a high increase initially, all 3 controllable
12
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Run: 1 2 3 4 5
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
6 7 8 9 10
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
11 12 13 14 15
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
16 17 18 19 20
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
21 22 23 24 25
2 2 2 2 2
1
0 0 0 0 0
-1
-2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
Fig. 13. Dark lines: observed 6-month drug release profiles in each of the N ¼ 25 experimental runs (Probit transformation applied). Lighter, dashed lines: predicted
profiles (mean of the posterior distribution). The predictions are so close to the actual they are hard to see.
1
We have provided a review and some extensions of bayesian pre-
Drug release (%)
0.6
dictive optimization methods, with application to the process industries.
0 0.5
The methods are based on predictors and response data from designed
0.4 experiments, where the response is in the form of either a vector of
-1
0.3 correlated responses, possibly of high dimensionality, or a profile.
Bayesian predictive methods are becoming increasingly relevant in
0.2
-2 practice not only for process optimization but also in QbD applications
0.1
where it is of interest to find a design space where the future performance
-3 of the process is guaranteed with certain probability (as in pharmaceu-
0 5 10 15 20 25 30
release time (minutes) tical applications, see Ref. [47,49,66]). The bayesian predictive approach
to process optimization provides a probability (or reliability measure) of
Fig. 14. Light lines: observed drug release profiles after a 6 month stability
satisfying the process goals at the optimal settings x*c . Such probability
period. A probit (Y’ ¼ ΦðYÞ1 ) transformation was applied to the drug release
percentage response before analysis. Actual drug release % can be read from the statements are not possible to obtain with classical statistics. A fre-
right hand axis. Dark line: optimal predicted drug release profile at 6 months, quentist “design space” can be obtain classically by bootstrapping, a
red lines: the 5 and 95% predicted percentiles of the optimal drug release possibility not reviewed here. The bootstrapping approach, however, can
posterior distribution, crossed dark lines: upper and lowed specifications for the only provide confidence regions on the optimal operating conditions x*c
6-month drug release profile. The predicted optimal profile response including and it is not possible to provide probability statements about whether the
its 95% variability is inside the desired specifications and only slightly exceeds process will fall or not in the given region if in the future it is run at
them for large release times. (For interpretation of the references to colour in
operating conditions x*c , something easily done under the bayesian
this figure legend, the reader is referred to the Web version of this article.)
framework.
Extensions to the original bayesian process optimization methodol-
ogy, based on regression models in Ref. [46] were presented. These
factors, RT, RH, and DT should be run at their high settings. This agrees
included the use of a hierarchical mixed effects model that can be used
with the optimal solution, obtained via numerical optimization,
either when the number of responses is too large for a noninformative
13
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Factor 1: RT Factor 2: RH Factor 3: DT Fig. 15. Effect of each of the 3 controllable variables
0.5 of the drug release profiles at 6 month stability time.
0.2 0.2 The difference between the two lines shows the effect
on the curve or profile response that the factor has, as
it changes from a low to a high setting. Lines with ‘þ’
0 0 0
correspond to the average response when the factor
was equal to its highest experimental setting, dashed
-0.2 -0.2 lines are the average response when the factor was
-0.5 equal to its lowest experimental setting. Response (%
0 10 20 30 0 10 20 30 0 10 20 30 drug release) shown after Probit transformation.
0 0
-0.1 -0.2
0 10 20 30 0 10 20 30
100
Table 8
Optimal settings for the drug release experiment when noise factors x4 ¼ ME and
90
x5 ¼ RS are assumed categorical with Bernoulli distributions with probabilities
80 p1 and p2 respectively. The optimal solutions for controllable factors x1 ; x2 ; x3
remain largely the same.
Drug release %
70
p1 p2 pðYðxc Þ 2 AjdataÞ x1* ¼ x2* ¼ x3* ¼ x4 ¼ x5 ¼
60
RT RH DT ME RS
50
0.5 0.5 0.406 0.855 0.848 0.841
40 1 1 0.416 0.814 0.998 0.999 1 1
1 0 0.496 1.000 0.754 0.917 1 1
30
0 1 0.505 0.889 0.830 0.999 1 1
20 0 0 0.311 0.792 0.811 0.910 1 1
10
The following Monte Carlo procedure can be used to compute pðxc ÞRPD , the quantity maximized in problem (8) for the single stage, multiple response
approach (see [38] for more details).
1 Set c ¼ 0.
2 Simulate a value of the noise factors, xn , from their (assumed known) distribution. The vector xc is given and contains the values of the controllable
factors. Make x ¼ ðxc ; xn Þ.
14
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
0
3 Simulate yx; data TνJ B x; ν2
ν H1 .
Step 3 can be performed efficiently in MATLAB using function mvtrnd in the statistics toolbox and in R using function rmvt in package mvtnorm.
For the hierarchical mixed model, step 3 is substituted by sampling from the simulated Markov Chain obtained via Gibbs sampling (after the burn-in
period) and substituting these simulated parameters in the likelihood, i.e., values of y are obtained from the predictive density by composition (see
Appendix B). As reported by Miro et al. [38], it is more effective to generate the required m noise factors once and use a common random numbers
strategy in the optimization whenever pðxc ÞRPD needs to be evaluated within the optimization process. Optimizing this function is a hard nonlinear
problem, and Miro et al. report how it is useful to start the optimizer from a point likely to contain enough “area” of the predictive density.
Lange et al. [35] give the full conditionals for the parameters ðβ; fwi g; fσ 2i g; Σw Þ in the linear mixed model (13) where the σ 2i ’s allow different
variances among the observed profiles (this is sometimes useful in longitudinal analysis models when it is not desired to make inferences on profiles
other than the N observed). Chib and Carlin [11] considered the case we consider, where Σ ¼ σ 2 IJ , and show how the Lange et al. procedure suffers
from slow convergence, and proposed two alternative algorithms, one (their algorithm 2) which is a pure Gibbs sampling approach, and another one
(their algorithm 3) which has a Metropolis step (a rejection sampling step). Since the convergence properties of these two algorithms appear to be about
the same, in particular for the β parameters, we choose Chib and Carlin's algorithm 2 (after correcting some errors in the full conditionals in their paper)
since it does not require a Metropolis step (i.e., all full conditionals are known distributions). We also provide the full conditionals of the other pa-
rameters since they were not explicitly given by these authors.
The priors we use were:
β Npq ðβ0 ; B0 Þ, with β0 ¼ 0 and B0 ¼ 1000I (noninformative for β);
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
σ 2 IGðλ1 ; λ2 Þ (inverse-gamma distribution), with λ1 ¼ 0:001 and λ2 ¼ 5 (this gives Eðσ 2 Þ ¼ 2 and Varðσ 2 Þ ¼ 63, relatively non-informative);
Σ1 1
w Wishartðν0 R0 ; ν0 Þ with ν0 ¼ p (most non-informative choice) and R0 ¼ Ip (gives EðΣw Þ ¼ Ip with as large variance as possible);
The Gibbs sampling scheme is:
1. Sample β from βY; σ 2 ; Σw :
X
N 1 X
N X
N 1
0 0 0
N X i V 1 X i þ B1
0 X i V 1 yi þ B1
0 β0 ; X i V 1 X i þ B1
0
i¼1 i¼1 i¼1
0
ðmÞ
where V ¼ S
Σw S
' þ σ 2 I and X i ¼ xi
S* .
2. Sample the random effects wi , i ¼ 1; …; N from fwi gY; β; σ 2 ; Σw :
' 1 ðwÞ
' 1
S S S
' Ri S S
N þ Σ1 ; þ Σ1
σ2 w
σ2 σ2 w
ðwÞ
where Ri ¼ yi X i β.
1
3. Sample Σ1
w from Σw fwi g:
X
N 1
0
Wishart wi wi þ ν0 R1
0 ; N þ ν0
i¼1
4. Sample σ 2 from σ 2 Y; β; fwi g:
1
JN 1 1 ðσ2 Þ0 ðσ2 Þ
Gamma λ1 þ ; þ R R
2 λ2 2
Gibbs sampling for mixed effect models can sometimes have very high MCMC autocorrelations, and the use of Hamiltonian MCMC (e.g. via Stan, [7])
can help as an alternative algorithm.
The MCMC sampling of the mixed effects model parameters needs not be conducted within the optimization routine necessary to solve (8), otherwise
this would imply a tremendous computational burden. The reason for this is given by the model written as in (12). Given realizations of the posterior of
Θ ¼ ðβ; fwi gΣw ; σ 2 Þ, pðΘjdataÞ, we can simulate draws of the posterior predictive density by composition (Gelman et al., 2004):
15
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
Z
p yjdata; xÞ ¼ pðy; Θjdata; x dΘ
Z
¼ pðyjdata; Θ; xÞpðΘjdataÞdΘ
Thus, we conduct a simulation of the Gibbs sampling chain until convergence, approximating in this way pðΘjdataÞ and simply sample from it. We then
substitute the sampled values into the density pðyjdata; Θ; xÞ whenever a yjdata; x vector is needed in the optimization routine (this replaces step 3 in
Appendix A). Hence, the MCMC computations are only run once, before performing the optimization. For more information on MCMC methods, see
Gelman et al. [25] and for a more concise introduction, see Colosimo and Del Castillo [12].
In order to present the probabilistic version of PCA of Tipping and Bishop [70], we first contrast the closely related Maximum Likelihood PCA
(MLPCA) models in the Chemometrics literature (e.g., as in [73]) and the Factor Analysis (FA) model in the Machine Learning and Statistics literature
(e.g., as in [5, 26])1. We then discuss probabilistic PCA, as it was used in the estimation of matrix S
in the linear mixed effects model (13).
Both MLPCA and FA models are based on the description:
Y ¼ Wz þ μ þ ε
where Y is a p-dimensional observed vector and z is an unobserved (and therefore, hypothesized) k dimensional vector, with k typically much smaller
than p. W is called the “loadings” matrix and the entries in the z vector are called the latent or “factor” variables. Both MLPCA and Factor Analysis
assume:
ε Nð0; ΣÞ:
If we redefine Y μ to be observed data (i.e., if we center the data with b
μ ¼ Y), the parameter μ can be neglected.
MLPCA model characteristics.- The goal of MLPCA according to [73], is to best estimate z given the observed random vectors Y, an estimate of W
and a known covariance matrix Σ. In MLPCA it is assumed z is an error-free constant, so the model is a linear regression model, and the optimal solution
is given by the generalized least squares estimator:
0 1 0
bz ¼ W Σ1 W W Σ1 Y
b ¼ Wbz þ b
which yields predictions of the responses Y μ . This prediction equation is used to minimize the “reconstruction error”
X
N
SΣ ¼ b Yi Þ0 Σ1 ð Y
ðY b Yi Þ
i¼1
with respect to W. The MLPCA literature also discusses the case where Σ varies with each observation, so the covariance between the same elements of
different Yi ’s can differ with the observation number i. Covariances between elements in each Yi are modeled via the entries of Σi , which is a dense (not
diagonal) matrix. Note that in processes where a very large number of variables p are measured the number of elements in Σi is high (pðp þ1Þ= 2 distinct
entries). These covariances are assumed known via information available for the measurement noise, although [73] mention that in practice they need
to be estimated2. The objective of MLPCA in Chemometrics then is to find relationships between the measured variables.
FA model characteristics.- The FA model in the Machine Learning/Statistics literature, rather than considering the latent variables fixed constants,
assumes instead they are random:
z Nð0; Σz Þ:
This is a “random effects” model with two sources of variability affecting the observations. The term containing the latent variables z models corre-
lations between the elements of Y, while the error variables ε account for measurement error in each of these elements. In the Chemometrics field, Reis
and Saraiva [59] use random latent variables (in contrast to Wenzell [73]), calling the model an heteroscedastic latent model, and use it for Statistical
Process Control.
The goal when using the FA model is to estimate W and Σ under the additional assumption that Σz ¼ Ik (the k dimensional identity matrix) to best
model the correlation in the entries of the observed Y vectors. This additional assumption does not lose generality because any correlation between the
elements of Y can still be modeled given that
0
CovðYÞ ¼ WW þ Σ: (21)
A particular case: Probabilistic PCA.- If Σ ¼ σ I (same measurement error variance for all elements in Y) the FA model is called a Probabilistic
2
PCA model (PPCA). The maximum likelihood estimates of W and σ 2 were shown by [70] to be:
1=2
c
W ¼ V L σ2 I R
where V is a p k matrix with columns equal to the k eigenvectors associated with the top k eigenvalues of
16
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
X
N
S¼ ðY b Yi Þ0
b Yi Þð Y
i¼1
L is a diagonal matrix with the k largest eigenvalues, and R is an arbitrary rotation matrix, which can be set equal to I. The MLE of σ 2 is
1 X
p
σ2 ¼
b λi (22)
p k i¼kþ1
(the variance associated with the discarded dimensions). The inverse mapping in PPCA, giving the latent vector associated with a given observation Y, is
the posterior distribution of z, which is
0 1 0 0 1
zY N W W þ σ 2 I W Y; σ 2 W W þ σ 2 I
As it can be seen, the posterior mean is a linear projection that can be interpreted as “ridge” regression [27]. Probabilistic PCA provides a “generative”
model for PCA, but it is not a fully bayesian PCA approach as the subspace dimension k is considered fixed and known and W and σ 2 are considered
unknown constants. Minka [37] has provided a bayesian model selection version of Probabilistic PCA that also finds k and compares it with other
methods for selecting the best subspace dimension, providing some empirical evidence his method is more accurate.
Supplementary materials
All datafiles and MATLAB code that implements the methods discussed here, including MCMC estimation and numerical optimization are provided.
Running the script scriptExamples.m reproduces all examples in the paper by calling the appropriate functions. In addition, function Find_PCA_Dim.m is
provided, which implements the exact ML solution for Probabilistic PCA. It also finds the best latent space dimension using a bayesian algorithm
developed by Minka [37]. For readers who prefer R or Python, the examples in the paper, in particular the latent variable models, can be replicated
using Bayesian computing languages such as JAGS [51] and Stan [7].
References [19] Fda, Guidance Document: Q11 Development and Manufacture of Drug Substances,
2012.
[20] G.M. Fitzmaurice, N.M. Laird, J.H. Ware, Applied Longitudinal Analysis, John Wiley
[1] Asi, Robust Designs Using Taguchi Methods, American Supplier Institute (ASI)
& Sons, Hoboken, NJ, 2004.
Press, Livonia, MI., 1998.
[21] Python Software Foundation, Python Language Reference, 2020, version 3.8.3.
[2] Bhavik R. Bakshi, Mohamed N. Nounou, Prem K. Goel, Xiaotong Shen, Multiscale
[22] J.B. Gao, C.J. Harris, Some remarks on Kalman filters for the multisensor fusion, Inf.
bayesian rectification of data from linear steady-state and dynamic systems without
Fusion 3 (3) (September 2002) 191–201.
accurate models, Ind. Eng. Chem. Res. 40 (1) (2001) 261–274.
[23] Zhiqiang Ge, Process data analytics via probabilistic latent variable models: a
[3] Yaakov Bar-Shalom, Leon Campo, The effect of the common process noise on the
tutorial review, I&EC Research 57 (2018) 12646–12661.
two-sensor fused-track covariance, IEEE Trans. Aero. Electron. Syst. 22 (6)
[24] Zhiqiang Ge, Muguang Zhang, Zhihuan Song, Nonlinear process monitoring based
(November 1986) 803–805.
on linear subspace and bayesian inference, J. Process Contr. 20 (5) (2010) 676–688.
[4] Billheimer Dean, Predictive inference and scientific reproducibility, Am. Statistician
[25] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Vehtari Aki, Donald
73 (sup1) (2019) 291–295.
B. Rubin, Bayesian Data Analysis, Chapman and Hall/CRC, 2013.
[5] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer
[26] Zoubin Ghahramani, Geoffrey E. Hinton, et al., The EM algorithm for mixtures of
Scienceþ Business Media, 2006.
factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of
[6] A.J. Burnham, John F. MacGregor, R. Viveros, Latent variable multivariate
Toronto, 1996.
regression modeling, Chemometr. Intell. Lab. Syst. 48 (2) (1999) 167–180.
[27] Trevor Hastie, Robert Tibshirani, Martin Wainwright, Statistical Learning with
[7] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich,
Sparsity: the Lasso and Generalizations, Chapman and Hall/CRC, 2015.
Michael Betancourt, Brubaker Marcus, Jiqiang Guo, Peter Li, Allen Riddell Stan,
[28] Agnar H€ oskuldsson, PLS regression methods, J. Chemometr. 2 (3) (1988) 211–228.
A probabilistic programming language, J. Stat. Software 76 (1) (2017).
[29] Shuo-Huan Hsu, Stephen D. Stamatis, James M. Caruthers, W.N. Delgass,
[8] Hongshu Chen, Bhavik R. Bakshi, Prem K. Goel, Bayesian latent variable regression
V. Venkatasubramanian, Gary E. Blau, M. Lasinski, S. Orcun, Bayesian framework
via gibbs sampling: methodology and practical aspects, J. Chemometr.: A Journal of
for building kinetic models of catalytic systems, Ind. Eng. Chem. Res. 48 (10)
the Chemometrics Society 21 (12) (2007) 578–591.
(2009) 4768–4790.
[9] Hongshu Chen, Bhavik R. Bakshi, Prem K. Goel, Toward bayesian chemometrics—a
[30] Ulf G. Indahl, H. LilandKristian, Næs Tormod, Canonical partial least squares—a
tutorial on some recent advances, Anal. Chim. Acta 602 (1) (2007) 1–16.
unified PLS approach to classification and regression problems, J. Chemometr. 23
[10] Wen-shiang Chen, Bhavik R. Bakshi, Prem K. Goel, Ungarala Sridhar, Bayesian
(9) (2009) 495–504.
estimation of unconstrained nonlinear dynamic systems, IFAC Proceedings Volumes
[31] Qingchao Jian, Biao Huang, Xuefeng Yan, GMM and optimal principal components-
37 (1) (2004) 263–268.
based bayesian method for multimode fault diagnosis, Comput. Chem. Eng. 54
[11] S. Chib, B.P. Carlin, On mcmc sampling in hierarchical longitudinal models, Stat.
(2008) 338–349.
Comput. 9 (1999) 17–26.
[32] R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, Prentice
[12] B.M. Colosimo, E. Del Castillo, Modern numerical methods in bayesian
Hall, 2002.
computation, in: Bayesian Process Monitoring, Control and Optimization, CRC/
[33] Hiromasa Kaneko, Kimito Funatsu, Adaptive soft sensor based on online support
Taylor and Francis, NY, 2007.
vector regression and bayesian ensemble learning for various states in chemical
[13] E. Del Castillo, Process Optimization, a Statistical Approach, Springer, 2007.
plants, Chemometr. Intell. Lab. Syst. 20 (2014) 57–66.
[14] E. Del Castillo, B.M. Colosimo, An introduction to bayesian inference in process
[34] N.M. Laird, J.H. Ware, Random effects models for longitudinal data, Biometrics 38
monitoring, control, and optimization, in: Bayesian Process Monitoring, Control
(4) (1982) 963–974.
and Optimization, CRC/Taylor and Francis, NY, 2007.
[35] Nicholas Lange, Bradley P. Carlin, Alan E. Gelfand, Hierarchical Bayes models for
[15] Enrique Del Castillo, Bianca M. Colosimo, Hussam Alshraideh, Bayesian modeling
the progression of HIV infection using longitudinal cd4 t-cell numbers, J. Am. Stat.
and optimization of functional responses affected by noise factors, J. Qual. Technol.
Assoc. 87 (419) (1992) 615–626.
44 (2) (2012) 117–135.
[36] Matlab, The Mathworks, inc., Natick, Ma, 2020.
[16] Derringer George, Ronald Suich, Simultaneous optimization of several response
[37] Thomas P. Minka, Automatic choice of dimensionality for PCA, in: Advances in
variables, J. Qual. Technol. 12 (4) (1980) 214–219.
Neural Information Processing Systems, 2001, pp. 598–604.
[17] William R. Dillon, Matthew Goldstein, Multivatiate Analysis - Methods and
[38] G. Miro, E. Del Castillo, J.J. Peterson, A bayesian approach for multiple response
Applications, Wiley, 1984.
surface optimization in the presence of noise variables, J. Appl. Stat. 31 (3) (2004)
[18] Alireza Fatehi, Biao Huang, Kalman filtering approach to multi-rate information
251–270.
fusion in the presence of irregular sampling rate and variable measurement delay,
[39] Kevin P. Murphy, Machine Learning: a Probabilistic Perspective, MIT press, 2012.
J. Process Contr. 53 (15–25) (May 2017).
17
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121
[40] R.H. Myers, D.C. Montgomery, C. Anderson-Cook, Response Surface Methodology, case study—solid-phase microextraction (spme) applied to the quantification of
3rd. ed., Wiley, NY, 2009. analytes with impact on wine aroma, J. Chemometr. 33 (3) (2019), e3103.
[41] Szu Hui Ng, A bayesian model-averaging approach for multiple-response [58] Marco S. Reis, Pedro M. Saraiva, Integration of data uncertainty in linear regression
optimization, J. Qual. Technol. 42 (1) (2010) 52–68. and process optimization, AIChE J. 51 (11) (2005) 3007–3019.
[42] Mohamed N. Nounou, Bhavik R. Bakshi, Prem K. Goel, Xiaotong Shen, Bayesian [59] Marco S. Reis, Pedro M. Saraiva, Heteroscedastic latent variable modelling with
principal component analysis, J. Chemometr. 16 (9) (2002) 576–595. applications to multivariate statistical process control, Chemometr. Intell. Lab. Syst.
[43] Mohamed N. Nounou, Bhavik R. Bakshi, Prem K. Goel, Xiaotong Shen, Process 80 (1) (2006) 57–66.
modeling by bayesian latent variable regression, AIChE J. 48 (8) (2002) [60] Marco S. Reis, Pedro M. Saraiva, Prediction of profiles in the process industries, Ind.
1775–1793. Eng. Chem. Res. 51 (11) (2012) 4254–4266.
[44] A.M. Overstall, D.C. Woods, K.J. Martin, Bayesian prediction for physical models [61] Sajjad Safari, Faridoon Shabani, Dan Simon, Multirate multisensor data fusion for
with application to the optimization of the synthesis of pharmaceutical products linear systems using Kalman filters and a neural network, Aero. Sci. Technol. 39
using chemical kinetics, Comput. Stat. Data Anal. 132 (2018) 126–142. (December 2014) 465–471.
[45] D.F. Percy, Prediction for seemingly unrelated regressions, J. Roy. Stat. Soc. B 54 [62] Geisser Seymour, Predictive Inference, Chapman and Hall, New York, 1993.
(1992), 254–252. [63] Ruijie Shi, John F. MacGregor, Modeling of dynamic systems using latent variable
[46] J.J. Peterson, A bayesian reliability approach to multiple response surface and subspace methods, J. Chemometr. 14 (5–6) (2000) 423–439.
optimization, J. Qual. Technol. 36 (2) (2004) 139–153. [64] Branca MA. Silva, Sílvia Vicente, Sofia Cunha, Jorge FJ. Coelho, Claudia Silva,
[47] J.J. Peterson, A bayesian approach to the ich q8 definition of design space, Marco Seabra Reis, Sergio Sim~ oes, Retrospective quality by design (rqbd) applied to
J. Biopharm. Stat. 18 (2008) 959–975. the optimization of orodispersible films, Int. J. Pharm. 528 (1–2) (2017) 655–663.
[48] J.J. Peterson, E. Del Castillo, A posterior predictive approach to multiple response [65] Andrew Smyth, Meiliang Wu, Multi-rate Kalman filtering for the data fusion of
surface optimization, JSM, Salt Lake City, Utah, 2007. displacement and acceleration response measurements in dynamic system
[49] John J. Peterson, Lief Kevin, The ICH Q8 definition of design space: a comparison of monitoring, Mech. Syst. Signal Process. 21 (2) (February 2007) 706–723.
the overlapping means and the bayesian predictive approaches, Stat. Biopharm. [66] Gregory W. Stockdale, Aili Cheng, Finding design space and a reliable operating
Res. 2 (2) (2010) 249–259. region using a multivariate bayesian approach with experimental design, Quality
[50] John J. Peterson, Guillermo Miro-Quesada, Enrique del Castillo, A bayesian Technology & Quantitative Management 6 (4) (2009) 391–408.
reliability approach to multiple response optimization with seemingly unrelated [67] J.E. Tabora, F. Lora Gonzalez, J.W. Tom, Bayesian probabilistic modeling in
regression models, Quality Technology & Quantitative Management 6 (4) (2009) pharmaceutical process development, AIChE J. 65 (11) (2019).
353–369. [68] G. Taguchi, System of Experimental Design, Unipub/Kraus International
[51] Martyn Plummer Jags, A program for analysis of bayesian graphical models using Publications., NY, 1987.
gibbs sampling, in: Proceedings of the 3rd International Workshop on Distributed [69] R Core Team, The R Project for Statistical Computing, r foundation for statistical
Statistical Computing (DSC 2003), March 20-22 2003. Vienna, Austria. computing, vienna, austria, 2020.
[52] L. Alexey, Pomerantsev. Successive bayesian estimation of reaction rate constants [70] Michael E. Tipping, Christopher M. Bishop, Probabilistic principal component
from spectral data, Chemometrics and Intelligent Laborarory Systems 66 (2) (2003) analysis, J. Roy. Stat. Soc. B 61 (3) (1999) 611–622.
127–139. [71] O Arda Vanli, Enrique Del Castillo, Bayesian approaches for on-line robust
[53] S.J. Press, Applied Multivariate Analysis: Using Bayesian and Frequentist Methods parameter design, IIE Trans. 41 (4) (2009) 359–371.
of Analysis, R. E. Krieger Pub. Co., Malabar, FL., 1982. [72] James H. Ware, Linear models for the analysis of longitudinal studies, Am.
[54] Ramkumar Rajagopal, Enrique Del Castillo, Model-robust process optimization Statistician 39 (2) (1985) 95–101.
using bayesian model averaging, Technometrics 47 (2) (2005) 152–163. [73] Peter D. Wentzell, Darren T. Andrews, David C. Hamilton, Klaas Faber, Bruce
[55] Ramkumar Rajagopal, Enrique Del Castillo, John J. Peterson, Model and R. Kowalski, Maximum likelihood principal component analysis, J. Chemometr.: A
distribution-robust process optimization with noise factors, J. Qual. Technol. 37 (3) Journal of the Chemometrics Society 11 (4) (1997) 339–366.
(2005) 210–222. [74] C.F. J Wu, M. Hamada, Experiments, Planning, Analysis and Optimization, second
[56] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, NY, 2007. ed., Wiley, NY, 2011.
[57] Marco S. Reis, Ana C. Pereira, Jo~ao M. Leça, Pedro M. Rodrigues, Jose C. Marques, [75] Jie Yu, S. Joe Qin, Multimode process monitoring with bayesian inference-based
Multiresponse and multiobjective latent variable optimization of modern analytical finite Gaussian mixture models, AIChE J. 54 (7) (2008) 1811–1829.
instrumentation for the quantification of chemically related families of compounds:
18