Castillo2020 Bayesian Predictive Optimization of Multiple and Profile Responses Systems in Industry

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems


journal homepage: www.elsevier.com/locate/chemometrics

Bayesian predictive optimization of multiple and profile response systems in


the process industry: A review and extensions
Enrique del Castillo a, *, Marco S. Reis b
a
Dept. of Industrial Engineering and Dept. of Statistics the Pennsylvania State University, University Park, PA, 16802, USA
b
Univ Coimbra, CIEPQPF, Department of Chemical Engineering, Rua Sílvio Lima, Polo II – Pinhal de Marrocos, 3030-790 Coimbra, Portugal

A R T I C L E I N F O A B S T R A C T

Keywords: Bayesian statistical methods provide a sound mathematical framework to combine prior knowledge about
High dimensional response optimization controllable variables of importance in a process, if available, with the actual data, and has proved useful in
Quality by design several data analytic tasks in the process industry. We present a review and some extensions of bayesian predictive
Hierarchical linear models
methods for process optimization based on experimental design data, an area that is critical in Quality by Design
Robust parameter design
activities and where the bayesian perspective has received limited attention from the Chemometrics and process
analytics communities. The goal of the methods is to maximize the probability of conformance of the predicted
responses to their specification limits by varying the process operating conditions. Optimization of multiple
response systems and of systems where the performance is given by a curve or “profile” are considered, as they are
more challenging to model and optimize, yet are increasingly common in practice. We discuss the particular case
of Robust Parameter Design, a technique due to G. Taguchi and popular in discrete manufacturing systems, and its
implementation within a bayesian optimization framework. The usefulness of the models and methods is illus-
trated with three real-life chemical process examples. MATLAB code that implements all methods and reproduces
all examples is made available.

1. Introduction: bayesian optimization and robust parameter describes the best operation of the process [60]. In this type of processes,
design the data collected at given operating conditions are a complete realiza-
tion of a function of some index variable over a given range, what in the
The utility of the bayesian paradigm to model uncertainties in com- Statistics literature is called Functional Data [56]. We consider only on the
plex engineering systems has long been recognized. Bayesian “predicti- optimization of a process based on models empirically derived from
vism” centers on forecasting future observable variables (typically, a experimental data, and leave the case of mechanistic models, based on
response of some system) in terms of the resulting predicting posterior first principles, outside of the scope of the paper.
distributions, and avoids traditional inferences on non-observable model The recognition of the usefulness of the bayesian paradigm has
parameters [4,62]. The focus of this paper is bayesian predictive tech- recently been pointed out by Tabora et al. [67] in the area of pharma-
niques that deal with experimental data, obtained with the goal to ceutical process development. These authors recommend the bayesian
optimize or improve a process or a product. predictive approach developed by Peterson et al. [13,38,46–48,50] to
A specific level of system complexity that can be analyzed in an consider the inherent uncertainties in process development and to
effective manner using a bayesian predictive approach is when the pro- quantify the risk in “Quality by Design” problems, as defined by the FDA.
cess response is multivariate, including the high dimensional case, Even though the diffusion of bayesian methods to systems engineering,
common in current industrial data-rich environments. Related to the process analytics and chemometrics is still limited, a number of impor-
multivariate response process case, but significantly different from it, is tant problems have been studied. For instance, Nounou et al. developed
the case we focus on this paper, that of a process whose performance is bayesian latent variable model estimation methodologies [42,43]. Chen
not given by a collection of correlated scalar responses but is given et al. [9] propose to study traditional chemometrics methods from a
instead by the shape of a continuous curve, i.e., a “profile” response bayesian point of view. Mechanistic modeling has benefitted from
system, in which an ideal profile or function shape is assumed to exist and bayesian formulations for the estimation of kinetic parameters from

* Corresponding author.
E-mail address: exd13@psu.edu (E. Castillo).

https://doi.org/10.1016/j.chemolab.2020.104121
Received 25 February 2020; Received in revised form 22 June 2020; Accepted 21 July 2020
Available online 25 August 2020
0169-7439/© 2020 Elsevier B.V. All rights reserved.
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Fig. 1. A schematic of a process as seen from the point of view of Taguchi’s Robust Parameter Design (RPD). The goal in RPD is to find settings for the controllable
factors (operating or design variables) that keep the responses of the process on the desired targets or goals in the presence of uncontrolled variability in the so-called
noise factors. RPD requires process models, often obtained via experimental tests and statistical modeling.

chemical reactions [29,44,52]. Applications to data rectification [2], controlled by the manufacturer anymore and, on the contrary, vary
state estimation [10] and sensor fusion [3,18,22,61,65] have also been randomly. This situation leads to the Robust Parameter Design (RPD)
developed for dynamical systems. More recently, several bayesian pro- problem (Taguchi referred to process variables as “parameters”), in
cess monitoring methodologies have also been developed, namely for which the goal is to find the settings of the controllable process/product
multimode, non-linear and non-stationary processes [24,31,33,75]. variables that optimize the responses despite the uncontrolled variability
Particularly, the bayesian predictive approach is the basis of many introduced by the noise factors (see Fig. 1), i.e., find an optimal solution
contemporary machine learning (ML) methods, but explicit ML models that is robust with respect to noise factor variability. Both controllable
and methods for the optimization of chemical processes are lacking. and noise factors can be either continuous variables or categorical vari-
In the optimization of a chemical process from data-based models ables (taking values over a discrete set of levels). For an overview of
obtained from experimental tests, it is of prime importance to compute Robust Parameter Design methods, see Ref. [1,13,40,74]. As it will be
the probability that a future experiment will reproduce the current result shown below, all these sources of uncertainty can be handled with a
considering the different uncertainties involved [58]. There are un- bayesian predictive optimization approach.
certainties not only due to measurement errors but also in the model In this paper, we present a review and some extensions of the
itself, or resulting from the type of experimental design used, that need to bayesian predictive approach for the solution of optimization problems
be considered during the optimization stage. Additional “common cause” in the broader context of the process industries, including the RPD case.
sources of variability in the process need to be considered as well. In the We focus on two types of empirical models: multiple response processes
discrete-parts industry -but not so much in the process industry- a pop- (perhaps with a high dimensional response) and processes where the
ular type of process optimization approach was initiated in the late 80s performance is described by some continuous curve or profile. These
by the Japanese engineer G. Taguchi [68] who introduced the key models have not been sufficiently covered in the literature, but are
concept of a noise factor. These are process or product variables that can becoming increasingly relevant as more and more applications involve
be manipulated in a carefully controlled experiment, but once the process these type of high-dimensional responses (rather than the more common
operates regularly (or the product is in the marketplace) they cannot be case of a high-dimensional predictor space as in machine learning ap-
plications). We illustrate the methods with 3 real-life applications in the
process industry where the system response is a profile.
The remainder of the paper is organized as follows. We first contrast
optimization approaches based on models fitted with classical (fre-
quentist) statistical methods with a bayesian predictive approach. Next,
two specific types of models, multivariate regression for multiple
response processes and hierarchical mixed effects models for processes
with a profile or curve response are presented, and their corresponding
bayesian modeling and optimization is discussed, including the case of
robust parameter design optimization. Here a hierarchical mixed effects
model previously used by del Castillo et al. [15] is adapted to the opti-
mization of high dimensional responses via principal component anal-
ysis. Next, we compare the hierarchical mixed effects model for a profile
response to the increasingly popular probabilistic latent variable models
used in Machine Learning, highlighting their different structure, and how
Fig. 2. Overlaid contour plot of the four fitted responses in a tire tread exper- the latter are not adequate to model high dimensional responses origi-
iment in the x1  x2 plane (keeping x3 constant at 0.0). Unshaded region nated from experimental data where controllable factors are manipu-
appears to be a “sweet spot” where to run the process as it seems to satisfy the lated. Finally, three examples are presented that illustrate the methods
response constraints. However, these are models for the mean responses, and as applied to industrial processes. MATLAB code is made available as
such neglect both the process variation and the model uncertainty. See text.

2
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

supplementary material that accompanies this paper, which implements  neglects the covariance between the responses during the optimiza-
all methods and examples presented here. tion step, even if models were fitted via classical multivariate
regression, which does consider these covariances during the esti-
1.1. Limitations of process optimization based on classical statistical mation step.
models
1.2. Advantages of the bayesian predictive approach for process
A common approach for process optimization based on experimental optimization
data is to fit regression models to responses and overlay contour plots to
find a “sweet spot” where to run the process, a task facilitated by various Billheimer [4] has recently advocated the bayesian predictive
popular statistical software packages. For a classical example, consider approach for statistical inference. A natural way to optimize any process
the experiment reported in Ref. [16] for the optimization of a car tire from a quality and reliability standpoint is to maximize the probability of
tread compound. The controllable factors were x1 , hydrated silica level, conformance of the predicted responses to their specification limits.
x2 , silane coupling agent level, and x3 , sulfur level. The four responses to Bayesian predictive models and their use in industrial process optimi-
be optimized and their desired ranges were: PICO Abrasion index, y1 , zation were pioneered by Peterson [46] in pharmaceutical applications
120 < y1 ; 200% modulus, y2 , 1000 < y2 ; elongation at break, y3 , 400 < (see also [13] for a detailed presentation). As Billheimer [4] indicates, “A
y3 < 600, and hardness, y4 , 60 < y4 < 75. Quadratic polynomial models scientist is interested in the probability that a future experiment will
were fitted to each of these responses from experimental data. Fig. 2 reproduce the current result”. Likewise, in process optimization, an en-
shows overlaid contour plots (obtained with a popular statistical software gineer is interested in the probability that the future operation of the
package) which seem to imply it is safe to run the process within the process will result in the optimal performance that settings deduced from
“sweet spot” in the unshaded area. past experiments seems to indicate.
However, it can be risky to follow this common practice. Suppose we In the pharmaceutical sector, the ICH Q8 and Q11 guidelines [19] for
fit a regression model to a process response of the form Y b ¼ gðx1 ; x2 ; ::; industry promoted the concept of a “design space”, defined as the com-
xk Þ from experimental data. Under the usual regression assumptions, the bination of input variables and process parameters that have been
b ¼ E½Y.
model fit is a prediction of the mean of Y, i.e., Y b Assume, for demonstrated to provide assurance to quality”. Using the methods pre-
instance, that we wish to minimize the response by finding operating sented here it is possible to find a design space for a process that with
known probability is predicted to achieve particular goals if run inside the
conditions x* ¼ ðx1* ; …; xk* Þ such that Yðx* Þ ¼ E½Yðx* Þ  u where u is an
space (see Ref. [66], and for an instance, see example 2 below).
acceptable upper bound for the response. It is important to note that all
In summary, the advantages of the bayesian predictive optimization
that E½Yðx* Þ  u guarantees for future times we observe the process at
approach compared to optimizing models fitted classically are:
conditions x* is that
 it provides a probability estimate that future responses at given
PðYðx* Þ  uÞ  0:5
operating conditions x will satisfy the desired process specifications;
under the assumption of a symmetric distribution around the mean such  it considers the uncertainty in the model parameters, not only in the
as the normal. Likewise, in the case of two responses, if we find operating measurements;
conditions x* such that E½Y1 ðx* Þ  u1 and E½Y2 ðx* Þ  u2 , all that is  it takes into account the correlation between the responses during the
guaranteed at these conditions is that optimization (not only during model building);
 it permits to include additional sources of variability such as “noise
PðY1  u1 ; Y2  u2 Þ > 0:25: factors” in the Taguchi sense;
 it can be extended to different types of responses, as reviewed below.
In general, suppose we have q responses in a process, and we fit
corresponding regression models to E½Y1 ðxÞ; …; E½Yq ðxÞ. If we then find
For further discussion of disadvantages of the classical approach for
either numerically or simply by overlapping contour plots of the re-
response surface optimization and the advantages of a bayesian
sponses the process conditions x* such that E½Y1 ðx* Þ  u1 and E½Y2 ðx* Þ  approach, see [13, chapter 12].
u2 ; …; E½Yq ðx* Þhuq , all that is guaranteed at x* is that
  2. Bayesian inference for multiple response and profile response
P Y1 ðx* Þ  u1 ; Y1 ðx* Þ  u2 ; …; Yq ðx* Þ  uq  0:5q
processes
a probability that can indeed be very low. In practice, this probability
2.1. Central role of the predictive density in bayesian process optimization
bound is an overestimate, given that fitted models Yb j ðxÞ ¼ E½Y
b j ðxÞ are
used instead. Also, if the responses are positively correlated, then the Regardless of the model that is most appropriate for a response Y, the
lower bound will be higher and it may be easier to jointly optimize the object of interest for bayesian optimization is the posterior predictive
system. If the responses are negatively correlated, the opposite would ~ dataÞ
density (or predictive density, for short) of the response, f ðYjx;
happen [48,49]. Peterson and Lief [49] document how in six real data
where “data” denotes all the experimental data available in an n-run
industrial process optimization studies, including those in three pub-
experiment in k factors, including controllable and noise factors
lished papers from the literature, the posterior probability of meeting the
fxij gi¼1;…n;j¼1;…k and the corresponding observed response values
desired specifications actually varied from as low as 0.11, highlighting
fYi gj¼1;…;n , and the tilde on Y denotes a future response value not yet
the danger of optimizing fitted functions for the mean response that
neglect the uncertainties involved. In summary, optimizing regression observed at covariates x. The predictive density contains all the relevant
models fitted to a set of responses and looking at overlaying contour information, obtained from the observed data or a priori, about a
plots: response for which one wishes to make future inferences, and this in-
cludes the quality of the fit of the model as expressed in the dispersion of
 neglects model parameter uncertainty; the density. Further model diagnostics are based on predictive checks
 can not provide a probability of assurance or “reliability” about based on this density. Given a statistical model with parameters θ, the
whether future responses at given process settings x will satisfy pro- predictive posterior density is defined as:
cess specifications;

3
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

 Z 
~ dataÞ ¼ ~ θjx; data dθ Y ¼ XΓ þ E (3)
f Yjx; pðY;
Z where Y is a N  J matrix containing the observations of the J responses
~ x; dataÞpðθjdataÞdθ
¼ pðYjθ; in each of the N experimental runs (combinations of controllable and
Z (1) noise factors). It is important to note that the word “linear” refers to the
~ x; dataÞpðθjdataÞdθ
¼ pðYjθ; model parameters Γ, which enter linearly in the response model, not to
Z the covariates xi which can enter in matrix X via a nonlinear basis, such
~ xÞpðθjdataÞdθ
¼ pðYjθ; as a polynomial or a spline basis. Hence, curvature or nonlinearities in
each dependent variable or response (Yj ) can be modeled as a function of
where the last step follows because the density of Y~ at x is independent of the independent variables xi . Also, X is a N  q design matrix, with all the
the data given the parameters θ (in this case its density only depends on controllable and noise factors expanded in model form according to the
the density of measurement error ε). Here, pðY;~ θjx; dataÞ is the likelihood same q-parameter model. Γ is a q  J matrix containing all the q pa-
function, which is marginalized over the model parameters θ, pðYjθ; ~ xÞ is rameters for each response j ¼ 1; 2; …; J (note the same model form is
~ assumed for all responses), and E  ½εij  is a N  J matrix of random er-
density function of the future response Y evaluated at covariates x, which
rors which are assumed to be probabilistically described, as follows:
depends on the statistical model assumed, and pðθjdataÞ is the posterior
distribution of the model parameters. Depending on the model for the εi: NJ ð0; ΣÞ 8i ¼ 1; …; N
response Y, the integral in equation (1) may have a closed form expres-  
sion (this is the case of the multiple response regression case, see below). ε:j NN 0; σ 2j IN 8j ¼ 1; …; J:
In case it does not have a closed form, Markov Chain Monte Carlo
(MCMC) techniques can be used to find the posterior of the parameters That is, the model assumes that errors between responses from the same
and, with it, the posterior predictive density. Either informative or non- experiment (a row of E) can be correlated, but it also assumes that the
informative priors on θ can be used, but informative priors are usually errors along a column of E (errors for the same response Yj for different
difficult to justify and it is usually better to work with non-informative experimental runs i) are independent random variables. Under this
priors, see Ref. [12–14,25]. model, a J  1 vector y, containing a single, not yet observed, vector of
For a specific illustration of a simple predictive density, consider a responses, for given levels of the controllable and noise factors x, is
basic (univariate) linear regression model Y ¼ β0 þ β1 x1 þ; …; þβk xk þ ε assumed to follow the model:
where εeNð0; σ 2 Þ, with p ¼ k þ 1 parameters. In this case, the posterior
~ x; Y; XÞ (where the observed data ¼ ðY; XÞ) is y ¼ Γ’fðxÞ þ ε (4)
predictive density f ðYj~
given by a Student t distribution with N  p degrees of freedom, mean ~x’b β where fðxÞ is a q  1 vector containing the values of the controllable and
(where b β is the ordinary least squares estimate of β ¼ ðβ0 ; …; βk Þ’ ¼ noise factors x ¼ ðxc ; xn Þ at which the prediction is desired (expanded in
ðX’XÞ1 X’Y) and variance S2 ð1 þ x~’ðX’XÞ1 ~xÞ. Here, S2 is the sample model form, same form as used in the columns of X and Γ) and ε has the
variance estimator of σ 2 and X is an N  p matrix with one column per same distribution as a row of E, i.e., a Nð0; ΣÞ distribution.
model term and one row per experimental test (thus for this simple model Fortunately, the posterior predictive density of model (4) is available
in particular, X ¼ ½xi i¼1::n where xi ’ ¼ ð1; xi1 ; xi2 ; …; xik ÞÞ. From these in closed form. Here one can utilize non-informative priors, to avoid
results, it is easy to verify that the variance of the predictive density is a heavily weighted priors that are hard to justify. If condition (2) does not
function of: hold because the response is very high dimensional, the hierarchical
mixed effects model described further below, originally proposed for the
 the amount of data (N, as seen in matrix X and the sample variance optimization of profile responses, can be used as well as we explain in
S2 ); example 1.
 the inherent noise due to measurement error (estimated via S2 ); Under the classical non-informative joint prior for Γ and Σ in equation
 
 
 how well the model (as defined by the columns of X) fits; (3), namely, pðB;ΣÞ∝ΣjðJþ1Þ=2 , it is well-known (see, e.g., see Ref. [53],
 the model parameter uncertainty, determined in good part by the pp. 136 or [13]) that the bayesian predictive density for a new response
experimental design used, present in matrix X; vector y is given by a J-dimensional Student t distribution with ν ¼ N 
 the variability of the noise factors xn where x ¼ ðxc ; xn Þ are all factors q  J þ 1 degrees of freedom:
in the experiment (controllable and noise factors, respectively);
 
 our ability to control the noise factor variability thanks to the pres-
Γ νþJ
νþJ
ence of interaction terms in the model between xc and xn variables. If 2 pffiffiffiffiffiffiffi 1 2
f ð~yjx; dataÞ ¼  b ’ Hð~y  B’xÞ
jHj 1 þ ð~y  B’xÞ b (5)
such control  noise interactions do not exist in the model, there is no ðπνÞJ=2 Γ ν ν
2
way to solve the RPD problem [13].
This is denoted TνJ ða; bÞ, where a is the mean vector and b is the
  
2.2. Bayesian optimization of multi-response linear regression models  b b is the or-
variance matrix. That is, ~yx; dataeTνJ Γ’x; ν
ν2H
1
, where Γ

Let Y1 ; Y2 ; …; YJ be J responses of interest in a process. To investigate dinary least squares (OLS) estimator of Γ, and H is given by:
the input-output relations of the process, an experiment is designed and  
conducted by varying k controllable factors x1 ; …; xk over N different test ν b 1
Σ

conditions or “runs”. Assume we fit a linear regression model to each of N  q 1 þ x’ðX’XÞ1 x
the J responses (same model for all responses assumed in this section)
each containing q parameters. If the number of runs in the experiment N b is the usual Maximum Likelihood Estimator (MLE) of Σ.
where Σ
satisfies the inequality: Based on the bayesian multivariate regression formulation above,

J <N  q þ 1 (2)
1
The Machine Learning and Statistics literature actually calls “PCA” the
then a standard multivariate linear regression model can be fitted, written
model where Σ ¼ σ 2 I while it calls the model a FA model when Σ is diagonal but
in matrix notation as: has different entries. Thus PCA in Chemical Engineering is closer to FA in Sta-
tistics and Machine Learning.

4
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Peterson [46] proposed a method to conduct multiple-response optimi- problem to the case a set of models fMi g fit the response adequately,
zation that accounts for the uncertainty in the parameters of the model solving instead the problem:
and for any correlation present between the responses. His method as-
XZ
sumes no noise factors, hence x ¼ xc and consists in solving max ~ i ; data; xÞdY~ pðMi jdataÞ
pðYjM
xc 2Rc A
i
max p ðxc Þ ¼ Pð~
y 2 Ajxc ; dataÞ
subject to : (6) where pðMi jdataÞ is the posterior density of each model i. The model-
xc 2 Rc averaged solution found is then robust with respect to variation in the
true model describing the response. This methodology was extended by
where A is a desired specification region for the J responses and Rc is a Ng [41] to the multiple response case, allowing a user to define alter-
feasible region for the controllable factors. In its simplest form, both A native models for each response.
and Rc are given by upper and lower bounds. The bayesian models presented above assume normally distributed
To obtain p ðxc Þ through solving (6), the predictive density f ð~yjxc ; data. To robustify this assumption, Rajagopal et al. [55] considered
dataÞ in (5) needs to be integrated (numerically) over the region A: instead the noise in the model as originating from a Student t distribu-
Z tion. The models above also assume a “steady state” input-output
p ðxc Þ ¼ f ð~
yjxc ; dataÞ d~
y: (7) behavior of the process. It is possible to extend bayesian optimization
A
ideas to the dynamic case where some of the variables are lagged. Vanli
For a multivariate regression model, this can easily be done by Monte and del Castillo [71] consider bayesian RPD optimization of a process
Carlo simulation of a multivariate t distribution, as explained in Ap- based on a linear regression model in which the noise factors randomly
pendix A. vary according to a time series model, and frequent reoptimization is
If noise factors are present in the system and affect the responses (RPD necessary to adapt to the varying noise factor state.
case), a bayesian approach that provides solutions which are robust both
to the noise factor variability and to the uncertainty in the model pa- 2.4. Bayesian optimization of response profiles
rameters was proposed by Miro et al. [38] as an extension to Peterson’s
method. It consists in solving the following optimization problem, In cases where the performance of a process is given by a curve or
R “profile”, as opposed to a set of scalar responses, a different model is
max pðxc ÞRPD ¼ y 2 Ajx; dataÞf ðxn Þdxn
Pð~ necessary, and here we review a useful hierarchical mixed effects model
subject to : (8)
for profile responses originally presented in Ref. [15]. Let us assume the
xc 2 Rc :
response YðsÞ can be observed at several fixed values of an auxiliary
That is, after obtaining pðxÞ for fixed x ¼ ðxc ;xn Þ, a second integration is variable s which we will refer to as the “locations” s1 ; s2 ; …; sJ (note here
performed over the (assumed known) distribution of the noise factors, J denotes number of points per profile). For each experimental run i (i ¼
f ðxn Þ, to obtain the probability of conformance to specifications (also 1; …; N) where controllable and noise factors xi ¼ ðxc ; xn Þi have been
called the “reliability” of the process [46]) for the RPD problem, pðxc ÞRPD . tried in a designed experiment, a complete profile or function is observed
Miro et al. [38] considered only normally distributed noise factors. We consisting of J points along the curve YðsÞ. Thus, rather than observing a
extend this below to the case of Bernoulli ðpÞ and generalized Bernoulli continuous curve Yi ðsjxi Þ, we observe the response at discrete locations sj
ðp1 ; p2 ; …; pL Þ noise factors, given that in many cases categorical noise and model it accordingly:
factors, with L levels each, are uncontrolled and can not be treated as    
Yij ¼ g xi ; sj þ εi sj ; i ¼ 1; …; N; j ¼ 1; …; J; (9)
continuous random variables. The MATLAB code provided (see supple-
mentary materials) implements all integrations and the optimization
where g is some function to be specified/estimated and εi ðsj Þ is a random
needed to solve problem (8). Since the function pðxc ÞRPD is in general not
error, which can depend on the location sj .
concave, a non-linear optimization algorithm must be run from several
The multivariate regression approach could be used again treating the
initial points within the region Rc . Also, evaluation of pðxc ÞRPD requires
different response values at different locations sj as different responses. If
the integral be computed numerically via Monte Carlo simulation.
the number of experiments is large such that condition (2) holds, this
If the optimal probability after solving (8) is high, then there is a high
would be feasible. However, while this approach provides predictions at
degree of assurance that future responses inside the specification region
the predefined locations values sj , it is not possible to interpolate with this
A will be achieved, despite the noise factor and other sources of vari-
model, since the locations sj are not used explicitly in the model. That is,
ability. We would then have a robust solution to the RPD problem. The
it is not possible to make predictions at profile locations sj other than
solution will also consider the uncertainty in the model parameters and
those observed during the experiment.
measurement error, since it is based on the predictive density of the
When the number of points per profile J is large (and hence J  N 
response. Model parameter uncertainty is linked to the specific type of
q þ 1), the previous approach cannot be applied. An alternative then is to
experimental design used, which determines matrix X.
use informative priors and optimize the posterior predictive density, but
very informative priors are difficult to justify in general. A more general
2.3. Bayesian predictive optimization based on other response models approach is needed. Del Castillo et al. [15] propose to model a profile
response using a hierarchical Bayes approach.
The multivariate regression model presented earlier assumes all J Rather than trying to define a model for g as a function of all xi and s, a
responses follow the same exact model form with p parameters each. This hierarchical model breaks the functional dependency in two stages. In a
may constitute a limitation in practical applications. Therefore, this first stage, the curves or profiles are modeled as a regression function of
model and its use in process optimization has been extended in Ref. [50] the locations s:
to the case where each response can adopt a different linear model, the   
Yij ¼ g sj θðxi Þ þ εij :
so-called “SUR” (seemingly unrelated regression) case, which can easily
be dealt with via Gibbs sampling [45]. On a second stage, the parameters θ ¼ ½fθðkÞ g are then modeled as a
The optimization depends on the models used. There are situations
where more than one model fits the data reasonable well, but their cor-
2
responding optima lie in different regions of the controllable factor space. A computational consideration for very large p is that the dense p  p matrix
Rajagopal and del Castillo [54] expand the bayesian optimization Σ needs to be inverted to find bz , which is an Oðp3 Þ operation.

5
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Fig. 3. Other alternative model structures with latent variables. a) Principal components regression. b) Partial least squares; c) Canonical correlation analysis (adapted
from a figure by Ref. [39]). In situations where the input data ðx 2 XÞ comes from a designed experiment these structures are not appropriate, as there is no latent space
on the X space.

function of controllable and noise factors x ¼ ðxc ; xn Þ. Thus, modifying yi ¼ Xi β þ S* wi þ εi (13)


the controllable factors affects the first stage parameters, which in turn
modifies the form of the curve response. Given enough data, selecting the where we still have that X i ¼ fðxi Þ’
S. Their method to select S* ,
form of function gðsjθÞ is usually not difficult since for profile responses however, frequently results in S* ¼ S. A better approach, which we fully
s 2 R1 only, and this is illustrated in the examples. explain in example 3 below (section 4.1.1), is to select S* using Proba-
Putting all J sampled points of each profile i in a vector, the model in bilistic Principal Components (PPCA) analysis [70].
Ref. [15] is The parameters of model (13) are β; fwi g; Σw , and σ 2 . Bayesian
inference for this model has been studied by some authors ([11,15,35])
yi ¼ Sθi þ εi ; εi eNJ ð0; ΣÞ; (10) and requires Markov Chain Monte Carlo (MCMC) sampling, since the
joint posterior of these parameters is not a known distribution in closed
and second stage model given by
form. Appendix B gives the full conditional distributions of each
θi ¼ Bfðxi Þ þ wi ; wi eNp ð0; Σw Þ (11) parameter needed in the Gibbs sampling algorithm that yields the joint
posterior of the model parameters.
for i ¼ 1;::;N, where yi is a J  1 vector containing the observations along The probability of conformance to a set of specifications pðxc ÞRPD is
profile i, S is a J  p matrix of regressors for fitting the stage 1 model maximized once the predictive density of the response along the profile,
(here the regressors will be functions of the locations s), θi is a p 1 vector yjx; data is obtained via the MCMC scheme in Appendix B. As described in
of stage 1 parameters, and B is a p  q matrix of parameters –the stage 2 that appendix, the Markov chain is run once until convergence and
0
parameters– containing the effects of xi ¼ ðxc ; xn Þ i , the experimental sampled within the optimization whenever a value of yjx; data is needed.
conditions in run i. The notation fðxi Þ indicates as before that xi is The MCMC sampling scheme and the optimization needed are imple-
expanded in model form to include all q terms in stage 2. Similarly to mented in the MATLAB code we are providing (see supplementary
previous sections, these functions are typically non-linear in the cova- materials).
riates xi , and the parameters B enter linearly in (11). Thus, each element
of the parameter vector θi is assumed to be modeled adequately by a 3. Comparing the hierarchical model for profile responses to
model containing q parameters. The hierarchical model (10–11) can be probabilistic latent variable models
written as the single linear model:
Profile responses are increasingly common in the process industry
yi ¼ X i β þ Swi þ εi (12) [60], and their dimensionality, i.e., the number of observations per
profile, J can be high. A frequent approach in these high-dimensional
where X i ¼ fðxi Þ’
S is a J  qp matrix and β ¼ vecðBÞ is a qp 1 vector. settings is to adopt latent variable approaches, where the goal is to find
This is a model widely used in Biostatistics, in particular, in longitudinal one or more lower dimensional (latent) spaces of variables that structure
growth curve analysis (see Ref. [20,34]) where it is called a linear mixed and simplify the analysis of the input-output data. In this section, we
effects model, since the term Swi is stochastic but the term X i β is not (i.e., compare the structure of probabilistic latent variable methods with the
it has a “fixed” effect). Notice how if using this model, we would be fitting hierarchical model used for profile responses (10–11), and show the
pq parameters, compared to Jq parameters that would be needed if the former do not provide the model structure necessary if a high dimen-
multi-response approach would be used considering each of the J loca- sional response arises from experimental design data where covariates
tions different responses. Thus, if J is large compared to p, this would that control the process were manipulated.
represent a more parsimonious model. This is also an alternative for the The latent variable approaches that are most commonly applied in the
case J > N  q þ 1, when the single stage multi-response approach of the chemometrics/process analytics communities are Principal Components
previous section cannot be applied. However, model (12) requires the Regression (PCR), Partial Least Squares (PLS) and Canonical Correlation
estimation of a J  J covariance matrix, unless simplifying assumptions Analysis (CCA). They are classically presented as sample estimation
are made. Hierarchical approaches based on (10-11) explicitly utilize the procedures without reference to an explicit stochastic model that repre-
information about the locations (contained in matrix S). This is not the sents the population from which the X and Y data were generated. In PCR
case in the single stage multiple response approach presented earlier. one estimates the principal components of X and uses them as explana-
As discussed by Ref. [72], model (12) is unnecessarily restricted, since tory regressors for Y; PLS finds the latent variables common to X and Y
the design matrices for the fixed effects term and for the random effects from maximizing the covariance between these matrices; and CCA finds
term are linked. For this reason, del Castillo et al. [15] suggested to use the linear combinations maximizing the correlation between X and Y
different matrices S and S* in (12), keeping the original S matrix with the (instead of the covariance, as in PLS) [6,17,28,30,32,63]. Until recently,
location information for the fixed effects term but selecting S* to match this has been the prevailing perspective from the chemometrics/process
the observed within-profile covariance. The model is therefore: analytics communities.

6
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

The Machine Learning community looks at these methods from a


probabilistic perspective, where a stochastic model explains how the data
were generated, and based upon which maximum likelihood estimators
of the different latent variable can be obtained, from which bayesian
Fig. 4. The model structure needed if a high dimensionality response is
approaches can be developed. Starting with the work by Tipping and
considered in an experimental optimization situation. A latent structure in the Y
Bishop [70] on probabilistic PCA (see Appendix C), various authors have
space only may be necessary if Y is very high dimensional and condition (2) does
developed probabilistic versions of PCR, PLS, and CCA. Ge [23] has not hold.
recently reviewed probabilistic latent variable models and their appli-
cation in chemometrics and in general, in the process industries. We refer
multivariate regression model is built from the X space to the Y latent
to Ge’s paper and to Ref. [5,39] for fuller accounts, and only summarize
variable space. The appropriate data structure then goes from X space to
next these probabilistic models given our goal of contrasting their
Z space to Y space (see Fig. 4).
structure with the hierarchical mixed effects model we used for profile
The latent space structure depicted in Fig. 4 is made explicit in the
response experiments, highlighting their different structure.
hierarchical model we presented for profile responses, model (10–11).
Probabilistic Principal Components Regression. In PCR (model a)
Note how the two-stage hierarchical model can be written as:
in Fig. 3), it is assumed that both the response space Y and the control-
lable factor space X share a common latent space Z (where X, Y, and Z are yjz ¼ Wz þ ε1 ; ε1 eNð0; ΣÞ ðΣ diagonalÞ
vectors spaces where the corresponding variables lie). The model
(assumed centered input and output data, so their means are all zero) is   
zx ¼ Bx þ ε2 ; ε2 eN 0; σ 2 I
then
by considering the model parameters in stage 1, θ, the latent “features” of
yjz ¼ Wz þ ε1 ; ε1 eNð0; ΣÞ ðΣ diagonalÞ (14)
the profile or curve response ðzÞ. Thus, the hierarchical mixed effects
   model implements a feature-based dimensionality reduction in the data.
xz ¼ Az þ ε2 ; ε2 eN 0; σ 2 I (15)
It can be used either for profile/curve response systems (where the main
curve features are preselected by the user) or for high-dimensional Y-data
where y 2 Y; x 2 X and the latent variables z 2 Z are of much lower
via a previously conducted principal component analysis that provides
dimension than the dim ðxÞ or dim ðyÞ. From these assumptions, the
the “design” matrix W above.
relation of most interest for process optimization, yjx, can be obtained
[39]. Chen et al. [8] give an excellent presentation of Markov Chain
4. Bayesian predictive optimization in the process industry:
Monte Carlo methods applied to the PCR model, with practical applica-
some examples
tion in chemical processes.
Probabilistic Partial Least Squares. In PLS, in addition of a latent
The three examples in this section refer to continuous industrial
space Z c that is common to both Y and X spaces, there is a unique latent
processes where the response is either multivariate or in the form of a
space for the input space, Z x (see model b) in Fig. 3). The model
profile, and were selected also because the experimental data were
(assuming centered input and output data) is:
available to demonstrate the bayesian optimization methods we wish to
  
yz ¼ Wzc þ ε1 ; ε1 eN 0; σ 2 I (16) illustrate. Hence, in the examples below we fit either the bayesian
multivariate regression model or the bayesian hierarchical mixed effects
   model, depending on the type of response to be handled. It cannot be
xz ¼ Azc þ Bzx þ ε2 ; ε2 eN 0; σ 2 I (17)
over emphasized the importance of sound experimental design and
Probabilistic Canonical correlation analysis. In addition to the model building techniques whose coverage is beyond the scope of this
structure in PCA, in CCA there is also a unique latent space Z y for Y paper. In the examples below, we adopt the experimental designs re-
(model c) in Fig. 3). The model, under centered input and output data is: ported and the models fitted in their sources. In addition, each type of
   model (either multivariate regression or mixed effects model) has a va-
yz ¼ Wzc þ Czy þ ε1 ; ε1 eN 0; σ 2 I (18) riety of model diagnostics that should be consulted. For diagnostics
pertaining to the hierarchical mixed effects model, see Ref. [15], where
  
xz ¼ Azc þ Bzx þ ε2 ; ε2 eN 0; σ 2 I (19) several residual plots and normality tests are presented. These were all
checked in the examples that follow, and are omitted for brevity.
In both PLS and CCA, the induced distribution yjx, can also be ob-
tained but numerical integration is necessary [39].
In contrast with the previous probabilistic latent models, in the pro- 4.1. Example 1: quantification of analytes impacting wine aroma
cess optimization setting we discuss in this paper, the X space is usually
well defined and not high dimensional, as it comes from an experimental The following example illustrates the RPD optimization method using
design where the number of controllable factors is usually not large. multivariate regression (the case when J > N  q þ 1). The optimization
Therefore, there is no need to find a latent structure in it, as there are no of analytical instrumentation is a major activity in research and industrial
underlying unobserved variables generating the observed variability in laboratories. It is fundamental to take the most out of the high capital and
X, on the contrary, such variation is induced by the measured factors. For operational costs involved including operator time. Quite often the
the same reason, all of the latent space models above, which assume a measurement system should address multiple targets, becoming a multi-
common latent structure in the X and Y spaces are unnecessary and do not response process. Reis et al. [57] studied the optimization of a headspace
correspond to the data structure of an industrial experiment. Dimensionality solid-phase microextraction (HS-SPME) method, which is a state of the
reduction may be needed in the response space only, where a latent art extraction methodology for the analysis of volatile chemical com-
variable structure may exist for which a lower dimensional space can be pounds in liquid samples. We consider their experimental data and fitted
estimated. For instance, the multivariate linear regression model dis- models and use them in our bayesian optimization approach. The goal is
cussed earlier is applicable when J is not too large compared to N, to optimize the quantification of analytes impacting wine aroma. The
because otherwise this would require weighting priors that are harder to experimental design run was an N ¼ 18 run definitive screening design
justify. If the number of responses J is large, a principal component which allows to estimate second order polynomials, with 2-factor in-
analysis can be conducted on the Y space prior to the analysis, and then a teractions partially aliased with each other. The responses were the
chromatographic responses of 9 compounds establishing wine flavor

7
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Table 1
Estimated predictive posterior probabilities of satisfying the joint bounds A and each interval Ai ¼ ðLi ;Ui Þ. Here, pi ¼ pðLi  Yi  Ui jdataÞ using a multivariate regression
analysis.
pðYðxc Þ 2 AjdataÞ p1 p2 p3 p4 p5 p6 p7 p8 p9

0.110 0.279 0.557 0.496 0.537 0.524 0.663 0.678 0.936 1.000

Table 2 with high probability regardless of the type of fiber used.


Optimal settings for the wine aroma original experiment (N ¼ 18, ν ¼ 1), Since the goal is to maximize the responses and they are standardized,
multivariate regression analysis. “Coded” covariates refer to scaled variables in we set the bounds Li ¼ 0:5 and Ui ¼ 3 for all responses i ¼ 1; …; 9. We
the (1,1) numeric scale, common in statistical experimental design and anal- define the tolerance or specification region for the responses A as the
ysis. “Uncoded” variables show the covariate in its original units of hyperbox defined by the intervals ðLi ;Ui Þ. We then wish to maximize the
measurement.
joint posterior predictive probability that the 9 chromatographic re-
VOL ¼ x1 ET1 ¼ x2 ETE ¼ x3 EC ¼ x4 sponses are all inside their bounds. The model fit to each response in Refs.
x (coded)
* 0.950 0.993 0.081 0.827 [57] was a quadratic polynomial simplified after elimination of
x* (uncoded) 9.87 39.61 47.63 16.83 non-significant terms:

Y ¼ β1 x1 þ β2 x2 þ β3 x3 þ β4 x4 þ β5 x5 þ β12 x1 x2 þ β15 x1 x5 þ β34 x3 x4 þ β44 x24


(peak area of each compound in the chromatogram). These 9 compounds (20)
are: isobutyric acid, butyric acid, isovaleric acid, valeric acid, hexanoic
acid, octanoic acid, nonanoic acid, decanoic acid and dodecanoic acid. (note there is no intercept as the Y’s were standardized). The bayesian
In order to apply bayesian optimization, we initially treat the 9 multivariate regression model (3) was fit and the RPD optimization
chromatographic responses as a 9-dimensional response vector. The goal problem (8) was solved. Tables 1 and 2 show the maximum probabilities
of the HS-SPME extraction method is to detect the compounds, so the of satisfying the given specifications and the resulting solution x*c ,
larger the peak areas the better. The responses were standardized and the respectively. Fig. 5 shows the point prediction of the optimal responses
controllable factors coded into the (1,1) convention. Out of the seven (mean vector of the posterior predictive density) and one-sigma standard
factors reported by Refs. [57] (fiber coating, pre-incubation time, errors. The errors are evidently too large compared to the specifications.
extraction time, extraction temperature, headspace sample volume, The optimal solution (uncoded), xc ¼ ð9:87; 39:61; 47:63; 16:83Þ co-
agitation during extraction and ethanol content) two of them, incides closely to what was found by Ref. [57]. Note that there is no run
pre-incubation time and agitation seem not to have any effect on the similar to this solution in the original experimental design, so simply
responses and were eliminated from the study. The final set of the factors “picking the winner” from the list of experimental trials would not have
were x1 ¼ VOL (headspace sample volume, ranging from 5 to 10 ml), x2 resulted in an optimal solution.
¼ ETI (extraction time, ranging from 15 to 20 min), x3 ¼ ETE (extraction The overall probability of satisfying the 9 bounds at x* is quite low,
temperature, ranging from 40 to 55 C ), x4 ¼ EC (Ethanol content, only 0.110. This indicates either the presence of large common-cause
ranging from 4.5 to 18%)and x5 ¼ F (type of fiber, a categorical factor variability, problems with the amount data available, issues arising
with labels fL1-PA,L2-DVBg. from the experimental design used or the possibility that the bounds
In the following analysis, we treat x5 as a noise factor with a Ber- ðL; UÞ considered in the analysis were too strict. We conducted a pre-
noulli(0.5) distribution, implying that we wish to find settings in the ðx1 ; posterior analysis [46] to discern if more data would have resulted in
x2 ; x3 ; x4 Þ controllable factors that make the chromatographic areas large better probabilities pðYðxc Þ 2 AjdataÞ or if the low probability is due to

12

10

6
Response value (standardized)

-2

-4

-6

-8
1 2 3 4 5 6 7 8 9
Y (response number)

Fig. 5. Predicted responses (blue stars) at the optimal settings with one std. deviation error bars, using the original experimental design, wine aroma experiment, N ¼
18 and ν ¼ 1 degree of freedom. Optimization using the multivariate regression model. (For interpretation of the references to colour in this figure legend, the reader is
referred to the Web version of this article.)

8
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

3.5

Response value (standardized)


2.5

1.5

0.5

-0.5
1 2 3 4 5 6 7 8 9
Y (response number)

Fig. 6. Preposterior analysis, predicted responses (blue stars) at the optimal settings in the wine aroma experiment, with one std. deviation error bars assuming the
original design was duplicated, N ¼ 36 and ν ¼ 19 degrees of freedom (note different y-scale compared to Fig. 5). Optimization based on multivariate regression
model. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.)

unrealistic bounds. We repeated the optimization by duplicating the Table 4


design and the response data (N ¼ 36). The results are shown in Fig. 6, Optimal settings for wine aroma original experiment (N ¼ 18), mixed effects
and as it can be seen, the standard errors are smaller. The optimal solu- model based on a preliminary PCA of the response data for the fixed effects and
tion x*c changed little, and the overall probability of conformance went up probabilistic PCA estimate for the random effects. “Coded” covariates refer to
scaled variables in the (1,1) numeric scale, common in statistical experimental
but only to 0.326. This is likely due to the increase in the degrees of
design and analysis. “Uncoded” variables show the covariate in its original units
freedom, from v ¼ 1 to v ¼ 19. Moreover, additional replicates do not
of measurement.
increase this probability significantly further, an indication that either
VOL ¼ x1 ETI ¼ x2 ETE ¼ x3 EC ¼ x4
the specification limits are unrealistic or too narrow relative to the
common-cause variation (especially for response Y1 ), or the response x (coded)
* 0.571 0.995 0.091 0.9984
information collected from the original experimental design is too x* (uncoded) 8.92 39.93 48.18 17.98
limited, such as confounded (biased) effects that will not be resolved via
replication of the same design.
Note that the first PC is proportional to the average of the nine re-
sponses while the second PC is similar to a contrast between analytes
4.1.1. Reanalyzing the wine aroma experiment using the hierarchical mixed
with different chain lengths. The two parameters in the θi vectors were
effects model as a response dimensionality reduction method
modeled as a function of the controllable factors following model (20). In
The large standard errors of the predictive density, and the low pre-
this analysis based on the mixed effects model, as well as in the next two
dictive probability of conformance to the specifications occur because in
examples below, the design matrix S* used in model (13) was estimated
reality the 9 chromatographic responses are highly correlated so that their
using Probabilistic PCA (Appendix C) applied to the residuals ri of the
effective dimension is much lower than 9. Therefore, we may follow the
model
modeling approach proposed in Refs. [57] and first conduct a standard
principal component analysis (PCA) analysis on the 9 responses, which Yi ¼ Xi β þ ri
identifies the first two principal components (PC) as explaining more than
95% of the variability in the responses. We then define the matrix S in the where the first term on the right hand side is as in equation (12), ri ¼
hierarchical model (10–11) by these two principal components loadings: b i  Yi , such that ri ¼ Wz þ ε where we make W in equation (21) equal
Y
0 1
0:3304 0:3797 to S* in (13), see Appendix C.
B 0:2670 0:6036 C We next solve the RPD optimization problem with the same bounds
B C
B 0:3486 0:2596 C (for the 9 responses) as before. Tables 3 and 4 show the optimal proba-
B C
B 0:3534 0:0873 C
S¼B
B 0:3361
C (21) bilities of conforming to the same specifications presented before and the
B 0:3424 C
C optimal xc settings. The overall predictive probability of satisfying the
B 0:3331 0:3563 C
B C response specifications is much higher, 0.857 (see Fig. 7 for a plot of the
@ 0:3393 0:3385 A
predicted optimal responses, which shows a much more concentrated
0:3219 0:1733
optimal predictive density around its mean). The original problem had

Table 3
Estimated predictive posterior probabilities of satisfying the joint bounds A and each interval Ai ¼ ðLi ;Ui Þ. Here, pi ¼ pðLi  Yi  Ui jdataÞ using the mixed effects model
based on a preliminary PCA of the response data.
pðYðxc Þ 2 AjdataÞ p1 p2 p3 p4 p5 p6 p7 p8 p9

0.857 0.998 0.991 0.997 0.992 0.970 0.935 0.935 0.951 0.986

9
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

2.5

Response value (standardized)


1.5

0.5

-0.5

-1

-1.5
1 2 3 4 5 6 7 8 9
Y (response number)

Fig. 7. Optimal responses (stars) and 95% prediction interval for the responses in the wine aroma experiment under the solution obtained using the mixed effects
model based on the first 2 principal components of the 9 chromatographic responses. Dark lines at 0.5 and 3.5 are the specifications. The nine predicted responses are
inside their specifications, with variability predicted to be inside the desired specifications.

responses highly correlated, effectively diffusing the probability over a orientation angle that has a major effect in paper mechanical and
space that was too high dimensional with respect to the subspace where dimensional properties. For instance, higher angles tend to favor diago-
the data really lied. Using a smaller number of PCs (2) captures better the nal curl modes, which can be very detrimental to printing processes,
underlying probability distribution. Note that the optimal solution x*c did causing frequent jams, loss production and reduced operational effi-
not change much from model to model. What was gained was a higher ciency. The profile of TSO across the paper machine were obtained at 10
degree of assurance that the optimal process will meet its specifications. positions over the machine manifold and can be controlled by manipu-
lating several process variables (experimental factors), such as x1 ¼ VJ
(jet velocity, ranging from 800 to 850 m/min), x2 ¼(A, slice opening,
4.2. Example 2: optimization of the tensile stiffness orientation profile in a ranging from 40 to 80%), and x3 ¼(VR, manifold recirculation, ranging
paper machine from 40 to 80%). The wire velocity (VW) of the machine was fixed at 750
m/min (this sets the production pace, which was assumed to be kept
Reis and Saraiva [60] considered the prediction of a profile response fixed). For operational reasons, jVW VJj > 50 is a condition that was
in a paper manufacturing facility. The response of interest is the tensile always maintained during experimentation, as this is a machine
stiffness orientation (TSO) angle profile across a paper sheet produced in requirement to balance paper properties (surface quality, dimensional
a paper machine. The TSO orientation is closely related to the fiber

20 Fig. 8. Light lines: observed Tensile Stiffness Orien-


tation (TSO) angle profile responses over the 10 posi-
tions on the paper sheet at which they were measured.
15 Dark line: predicted TSO profile response at the opti-
mized process settings, red lines: the 5 and 95% pre-
dicted percentiles of the optimal TSO posterior
10
distribution. Crossed dark lines: upper and lowed
specifications for the TSO profile. The optimal TSO
5 profile response is predicted to be quite “flat” around
TSO angle (degrees)

zero degrees, with predicted variation inside the


response specifications, as desired. (For interpretation
0 of the references to colour in this figure legend, the
reader is referred to the Web version of this article.)

-5

-10

-15

-20
1 2 3 4 5 6 7 8 9 10
Position across paper sheet

10
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Run no: 1 2 3 4 5

10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
6 7 8 9 10

10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
11 12 13 14 15

10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10
16 17 18 19 20

10 10 10 10 10
0 0 0 0 0
-10 -10 -10 -10 -10

2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Fig. 9. Dark lines: observed TSO profiles in each of the N ¼ 20 experimental runs. Lighter, dashed lines: predicted TSO profiles (mean of the posterior distribution).
The predictions are so close to the actual they are hard to see.

Table 5 and the corresponding probabilities are shown in Table 6.


Table 5
Clearly, this is a very good solution, as TSO variation across the paper
Optimal settings for Pulp and Paper experiment. “Coded” covariates refer to
sheet is very low.
scaled variables in the (1,1) numeric scale, common in statistical experimental
The profile main effects plot reveals how a low setting for the recir-
design and analysis. “Uncoded” variables show the covariate in its original units
of measurement. culation compensates with the average trend the profiles have, resulting
in a flatter TSO response (see Fig. 10).
VJ ¼ x1 A ¼ x2 VR ¼ x3
We can in addition find a “design space” for this process by evaluating
x* (coded) 0.898 0.995 0.732 the probability of meeting the specifications pðY 2 Ajdata; xÞ at a variety
x* (uncoded) 847.4 79.90 45.35 of x-points [49,66]. This is feasible of course when the number of
controllable factors is low. Fig. 11 shows a contour plot of the probability
and mechanical properties). function evaluated at different lip opening (x3 ) and recirculation ðx4 Þ
From observations of the recorded TSO angle profiles during the values keeping the jet velocity fixed. A confirmation run was made at
experiment (light lines in Fig. 8), it is not difficult to devise a parametric these settings. Fig. 12 shows the actual profile.
model of TSO as a function of the angle at each of the J ¼ 10 positions s in
each profile i (where s 2 f1; 2; …; Jg). There is clearly a sinusoidal fluc- 4.3. Example 3: optimization of a drug release stability profile
tuation along each profile and a general trend, and hence, in the hier-
archical mixed effects model (10–11) we use a first stage model given by: Silva et al. [64] applied Quality-by-Design techniques to find the root
  cause for the observed slower drug release in orodispersible films during

Yi ¼ θ0;i þ θ1;i si þ θ2;i sin si þ εi ; i ¼ 1; …:; N: storage. The response is the drug release time profile, which should
J follow a given reference profile within certain bounds and should also be
For the second stage model, each of the three parameters in θ’i ¼ ðθ0 ; stable along time upon repetition of the same drug release trial. The
θ1 ; θ2 Þ’i are simply modeled as a full quadratic polynomial fðxÞ in x’ ¼ following conditions were analyzed as experimental design factors to
ðx1 ; x2 ; x3 Þ (10 parameters β) to account for main effects, two-factor in- manipulate: x1 ¼ RT ¼ Room temperature, varying from 17 to 25 ⋄C, x2
teractions, and curvature. ¼ RH ¼ room humidity, varying from 30 to 62 %, x3 ¼ DT ¼ drying
A D-optimal, N ¼ 20 run designed experiment was conducted, which temperature, varying from 40 to 60 ⋄C, x4 ¼ ME ¼ mixing equipment, a
gives adequate degrees of freedom to fit the quadratic polynomial model categorical factor with levels fM,Dg and x5 ¼ DS addition, also cate-
in stage 2. The object of the experimentation is to find conditions on the gorical with levels fS,Pg. All factors were coded into the (1,1) scale. The
machine that make the TSO profiles as flat as possible. We therefore categorical factors were modeled each as a Bernoulli(0.5) random
define bounds at Uj ¼ 3 and Lj ¼  3;j ¼ 1;…;J ¼ 10. The 20 observed variable.
TSO profiles are displayed, together with the predicted profiles (from the The resulting Robust Parameter Design problem consists in finding
mean of the posterior predictive density) in Fig. 9. The optimal TSO RT, RH, and DT that with high probability gives a drug release stability
profile is displayed in Fig. 8. The optimal solution x* ¼ xc (no noise profile at 6 months of storage that does not decay with respect to the
factors were considered here) that maximizes the posterior predictive reference profile regardless of the mixing equipment and DS addition
probability that the profile jointly meets its specifications is shown in used. This is an instance of a process that generates profiles that change
in time (1–6 months, in this particular case). In this example, the rational

Table 6
Estimated predictive posterior probabilities of a TSO profile satisfying all bounds Ai ¼ ðLi ; Ui Þ simultaneously and at each point s. Here.pi ¼ pðLi  Yðsi Þ  Ui jdataÞ
pðYðxc Þ 2 AjdataÞ p1 p2 p3 p4 p5 p6 p7 p8 p9

0.898 0.983 0.997 0.939 1.000 0.977 0.994 0.939 0.935 0.962

11
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Constant Factor 1: Velocity of the jet


15 1.5

10 1

TSO angle (degrees)

TSO angle (degrees)


5 0.5

0 0

-5 -0.5

-10 -1

-15 -1.5
Position across paper sheet
0 2 4 6 8 10 0 2 4 6 8 10
Position across paper sheet

Factor 2: Lip aperture Factor 3: Recirculation


2 15

1.5
10
1
TSO angle (degrees)

TSO angle (degrees)


5
0.5

0 0

-0.5
-5
-1
-10
-1.5

-2 -15
0 2 4 6 8 10 0 2 4 6 8 10
Position across paper sheet Position across paper sheet

Fig. 10. Effect of each of the 3 controllable variables on the TSO response across the paper sheet. The difference between the two lines shows the effect on the curve or
profile response that the factor has, as it changes from a low to a high setting. Lines with ‘þ’ correspond to the average response when the factor was equal to its
highest experimental setting, dashed lines are the average TSO response when the factor was equal to its lowest experimental setting.

1
TSO angle (degrees)

-1

-2

-3

-4
1 2 3 4 5 6 7 8 9 10
Fig. 11. Contour plot of the predictive probability pðY 2 Ajdata; xÞ for x1 ¼ Position across paper sheet

847:4 while varying x2 and x3 within their feasible region (40%–80%). A given
Fig. 12. Blue line with dots: TSO profile from confirmation run at VJ ¼ 847:4,
contour can be used as the boundary defining the design space for this partic-
A ¼ 79:9 and VR ¼ 45:35. Crossed dark lines: upper and lowed specifications
ular process.
for the TSO profile. (For interpretation of the references to colour in this figure
legend, the reader is referred to the Web version of this article.)
is that if the latest profile (at stability time ¼ 6 months), corresponding to
the oldest stability time of drug, satisfies the product specifications, then
terms (xl2 ) only in the numeric factors x1 to x3 (but not in the categorical
it can be assured that the rest of the profiles at earlier stability times, will
factors, for which they would not make sense).
also satisfy them.
The experimental ran was a N ¼ 25 D-optimal design which allows us
The response Y is therefore the drug release percentage curve at the
to estimate all main effects and 2-factor interactions plus quadratic terms
last observed stability time (6 months) observed over 10 release time
in the non-categorical factors (18 parameters). The optimal settings
points s. Since the drug release vs. release time profiles are percentages,
found (assuming both noise factors are randomly varying as Ber-
they do not follow a normal distribution, necessary in our approach. We
noulli(0.50) random variables) are shown in Table 7.
therefore first normalize the data by applying the Probit transformation:
The optimal profile at the x*c settings is shown in Fig. 14 together with
Y’ ¼ ΦðYÞ1 95% prediction intervals. The estimated maximum joint probability of
having a future curve completely inside the bounds when running the
where Φð Þ is the cumulative distribution function of a standard normal process at settings x*c is 0.4060. The joint probability is low because of the
random variable. The observed curves at each of the 25 experimental difficulty of maintaining the response within bounds at the highest
runs are shown in Fig. 13, together with the predicted profiles. From release times, when it is more variable. Finally, Fig. 15 shows the main
observation of these profiles, a cubic polynomial is assumed in stage 1 of effect plots of factors RT, RH, DT, ME, and DS. The latter two are cate-
the mixed effects model for Y’ðsÞ, requiring four parameters θ. In a sec- gorical factors. From the plots, to keep the release curve as high as
ond stage, quadratic polynomial models were fit with pure quadratic possible, especially with a high increase initially, all 3 controllable

12
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Run: 1 2 3 4 5
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
6 7 8 9 10
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
11 12 13 14 15
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
16 17 18 19 20
2 2 2 2 2
0 0 0 0 0
-2 -2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30
21 22 23 24 25
2 2 2 2 2
1
0 0 0 0 0
-1
-2 -2 -2 -2
0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30

Fig. 13. Dark lines: observed 6-month drug release profiles in each of the N ¼ 25 experimental runs (Probit transformation applied). Lighter, dashed lines: predicted
profiles (mean of the posterior distribution). The predictions are so close to the actual they are hard to see.

displayed in Table 7. A confirmation run was made at these settings.


Table 7 Fig. 16 shows the actual drug release profiles at different stability times.
Optimal settings for Drug Release experiment. “Coded” covariates refer to scaled Finally, a simple type of optimization can be done over the categorical
variables in the (1,1) numeric scale, common in statistical experimental design factors if desired. This can be achieved by systematically varying the
and analysis. “Uncoded” variables show the covariate in its original units of
probabilities p1 and p2 of the Bernoulli(pi ) variables x4 and x5 , and solve
measurement.
the associated bayesian RPD problems again. Table 8 shows the results
RT ¼ x1 RH ¼ x2 DT ¼ x3 obtained in this analysis. A value of pi ¼ 1 is equivalent to say that the
*
x (coded) 0.855 0.848 0.841 corresponding factor was fixed at its low (1) setting. While the optimal
x* (uncoded) 24.42 59.57 58.48 settings of the controllable factors change little when ME and RS are
varied, it is clear that highest probabilities of conformance to specifica-
tions can be obtained when these two categorical factors have opposing
3 settings, either ðx1 ; x2 Þ ¼ ð1; 1Þ or (1,-1) corresponding to the (ME,
0.9
DS)¼(M,P) or (D,S) settings.
2
0.8
5. Conclusions
0.7
Probit( drug release %)

1
We have provided a review and some extensions of bayesian pre-
Drug release (%)

0.6
dictive optimization methods, with application to the process industries.
0 0.5
The methods are based on predictors and response data from designed
0.4 experiments, where the response is in the form of either a vector of
-1
0.3 correlated responses, possibly of high dimensionality, or a profile.
Bayesian predictive methods are becoming increasingly relevant in
0.2
-2 practice not only for process optimization but also in QbD applications
0.1
where it is of interest to find a design space where the future performance
-3 of the process is guaranteed with certain probability (as in pharmaceu-
0 5 10 15 20 25 30
release time (minutes) tical applications, see Ref. [47,49,66]). The bayesian predictive approach
to process optimization provides a probability (or reliability measure) of
Fig. 14. Light lines: observed drug release profiles after a 6 month stability
satisfying the process goals at the optimal settings x*c . Such probability
period. A probit (Y’ ¼ ΦðYÞ1 ) transformation was applied to the drug release
percentage response before analysis. Actual drug release % can be read from the statements are not possible to obtain with classical statistics. A fre-
right hand axis. Dark line: optimal predicted drug release profile at 6 months, quentist “design space” can be obtain classically by bootstrapping, a
red lines: the 5 and 95% predicted percentiles of the optimal drug release possibility not reviewed here. The bootstrapping approach, however, can
posterior distribution, crossed dark lines: upper and lowed specifications for the only provide confidence regions on the optimal operating conditions x*c
6-month drug release profile. The predicted optimal profile response including and it is not possible to provide probability statements about whether the
its 95% variability is inside the desired specifications and only slightly exceeds process will fall or not in the given region if in the future it is run at
them for large release times. (For interpretation of the references to colour in
operating conditions x*c , something easily done under the bayesian
this figure legend, the reader is referred to the Web version of this article.)
framework.
Extensions to the original bayesian process optimization methodol-
ogy, based on regression models in Ref. [46] were presented. These
factors, RT, RH, and DT should be run at their high settings. This agrees
included the use of a hierarchical mixed effects model that can be used
with the optimal solution, obtained via numerical optimization,
either when the number of responses is too large for a noninformative

13
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

Factor 1: RT Factor 2: RH Factor 3: DT Fig. 15. Effect of each of the 3 controllable variables
0.5 of the drug release profiles at 6 month stability time.
0.2 0.2 The difference between the two lines shows the effect
on the curve or profile response that the factor has, as
it changes from a low to a high setting. Lines with ‘þ’
0 0 0
correspond to the average response when the factor
was equal to its highest experimental setting, dashed
-0.2 -0.2 lines are the average response when the factor was
-0.5 equal to its lowest experimental setting. Response (%
0 10 20 30 0 10 20 30 0 10 20 30 drug release) shown after Probit transformation.

Factor 4: ME (noise) Factor 5: DS (noise)


0.1 0.2

0 0

-0.1 -0.2
0 10 20 30 0 10 20 30

100
Table 8
Optimal settings for the drug release experiment when noise factors x4 ¼ ME and
90
x5 ¼ RS are assumed categorical with Bernoulli distributions with probabilities
80 p1 and p2 respectively. The optimal solutions for controllable factors x1 ; x2 ; x3
remain largely the same.
Drug release %

70
p1 p2 pðYðxc Þ 2 AjdataÞ x1* ¼ x2* ¼ x3* ¼ x4 ¼ x5 ¼
60
RT RH DT ME RS
50
0.5 0.5 0.406 0.855 0.848 0.841
40 1 1 0.416 0.814 0.998 0.999 1 1
1 0 0.496 1.000 0.754 0.917 1 1
30
0 1 0.505 0.889 0.830 0.999 1 1
20 0 0 0.311 0.792 0.811 0.910 1 1

10

0 The probabilistic latent variable models were shown to have a


0 5 10 15 20 25 30
structure that does not correspond to that of data from a process opti-
Release time (minutes)
mization experiment. We also discussed the case noise factors are present
Fig. 16. Light lines: observed drug release profiles at stability times 1, 2, 3, 4, 5, and a “Robust Parameter Design” optimization is desired, and an
and 6 months from confirmation run at RT ¼ 24:4, A ¼ 59:5 and VR ¼ 58:4 extension to the case when noise factors are categorical was discussed
(x1 ¼ 0:855;x2 ¼ 0:848;x3 ¼ 0:841). Categorical factors were randomly varied and illustrated. The MATLAB code provided (see supplementary mate-
as Bernoulli(0.5) random variables, thus the solution is robust with respect to rials) can be used to replicate the analyses in the case studies covered in
either setting of ME and DS. Crossed dark lines: upper and lowed specifications this paper, and it is hoped it will facilitate the comprehension of the
for the TSO profile. paper conceptual contents, as well as their application and future
adoption.
multivariate regression analysis, in such a way that a preliminary prob-
abilistic PCA provides the “features” to be modeled in a second stage Declaration of Competing Interest
model as a function of controllable factors, or to model profile responses,
where a first stage model defines a parametric model expressing the The authors declare that they have no known competing financial
shape of the curve that has been observed, and a second stage models the interests or personal relationships that could have appeared to influence
relation of these parameters with the controllable factors. In either case, a the work reported in this paper.
complete bayesian solution is available. Given that there are no gua-
rantees for convergence in bayesian analysis of models that require Acknowledgements
MCMC, as is the case in particular in complex predictive models, it is
important to emphasize users follow Bayesian MCMC diagnostic tools This research was funded by a scholarship grant to the first author by
[25]. Note how MATLAB [36], R [69], and Python [21] all have such the Fulbright Portugal foundation.
diagnostic tools available.

Appendix A. Calculating the probability of conformance to specifications

The following Monte Carlo procedure can be used to compute pðxc ÞRPD , the quantity maximized in problem (8) for the single stage, multiple response
approach (see [38] for more details).

1 Set c ¼ 0.
2 Simulate a value of the noise factors, xn , from their (assumed known) distribution. The vector xc is given and contains the values of the controllable
factors. Make x ¼ ðxc ; xn Þ.

14
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

  0 

3 Simulate yx; data TνJ B x; ν2
ν H1 .

4 If y 2 A, make c ← c þ 1. Go to step 2 and repeat m times.


5 Return bp ðxc ÞRPD ¼ c=m.

Step 3 can be performed efficiently in MATLAB using function mvtrnd in the statistics toolbox and in R using function rmvt in package mvtnorm.
For the hierarchical mixed model, step 3 is substituted by sampling from the simulated Markov Chain obtained via Gibbs sampling (after the burn-in
period) and substituting these simulated parameters in the likelihood, i.e., values of y are obtained from the predictive density by composition (see
Appendix B). As reported by Miro et al. [38], it is more effective to generate the required m noise factors once and use a common random numbers
strategy in the optimization whenever pðxc ÞRPD needs to be evaluated within the optimization process. Optimizing this function is a hard nonlinear
problem, and Miro et al. report how it is useful to start the optimizer from a point likely to contain enough “area” of the predictive density.

Appendix B. Gibbs sampling for estimating the linear mixed model

Lange et al. [35] give the full conditionals for the parameters ðβ; fwi g; fσ 2i g; Σw Þ in the linear mixed model (13) where the σ 2i ’s allow different
variances among the observed profiles (this is sometimes useful in longitudinal analysis models when it is not desired to make inferences on profiles
other than the N observed). Chib and Carlin [11] considered the case we consider, where Σ ¼ σ 2 IJ , and show how the Lange et al. procedure suffers
from slow convergence, and proposed two alternative algorithms, one (their algorithm 2) which is a pure Gibbs sampling approach, and another one
(their algorithm 3) which has a Metropolis step (a rejection sampling step). Since the convergence properties of these two algorithms appear to be about
the same, in particular for the β parameters, we choose Chib and Carlin's algorithm 2 (after correcting some errors in the full conditionals in their paper)
since it does not require a Metropolis step (i.e., all full conditionals are known distributions). We also provide the full conditionals of the other pa-
rameters since they were not explicitly given by these authors.
The priors we use were:
β Npq ðβ0 ; B0 Þ, with β0 ¼ 0 and B0 ¼ 1000I (noninformative for β);
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
σ 2 IGðλ1 ; λ2 Þ (inverse-gamma distribution), with λ1 ¼ 0:001 and λ2 ¼ 5 (this gives Eðσ 2 Þ ¼ 2 and Varðσ 2 Þ ¼ 63, relatively non-informative);
Σ1 1
w Wishartðν0 R0 ; ν0 Þ with ν0 ¼ p (most non-informative choice) and R0 ¼ Ip (gives EðΣw Þ ¼ Ip with as large variance as possible);
The Gibbs sampling scheme is:

1. Sample β from βY; σ 2 ; Σw :
 X
N 1  X
N  X
N 1 
0 0 0
N X i V 1 X i þ B1
0 X i V 1 yi þ B1
0 β0 ; X i V 1 X i þ B1
0
i¼1 i¼1 i¼1

0
ðmÞ
where V ¼ S Σw S ' þ σ 2 I and X i ¼ xi
S* .

2. Sample the random effects wi , i ¼ 1; …; N from fwi gY; β; σ 2 ; Σw :
 ' 1 ðwÞ  ' 1 
S S S ' Ri S S
N þ Σ1 ; þ Σ1
σ2 w
σ2 σ2 w

ðwÞ
where Ri ¼ yi  X i β.

1 
3. Sample Σ1
w from Σw fwi g:

 X
N 1 
0
Wishart wi wi þ ν0 R1
0 ; N þ ν0
i¼1


4. Sample σ 2 from σ 2 Y; β; fwi g:
  1 
JN 1 1  ðσ2 Þ0 ðσ2 Þ 
Gamma λ1 þ ; þ R R
2 λ2 2

where Rðσ Þ ¼ Y  X β  vecðS w1 ; …; S wN Þ (a NJ  1 vector).


2

Gibbs sampling for mixed effect models can sometimes have very high MCMC autocorrelations, and the use of Hamiltonian MCMC (e.g. via Stan, [7])
can help as an alternative algorithm.
The MCMC sampling of the mixed effects model parameters needs not be conducted within the optimization routine necessary to solve (8), otherwise
this would imply a tremendous computational burden. The reason for this is given by the model written as in (12). Given realizations of the posterior of
Θ ¼ ðβ; fwi gΣw ; σ 2 Þ, pðΘjdataÞ, we can simulate draws of the posterior predictive density by composition (Gelman et al., 2004):

15
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

 Z 
p yjdata; xÞ ¼ pðy; Θjdata; x dΘ
Z
¼ pðyjdata; Θ; xÞpðΘjdataÞdΘ

Thus, we conduct a simulation of the Gibbs sampling chain until convergence, approximating in this way pðΘjdataÞ and simply sample from it. We then
substitute the sampled values into the density pðyjdata; Θ; xÞ whenever a yjdata; x vector is needed in the optimization routine (this replaces step 3 in
Appendix A). Hence, the MCMC computations are only run once, before performing the optimization. For more information on MCMC methods, see
Gelman et al. [25] and for a more concise introduction, see Colosimo and Del Castillo [12].

Appendix C. Probabilistic Principal Components Analysis

In order to present the probabilistic version of PCA of Tipping and Bishop [70], we first contrast the closely related Maximum Likelihood PCA
(MLPCA) models in the Chemometrics literature (e.g., as in [73]) and the Factor Analysis (FA) model in the Machine Learning and Statistics literature
(e.g., as in [5, 26])1. We then discuss probabilistic PCA, as it was used in the estimation of matrix S in the linear mixed effects model (13).
Both MLPCA and FA models are based on the description:

Y ¼ Wz þ μ þ ε

where Y is a p-dimensional observed vector and z is an unobserved (and therefore, hypothesized) k dimensional vector, with k typically much smaller
than p. W is called the “loadings” matrix and the entries in the z vector are called the latent or “factor” variables. Both MLPCA and Factor Analysis
assume:

ε Nð0; ΣÞ:
If we redefine Y  μ to be observed data (i.e., if we center the data with b
μ ¼ Y), the parameter μ can be neglected.
MLPCA model characteristics.- The goal of MLPCA according to [73], is to best estimate z given the observed random vectors Y, an estimate of W
and a known covariance matrix Σ. In MLPCA it is assumed z is an error-free constant, so the model is a linear regression model, and the optimal solution
is given by the generalized least squares estimator:
 0 1 0
bz ¼ W Σ1 W W Σ1 Y

b ¼ Wbz þ b
which yields predictions of the responses Y μ . This prediction equation is used to minimize the “reconstruction error”

X
N
SΣ ¼ b  Yi Þ0 Σ1 ð Y
ðY b  Yi Þ
i¼1

with respect to W. The MLPCA literature also discusses the case where Σ varies with each observation, so the covariance between the same elements of
different Yi ’s can differ with the observation number i. Covariances between elements in each Yi are modeled via the entries of Σi , which is a dense (not
diagonal) matrix. Note that in processes where a very large number of variables p are measured the number of elements in Σi is high (pðp þ1Þ= 2 distinct
entries). These covariances are assumed known via information available for the measurement noise, although [73] mention that in practice they need
to be estimated2. The objective of MLPCA in Chemometrics then is to find relationships between the measured variables.
FA model characteristics.- The FA model in the Machine Learning/Statistics literature, rather than considering the latent variables fixed constants,
assumes instead they are random:

z Nð0; Σz Þ:

This is a “random effects” model with two sources of variability affecting the observations. The term containing the latent variables z models corre-
lations between the elements of Y, while the error variables ε account for measurement error in each of these elements. In the Chemometrics field, Reis
and Saraiva [59] use random latent variables (in contrast to Wenzell [73]), calling the model an heteroscedastic latent model, and use it for Statistical
Process Control.
The goal when using the FA model is to estimate W and Σ under the additional assumption that Σz ¼ Ik (the k dimensional identity matrix) to best
model the correlation in the entries of the observed Y vectors. This additional assumption does not lose generality because any correlation between the
elements of Y can still be modeled given that
0
CovðYÞ ¼ WW þ Σ: (21)

A particular case: Probabilistic PCA.- If Σ ¼ σ I (same measurement error variance for all elements in Y) the FA model is called a Probabilistic
2

PCA model (PPCA). The maximum likelihood estimates of W and σ 2 were shown by [70] to be:
 1=2
c
W ¼ V L  σ2 I R

where V is a p  k matrix with columns equal to the k eigenvectors associated with the top k eigenvalues of

16
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

X
N
S¼ ðY b  Yi Þ0
b  Yi Þð Y
i¼1

L is a diagonal matrix with the k largest eigenvalues, and R is an arbitrary rotation matrix, which can be set equal to I. The MLE of σ 2 is

1 X
p
σ2 ¼
b λi (22)
p  k i¼kþ1

(the variance associated with the discarded dimensions). The inverse mapping in PPCA, giving the latent vector associated with a given observation Y, is
the posterior distribution of z, which is
  0 1 0  0 1 

zY N W W þ σ 2 I W Y; σ 2 W W þ σ 2 I

As it can be seen, the posterior mean is a linear projection that can be interpreted as “ridge” regression [27]. Probabilistic PCA provides a “generative”
model for PCA, but it is not a fully bayesian PCA approach as the subspace dimension k is considered fixed and known and W and σ 2 are considered
unknown constants. Minka [37] has provided a bayesian model selection version of Probabilistic PCA that also finds k and compares it with other
methods for selecting the best subspace dimension, providing some empirical evidence his method is more accurate.

Supplementary materials

All datafiles and MATLAB code that implements the methods discussed here, including MCMC estimation and numerical optimization are provided.
Running the script scriptExamples.m reproduces all examples in the paper by calling the appropriate functions. In addition, function Find_PCA_Dim.m is
provided, which implements the exact ML solution for Probabilistic PCA. It also finds the best latent space dimension using a bayesian algorithm
developed by Minka [37]. For readers who prefer R or Python, the examples in the paper, in particular the latent variable models, can be replicated
using Bayesian computing languages such as JAGS [51] and Stan [7].

References [19] Fda, Guidance Document: Q11 Development and Manufacture of Drug Substances,
2012.
[20] G.M. Fitzmaurice, N.M. Laird, J.H. Ware, Applied Longitudinal Analysis, John Wiley
[1] Asi, Robust Designs Using Taguchi Methods, American Supplier Institute (ASI)
& Sons, Hoboken, NJ, 2004.
Press, Livonia, MI., 1998.
[21] Python Software Foundation, Python Language Reference, 2020, version 3.8.3.
[2] Bhavik R. Bakshi, Mohamed N. Nounou, Prem K. Goel, Xiaotong Shen, Multiscale
[22] J.B. Gao, C.J. Harris, Some remarks on Kalman filters for the multisensor fusion, Inf.
bayesian rectification of data from linear steady-state and dynamic systems without
Fusion 3 (3) (September 2002) 191–201.
accurate models, Ind. Eng. Chem. Res. 40 (1) (2001) 261–274.
[23] Zhiqiang Ge, Process data analytics via probabilistic latent variable models: a
[3] Yaakov Bar-Shalom, Leon Campo, The effect of the common process noise on the
tutorial review, I&EC Research 57 (2018) 12646–12661.
two-sensor fused-track covariance, IEEE Trans. Aero. Electron. Syst. 22 (6)
[24] Zhiqiang Ge, Muguang Zhang, Zhihuan Song, Nonlinear process monitoring based
(November 1986) 803–805.
on linear subspace and bayesian inference, J. Process Contr. 20 (5) (2010) 676–688.
[4] Billheimer Dean, Predictive inference and scientific reproducibility, Am. Statistician
[25] Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Vehtari Aki, Donald
73 (sup1) (2019) 291–295.
B. Rubin, Bayesian Data Analysis, Chapman and Hall/CRC, 2013.
[5] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer
[26] Zoubin Ghahramani, Geoffrey E. Hinton, et al., The EM algorithm for mixtures of
Scienceþ Business Media, 2006.
factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of
[6] A.J. Burnham, John F. MacGregor, R. Viveros, Latent variable multivariate
Toronto, 1996.
regression modeling, Chemometr. Intell. Lab. Syst. 48 (2) (1999) 167–180.
[27] Trevor Hastie, Robert Tibshirani, Martin Wainwright, Statistical Learning with
[7] Bob Carpenter, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich,
Sparsity: the Lasso and Generalizations, Chapman and Hall/CRC, 2015.
Michael Betancourt, Brubaker Marcus, Jiqiang Guo, Peter Li, Allen Riddell Stan,
[28] Agnar H€ oskuldsson, PLS regression methods, J. Chemometr. 2 (3) (1988) 211–228.
A probabilistic programming language, J. Stat. Software 76 (1) (2017).
[29] Shuo-Huan Hsu, Stephen D. Stamatis, James M. Caruthers, W.N. Delgass,
[8] Hongshu Chen, Bhavik R. Bakshi, Prem K. Goel, Bayesian latent variable regression
V. Venkatasubramanian, Gary E. Blau, M. Lasinski, S. Orcun, Bayesian framework
via gibbs sampling: methodology and practical aspects, J. Chemometr.: A Journal of
for building kinetic models of catalytic systems, Ind. Eng. Chem. Res. 48 (10)
the Chemometrics Society 21 (12) (2007) 578–591.
(2009) 4768–4790.
[9] Hongshu Chen, Bhavik R. Bakshi, Prem K. Goel, Toward bayesian chemometrics—a
[30] Ulf G. Indahl, H. LilandKristian, Næs Tormod, Canonical partial least squares—a
tutorial on some recent advances, Anal. Chim. Acta 602 (1) (2007) 1–16.
unified PLS approach to classification and regression problems, J. Chemometr. 23
[10] Wen-shiang Chen, Bhavik R. Bakshi, Prem K. Goel, Ungarala Sridhar, Bayesian
(9) (2009) 495–504.
estimation of unconstrained nonlinear dynamic systems, IFAC Proceedings Volumes
[31] Qingchao Jian, Biao Huang, Xuefeng Yan, GMM and optimal principal components-
37 (1) (2004) 263–268.
based bayesian method for multimode fault diagnosis, Comput. Chem. Eng. 54
[11] S. Chib, B.P. Carlin, On mcmc sampling in hierarchical longitudinal models, Stat.
(2008) 338–349.
Comput. 9 (1999) 17–26.
[32] R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, Prentice
[12] B.M. Colosimo, E. Del Castillo, Modern numerical methods in bayesian
Hall, 2002.
computation, in: Bayesian Process Monitoring, Control and Optimization, CRC/
[33] Hiromasa Kaneko, Kimito Funatsu, Adaptive soft sensor based on online support
Taylor and Francis, NY, 2007.
vector regression and bayesian ensemble learning for various states in chemical
[13] E. Del Castillo, Process Optimization, a Statistical Approach, Springer, 2007.
plants, Chemometr. Intell. Lab. Syst. 20 (2014) 57–66.
[14] E. Del Castillo, B.M. Colosimo, An introduction to bayesian inference in process
[34] N.M. Laird, J.H. Ware, Random effects models for longitudinal data, Biometrics 38
monitoring, control, and optimization, in: Bayesian Process Monitoring, Control
(4) (1982) 963–974.
and Optimization, CRC/Taylor and Francis, NY, 2007.
[35] Nicholas Lange, Bradley P. Carlin, Alan E. Gelfand, Hierarchical Bayes models for
[15] Enrique Del Castillo, Bianca M. Colosimo, Hussam Alshraideh, Bayesian modeling
the progression of HIV infection using longitudinal cd4 t-cell numbers, J. Am. Stat.
and optimization of functional responses affected by noise factors, J. Qual. Technol.
Assoc. 87 (419) (1992) 615–626.
44 (2) (2012) 117–135.
[36] Matlab, The Mathworks, inc., Natick, Ma, 2020.
[16] Derringer George, Ronald Suich, Simultaneous optimization of several response
[37] Thomas P. Minka, Automatic choice of dimensionality for PCA, in: Advances in
variables, J. Qual. Technol. 12 (4) (1980) 214–219.
Neural Information Processing Systems, 2001, pp. 598–604.
[17] William R. Dillon, Matthew Goldstein, Multivatiate Analysis - Methods and
[38] G. Miro, E. Del Castillo, J.J. Peterson, A bayesian approach for multiple response
Applications, Wiley, 1984.
surface optimization in the presence of noise variables, J. Appl. Stat. 31 (3) (2004)
[18] Alireza Fatehi, Biao Huang, Kalman filtering approach to multi-rate information
251–270.
fusion in the presence of irregular sampling rate and variable measurement delay,
[39] Kevin P. Murphy, Machine Learning: a Probabilistic Perspective, MIT press, 2012.
J. Process Contr. 53 (15–25) (May 2017).

17
E. Castillo, M.S. Reis Chemometrics and Intelligent Laboratory Systems 206 (2020) 104121

[40] R.H. Myers, D.C. Montgomery, C. Anderson-Cook, Response Surface Methodology, case study—solid-phase microextraction (spme) applied to the quantification of
3rd. ed., Wiley, NY, 2009. analytes with impact on wine aroma, J. Chemometr. 33 (3) (2019), e3103.
[41] Szu Hui Ng, A bayesian model-averaging approach for multiple-response [58] Marco S. Reis, Pedro M. Saraiva, Integration of data uncertainty in linear regression
optimization, J. Qual. Technol. 42 (1) (2010) 52–68. and process optimization, AIChE J. 51 (11) (2005) 3007–3019.
[42] Mohamed N. Nounou, Bhavik R. Bakshi, Prem K. Goel, Xiaotong Shen, Bayesian [59] Marco S. Reis, Pedro M. Saraiva, Heteroscedastic latent variable modelling with
principal component analysis, J. Chemometr. 16 (9) (2002) 576–595. applications to multivariate statistical process control, Chemometr. Intell. Lab. Syst.
[43] Mohamed N. Nounou, Bhavik R. Bakshi, Prem K. Goel, Xiaotong Shen, Process 80 (1) (2006) 57–66.
modeling by bayesian latent variable regression, AIChE J. 48 (8) (2002) [60] Marco S. Reis, Pedro M. Saraiva, Prediction of profiles in the process industries, Ind.
1775–1793. Eng. Chem. Res. 51 (11) (2012) 4254–4266.
[44] A.M. Overstall, D.C. Woods, K.J. Martin, Bayesian prediction for physical models [61] Sajjad Safari, Faridoon Shabani, Dan Simon, Multirate multisensor data fusion for
with application to the optimization of the synthesis of pharmaceutical products linear systems using Kalman filters and a neural network, Aero. Sci. Technol. 39
using chemical kinetics, Comput. Stat. Data Anal. 132 (2018) 126–142. (December 2014) 465–471.
[45] D.F. Percy, Prediction for seemingly unrelated regressions, J. Roy. Stat. Soc. B 54 [62] Geisser Seymour, Predictive Inference, Chapman and Hall, New York, 1993.
(1992), 254–252. [63] Ruijie Shi, John F. MacGregor, Modeling of dynamic systems using latent variable
[46] J.J. Peterson, A bayesian reliability approach to multiple response surface and subspace methods, J. Chemometr. 14 (5–6) (2000) 423–439.
optimization, J. Qual. Technol. 36 (2) (2004) 139–153. [64] Branca MA. Silva, Sílvia Vicente, Sofia Cunha, Jorge FJ. Coelho, Claudia Silva,
[47] J.J. Peterson, A bayesian approach to the ich q8 definition of design space, Marco Seabra Reis, Sergio Sim~ oes, Retrospective quality by design (rqbd) applied to
J. Biopharm. Stat. 18 (2008) 959–975. the optimization of orodispersible films, Int. J. Pharm. 528 (1–2) (2017) 655–663.
[48] J.J. Peterson, E. Del Castillo, A posterior predictive approach to multiple response [65] Andrew Smyth, Meiliang Wu, Multi-rate Kalman filtering for the data fusion of
surface optimization, JSM, Salt Lake City, Utah, 2007. displacement and acceleration response measurements in dynamic system
[49] John J. Peterson, Lief Kevin, The ICH Q8 definition of design space: a comparison of monitoring, Mech. Syst. Signal Process. 21 (2) (February 2007) 706–723.
the overlapping means and the bayesian predictive approaches, Stat. Biopharm. [66] Gregory W. Stockdale, Aili Cheng, Finding design space and a reliable operating
Res. 2 (2) (2010) 249–259. region using a multivariate bayesian approach with experimental design, Quality
[50] John J. Peterson, Guillermo Miro-Quesada, Enrique del Castillo, A bayesian Technology & Quantitative Management 6 (4) (2009) 391–408.
reliability approach to multiple response optimization with seemingly unrelated [67] J.E. Tabora, F. Lora Gonzalez, J.W. Tom, Bayesian probabilistic modeling in
regression models, Quality Technology & Quantitative Management 6 (4) (2009) pharmaceutical process development, AIChE J. 65 (11) (2019).
353–369. [68] G. Taguchi, System of Experimental Design, Unipub/Kraus International
[51] Martyn Plummer Jags, A program for analysis of bayesian graphical models using Publications., NY, 1987.
gibbs sampling, in: Proceedings of the 3rd International Workshop on Distributed [69] R Core Team, The R Project for Statistical Computing, r foundation for statistical
Statistical Computing (DSC 2003), March 20-22 2003. Vienna, Austria. computing, vienna, austria, 2020.
[52] L. Alexey, Pomerantsev. Successive bayesian estimation of reaction rate constants [70] Michael E. Tipping, Christopher M. Bishop, Probabilistic principal component
from spectral data, Chemometrics and Intelligent Laborarory Systems 66 (2) (2003) analysis, J. Roy. Stat. Soc. B 61 (3) (1999) 611–622.
127–139. [71] O Arda Vanli, Enrique Del Castillo, Bayesian approaches for on-line robust
[53] S.J. Press, Applied Multivariate Analysis: Using Bayesian and Frequentist Methods parameter design, IIE Trans. 41 (4) (2009) 359–371.
of Analysis, R. E. Krieger Pub. Co., Malabar, FL., 1982. [72] James H. Ware, Linear models for the analysis of longitudinal studies, Am.
[54] Ramkumar Rajagopal, Enrique Del Castillo, Model-robust process optimization Statistician 39 (2) (1985) 95–101.
using bayesian model averaging, Technometrics 47 (2) (2005) 152–163. [73] Peter D. Wentzell, Darren T. Andrews, David C. Hamilton, Klaas Faber, Bruce
[55] Ramkumar Rajagopal, Enrique Del Castillo, John J. Peterson, Model and R. Kowalski, Maximum likelihood principal component analysis, J. Chemometr.: A
distribution-robust process optimization with noise factors, J. Qual. Technol. 37 (3) Journal of the Chemometrics Society 11 (4) (1997) 339–366.
(2005) 210–222. [74] C.F. J Wu, M. Hamada, Experiments, Planning, Analysis and Optimization, second
[56] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, Springer, NY, 2007. ed., Wiley, NY, 2011.
[57] Marco S. Reis, Ana C. Pereira, Jo~ao M. Leça, Pedro M. Rodrigues, Jose C. Marques, [75] Jie Yu, S. Joe Qin, Multimode process monitoring with bayesian inference-based
Multiresponse and multiobjective latent variable optimization of modern analytical finite Gaussian mixture models, AIChE J. 54 (7) (2008) 1811–1829.
instrumentation for the quantification of chemically related families of compounds:

18

You might also like