Final Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems


journal homepage: www.elsevier.com/locate/chemometrics

Developing semi-supervised variational autoencoder-generative adversarial


network models to enhance quality prediction performance
Sai Kit Ooi a, Dave Tanny a, Junghui Chen a, *, Kai Wang b
a
Department of Chemical Engineering, Chung Yuan Christian University, Chungli, Taoyuan, 32023, Taiwan, R.O.C.
b
School of Automation, Central South University, Changsha, 410083, China

A R T I C L E I N F O A B S T R A C T

Keywords: One common serious issue of training a prediction model is that the process data significantly outnumber the
Generative adversarial network quality data. Such discrepancy exists because of the time lag for obtaining quality data. This paper proposes semi-
Latent variable model supervised variational autoencoder-generative adversarial network (S2-VAE/GAN), that is able to make use of all
Semi-supervised variational autoencoder
the data even with some missing quality data. The key idea in S2-VAE/GAN is the capability of enhancing the
Soft sensors
performance of the decoder/generator in learning the true distribution of both process and quality data in a
competition between the decoder/generator and the discriminator in S2-VAE/GAN through Nash Equilibrium,
allowing the model to improve the qualities of reconstruction and prediction data. The S2-VAE/GAN model is also
flexible enough to automatically adjust itself according to the input data. If the quality data are missing, the model
can fill up the data through the trained prediction model and the same network structure defined in the super-
vised case can still be re-used. With the probabilistic distribution format, the proposed method is also capable of
capturing the nonlinear feature of the process and representing the stochastic nature of operating plants. The
results of the numerical case and the industrial case in this paper show that S2-VAE/GAN outperforms conven-
tional methods in terms of predictabilities of the missing quality data.

1. Introduction require labeled datasets (both process variables and quality variables) for
modeling, but in most cases, these quality data are only obtainable
In chemical processes, soft sensors have grown in popularity in esti- through laboratory tests after the samples are received. The severe time
mating product quality via utilization of process variables that are lag of laboratory tests leads to inconsistent sample numbers of process
obtainable from on-line sensors in the process. There are mainly two variables and quality variables. In most cases, the simplest solution to the
methods of constructing soft-sensor models: the first principal model and aforementioned requirement is to divide the whole dataset into the set
the data-driven model. The first principles model is based on the physical consisting of both process and quality data only. The main problem is the
principal of the process, such as mass balance, energy balance, etc. number of labeled datasets with process and quality data. That is, process
Nevertheless, the main limitation of the first principal model is the and quality data may be insufficient to train the model and obtain the
requirement of complete knowledge of the process, and the assumptions correlation between process and quality variables. The next problem lies
of the model restrain practical applications of the model in large and in the remaining process data that are not used at all to improve the
complex industrial processes. model. As the remaining unused unlabeled data may still contain
With the abundance of industrial data, process modeling through important information about the process condition, discarding the un-
data-driven models has become easier and more important for the labeled dataset is simply a waste of potential information. From this point
development of soft sensors because of the complexity of industrial onward, the data consisting of process and quality data are referred to as
processes which limits the application of first principal models. The labeled data, and the ones consisting of individual process data only are
commonly used and popular models include: the regression form of referred to as unlabeled data.
principle component analysis into principal component regression (PCR), In order to solve the aforementioned problems, the solution is using
partial least squares (PLS) [1], artificial neural network [2,3], kernel the unlabeled data to increase the training samples while those data
partial least squares [4], and support vector machine [5]. These methods provide a pathway to allow the usage of the original model even with the

* Corresponding author.
E-mail address: jason@wavenet.cycu.edu.tw (J. Chen).

https://doi.org/10.1016/j.chemolab.2021.104385
Received 13 October 2020; Received in revised form 14 June 2021; Accepted 6 July 2021
Available online 13 July 2021
0169-7439/© 2021 Elsevier B.V. All rights reserved.
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

absence of quality data. Thus, in order to apply unlabeled data to the two-player competition. The two neural networks, including the gener-
model, a prediction value is obtained by training a model to fit the quality ator (which generates data) and the discriminator (which distinguishes
data from available process data. This generally involves the separation the real data from the generated fake ones), compete with each other to
of training data based on the number of available labeled data and meet the goal that the generator's reconstructed data are almost identical
remaining unlabeled data. The training procedure involves obtaining the to the real data. Though this seems wonderful theoretically, it is quite
prediction for unlabeled data to fill in the missing quality data. The difficult to train the model in reality because of the convergence problem
process data and the quality data (from either the real quality data or the arising from the miniscule gradient problem and its tendency to produce
predictions) will be used to train the model. Then the accuracy of the exactly the same reconstruction (a.k.a Mode collapse). Past papers pro-
predicted quality data can be tested by evaluating the mean squared error posed a lot of changes to its originally proposed loss function, while the
of the prediction and the real quality data. For example [6], proposed the past research chose to change the model or even combined GAN with
semi-supervised PCR method with Bayesian regularization. Their objec- other existing methods. They claimed that the training convergence
tive function is based on the maximization of the sum of the probability difficulty of the original GAN loss function is attributed to the strong
between the available labeled dataset and the remaining unlabeled regularization parameter exerted by Kullback-Leibler (KL) divergence.
dataset through the expectation-maximization (EM) algorithm. Another [14] used deep Wasserstein GAN to preprocess data through imple-
method is ignoring the unlabeled dataset and only using labeled data as mentation of the Earth-Mover distance. It converges faster than the va-
the training dataset, and then the trained model is used to predict the nilla VAE objective function. But all of them are unsupervised learning
unlabeled dataset through the latent space. [7] proposed the Bayesian methods.
sparse PLS method. With additional sparsity prior as a regularization On the other hand, [15] proposed using VAE and GAN together to
parameter for sparse data, the mean and the variance of the variables are address the weakness of the individual GAN and VAE algorithms. The
approximated by the parameters of the probabilistic distribution to allow combinational method produced better image quality results than the
robust prediction. [8] proposed the semi-supervised probabilistic latent original vanilla VAE because of the additional GAN network, which al-
variable model. There are two key differences from the past work. The lows the model to generate a better reconstruction image and be trained
first one is that the projections of process data and quality data are easily than the original GAN algorithm. [16,17] implemented similar
assigned to different latent variables as opposed to the single latent combination of VAE with the GAN on adversarial training in the latent
variable to account for both data. The second one is that the latent var- space to mimic the real distribution of the labeled data and to capture the
iable of quality data also possesses a relation to the latent variable of the latent feature of the image. Both methods aim to use adversarial training
process data, with each variable approximated under probability distri- to train the encoded quality data (labeled data) to be like the real quality
bution. Afterward, the prediction of quality data can be obtained by data, and the prediction of quality data (labeled data) would be per-
sampling the latent variables of quality data. To avoid misunderstanding, formed from the latent space of the VAE structure. [18] proposed an
the modeling method (S2-PCR) in reference [6] is set to be the same as adversarial training loss function to train both VAE and GAN structures
S2-PLVR; the only difference is that the application of S2-PLVR is used for instead of using the original VAE loss function template. [19] used an
monitoring while S2-PCR is used for prediction. The similarity in both autoencoder with two discriminators to discriminate attributes and
aforementioned methods lie in the assumption that the parameters are classes of the generated labeled data from the autoencoder output to its
represented in probability distribution and they are both trained by a real labeled data. All the methods discussed above are unsupervised
complete dataset so that both models can be more robust and all the training. They can be expanded to supervised and semi-supervised
predictions are generated by directly sampling the latent variables and learning, and modified to allow the model to comply with the stochas-
their corresponding parameters. However, the main limitation of the tic nature of chemical plants.
above methods is that they are assumed to be linear. As most chemical In this paper, a combination of a GAN model and semi-supervised
processes are highly nonlinear in nature, using linear models can cause VAE (S2-VAE/GAN) is proposed. This combinational method is able to
large model biases. Therefore, a nonlinear model is highly needed to solve the lackluster reconstruction of S2-VAE through the GAN model by
represent the complex chemical operations. training the generator (VAE decoders) and the prediction model to learn
In recent years, the variational autoencoder (VAE) [9] has been the real data distribution regardless of the number of quality data
gaining popularity because of its capability of representing nonlinear available in the dataset. The prediction result of the proposed method is
systems through two neural networks: the encoder (projection to latent shown to be improved with the addition of GAN rather than using the
variables) and the decoder (reconstruction to original data). The use of original S2-VAE alone, resulting in better representation of the correla-
the probability model in its latent variables can improve the capture of tion between the process variables and the quality variables. The whole
the data distribution. It is easy to train and fit VAE onto any given design is detailed in the following sections. Section 2 introduces S2-VAE.
probability density function to mimic the real distribution of the system. The loss function of S2-VAE is derived and the structure of the network as
In the computer science field, [10] proposed a semi-supervised method well as its training sequence are shown and explained. Section 3 gives the
using the VAE model through combining two different models: super- detailed formulation of proposed S2-VAE/GAN. In Section 4, a numerical
vised learning (M1) and unsupervised learning (M2). In M1, both image example and an industrial case are used to show the performance of the
and labeled data are used to train the supervised VAE model while only proposed method in comparison with other conventional data-driven
image data is used to train the model in M2. Like Kingma's model, an algorithms. Finally, conclusions are stated with possible improvements
extension study was performed to specifically improve the prediction of on the proposed method.
the labeled data by taking the prediction network term as the objective
function in [11,12]. Albeit the addition of the prediction network under 2. Semi-supervised VAE process modeling
the original supervised VAE model, labeled data and latent variables
were assumed to be independent for all the methods mentioned above. Given N number of data, consisting of N l number of labeled data and
However, in typical industrial processes, there is inherent complex cor-
N u number of unlabeled data, the labeled data are represented by ðXl ; YÞ
relation between process variables and quality variables. In addition,
Nl
quality data are mainly characterized by continuous distribution, instead ¼ ½ðxln ; yn Þn¼1 , x 2 RM is the process observation and y 2 RP is the quality
u
of multinomial distribution for label data that was used in the past work. measurement. The unlabeled data is given by ðXu Þ ¼ ½ðxun ÞNn¼1 , where the
Although VAE looks promising, the major problem of VAE is the poor superscript u represents unlabeled data, l indicates labeled data, and the
reconstruction accuracy. The most recent innovation in probabilistic subscript n indicates the index from N samples. These data are usually
neural network field extends to generative adversarial network (GAN), contaminated by noises/disturbances in the process. Thus, it is necessary
which was proposed by [13]. GAN aims to train a good model by a

2
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

to identify the variables that significantly affect the system in a direct posterior functions of the labeled data qφ ðzjx;yÞ and the unlabeled data
manner. Those variables are usually commonly referred to as latent qφ ðz;yjxu Þ eventually become closer to the true posterior distribution
variables z 2 RN . Given that the latent variables affect both process and functions pθ ðzjx;yÞ and pθ ðz;yjxu Þ of the labeled and the unlabeled data
quality variables directly, a simple mathematical representation is shown respectively, simultaneously maximizing the true log probability JðX; YÞ.
as follows: Therefore, the overall variational lower bound LðX;YÞ can be used as an
alternative means for the original objective function of JðXl ;YÞ and JðXu Þ
xn ¼ fðzn Þ þ w as follows:
(1)
yn ¼ gðzn Þ þ v
   
The mathematical representation mentioned above reflects the gen- maxLðX; YÞ ¼ max L Xl ; Y þ LðXu Þ (5)
eral characteristics of the process variables and the quality variables in 2
S -VAE in Eq. (5) is composed of 2 parts: a supervised VAE model
accordance to the latent variable, ignoring the presence of labeled data
(LðXl ;YÞ) and an unsupervised VAE model LðXu Þ. The loss function of the
and unlabeled data. There is a difference between labeled and unlabeled
supervised VAE model LðXl ; YÞ can be easily derived:
data because of the sampling rate of quality data measured with a large
time lag. However, regardless of the sampling rate, the latent variables  
L Xl ; Y ¼ Eqφ ðzjx;yÞ ½log pθ ðxjzÞ  þ Eqφ ðzjx;yÞ ½log pθ ðyjzÞ 
affect the quality data. The function of the latent variables fð Þ : RN → RM (6)
KLðqφ ðzjx; yÞjjpθ ðzÞ Þ
and gð Þ : RN → RP can be nonlinear or linear, depending on the process
characteristic. It is also assumed that the disturbance of the process w The whole structure of S2-VAE and its detailed training scheme are
P shown in Fig. 1. Based on the network structure in Fig. 1. The supervised
follows the zero mean Gaussian w  Nð0; w Þ. Similarly, there is v  Nð0;
P VAE model takes both the process data ðxÞ and the quality data ðyÞ as
v Þ. To accurately learn the functions fð Þ and gð Þ, and consider the
presence of labeled and unlabeled data, S2-VAE can be trained by inputs to extract the features and denote them as the latent variables ðzÞ
maximizing the log probability of both labeled and unlabeled data: and reconstruct back into the process data and the quality data, respec-
tively. Regarding LðXu Þ, the label Y is missing. pðyjxu Þ can be estimated

Y
Nl
 Y
Nu
  by the differential entropy (Hðqφ ðyi xui ÞÞ), so LðXu Þ can be reformulated.
maxJðX; YÞ ¼ maxln p xln ; yn p xun
2
n¼1 i¼1
3 LðXu Þ ¼ Eqφ ðyjxu Þ ½LðXu ; YÞ þ Hðqφ ðyjxu ÞÞ (7)

6X 7 (2) R
6N
l
  XNu
 7 where Hðqφ ðyjxÞÞ ¼ qφ ðyjxÞEqφ ðzjx;yÞ ½log qφ ðyjxÞdy. The prediction
¼ max6
6 ln p xln ; yn þ ln p xun 7 7 network generates the corresponding quality data for the unlabeled
4 n¼1 5
|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflffl
n¼1
ffl{zfflfflfflfflfflfflfflffl} dataset to allow full utilization of all the available data while the su-
J ðX ;YÞ
l JðX uÞ
pervised VAE model is retrained. The prediction model would not be
used to generate quality data when the real quality data are available. All
where JðXl ; YÞ is the objective function of the available labeled data and the networks (including prediction modelpθ ðyjxÞ, encoder qφ ðzjx; yÞ, x-
JðXu Þ is the objective function of the unlabeled data. In most cases, decoderpθ ðxjzÞ, and y-decoder pθ ðyjzÞ) are modeled by deep neural
pðxln ; yn Þ and pðxun Þ are too complex. It is impossible to calculate the true network (DNN) shown in Fig. 2, to generate a Gaussian distribution. Each
posterior distribution as the true posterior distribution would also be individual network output consists of mean and covariance value
intractable. Using the KL divergence between the true process posterior modeled with Gaussian distribution that is calculated through the neuron
(pθ ) and the approximate posterior (qφ ), JðXu Þ and JðXl ; YÞ can be layers in DNN, which is presented as:
expressed by
pθ ðyjxÞ  NðhðxÞ; VðxÞÞ qφ ðzjx; yÞ  Nðμðx; yÞ; Λðx; yÞÞ (8)
JðXu Þ ¼ Eqφ ðz;yjxÞ ½log pθ ðxu jzÞ   KLðqφ ðz; yjxu Þjjpθ ðy; zÞ Þ pθ ðxjzÞ  NðfðzÞ; Σ x Þ pðyjzÞ  N gðzÞ; Σ y
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Lðxu Þ (3)
þKLðqφ ðz; yjxu Þkpθ ðz; yjxu ÞÞ where the means of all the networks are in the scope of ½μjμ 2 R. The
covariance scope is a positive value ð0; ∞Þ, which is attainable through
the softplus activation function ζðaÞ ¼ lnð1 þ ea Þ. As shown in Eq. (6),
the variational lower bound of the labeled data is only used to train the
 
J Xl ;Y ¼ Eqφ ðzjx;yÞ ½logpθ ðxjzÞþEqφ ðzjxi ;yi Þ ½logpθ ðyjzÞKLðqφ ðzjx;yÞjjpθ ðzÞÞ supervised-VAE model without considering the training of the prediction
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
network parameters. To train all the networks in the model, the overall
LðXl ;YÞ
variational lower bound LðX; YÞ can be modified in the way that the
þKLðqφ ðzjx;yÞkpθ ðzjx;yÞÞ
process data of all the labeled datasets are regarded as unlabeled data-
(4)
sets. All the unlabeled datasets are used to generate quality data through
The complete derivation of S2-VAE is shown in Appendix A and B. In the prediction model; that is,
Eqs. (3) and (4), the KL divergence is always positive, and the variational

" # " #
X
Nl
  XNu Nl   
P    X N u

max L Xl ; Y þ LðXu Þ ¼ max L Xl þ ln qφ yi xli þ LðXu Þ


i¼1 i¼1 i¼1 i¼1
" # (9)
P
N P
Nl   
¼ max LðXÞ þ ln qφ yi xli
i¼1 i¼1

lower bound LðXl ;YÞ and LðXu Þ would always be smaller than JðXl ;YÞ where N ¼ N l þ N u . In this way, LðXÞ of the S2-VAE model simply trains
and JðXu Þ. Maximizing the lower variational bound of LðXl ;YÞ and LðXu Þ all of the networks simultaneously no matter whether the data are
would make the KL divergence become zero; therefore, the approximate labeled or unlabeled. The lower bound LðX; YÞ in Eq. (9) can be

3
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

the real data. With the inputs of the real data, a new solution is proposed
by introducing competition to the reconstruction and prediction outputs
to improve the original model performance and the prediction of the S2-
VAE model.
GAN is a concept of competing the performance of a model in rep-
resenting the real data distribution. It is done through the discriminator,
which is an additional network, so that the closeness of the generated
data to its real counterpart can be judged. GAN can train the generator (or
the decoder) and the discriminator subsequently to achieve Nash equi-
librium. The equilibrium state is usually obtained by a minmax training
function. The purpose of using GAN is to fool a discriminator and make it
unable to differentiate the real data from the fake/generated ones. In this
way, the loss function can be expressed as the following relation:

J ¼ EapðaÞ ½DiscðaÞ þ EzpðzÞ ½1  DiscðGðzÞÞ (10)


Fig. 1. The whole structure of S2-VAE.
where pðaÞ is the real data distribution, Discð Þ represents the discrimi-
maximized by the EM algorithm. This algorithm is divided into two steps nator and Gð Þ represents the generator/decoder. The derivation of Eq.
conducted in an iterative manner, including the E-step and the M-step. In (10) is detailed in Appendix C.
the E-step, the model parameters of each network calculated from the The proposed S2-VAE/GAN network structure is shown in Fig. 3. In
previous iteration are used to calculate the new expected output value of the S2-VAE model, two discriminators are attached to the process data
each network respectively. The new values will then be used to calculate and the quality data. The discriminators are trained to distinguish the
the new variational lower bound in Eq. (9). In the M-step, the parameters real data from its generated counterparts from z distribution obtained
of each network will be updated by the new expected value of each in- through the whole S2-VAE model. Like S2-VAE, the S2-VAE/GAN model
dividual network. These two steps are repeated iteratively until the pa- trains all the networks no matter whether the selected data are labeled or
rameters converge and the variational lower bound Eq. (9) is maximized. unlabeled instead of changing the loss function of the networks on the
The prediction of the quality data is made by the prediction network basis of the input data. This allows the training of S2-VAE/GAN to be
when the training is over. more flexible and specific to each network in the model. To train S2-VAE/
GAN, the discriminators are prioritized to train first for a few iterations to
3. Semi-supervised VAE/GAN process modeling learn the real data distribution before the rest of the models are trained.
The generated process data from the prior distribution z in the original
To check the eligibility of the generated data toward the real process GAN model as well as the whole process data of the S2-VAE reconstruc-
data and the quality data, a new method is proposed to enhance the tion are considered separately in the loss function of the process data (x)
performance of the overall combined model in this section. Because the discriminator. The objective function of the discriminator for process
trained S2-VAE can make the approximate posterior distribution data (x) is formulated as follows:
qφ ðzjx; yÞ close to the true posterior distribution pθ ðzjx; yÞ by the varia-
tional lower bound function, the important feature can be extracted. The
3.1. Process data (x)
good approximation of the marginal distribution of pθ ðx; yÞ can be ob-
tained to reconstruct each measurement from the latent space. The main
problem behind this loss function is that it does not effectively train the LGANx ¼ Expθ ðxÞ ½logðDiscx ðxÞ Þ  þ Expθ ðxj zÞ Ezpθ ðzÞ
prediction model, which is not connected to the original objective of
½logð1  Discx ðDecx ðzÞ Þ Þ  þ Expθ ðxjzÞ Ezqφ ðzjx;yÞ Eypθ ðyjxÞ Expθ ðxÞ (11)
minimizing the divergence of distribution. Thus, the loss function is not
suitable for the prediction. Previous research on the VAE model also ½logð1  Discx ðDecx ðEncðx; PredðxÞ Þ Þ Þ Þ 
showed VAE reconstruction is quite blurry and inaccurate/unrealistic to
where Discx ðxÞ is the discriminator network for process data, Decx is the
decoder network for process data, Enc is the encoder network, and Pred is
the prediction network.
In the discriminator for the quality data (y), y from GAN is compared
with the quality data generated by the decoder of S2-VAE. To handle
available and unavailable quality data, the objective function related to
the data input in the discriminator of y is considered. The measured y
will be directly used. When the real quality data are missing, the pre-
diction network result would be used as a reference of the real quality
data. The formulation of the objective function for each scenario is shown
as follows.

3.2. Available quality data (y)

  
LGANyav ¼ EypðyÞ log Discy ðyÞ þ Eypθ ðyjxÞ Expθ ðxÞ
  
log 1  Discy ðPredðxÞ Þ þ Eypθ ðyjzÞ Ezpθ ðzÞ
   
log 1  Discy Decy ðzÞ þ Eypθ ðyjzÞ Ezqϕ ðzjx;yÞ Eypθ ðyjxÞ Expθ ðxÞ
   
log 1  Discy Decy ðEncðx; PredðxÞ Þ Þ
Fig. 2. Deep neural network structure. (12)

4
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

Fig. 3. The whole structure of S2-VAE/GAN.

3.3. Unavailable quality data (y)


x like ¼ Eqφ ðzjxi ;yi Þ ½log pθ ðDiscx ðxi ÞjzÞ
LDisc (17)
x

2
  As the decoder in the S -VAE structure is taken as the generator of
LGANyun ¼ Eypθ ðyjxÞ logDisc
  y ðPredðxÞÞ   GAN, the training of the decoder will need to be balanced between the
þEypθ ðyj zÞ Ezpθ ðzÞ log 1Discy Dec y ðzÞ   reconstruction likelihood loss and the discriminator loss. This can be
þEypθ ðyjzÞ Ezqϕ ðzjx;yÞ Eypθ ðyjxÞ Expθ ðxÞ log 1Discy Decy ðEncðx;PredðxÞÞÞ
achieved by performing the weighting/penalizing factor through the
(13) parameter γ to control the ability of decoder/generator to reconstruct
and fool the discriminator. The parameter γ indicates the trade-off be-
where Discy is the discriminator network for quality data, Pred is the
tween the accuracy reconstructed from the latent variable z and rep-
prediction network, Decy is the decoder network for quality data, and Enc
resentation of the distribution of the real data from the discriminator.
is the encoder network. (Detailed derivations of the above objective
The parameter will only be used to update the parameter of each
functions are shown in Appendix C.)
decoder.
Because the proposed S2-VAE and GAN are integrated (especially for
the case of the decoder/generator of both x process data and y quality 
θDecx ¼  rθDec γLDisx
x like  LGANx (18)
data), the training of the proposed method has to be differentiated for x

each of the network in the model. S2-VAE model networks are all con-

nected together, so a single loss function can be used to backpropagate θDecy ¼  rθDec γLDisy
y y like  LGANy (19)
the gradient to all of the networks. However, in GAN, the networks are
distinctly separated by the discriminator network, so the back- The objective function includes the decoders based on the GAN loss
propagation of the gradients need to be specified according to how they functions for the process data LGANx and the quality data LGANy , both of
are connected. The training for the discriminator of the process data (x) which are stated in Eqs. (11)–(13), and the reconstruction losses LDisx
x like
and the quality data (y) can maximize the derived GAN loss functions and LDisy
y like (Eqs.16 and 17). The decoder training would be different
given in Eqs. (11)–(13) respectively, all of which are given by.
when the discriminator is currently trained through unlabeled or
labeled data. The gradient method is employed to train the decoder for
3.4. x-discriminator the process and quality data to ensure that the trained decoder can
reconstruct data similar to the real counterparts while making sure
that the whole latent distribution is taken into account for the recon-
θDiscx ¼ maxLGANx (14)
struction process, just like the originally derived S2-VAE reconstruc-
tion loss.
3.5. y-discriminator In each encoder, the loss function of S2-VAE/GAN is similar to that of
the original S2-VAE, as both share the similar structure and both trainings
also depend on whether y is generated by the prediction network or from
θDisc ¼ maxLGANyav
y
av
y
(15) the real data. Like Eq. (9) in S2-VAE, whose objective is to allow the use of
θDisc ¼ maxLGANyun
y
un
y any form of data as an input into the model, the objective function of the
encoder in S2-VAE/GAN is thus represented by
In Fig. 3, the network connected to the discriminator is the decoder or
the generator for x and y data from the latent variable z. As it is con- h    l  
θyenc ¼ argmax LDisc
Discy  pθ ðzÞ
x like þ Ly like  KL qφ z xi ; yi
real x
nected to the discriminator, the sigmoidal activation function is used to
# (20)
represent the likelihood of reconstruction of each x and y data, given by XNl
  l
þ 
ln qφ yi x
    i
¼ Eqφ ðzjxi ;yi Þ log pθ Discy ðyi Þz
Discy i¼1
Ly like (16)

5
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

h
h h yprediction
θpred network ¼ argmax Ly like  KLðqφ ðzjx; yÞjjpðzÞÞ Þ
Disy
Discy
θyenc
¼ argmax Eqφ ðyjxÞ LDisc
pred
x like þ Ly like
x

   l      u    (21) þ Hðqφ ðyjxÞ  LGANyun
 KL qφ zxi ; yi pθ ðzÞ þ H p yi xi h (22)
θypred
real
network ¼ argmax Ly like  KLðqφ ðzjx; yÞjjpðzÞÞ Þ
Disy

The encoder is not part of the GAN structure. Instead, it is only part of 
the original S2-VAE structure. Training the encoder would be similar to þ lnðqφ ðyjxÞ  LGANyav
the training specified in the previous section. It is based on the original As the prediction network is used in both S2-VAE and GAN structures,
objective function of the S2-VAE structure, which is to minimize the the GAN loss function for the quality data is inputted to training of the
divergence between the approximate posterior distribution and the true prediction network. However, in S2-VAE/GAN, the process data would
posterior distribution function. The encoder can extract important in- not be constructed to prevent the prediction network from learning un-
formation or features of the data in the latent space. necessary relations. S2-VAE/GAN focuses on the correlations between
In S2-VAE, the poor prediction quality is attributed to the training of process and quality data. The prediction networks of S2-VAE and S2-VAE/
prediction network that relies heavily on the loss function of VAE. The GAN are different no matter whether there are available quality data or
loss function would undermine the desired correlation between process not. If predictions are estimated using the loss function of GAN LGANyun
data and quality data which the prediction network is supposed to learn.
and the differential entropy, they would substitute the unavailable
Using the latent variables, the S2-VAE encoder captures the correlations
quality data.
among process variables, the correlations among quality variables, and
Like the S2-VAE model, all the data in the S2-VAE/GAN model would be
the correlations among process and quality variables, but only the cor-
regarded as unlabeled data, and S2-VAE/GAN follows the same network
relations among the process variables and the quality variables are
pathway as the S2-VAE model until the process variable decoders and the
desired. This can lead to a negative effect on the prediction model as the
quality variable decoders. The main difference between S2-VAE and S2-
important inter-correlation between process variables and quality vari-
VAE/GAN models is that S2-VAE/GAN has a discriminator connected at
ables is diminished. In addition, S2-VAE minimizes the KL divergence
the end of the decoder for both process and quality data respectively. The
between the true posterior distribution and the approximate posterior
discriminator for process and quality data is trained initially with the real
distribution. The goal of the model is to mimic the true posterior distri-
data for given iterations before the remaining networks are trained. The
bution on lower dimension through the dimensionality reduction
training of the quality variable discriminator depends on whether there
method. The entire model focuses on the importance of mapping of input
are available quality data (labeled data). In case there are available quality
data onto the latent space without any consideration of the prediction
data, they would be used as the main point of reference for the loss
quality. In order for the prediction model to capture the relation between
function, which is represented in Eq. (12); if there is none, the generated
the process data and the quality data, the loss function of the prediction
quality data from the prediction network will be used as the main point of
network is modified as
reference with the loss function given in Eq. (13).

Fig. 4. S2-VAE/GAN flowchart diagram.

6
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

In Fig. 3, the sampling of z prior distribution from Nð0; IÞ is based on and noises ðw1 ; w2 ; w3 ; w4 Þ for each variable with the equation shown as
the original GAN formulation. It is also assumed that the posterior dis-
tribution from the encoder will eventually become closer to the prior x1 ¼ z2 sinðz1 Þ þ w1
distribution after a lot of training iterations. As the posterior distribution x2 ¼ z1 sinðz2 Þ þ w2
becomes closer to the prior distribution, the reconstructed data would z1
y1 ¼ z2 cosðz1 Þ  sin 2 ðz2 Þ þ þ w3 (24)
resemble the real data very closely. Therefore, the discriminator can z1 þ z2
differentiate real data from the fake ones at each training iteration; then z2
all the networks (from the decoder back to the prediction network) are y2 ¼ z1 sinðz2 Þ  cos 2 ðz1 Þ þ þ w4
z1 þ z2
trained in the S2-VAE model so that the reconstructed data can resemble
the real data closely and the prediction network can learn the quality data where each latent variable ðz1 ; z2 Þ and noise ðw1 ; w2 ; w3 ; w4 Þ are
distribution very closely from the backpropagation process. distributed under the Gaussian distribution with mean and variances
The whole S2-VAE/GAN network structure is given in Fig. 4. All the listed as follows.
networks, including the prediction model qφ ðyjxÞ, the encoder qφ ðzjx;yÞ, the
x-decoderpθ ðxjzÞ, the y-decoder pθ ðyjzÞ), are highlighted in light cyan in z1  Nð3; 0:5Þ z2  Nð2; 0:3Þ
Fig. 4 and the discriminator for the process data (x-discriminator) and the w1  Nð0; 0:02Þ w2  Nð0; 0:03Þ (25)
w3  Nð0; 0:05Þ w4  Nð0; 0:04Þ
quality data (y-discriminator) are highlighted in orange color in Fig. 4. Each
of these networks is modeled by the deep neural network (DNN). The light Using Eq.(24) and (25), 1200 labeled and 600 unlabeled samples are
cyan neural network outputs a Gaussian distribution while the orange- generated with noise contamination. The same number of labeled and
colored neural network outputs through the sigmoidal activation func- unlabeled samples are also generated without noise. The numerical data
tion, whose value is between 0 and 1. The S2-VAE/GAN training procedure for the noise-contaminated and noise-free dataset are shown in Fig. 5.
starts by initially training both discriminators with respective real data for a Half of the labeled and 600 unlabeled datasets are used to train each
given number of iterations before the remaining networks are trained. The model, and the remaining half labeled samples are used to test the trained
training of the y-discriminator network also depends on whether there are model. Several traditional models, including PPCR, PPLS, KPCR, KPLS,
available quality data (labeled data). If the quality data are available, the BPLS, SVR and MLR, are used for comparisons, but they can handle
real data will be used to train the discriminator network; otherwise, the labeled data only. The kernel function of conventional kernel methods is
generated quality data from the prediction network will be used as the main the radial basis function. Semi-supervised learning methods trained by
point of reference. The prior z distribution sampling is also used as it is noise-contaminated labeled and unlabeled datasets are also applied,
assumed that the posterior distribution from the encoder will eventually including semi-supervised probabilistic latent variable regression (S2-
become closer to the prior distribution after a lot of training iterations. This PLVR) and weight semi-supervised orthogonal factor analysis (WS2-OFA)
allows the discriminator to fully discern the features of the real data; that is, [20]. Both S2-PLVR and WS2-OFA are trained by the same number of
the discriminator is used to differentiate the real data from the generated trained labeled and unlabeled datasets as S2-VAE and S2-VAE/GAN.
data of either the whole model or of the prior z distribution. In S2-VAE, all of the networks consist of 3 hidden layers with 40
After the training of the discriminator, which is denoted with orange neural nodes on each layer with the tanh activation function, and the
color, the rest of the neural networks (in cyan color boxes) will be trained latent variables is designed for 2 latent dimensions. The cross-validation
for a given number of iterations. This allows the specification of the is used to stop the training before convergence to avoid the overfitting. In
discerning features between real and fake data can be learned for the S2-VAE/GAN, the discriminator networks for the process variables and
remaining neural network, which will serve as a benchmark for the fake quality variable follow the same structure as the network in S2-VAE, but
data. The output of the light cyan neural network consists of mean and the difference lies in the sigmoid function used as the activation function
covariance values modeled with the Gaussian distribution calculated of the output layer. S2-VAE/GAN has the same type of the optimizer and
through the neuron layers in DNN (Eq. (8)). To train the model, the the learning rate as S2-VAE. The models are trained for 1000 iterations
gradient must be allowed to backpropagate the networks, so it is required and the discriminator is trained for 3 times, and the rest of the networks
to approximate the distribution in the network, especially for the latent are trained for 3 times by mini-batch training data. The training of each
variable ðzÞ in the supervised-VAE and the prediction model for the network is changed based on labeled or unlabeled data into the model.
quality data ðyÞ. The backpropagation can be achieved by applying the To verify the prediction performance, the result of the prediction
reparameterization trick (highlighted in the green box in Fig. 4). The network is compared with noise-contaminated and noise-free labeled
covariance is assumed to be a constant sampled with the ε unit Gaussian quality test data. Root-mean-square-error (RMSE) is used as a metric to
distribution (highlighted in orange color in Fig. 4). assess the accuracy of the model.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
y ¼ V 2 ðxÞε þ hðxÞ
1
u N 2
u1 X
(23) RMSE ¼ t yi  ^yi (26)
z ¼ Λ2 ðx; yÞ  ε þ μðx; yÞ
1
N i¼1

where ε  Nð0; IÞ. During training the S2-VAE/GAN model, the best in- where N is the number of data, yi is noise-contaminated or noise-free
terest is to train all of the networks simultaneously no matter whether the quality data, and y^i is the predicted variable of the model.
data are labeled or unlabeled. The training function for each neural
RMSEs of the prediction results of each method are listed in Table 1,
network is fully derived and stated in the previous section. Afterward, the
and graphical representations of S2-VAE/GAN and S2-VAE are shown in
training is repeated by training both discriminators, and then the training
Fig. 6. The lower RMSE value based on the prediction and noise-free
of the remaining network follows. This training iteration will be carried
quality data means that the model can follow the quality data distribu-
out until the parameters for all of the networks converge.
tion very closely; it is more preferable than the RMSE value based on the
prediction and the noise-contaminated quality data.
4. Case study
For the future prediction of quality data, the prediction result is ob-
tained by the prediction network qφ ðyjxÞ directly to obtain the quality
4.1. Numerical example
data. With deep nonlinear characteristic of the neural layers in each
network, S2-VAE/GAN and S2-VAE show better prediction results than
In this numerical example, two input variables ðx1 ; x2 Þ and two
other conventional latent variable models. They also have a better
quality variables ðy1 ; y2 Þ are constructed by two latent variables ðz1 ; z2 Þ
nonlinear representation than the kernel trick used in KPCR and KPLS.

7
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

Fig. 5. Process and quality data representation with and without noise in the numerical case study.

Other from soft sensor purpose, the proposed method can also be
Table 1
regarded as a special case of soft sensor purposed method can also be
The RMSE values of the prediction results obtained from various methods in the
regarded as a special case of data imputation. The proposed method has
numerical case study.
been compared with multiple imputations by chain equation (MICE) with
Method RMSE (data infested with noise) RMSE (data without noise)
random forest regression in the numerical case. 1200 labeled and 600
PPCR 0.9404 0.9337 unlabeled data are generated from Eq. (24). 600 labeled and 600 unla-
PPLS 0.7814 0.7746 beled data are used for training while the other 600 labeled data are used
KPCR 0.571 0.56
for testing. RMSE is used as a comparison criterion to assess the similarity
KPLS 0.7436 0.7366
SVR 0.465 0.453 between imputed data and real data. The comparison result is presented
MLR 0.496 0.483 in Table 1 and Fig. 7. From the above comparison, it is found that the
S2-PLVR 0.7901 0.7836 accuracy of data imputation of either S2-VAE or S2-VAE-GAN is higher
WS2-OFA 0.7446 0.7376
than MICE with random forest.
BPLS 0.8047 0.7975
S2-VAE 0.411 0.401
4.2. Industrial case study

Data from an ammonia synthesis chemical plant are used to study the
S2-VAE/GAN is also better than S2-VAE as the prediction of the former is
effectiveness of the proposed method. Ammonia is an essential compo-
almost identical to the real data with less training. On the other hand, the
nent for a lot of applications, such as the key ingredient in the production
latter concentrates on the most clustered part of the data and requires
of fertilizers. One of the most important parts in the synthesis process is
more training to capture the features of the data. The reason behind the
the pre-decarburization process, in which the CO2 from the feed gas is
better prediction result of S2-VAE/GAN can be attributed to the use of the
absorbed into a solvent, and the absorbed gas in the solvent will be
GAN model, which allows the prediction network in S2-VAE/GAN to
extracted and used for the future production process. The flowchart of
focus on the inter-correlation between process data and quality data
this process is given in Fig. 8, where the process equipment consists of 4
without the need to consider the intra-correlation of process data and
major devices (the feed gas separator, the PG separator, the heat
quality data. In the loss function of S2-VAE, the correlation of process and
exchanger, and the absorption column). The absorption column is the
quality data are learned through forward and backward propagation.
main device responsible for capturing CO2 in the feed gas, with the main
That is, the prediction network will be forced to learn the intra-
chemical reactions shown as follows:
correlation of process data as well, but with the loss function of S2-
VAE/GAN focusing on each individual network, the prediction network RNH2 þ CO2 → RNH þ COO (27)
can be guided to focus on the inter-correlation between the process
variables and the quality variables.
RNH2 þ RNH þ COO → RNHCOO þ RNH3þ (28)
To show the robustness of S2-VAE and S2-VAE-GAN models, various
levels of noises and different ratios of labeled/unlabeled data points have The process contains a total of 19 process variables. The quality data
been included in the numerical case. In the noise levels, the level of of interest is the residual CO2 in the process gas. The goal of this study is
measurement noise is gradually magnified by 2.5 times, 5 times and 10 to enhance the prediction result of the quality, given the limited labeled
times. The results of S2-VAE with different noise levels are shown in samples and a large number of unlabeled data. There are 1800 data
Table 2 while results of S2-VAE-GAN are presented in Table 3. Noise samples. The labeled data accounts for 80% of the data samples and the
pollution is often overlooked in collected data for modeling. Although remainder are treated as unlabeled data. Training and testing data are
the model performances would be degraded with the increase of the selected randomly from each dataset. They account for half of the data
noise levels, the proposed S2-VAE and VAE-GAN are still acceptable when samples contained in the labeled and the unlabeled data respectively.
the noise levels increase 5 times of the original one. It is because the Half of the labeled and unlabeled data samples are used to train the semi-
RMSEs of models with the noise-free data are still small. supervised methods while all the labeled datasets are used to train the
In the ratios of labeled/unlabeled data, the results of S2-VAE are remaining methods. The prediction networks in S2-VAE and S2-VAE/
shown in Table 4 while S2-VAE-GAN is presented in Table 5. Tables 4 and GAN will be used to provide the prediction for the performance com-
5 show that the increment of unlabeled data would indeed improve the parison of each model.
prediction accuracy of models. The RMSE trend decrease with the For the S2-VAE model, using the same structure as the previous
increased number of unlabeled data because information about the structure in the numerical case, the number of training iterations is set to
process has been enlarged by the useful information contains in unla- be 6000. The S2-VAE/GAN model is trained for 2000 times, and its
beled data. However, when the number of unlabeled data increases until discriminator is trained for 10 iterations, with the rest of the networks
a critical quantity, negative transfer may occur. being trained for once only. The parameter γ is set to be 0.9. The RMSE of

8
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

Table 2
The RMSE values of S2-VAE for the collected data with various levels of noises.
Level σ w1 σ w2 σ w3 σ w4 RMSE (data RMSE (data
infested with without noise)
noise)

Original 0.02 0.03 0.05 0.04 0.411 0.401


2.5 0.05 0.075 0.125 0.1 0.476 0.443
times
5 times 0.1 0.15 0.25 0.2 0.587 0.45
10 times 0.2 0.3 0.5 0.4 0.81 0.5

Table 3
The RMSE values of S2-VAE-GAN for the collected data at various levels of noises.
Level σ w1 σ w2 σ w3 σ w4 RMSE (data RMSE (data
infested with without noise)
noise)

Original 0.02 0.03 0.05 0.04 0.371 0.355


2.5 0.05 0.075 0.125 0.1 0.521 0.465
times
5 times 0.1 0.15 0.25 0.2 0.75 0.566
10 times 0.2 0.3 0.5 0.4 0.942 0.618

Table 4
The RMSE values of S2-VAE at different ratios of labeled to unlabeled data.
Ratio (Labeled: RMSE (data infested with RMSE (data without
Unlabeled) noise) noise)

600:150 0.484 0.475


600:300 0.459 0.453
600:450 0.447 0.439
600:600 0.411 0.401
600:750 0.358 0.347
600:900 0.41 0.394
600:1050 0.458 0.455

Table 5
The RMSE values of S2-VAE-GAN at different ratios of labeled to unlabeled data.
Ratio (Labeled: RMSE (data infested with RMSE (data without
Unlabeled) noise) noise)

600:150 0.56 0.549


600:300 0.476 0.464
600:450 0.433 0.417
600:600 0.371 0.355
600:750 0.986 0.979
600:900 0.987 0.98
600:1050 0.988 0.981

the prediction result of each method using the industrial data is shown in
Table 6. In Table 6, KPLS, S2-VAE, and S2-VAE/GAN are the top 3
methods as their RMSE values are lower than 1. Their prediction results
are shown in Fig. 9. In Fig. 9, KPLS result still deviates so much from the
real industrial data. the KPLS method cannot deal with the nonlinearity
well because of the shallow nonlinear representation of the kernel
function. The results of S2-VAE and S2-VAE/GAN are shown to be very
close to the industrial data in both the training and the testing data
mainly because the deep nonlinear nature of the neural network allows
both S2-VAE and S2-VAE/GAN to model the complex nonlinear charac-
teristic of the decarburization process.
By comparing the results of S2-VAE and S2-VAE/GAN, it is found that
Fig. 6. The prediction results of various conventional methods: (PPCR, PPLS, S -VAE/GAN has better prediction results than S2-VAE at each data point.
2

KPCR, KPLS, SVR, MLR, S2-PLVR, WS2-OFA, and BPLS) vs. (proposed S2-VAE The S2-VAE predictions are larger than S2-VAE/GAN predictions in terms
and proposed S2-VAE-GAN) as well as the actual data with and without noise. of fluctuations as the points of S2-VAE are scattered more widely than

9
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

numerical and an industrial case. The results show that S2-VAE and S2-
VAE/GAN are able to model the complex nonlinear characteristic of the
systems in both numerical and industrial cases while the past data-driven
models with kernel trick mapping cannot. Finally, with incorporating the
GAN model into S2-VAE, the proposed method requires fewer training
iterations, and it can produce better prediction results than the original
semi-VAE method.

Table 6
The RMSE values of the prediction results ob-
tained from various methods in the industrial
case study.
Method RMSE

PPCR 1.0358
PPLS 1.554
KPCR 2.01
KPLS 0.853
S2-PPLS 1.021
WS2-OFA 1.002
S2-VAE 0.379
S2-VAE/GAN 0.308
Fig. 7. The data imputation results of MICE with random forest regression vs.
the proposed S2-VAE and S2-VAE-GAN.

those of S2-VAE/GAN. The RMSE value of S2-VAE/GAN is also lower than


that of S2-VAE. With fewer training iterations, S2-VAE/GAN can produce
a more accurate prediction result than S2-VAE. Its discriminator can train
the rest of the networks to resemble the real data distribution in a faster
manner than S2-VAE.

5. Conclusion

In this paper, a novel VAE/GAN hybrid model is developed by


incorporating a semi-supervised learning strategy. It uses labeled and
unlabeled data to enhance the prediction ability of the prediction
network. Adjusting the loss function of the proposed method, in addition
to discriminator in GAN, enables consistent learning of the prediction
network and allows the prediction network to generate quality data
similar to the real ones. With the addition of unlabeled data, the proposed
S2-VAE/GAN can train the model as the labeled data samples are not
sufficient enough to train the model. It is compared with the past Fig. 9. The prediction results of KPLS, S2-VAE, and S2-VAE/GAN in the indus-
methods, including PCR, PLS, PPCR, PPLS, KPCA, and KPLS in a trial case study.

Fig. 8. The decarburization step of the ammonia synthesis process.

10
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

As most processes are often operated at the fixed operating condition, Dave Tanny: Methodology, Software, Writing- Original draft
the collected data can be properly assumed to follow a Gaussian distri- preparation.
bution. Of course, in the operating plant, because of the market changes, Junghui Chen: Conceptualization, Methodology, Writing- Original
many manufacturers need to produce multiple products with different draft preparation, Writing-Reviewing and Editing.
operating conditions constantly to meet customer needs and market de- Kai Wang: Conceptualization, Methodology.
mands. As the process is operated under different operating conditions,
the collected data distributed as a mixture Gaussian distribution can be
regarded. To handle the data of the Gaussian mixture distribution, Declaration of competing interest
variational deep embedding and Gaussian mixture prior variational
autoencoder have been proposed ([21,22]), but they are the unsuper- The authors declare that they have no known competing financial
vised schemes. The proposed methods extended to Gaussian mixture interests or personal relationships that could have appeared to influence
based S2-VAE and S2-VAE-GAN would be interesting research topics in the work reported in this paper.
the future.
Acknowledgment
Author statement
The authors would like to gratefully acknowledge the Ministry of
Sai Kit Ooi: Visualization, Methodology, Software, Writing- Original Science and Technology, Taiwan, R.O.C. (MOST 109-2221-E-033-013-
draft preparation. MY3).

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.chemolab.2021.104385.

Appendix ADerivations of the unlabeled objective function of S2-VAE

To define the loss function of the unlabeled process data JðXu Þ, the missing quality data is treated as another latent variable in addition to defined
latent variable z. Also, to make the derivation process smoother and easier, the sample index of each process and quality data is omitted. Thus, the KL
divergence between the true unlabeled process posterior and the approximate unlabeled posterior is defined as

KLðqφ ðz; yjxÞkpθ ðz; yjxÞ Þ ¼ Eqφ ðz;yjxÞ ½log qφ ðz; yjxÞ
(29)
 log pθ ðz; yjxÞ 
Under the assumption that the quality variable y is influenced by the process variable x, the true posterior distribution pθ ðz; yjxÞ can be substituted by

pθ ðz; yjxÞ ¼ pθ ðzjx; yÞpθ ðyjxÞ (30)


Thus, with the conditional probability property, the joint probability pθ ðx; y; zÞ can be rewritten.

pθ ðx; y; zÞ ¼ pθ ðzjx; yÞpθ ðyjxÞpθ ðxÞ (31)


Assume that only the latent variable z can be reconstructed back to the process variable x and the quality variable y. It is a reasonable assumption
that, in latent variable z, there are common relations between the process variable x and the quality variable y. Thus, the term on the right hand side of
Eq. (31) can be expressed as

pθ ðx; y; zÞ ¼ pθ ðzjx; yÞpθ ðyjxÞpθ ðxÞ


(32)
¼ pθ ðxjzÞpθ ðy; zÞ
Then, rearrange the equation,

pθ ðxjzÞpθ ðy; zÞ
pθ ðzjx; yÞpθ ðyjxÞ ¼ (33)
pθ ðxÞ
Also, pθ ðz; yjxÞ can be expressed by

pθ ðxjzÞpθ ðy; zÞ
pθ ðz; yjxÞ ¼ (34)
pθ ðxÞ
Substitute and rearrange the above true posterior into the KL divergence defined in Eq. (29)

pθ ðxjzÞpθ ðy; zÞ
KLðqφ ðz; yjxÞkpθ ðz; yjxÞÞ ¼ Eqφ ðz;yjxÞ log qφ ðz; yjxÞ  log
pθ ðxÞ (35)
¼ Eqφ ðz;yjxÞ ½log qφ ðz; yjxÞ  log pθ ðxjzÞ  log pθ ðy; zÞ þ log pθ ðxÞ

Thus, pθ ðxÞ can be obtained by

Eqφ ðz;yjxÞ ½log pθ ðxÞ  ¼ Eqφ ðz;yjxÞ ½log pθ ðxjzÞ  þ Eqφ ðz;yjxÞ ½log pθ ðy; zÞ   Eqφ ðz;yjxÞ ½log qφ ðz; yjxÞ 
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
KLðqφ ðz; yjxÞjjpθ ðy; zÞ Þ (36)
þKLðqφ ðz; yjxÞkpθ ðz; yjxÞ Þ

11
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

where the evidence lower bound of LðXu Þ is defined as

LðXu Þ ¼ Eqφ ðz;yjxÞ ½log pθ ðxjzÞ   KLðqφ ðz; yjxÞ kpθ ðy; zÞ Þ (37)

In order to fully observe the detailed transformation of the unlabeled objective function, the expectation operator is changed to an integral format.
RR
Eqφ ðz;yjxÞ ½log pθ ðxjzÞ  ¼ qφ ðyjxÞqφ ðzjx; yÞlog pθ ðxjzÞdzdy
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Eqφ ðzjx;yÞ ½log pθ ðxjzÞ 
R
¼ qφ ðyjxÞEqφ ðzjx;yÞ ½log pθ ðxjzÞ dy (38)
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl
ffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Eqφ ðyjxÞ ½Eqφ ðzjx;yÞ ½log pθ ðxjzÞ  
 
¼ Eqφ ðyjxÞ Eqφ ðzjx;yÞ ½log pθ ðxjzÞ 

where the approximate unlabeled posterior distribution can also be assumed to follow the same assumption of Eq. (30),

qφ ðz; yjxÞ ¼ qφ ðzjx; yÞqφ ðyjxÞ (39)

Likewise, expand the KL divergence in Eq. (37) into an integral format,

RR qφ ðz; yjxÞ
KLðqφ ðz; yjxÞjjpθ ðy; zÞ Þ ¼ qφ ðz; yjxÞlog dzdy
pθ ðy; zÞ
RR qφ ðyjxÞqφ ðzjx; yÞ
¼ qφ ðyjxÞqφ ðzjx; yÞlog dzdy
pθ ðyjzÞpθ ðzÞ
Separate the logarithmic term in the above equation, given by

KLðqφ ðz; yjxÞjjpθ ðy; zÞ Þ ¼ Eqφ ðyjxÞ ½KLðqφ ðzjx; yÞjjpθ ðzÞ Þ 
Hðqφ ðyjxÞ (40)
 Þ 
Eqφ ðyjxÞ Eqφ ðzjx;yÞ ½log pθ ðyjzÞ 

R
where Hðqφ ðyjxÞ Þ ¼ qφ ðyjxÞEqφ ðzjx;yÞ ½log qφ ðyjxÞ dy. Substitute the derivations from Eq. (38) and Eq. (40) into Eq. (37),
    
LðXu Þ ¼ Eqφ ðyjxÞ
2 Eqφ ðzjx;yÞ ½log pθ ðxjzÞ   Eqφ ðyjxÞ ½KLðqφ ðzjx; yÞjjpθ ðzÞ Þ   Hðqφ ðyjxÞ
3 Þ  Eqφ ðyjxÞ Eqφ ðzjx;yÞ ½log pθ ðyjzÞ 
6 7
¼ Eqφ ðyjxÞ 4 Eqφ ðzjx;yÞ ½log pθ ðxjzÞ  þ Eqφ ðzjx;yÞ ½log pθ ðyjzÞ   KLðqφ ðzjx; yÞjjpθ ðzÞ Þ 5 þ Hðqφ ðyjxÞ Þ
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
LðXu ;YÞ

LðXu Þ ¼ Eqφ ðyjxu Þ ½LðXu ; YÞ þ Hðqφ ðyjxÞÞ (41)

Appendix BDerivations of the labeled dataset objective function of S2-VAE

In order to capture the true posterior distribution, an approximate function qφ ðzjx; yÞ is trained to mimic the real posterior distribution pθ ðzjx;yÞ; thus,
the KL divergence can be defined as

KLðqφ ðzjx; yÞkpθ ðzjx; yÞ Þ ¼ Eqφ ðzjx;yÞ ½log qφ ðzjx; yÞ


(42)
 log pθ ðzjx; yÞ 
The posterior distribution pθ ðzjx; yÞ can be extended through the conditional probability property in relation to the joint probability pθ ðx;y;zÞ, which
is shown as

pθ ðx; y; zÞ ¼ pθ ðzjx; yÞpθ ðx; yÞ ¼ pθ ðx; yjzÞpθ ðzÞ (43)


Like Appendix A, assume that only the latent variable z can be reconstructed back to the process variable x and the quality variable y. By rearranging
the above equation, the posterior distribution related to the remaining terms can be represented by

pθ ðx; yjzÞpθ ðzÞ


pθ ðzjx; yÞ ¼
pθ ðx; yÞ
(44)
pθ ðxjzÞpθ ðyjzÞpθ ðzÞ
¼
pθ ðx; yÞ
Substituting the true posterior distribution pθ ðzjx; yÞ into the KL divergence equation in Eq. (42):

pθ ðxjzÞpθ ðyjzÞpθ ðzÞ
KLðqφ ðzjx; yÞkpθ ðzjx; yÞ Þ ¼ Eqφ ðzjx;yÞ log qφ ðzjx; yÞ  log
pθ ðx; yÞ
¼ Eqφ ðzjx;yÞ ½log qφ ðzjx; yÞ   Eqφ ðzjx;yÞ ½log pθ ðxjzÞ   Eqφ ðzjx;yÞ ½log pθ ðyjzÞ   Eqφ ðzjx;yÞ ½pθ ðzÞ  þ Eqφ ðzjx;yÞ ½log pθ ðx; yÞ 

Rearrange the equation to represent the relation of the marginal likelihood pθ ðxi ; yi Þ and yield:

12
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

Eqφ ðzjx;yÞ ½log pθ ðx; yÞ  ¼ Eqφ ðzjx;yÞ ½log pθ ðxjzÞ 


þ Eqφ ðzjx;yÞ ½log pθ ðyjzÞ 
(45)
 KLðqφ ðzjx; yÞjjpθ ðzÞ Þ
þ KLðqφ ðzjx; yÞkpθ ðzjx; yÞ Þ

KLðqφ ðzjx; yÞjjpθ ðzÞ Þ ¼ Eqφ ðzjx;yÞ ½log pθ ðzÞ 


. The conditional probability in terms of pθ ðxjzÞ and pθ ðyjzÞ and the KL divergence in terms of the
Eqφ ðzjx;yÞ ½log qφ ðzjx; yÞ 
approximate posterior given the prior distribution KLðqφ ðzjx; yÞjjpθ ðzÞÞ are combined to be defined as the evidence lower bound of Lðx; yÞ, which is
shown as follows,

Lðx; yÞ ¼ Eqφ ðzjx;yÞ ½log pθ ðxjzÞ  þ Eqφ ðzjx;yÞ ½log pθ ðyjzÞ 


(46)
KLðqφ ðzjx; yÞjjpθ ðzÞ Þ

Thus, log pθ ðx; yÞ can be obtained by

Eqφ ðzjx;yÞ ½log pθ ðx; yÞ  ¼ Lðx; yÞ


(47)
þKLðqφ ðzjx; yÞkpθ ðzjx; yÞ Þ

Appendix C. Derivations of S2-VAE/GAN objective

The loss function of the GAN function is expressed as the JSD, which measures the dissimilarity in the difference between the real data distribution
(pθ ðaÞ) and the average of two distributions (pθ ðaÞand pθ ðajzÞ). As the convex property of the original KL divergence is symmetrized and smoothed, for
the measurement of the divergence, both distributions (pθ ðaÞand pθ ðajzÞ) equaling to the average of both distributions should be taken into account.

DJS ðpθ ðaÞjjpθ ðajzÞ Þ


Z
1 2pθ ðaÞ
¼ argmin pθ ðaÞlog da
Z 2 p θ ðaÞ þ pθ ðajzÞ
1 2pθ ðajzÞ
þ pθ ðajzÞlog da
2 pθ ðaÞ þ pθ ðajzÞ
Separate the constant 2 from inside of the two logarithmic functions. As the activation function in the outermost layer of the GAN discriminator is a
sigmoidal function, whose output ranges from 0  1, the term can be further simplified as

DJS ðpθ ðaÞjjpθ ðajzÞ Þ


0
B Z
1B
B pθ ðaÞ (48)
¼ B2 log 2 þ pθ ðaÞlog da
2B pθ ðaÞ þ pθ ðajzÞ
@ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
EapðaÞ ½DiscðaÞ 
1

Z C
pθ ðajzÞ C
C
þ pθ ðajzÞlog da C
pθ ðaÞ þ pθ ðajzÞ C
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} A
EzpðzÞ ½Discð~aÞ 

The key principle behind the implementation of the sigmoidal function is to limit the identification of fake data (represented by 0) and real data
(represented by 1). Initially, the fake data can be easily distinguished from the real counterparts, but the fake data cannot be identified by the
discriminator through iterative learning. As the limitation in identification is set between 0 and 1 to represent the confidence level of the real data, the
remaining expectation level of the data being fake is just the subtraction of the highest probable value (i.e. one) from the confidence level of the real
data. Thus the objective function of Eq. (48) can be simplified as the loss function as follows:

J ¼ EapðaÞ ½logðDiscðaÞÞ þ EzpðzÞ ½logð1  DiscðGðzÞÞÞ (49)

where Disc is the discriminator and G is generator (GAN term)/decoder (VAE term). Both discriminator and generators are constructed by training the
neural networks. The first term in the objective function EapðaÞ ½logðDiscðaÞÞ signifies the confidence level of the real data while the second term
EzpðzÞ ½logð1 DiscðGðzÞÞÞ denotes the confidence level of the fake/generated data. The objective function is then applied through a minmax function.
That is, the discriminator objective function is minimized to effectively distinguish the real data from the fake/generated ones; then the generator
function is maximized to generate data distribution similar to the distribution of the real data. The discriminator and the generator are trained iter-
atively. This training iteration would not be done until the objective function converges.
It should be noted that the objective function is only applicable for unsupervised learning, and the data of the generator from a  pðaÞ usually assume
to follow Nð0; IÞ. Therefore, with the consideration of the VAE structure connected to the discriminator for both process and quality data, a new
objective function for the supervised learning and the semi-supervised learning can be derived. The derivation procedure will be discussed. The
objective function of the process data discriminator is first obtained when the data are completely available; then the objective function of the quality
data discriminator is derived, when part of the quality data are available and part of them are missing.
Extending the derivations of the loss function to the VAE/GAN structure, the objective is that the divergence of the real distribution x is as close to
the reconstruction distribution x of the sampled prior z distribution as possible while the divergence of the real distribution x is also close to the whole

13
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

VAE network reconstruction distribution x. The derivation is shown as follows:


x-GAN component

DJS ðpθ ðxÞjjpθ ðxjzÞpθ ðzÞ Þ þ DJS ðpθ ðxÞjjpθ ðxjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ Þ
 
1 pθ ðxÞ þ pθ ðxjzÞpθ ðzÞ pθ ðxÞ þ pθ ðxjzÞpθ ðzÞ
¼ KL pθ ðxÞ þ KL pθ ðxjzÞpθ ðzÞ
2 2 2

pθ ðxÞ þ pθ ðxjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ
þKL pθ ðxÞ
2 (50)

pθ ðxÞ þ pθ ðxjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ
þKL pθ ðxjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ
2
¼ Expθ ðxÞ ½logðDiscx ðxÞ Þ  þ Expθ ðxj zÞ Ezpθ ðzÞ ½logð1  Discx ðDecx ðzÞ Þ Þ 
þExpθ ðxjzÞ Ezqφ ðzjx;yÞ Eypθ ðyjxÞ Expθ ðxÞ ½logð1  Discx ðDecx ðEncðx; PredðxÞ Þ Þ Þ Þ 

For the component y, like S2-VAE (Section 3), the labeled and the unlabeled datasets in S2-VAE/GAN can be regarded as unlabeled datasets. Their
P l 
difference lies in the differential entropy of N ln pðyi xl Þ for labeled data. The differential entropy would be conducted on the prediction network with
i¼1 i
unavailable y data. Based on the S2-VAE/GAN network pathway in Fig. 5, the quality data would be used as the target real data for the y-discriminator
when they are available. On the other hand, if the quality data are absent, the output of the prediction network would be used as the target real data.
Thus, the y-GAN component can be formulated as

DJS ðpθ ðyjxÞjjpθ ðyjzÞpθ ðzÞ Þ þ DJS ðpθ ðyjxÞjjpθ ðyjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ Þ
 
1 pθ ðyjxÞ þ pθ ðyjzÞpθ ðzÞ pθ ðyjxÞ þ pθ ðyjzÞpθ ðzÞ
¼ KL pθ ðyjxÞ þ KL pθ ðyjzÞpθ ðzÞ
2 2 2

pθ ðyjxÞ þ pθ ðyjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ
þKL pθ ðyjxÞ
2 (51)
 
pθ ðyjxÞ þ pθ ðyjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ
þKL pθ ðyjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ
2
     
¼ Eypθ ðyjxÞ log Discy ðPredðxÞ Þ þ Eypθ ðyj zÞ Ezpθ ðzÞ log 1  Discy Decy ðzÞ
   
þEypθ ðyjzÞ Ezqφ ðzjx;yÞ Eypθ ðyjxÞ Expθ ðxÞ log 1  Discy Decy ðEncðx; PredðxÞ Þ Þ

For the labeled data, which contains quality data, the objective function is shown as follows:

DJS ðpθ ðyÞjjpθ ðyjxÞpθ ðxÞ Þ


þDJS ðpθ ðyÞjjpθ ðyjzÞpθ ðzÞ Þ (52)
þDJS ðpθ ðyÞjjpθ ðyjzÞqφ ðzjx; yÞpθ ðyjxÞpθ ðxÞ Þ

Thus, the above equation is simplified into a more compact form,


     
LGANyav ¼ EypðyÞ log Discy ðyÞ þ EypðyjxÞ  1  Disc
  ExpðxÞ log  yðPredðxÞ Þ
þEypðyjzÞ EzpðzÞ log 1  Discy Decy ðzÞ  (53)
þEypðyjzÞ Ezqðzjx;yÞ EypðyjxÞ ExpðxÞ log 1  Discy Decy ðEncðx; PredðxÞ Þ Þ

References [10] D.J. Rezende, S. Mohamed, M. Welling, D.P. Kingma, Semi-supervised learning with
deep generative models, in: Advances in Neural Information Processing Systems,
27, NIPS, 2014, p. 2014.
[1] M. Arakawa, K. Funatsu, H. Kaneko, Development of a new soft sensor method
[11] X. Yan, H. Lee, K. Sohn, Learning structured output representation using deep
using independent component analysis and partial least squares, AIChE J. 55 (1)
conditional generative models, in: Advances in Neural Information Processing
(January 2009) 87–98.
Systems, 28, NIPS, 2015, p. 2015.
[2] J. Liu, R. Srinivasan, D. Wang, Data-driven soft sensor approach for quality
[12] C. Doersch, Tutorial on variational autoencoders [Online]. Available, https://arxiv
prediction in a refining process, IEEE Transactions on Industrial Informatics 6 (1)
.org/abs/1606.05908, 19 June 2016.
(February 2010) 11–17.
[13] J.P.- Abadie, M. Mirza, B. Xu, D.W. Farley, S. Ozait, A. Courville, Y. Bengio,
[3] P.K. Hopke, M.J. Johnson, K.M. Scow, Z. Ramadan, Application of PLS and back-
I.J. Goodfellow, Generative adversarial nets [Online]. Available, https://arxiv.org/
Propagation neural networks for the estimation of soil properties, Chemometr.
abs/1406.2661, 10 June 2014.
Intell. Lab. Syst. 75 (1) (January 2005) 23–30.
[14] X. Wang, Data preprocessing for soft sensor using generative adversarial networks,
[4] K.I. Theron, J. Lammertyn, B.M. Nicolai, Kernel PLS regression on wavelet
in: International Conference on Control, Automation, Robotics and Vision,
transformed NIR spectra for prediction of sugar content of apple, Chemometr. Intell.
Singapore, 2018.
Lab. Syst. 85 (2) (2007) 243–252.
[15] S.K. Sonderby, H. Larochelle, O. Winther, A.B.L. Larsen, Autoencoding beyond
[5] H. Shao, X. Wang, W. Yan, Soft sensing modeling based on support vector machine
Pixels Using a Learned Similarity Metric, 10 February 2016 [Online]. Available, htt
and bayesian model selection, Comput. Chem. Eng. 28 (8) (July 2004) 1489–1498.
ps://arxiv.org/abs/1512.09300.
[6] Z. Song, Z. Ge, Semisupervised bayesian method for soft sensor modeling with
[16] J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, A. Makhzani, Adversarial autoencoders
unlabeled data samples, AIChE J. 57 (8) (September 2010) 2109–2119.
[Online]. Available, https://arxiv.org/abs/1511.05644, 25 May 2016.
[7] M.A. J.v. Gerven, C. Bielze, T. Heskes, D. Vidaurre, Bayesian sparse partial least
[17] S. Nowozin, A. Geiger, L. Mescheder, Adversarial Variational Bayes: Unifying
squares, Neural Comput. 25 (12) (December 2013) 3318–3339.
Variational Autoencoders and Generative Adversarial Networks, June 2018, p. 11
[8] H.-W.U.C. Hayward, G. Jongbloed, J.H. Duistermaat, S.e. Bouhaddani, Probabilistic
[Online]. Available, https://arxiv.org/abs/1701.04722.
partial least squares model: identifiability, estimation and application [Online].
[18] Y. Fu, L. Sigal, X. Xue, W. Yin, Semi-latent GAN: Learning to Generate and Modify
Available, https://arxiv.org/abs/1706.03597, 5 June 2018.
Facial Images from Attributes, April 2017, p. 7 [Online]. Available, https://arxiv
[9] M. Welling, D.P. Kingma, Auto-encoding variational bayes [Online]. Available, htt
.org/abs/1704.02166.
ps://arxiv.org/abs/1312.6114, 1 May 2014.

14
S.K. Ooi et al. Chemometrics and Intelligent Laboratory Systems 217 (2021) 104385

[19] J. Clune, Y. Bengio, A. Dosovitskiy, J. Yosinski, A. Nguyen, Plug & Play Generative [21] Z. Jiang, Y. Zheng, H. Tan, B. Tang, H. Zhou, Variational deep embedding: an
Networks: Conditional Iterative Generation of Images in Latent Space, April 2017, unsupervised generative approach to Clustering, IJCAI Int. Jt. Conf. Artif. Intell.
p. 12 [Online]. Available, https://arxiv.org/abs/1612.00005. (2017) 1965–1972, https://doi.org/10.24963/ijcai.2017/273, 0.
[20] J. Yang, H. Shi, X. Cui, Weighted semi-supervised orthogonal factor analysis model [22] Y.S. Lee, J. Chen, Enhancing monitoring performance of data sparse nonlinear
for quality-related process monitoring, in: 37th Chinese Control Conference, CCC), processes through information sharing among different grades using Gaussian
China, 2018. mixture prior variational autoencoders, Chemometr. Intell. Lab. Syst. 208 (2021)
104219, https://doi.org/10.1016/j.chemolab.2020.104219.

15

You might also like