Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

9th IFAC Symposium on Fault Detection, Supervision and

9th IFAC
Safety of Symposium on Fault Detection, Supervision and
Technical Processes
9th IFAC
9th IFAC Symposium on Fault
Fault Detection,
Detection, Supervision
Supervision and
and
Safety of Symposium
September Technical
2-4, 2015.
on
Processes
Arts Available online
et Métiers ParisTech, at www.sciencedirect.com
Paris, France
Safety of Technical
Safety of Technical Processes
Processes
September 2-4, 2015. Arts et Métiers ParisTech, Paris, France
September
September 2-4,
2-4, 2015.
2015. Arts
Arts et
et Métiers
Métiers ParisTech,
ParisTech, Paris,
Paris, France
France

ScienceDirect
IFAC-PapersOnLine 48-21 (2015) 844–851
Failure
Failure Prediction Methodology for
Failure Prediction
Prediction Methodology
Methodology for
for
Improved
Improved Proactive
Proactive Maintenance
Maintenance using
using
Improved Proactive Maintenance
 using
Bayesian
Bayesian Approach 
Bayesian Approach
Approach
A. Abu-Samah ∗∗ M.K. Shahzad ∗∗ E. Zamai ∗∗ A. Ben Said ∗∗ ∗∗
A.
A. Abu-Samah ∗∗ M.K. Shahzad ∗∗ E. Zamai ∗∗ A. Ben Said ∗∗
A. Abu-Samah
Abu-Samah M.K. M.K. Shahzad
Shahzad E. E. Zamai
Zamai A. A. Ben
Ben Said
Said ∗∗

∗ Univ. Grenoble Alpes, G-SCOP, F-38000 Grenoble, France (e-mail:
∗ Univ. Grenoble Alpes, G-SCOP, F-38000 Grenoble, France (e-mail:
∗ Univ. Grenoble
Univ. Grenoble Alpes,
Alpes, G-SCOP, F-38000
F-38000 Grenoble,
Grenoble, France
asma.abu-samah@grenoble-inp.fr;
G-SCOP, France (e-mail:
(e-mail:
asma.abu-samah@grenoble-inp.fr;
asma.abu-samah@grenoble-inp.fr;
muhammadkashif.shahzad@grenoble-inp.fr;
asma.abu-samah@grenoble-inp.fr;
muhammadkashif.shahzad@grenoble-inp.fr;
muhammadkashif.shahzad@grenoble-inp.fr;
eric.zamai@grenoble-inp.fr).
muhammadkashif.shahzad@grenoble-inp.fr;
∗∗ eric.zamai@grenoble-inp.fr).
eric.zamai@grenoble-inp.fr).
∗∗ STMicroelectronics, 850 Rue Jean Monnet, 38926, Crolles, France
eric.zamai@grenoble-inp.fr).
∗∗ STMicroelectronics, 850 Rue Jean Monnet, 38926, Crolles, France
∗∗ STMicroelectronics, 850
850 Rue
(e-mail:
STMicroelectronics, Jean
Jean Monnet,
Monnet, 38926,
38926, Crolles,
anis.bensaid@st.com)
Rue Crolles, France
France
(e-mail:
(e-mail: anis.bensaid@st.com)
anis.bensaid@st.com)
(e-mail: anis.bensaid@st.com)
Abstract: Failure prediction is essential for predictive maintenance due to its ability to
Abstract:
Abstract: Failure
Failure prediction
prediction is essential for predictive maintenance due to its ability to
prevent failure
Abstract: occurrences
Failure prediction andis essential
essential for
maintenance
ismaintenance predictive
costs.
forcosts. maintenance
At present,
predictive mathematical
maintenance due
due to its
and
to and ability
ability to
statistical
its statistical to
prevent
prevent failure occurrences and At present, mathematical
modelingfailure
prevent are the
failure occurrences
prominent
occurrences and
and maintenance
approaches
maintenance usedcosts. At
for failure
costs. At present, mathematical
predictions.
present, mathematical These and are statistical
and based on
statistical
modeling
modeling are
are the
the prominent
prominent approaches
approaches used
used for
for failure
failure predictions.
predictions. These
These are are based
based on
on
equipmentare
modeling degradation
the physical
prominent models and
approaches usedmachine
for learning
failure methods, respectively.
predictions. These are None on
based of
equipment
equipment degradation
degradation physical
physical models
models and
and machine
machine learning
learning methods,
methods, respectively.
respectively. None
None of
of
these approaches
equipment degradation ensures failure models
physical predictions and well before
machine their occurrence
learning methods, to provide sufficient
respectively. None of
these
these approaches
approaches ensures
ensures failure
failure predictions
predictions well
well before their occurrence to provide sufficient
time to
these
time to
treat potential
approaches
treat potential
causes
ensurescauses pro
failurepropredictions
actively. well before
actively. Therefore, before in
Therefore,
their
in thisoccurrence
their
this
paper, we to
occurrence
paper, we to provide
present
provide
present a
sufficient
a Bayesian
sufficient
Bayesian
time
based
time to
to treat
methodology
treat potential
potentialto causes
learn
causesand pro
pro actively.
associate
actively. Therefore,
failure signatures
Therefore, in
in this
with
this paper,
potential
paper, we
we present
failure
present a Bayesian
occurrences.
a Bayesian
based
based methodology
methodology to
to learn
learn and associate
associate failure
andmaintenance failure signatures
signatures with
with potential
potential failure
failure occurrences.
occurrences.
In thismethodology
based approach, event to driven
learn and associate data signatures
failure is used as with symptoms
potential which is aggregated
failure occurrences. on
In
In this
this approach,
approach, event
event driven
driven maintenance
maintenance data
data is
is used
used as
as symptoms
symptoms which
which is
is aggregated
aggregated on
on
discretized
In this intervals.
approach, Thedriven
event failures probabilitiesdata
maintenance as predicted
is used asbysymptoms
the Bayesian whichnetwork
is are plotted
aggregated on
discretized
discretized intervals. The failures probabilities as predicted by the Bayesian network are plotted
as temporalintervals.
discretized evolution.
intervals. The
TheThisfailures
This probabilities
is further
failures exploited
probabilities as
as predicted
to extractby
predicted the
the Bayesian
either
byeither rules or network
Bayesian patternsare
network areas plotted
failure
plotted
as
as temporal
temporal evolution.
evolution. This is
is further
further exploited
exploited to
to extract
extract either rules
rules or
or patterns
patterns as
as failure
signatures
as temporal and critical
evolution. regions. These
This is These are then
further exploited used to monitor
totoextract and
either predict
rules orthe the potential
patterns as failure failure
signatures
signatures and
and critical
critical regions.
regions. These areare then
then used
used onto monitor
monitor and
and predict
predictfrom the potential
potential failure
occurrences.
signatures and The proposed
critical regions. methodology
These are is tested
then used to the dataand
monitor collected
predict the a well reputed
potential failure
occurrences.
occurrences. The
The proposed
proposed methodology
methodology is
is tested
tested on the data collected from aa well reputed
semiconductor
occurrences. The manufacturer
proposed with promising
methodology is results.on
tested on the
the data
data collected
collected from
from a well
well reputed
reputed
semiconductor
semiconductor manufacturer with promising results.
semiconductor manufacturer
manufacturer with with promising
promising results.results.
© 2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved.
Keywords: Failure prediction, predictive maintenance, Bayesian network, rules extraction.
Keywords:
Keywords: Failure
Failure prediction,
prediction, predictive
predictive maintenance,
maintenance, Bayesian Bayesian network,
network, rules rules extraction.
extraction.
Keywords: Failure prediction, predictive maintenance, Bayesian network, rules extraction.
1. INTRODUCTION based on the distance from decision boundary (faulty and
1.
1. INTRODUCTION
INTRODUCTION based
based on
on the
the distance
distance from
from decision
decision boundary
boundary (faulty
(faultyisand
and
1. INTRODUCTION non
based faulty
on classes).
the distance Consequently,
from decision failure
boundaryprobability
(faulty es-
and
non
non faulty
faulty classes).
classes). Consequently,
Consequently, failure
failure probability
probability is
is es-
es-
In a highly competitive production environment, unsched- non timatedfaultyforclasses).
planningConsequently,
maintenance failure decisions. The accuracy
probability is es-
In
In a
a highly
highly competitive
competitive production
production environment,
environment, unsched-
unsched- timated
timated for
for planning
planning maintenance
maintenance decisions.
decisions. The
The accuracy
accuracy
uled
In a equipment
highly breakdowns
competitive productioncause disruptions
environment, in unsched- of such failure prediction methods
the pro- timated for planning maintenance decisions. The accuracy is limited because they
uled equipment breakdowns cause disruptions in
in the
the pro- of
of such
such failure prediction methods is limited because they
uled
duction
uled equipment breakdowns
capacities.
equipment This requires
breakdowns cause improved
cause disruptionsresponse
disruptions in the pro-
for take
pro- of suchintofailure
account
failure prediction
prediction methods
only physical
methods is
is limited
limited because
degradation. Furthermore
because they
they
duction
duction capacities.
capacities. This
This requires
requires improved
improved response
response for
for take take into
takecertain account
into account
account only physical
only physical
physical degradation.
degradation. Furthermore
Furthermore
failure diagnosis
duction capacities.andThisrepair times improved
requires and eventually responsethe ca-for in into application
only such as degradation.
Semiconductor Industry
Furthermore
failure diagnosis and repair times and eventually the ca- in certain application such as Semiconductor Industry
failure diagnosis
pabilitydiagnosis
failure and repair
to proactively
and repair
handletimestimes
theseandand eventually
failure occurrences
eventually the ca-
the for in
ca- (SI),
in certain
certain application
occurrence of a failure
application such
suchis as Semiconductor
an event
as based phenomenon
Semiconductor Industry
Industry
pability
pability to
to proactively
proactively handle
handle these
these failure
failure occurrences
occurrences for
for (SI),
(SI), occurrence
occurrence of
of aa failure
failure is
is an
an event
event based
based phenomenon
phenomenon
optimized
pability to maintenance
proactively management
handle these (Yang
failure et al.
occurrences(2008);for which
(SI), is difficult
occurrence to
of be
a modeled
failure is an statistically
event based for failure
phenomenon pre-
optimized maintenance management (Yang et al. (2008); which
which is
is difficult
difficult to
to be
be modeled
modeled statistically
statistically for
for failure
failure pre-
optimized
Haddad
optimized
Haddad
et
et
maintenance
al. (2012)).
maintenance
al. (2012)).
management
One of
management
One of
the
the
(Yang
promising
promising
et al. (2008);
approaches
al. (2008); which
(Yang et approaches diction
diction
using
isusing only
difficult
only
temporal
to temporal
be modeled data,
data,
due
statistically
due
to
to
the failure pre-
imbalanced
for imbalanced
the pre-
Haddad
to address
Haddad etthis
et al. (2012)).
al. (2012)).
challengeOne One of the
is online
of the promising
failure approaches
prediction,
promising which diction
approaches diction using
dimension using only
of only temporal
functional
temporal data,
anddata, due to
dysfunctional
due to the
thedataimbalanced
(Susto
imbalanced
to address this challenge is online failure prediction, which dimension of functional and dysfunctional data (Susto
to address
requires
to address this
thethis challenge
current stateis
challenge isofonline
online
a systemfailure
to be
failure prediction,
monitored
prediction, and dimension
which
which et al. (2012)).
dimension of
of functional
Due to thisand
functional context
and dysfunctional
and in addition
dysfunctional data
data (Susto
to the
(Susto
requires the current state of aa system to be monitored and et
et al.
al. (2012)).
(2012)). Due
Due to
to this
this context
context and
and in
in addition
addition to
to the
requires
evaluated
requires
evaluated
the
to
the
to
current
predict
current
predict
state
the
state
the
of
of a system
occurrence
system
occurrence
to
of
to
of
be
be monitored
failures in the
monitored
failures in the
and
near
and
near
availability
et al. (2012)).
availability
of
of
large
Due
large
scale
to this
scale
equipment
context
equipment and log
in
log
and
addition
and to the
contextual
contextualthe
evaluated
future. The
evaluated to key
to predict the occurrence
contribution
predict the occurrence
from this of failures
of failures
approachin the
in the near
cannear be dataavailability
availability of
for improving
of large
large scale
scale equipment
productivity
equipment andloglog and
control
and contextual
purposes,
contextual
future.
future. The
The key
key contribution
contribution from
from thisthis approach
approach can
canand be
be datadata
data for for
for improving
improving productivity
productivity and
anddata control
control purposes,
purposes,
divided The
future. into methods
key that reevaluate
contribution from temporal
this approach inputscan be failure prediction
improving models using these
productivity and controlare becoming
purposes,
divided into methods that reevaluate temporal inputs and failure prediction models using these data are becoming
divided
those that
divided into
into methods
rely that
on maintenance
methods that logs. temporal inputs and failure
reevaluate
reevaluate temporal inputs and popular.prediction
failure Li et al. models
prediction models using
(2007) detects
using these
failure
these data are
signatures
data are becoming
based
becoming
those
those that
that rely
rely on
on maintenance
maintenance logs. logs. popular.
popular. Li
Li et
et al.
al. (2007)
(2007) detects
detects failure
failure signatures
signatures based
based
those that rely on maintenance logs. on frequent
popular. Li co-occurrences
et al. (2007) of failures
detects failure and pair it with
signatures based a
With the development of sensors technology and real-time on time
frequent
on frequent
frequent
to failure
co-occurrences
co-occurrences
survival model,
of failures
of while
failures and
and pair
several
pair
pair
methods
it with
it with
with
have
a
a
With
With the development
the development
development of
ofonsensors
sensors technology
technology and
and real-time
real-time on co-occurrences of failures and it a
data collection,
With the researchof failuretechnology
sensors prediction and can be placed time
real-time time to
been
to
to failure
failure
using
survival
survival
Hidden
model,
model,Models
Markov
while several
while several
several
(HMM)methods
methods
methods
to estimate
have
have
data
data collection,
collection, research
research on
onet failure
failure prediction
prediction can
can bebe placed
be Moura
placed been time failure
using survival
Hidden model,
Markov Models while (HMM) to have
estimate
in thecollection,
data former category
research (Luon al. (2007);
failure das Chagas
prediction can placed been using Hidden
hidden Markov Models (HMM) to
in
in
et
the
the
al.
former
former
(2011)).
category
category
In this
(Lu
(Lu et
et
category,
al.
al. (2007);
(2007);
the
das
das
equipment
Chagas
Chagas Moura
Moura
condition
sequences
been
sequencesusingofof Hidden
hidden
degradation
Markov
degradation Models states
states
of a system
(HMM)
of a to estimate
system
before
estimate
before
in
et the
al. former
(2011)). category
In this (Lu et al. (2007);
category, the das Chagas
equipment Moura asequences
condition failure occurs
sequences of
of hidden
(Salfner
hidden degradation
(2005); Zhou
degradation states
states et of
al. aa(2010);
of system
system before
Vrignat
before
et
et al. (2011)).
al. (2011)).
is modelled In this
using
In this category,
structural
category, the
timethe equipment
series condition
by fore- aet
followedcondition
equipment failure
a failure
failure occurs
occurs The (Salfner
(Salfner (2005);
(2005); Zhou
Zhou et al.
etmodels (2010);
al. (2010);
(2010); Vrignat
Vrignat
is modelled using structural time series followed by fore- a al. (2015)).
occurs accuracy
(Salfner (2005); of Zhou
theseet al. dependsVrignaton
is modelled
casting
is using structural
of deteriorated
modelled using structuralstatestimetime series
in future followed
using
series using
followed by fore-
state-space
by fore- et et al. (2015)). The accuracy of these models depends on
casting of deteriorated states in future state-space theal.
et al. (2015)).
quality
(2015)). The
of temporal
The accuracy
data and
accuracy of
of these
are not
these models
suitable
models depends
for large
depends on
on
casting
modelling
casting of
of deteriorated
with regression
deteriorated states
methods.
states in
in future
Apart
future using
from
using state-space
time series
state-space the
the quality
quality of
of temporal
temporal data
data and
and are
are not
not suitable
suitable for
for large
large
modelling with regression methods. Apart from time series number
the qualityof variables.
of temporal data and are not suitable for large
modelling
analysis, some
modelling with regression
with regression
methods use methods. Apartsuch
classifiers
methods. Apart fromastime
from time series number
Support
series number of of variables.
analysis,
analysis, some
some methods
Vector Machine methods
(Susto etuse
use
use classifiers
classifiers
al. classifiers
(2013)) tosuch
such
such as
as Support
predict failures number
Support In addition of variables.
variables.
to the above, failure occurrence can also be
analysis,
Vector some
Machine methods
(Susto et al. (2013)) to as
predict Support
failures In addition to the above, failure occurrence can also be
Vector
Vector Machine
Machine
 The authors (Susto
(Susto et
et al.
al. (2013))
(2013)) to
to predict
predict failures
failures In
In addition
influenced
addition by to
tothethe above,
complex
the failure
failure occurrence
above, environment can
can also
of the manufactur-
occurrence also bebe
gratefully acknowledge STMicroelectronics
 The authors gratefully acknowledge STMicroelectronics for their for their influenced
influenced by
by the
the complex
complex environment
environment of
of the
the manufactur-
manufactur-
 The authors ing process.
influenced by Accordingly,
the complex this paper
environment integrates
of the contextual
manufactur-
support
 and
The authorsprovision of
gratefully
gratefullydata for TT
acknowledge
acknowledge case study. The authors
STMicroelectronics
STMicroelectronics
study. The authorsfor
also
for ac-
their
their ing
ing process.
process. Accordingly,
Accordingly, this paper
paper integrates
thisproduct, integrates contextual
contextual
support
knowledge
and provision of data for TT case also ac- information
ing process. collected from
Accordingly, this paper process, contextual
integrates equipment
support andEuropean
support and provisionproject
provision of data INTEGRATE
of data for
for TT
TT case andThe
case study.
study.
andThe
region RhoneAlpes
authors
authors also
also ac-
ac- information
information collected
collected from
from product,
product, process,
process, equipment
equipment
knowledge
for ongoing
knowledge
European
Research.
European
project
project
INTEGRATE
INTEGRATE and
region
region
RhoneAlpes
RhoneAlpes and maintenance
information datafrom
collected sourcesproduct,to predictprocess, a system
equipmentfail-
knowledge European
for ongoing Research. project INTEGRATE and region RhoneAlpes and
and maintenance
maintenance data
data sources
sources to
to predict
predict a
a system
system fail-
fail-
for ongoing Research.
for ongoing Research. and maintenance data sources to predict a system fail-
Copyright 2015 IFAC
2405-8963 © 2015, 844 Hosting by Elsevier Ltd. All rights reserved.
IFAC (International Federation of Automatic Control)
Copyright © 2015 IFAC 844
Copyright
Peer review©
Copyright 2015
©under IFAC
2015 responsibility
IFAC 844
of International Federation of Automatic
844Control.
10.1016/j.ifacol.2015.09.632
SAFEPROCESS 2015
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851 845

ure which in return positions our contribution as event-


based prediction but not limited to historical equipment
events data as the former methods. Bayesian Network
(BN) which is a multiple states modeling with ability to
evaluate several outputs in the same model is more suited
to the context of integrating numerous variables coming
from distinctive data sources to model several type of
failures probabilities shared by a same system. It has the
speciality of modeling both diagnostic and prognostic in
one model based. Its temporal counterpart, the Dynamic
Bayesian Network (DBN) is used in Arroyo-Figueroa and
Sucar (1999) for prediction of failures in industrial plants.
Static BNs are duplicated and temporal nodes are added in
between every two BN to represent time dependence, while
(Przytula & Choi, 2008) coherently integrates evidence
on component usage, environmental conditions of opera-
tion, and component health history to generate sequential
prediction of a component future performance in an aug-
mented DBN. While DBN is a potential tool to predict
failures ahead of time, their models is only applicable so
far for simple network with small number of variables.
Morever, relationship between variables at two consecutive
time is often modeled as a continous degradation, however
in our particular context of using statistical and contextual
information to predict failure, we are obliged to first find
the temporal relationship for each variable and the task is
not simple in a complex environment.
This paper is an extension of a static diagnosis BN model
(Samah et al. (2014)) which uses event based contextual
data/information to diagnose product quality drift and Fig. 1. Methodology for failure detection before its occur-
equipment failure, both at the equipment level as well as rence
at its module level. To overcome the above two obstacles,
current paper extends this model by using the same type of
data/information as predictors but in discretized time in-
tervals to develop event driven temporal BN for assessment
of failure risk temporally even though irregularly. The fail-
ure probabilities, computed and plotted by this BN model,
are then used to extract patterns and rules to identify a
failure before they are detected by the automated failure
and detection (FDC) system. This predictive scheme is
converted into a methodology which is presented in next
section. Case study, results and discussions are presented
in the section 3 whereas section 4 concludes this paper
with conclusions and perspectives.

2. FAILURE PREDICTION METHODOLOGY

The proposed 4-step methodology for failure occurrence


prediction and the dataset usage proposition are presented
in Figure 1 and Figure 2 respectively.
The first two steps consist of building a Bayesian Network
Fig. 2. Validation scheme using a single dataset
to infer failure probabilities at designated time intervals
(subsection 2.1) and the final two steps focus on the usage deductive causal reasoning takes into account causal links
of failure probability distribution to predict the occurrence between variables from causes to effects using dynamic
of a failure in advance with its validation (subsection 2.2). detection evolution whereas inter-causal reasoning is an in-
teresting and powerful ability of the BN where an evidence
2.1 Bayesian Network for Failure Probability Inference on one possible cause disapproves other possible causes. In
addition to their ability to represent causal relationships,
The Bayesian Network (BN) is a compact representa- BN also has the ability to perform learning efficiently in
tion of joint probability distributions based on conditional uncertain environments, involving small amount of data
probability theory. BN has the ability for deduction and and short temporal change of states. To build and use a
inter-causal reasoning (Kjaerulff & Madsen, 2006). The basic BN, we use the concepts of learning and inference.

845
SAFEPROCESS 2015
846
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851

Learning BN involves creating the qualitative part of the 2.2 Rules Extraction for Real Time Detection of Failures
network which is the causal structure between variables, Before Its Occurrence
commonly known as Directed Acyclic Graph (DAG) and
the quantitavive part of computing the set of conditional At this stage of the proposed methodology, we are
probability distributions of variables, most of the time equipped with the probability graph for each failure over
used as Conditional Probability Table (CPT). The re- discrete time intervals. These probability distributions
sulting network is used to perform probabilistic inference from the BNL&RE testing set are then analyzed in Step-
from multiple variables, such as calculating the value of 3 to extract patterns for all type of failures separately. We
P(failure|presence of all causes and/or symptoms) as well make the assumption that if a failure type probability is
as P(a given cause|knowing the failure). The Bayes theo- superior to certain level of value, it defines the type of fail-
rem is the heart of this computation (Margaritis (2003)). ure the system in question is in and in the approach, that
Since their introduction, BN has been extended to cover the occurrence of failures can be predicted by identifying
many important problems. Kobbacy et al. (2011) discuss the consistency of the failure probabilities the system is
the various utilities of BNs in manufacturing with em- experiencing. This assumption is based on the fact that
phasis on its applicability when uncertainty is the key dependencies among the chosen variables exist and trans-
characteristic. lated into conditional probabilities of target events. In case
Creation of BN is the heart of the first 2 steps to present of no pattern existence, Critical Region (CR) and number
the causal and conditional dependencies between two types of sufficient consecutive points are computed to find rule(s)
of nodes, paired with critical part of the methodology for prediction. Existence of pattern is also possibly in need
which is the handling of time: (i) Predictors, corresponding of CR to refine the results. This step concludes with the
to the observable events and statistical information coming prediction of failure using pattern/rule(s) on the training
from multiple data sources and (ii) Failure code, with BNL&RE dataset and the computation of lead time (the
’no failure’ included as the targeted equipment condition. time interval from the prediction to the failure occurrence)
Step-1 comprises of Predictors identification from man- for each prediction. If none of the pattern/rule(s) extracted
ufacturing process, quality inspection, maintenance and fulfills the users criteria, then it is repeated to identify new
process control operations database. It is one of the most pattern/rule(s) by either relaxing CR or number of consec-
difficult and complex task as it requires multidisciplinary utive points. Prediction of failures with rules on validation
expertise from each domain. The first task is pursued with set is the final step, Step-4 of this methodology. We also
the definition of time interval for these data collection, compute the predictability index (PI) for all chosen rules
because in the database, the targeted variables can either as the average of prediction accuracy (Eq. 1), precision
be event based or continuous data, collected at irregular (Eq. 2) and lead time percentage (Eq. 3) 1 .
intervals. To overcome this challenge, we propose a time
discretization with the objective to monitor the failure TP + TN
Accuracy = (1)
probabilities systematically. The next task is to divide TP + FP + TN + FN
historical dataset in two parts as BN learning with rules
extraction (BNL&RE) and validation (V). In this paper, TP
we distinguish the notion of test and validation. Test refers Precision = (2)
TP + FP
to cross validation task using BNL&RE sample data while
validation is the final step for the methodology. The two Lead time TP
parts of data are initially divided evenly as 50:50. Lead time % = (3)
Lead time TP+ Lead time FP
Step-2 is focused on learning and optimizing BN structure
and computing its CPT. The structure of BN can be 3. CASE STUDY AND PROOF OF CONCEPT
obtained either through experts knowledge or learning
from the data. In the methodology, it is proposed to This section introduces the case study and proof of concept
be learned from data using a score-based unsupervised for the methodology tested on one of the two process
learning algorithm that use Minimum Description Length reactors, one of the many modules of Thermal Treat-
(MDL) as an objective function for its advantage of trade- ment equipment (TASMI) from a reputed semiconductor
off between data fit and model complexity (Lam and manufacturer. It is used to grow and deposit oxide and
Bacchus (1994)). The prediction and accuracy criteria nitride layers on the surface of silicon wafers. It is also
are defined by the end user to validate the generated used for annealing (heat treatment) after production steps
structure and its CPT as the choice of Bayesian inference to stabilize the crystalline structure of silicon wafers. The
model. If it is not fulfilled, the initial BN structure can reactor module, TASMI01 considered in this case study is
be further optimized using other learning algorithm(s). comprised of: 1-Exterior chamber, 2-Inner chamber with
The non-compliance to the user defined criteria using quartz (liner), 3-Wafer support (boat), 4-Elevator boat
selected algorithms results in increasing and adjusting rotation, 5-Watertight door for loading and unloading,
the ratio of BNL&RE and V dataset. This task follows 6-Heating elements, 7-Gas panel, 8-Temperature sensor,
recursive relearning and optimization of BN structure 9-Pressure gauge (manometer), 10-Pressure regulator. At
until user defined criteria is met or Size(V) is less that present, the best preventive maintenance effort made so
0.25*Size(Complete Dataset). The last BN model is then far in this manufacturing industry exist in the shape of
used to plot failure probabilities upon testing dataset Fault Detection and Classification (FDC) technique that
on the discretized time intervals, and for each failure
1 TP=True Positive, TN=True Negative, FP=False Positive and
separately. These graphs are the input to next step.
FN=False Negative

846
SAFEPROCESS 2015
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851 847

the equipment condition. Wait time before process, out


of control thickness and defect distribution from previous
steps are also identified as key product predictors. It is also
identified that not only current recipe but also previous
recipe and their respective process steps could be strongly
linked to equipment dependability. The FDC sensor sig-
nals from equipment database are not directly considered;
however, decisional information based on these signals
is a good candidate for potential predictors. Equipment
number of wafers processed, equipment capability (Cm),
overall equipment efficiency (OEE), the productive time
of equipment delivering satisfying product for client (OEE
Time), total up time (productive with idle time) and pro-
ductive time are included. The counters as the additional
predictors are the meters associated with the reactor, used
for triggering preventive maintenance actions. Last cate-
Fig. 3. Step-1: Identification of Predictors as failure pre- gory of predictors is the maintenance where indicators such
dictors and dataset pre-processing as mean time between repair (MTTR), Number of rework,
alarms and warnings are considered. Other indicators are
uses sensors temporal data to detect changing conditions the failure code, locating the hardware bloc of failure inside
within equipment and use that knowledge to improve the equipment.
process. This is an approach where, carefully selected
sensors data is monitored against user defined rules to Table 1. List of identified Predictors and fail-
schedule equipment stoppages. It is because of the fact ures for TASMI01
that an equipment in the Semiconductor Industry (SI) is Failure Product Process
equipped with approximately 5000+ sensors where data is Wait time
Current step
sampled every 10 milliseconds. Hence, it is not possible to Defect distribution
Previous step
monitor all sensors and construct monitoring rules for all Current Product combination
Current recipe
Previous Product combination
equipment in a production line (Gallagher et al. (1997)). Out Of Control thickness
Previous recipe
Therefore, equipment experts’ identify key sensors as well Equipment Maintenance
as define rules for each equipment to be monitored during Equipment state
production. The failures generated by this FDC system are Failure a
PM Boat meter MTTR
PM Trap meter Failure code
used to define the type of failures as our target events. The Failure b
Wafer processed Productive time
dataset used in this case study spans 10 months (from week Failure c
Nb of Rework Up time
27th 2013 to week 16th 2014) and are collected across the no Failure
Nb of alarm OEE
product, process, equipment and maintenance databases Nb of warning OEE time
for reactor TASMI01. These are used as the predictors Cm
and failure type. The results are presented step by step
(Subsections 3.1-3.4 respectively). A discussion on results Once the predictors have been identified, it is followed by
comes naturally in subsection 3.5. the defintion of time interval. As a proof of concept, we
have chosen the time interval such that in each interval
3.1 Step-1: Identification of Failure Predictors and Data we have at least one value for all predictors. However, for
Pre-processing the predictors with multiple values in the time interval,
they must be aggregated by taking the mean and mode
The first task of the first step (Figure 3) has been elab- for continuous and discrete values, respectively. Chosen
orated in Samah et al. (2014) where it was carried out predictors values are updated either upon the occurrence
through brain storming sessions with experts’ from each of failure e.g. MTTR, Failure Code or between two failures
domain. Identification of predictors is done at equipment e.g. wait time before processing, OEE etc. Those being up-
module level where the impacting predictors and asso- dated upon failure occurrence are kept constant between
ciated failure type are believed to be more specific and respective failures for all time intervals. Besides the three
able to give more accurate prediction. This is also done identified failures, no failure is also added to the failure
because of the fact that equipment in the SI is composed node to be predicted by the BN. Given historical data
of modules and is modeled in the parent-child relationship. in the span of 10 months, we were able to generate 6300
Therefore, a good prediction model at module level is time intervals, equivalent to 82 occurrences of failure. The
trivial as a basic for whole equipment model. The brain intervals are not uniform and ranges from few minutes to
storming sessions with the experts for reactor TASMI01 several hours. This newly constructed dataset are evenly
resulted in identifying 23 predictors, related to 3 significant split in 2 parts, Bayesian Network modeling & Rule(s)
types of failure (a) ElevatorBoatRotation (b) GazPanel extractions (BNL&RE) and Validation Set (V). The pre-
and (c) Out of control (OC). These are presented in Table processed data is used to develop BN and rules extraction.
1 and are organized under respective data sources. They
are classified in four axes as Product, Process, Equipment 3.2 Step-2: Bayesian Network Learning and Optimization
and Maintenance. The TT equipment is of batch cluster
type which process multiple lots in a given step. Therefore In the proposed methodology, this step (Figure 4) consist
current/previous product combinations might influence of learning BN structure and its CPT. Using Minimum

847
SAFEPROCESS 2015
848
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851

Fig. 5. Bayesian Network model for failure inference

Fig. 4. Step-2: Unsupervised learning and optimization of


Bayesian Network.
Fig. 6. Failure inferences in BNL&RE testing dataset, time
interval [2835,4725].
Description Length (MDL), the structure of the network
is learned and optimized with BayesiaLab 5.3 and based
on previous experience (Samah et al. (2014)) with similar
data, the criteria for model validation is set for accuracy
and precision to be larger than 95%. The structure is
first to be learned using the Equivalence Class (EQ), a
heuristic algorithm to search highest scoring network in
a reduced space of potential BN structures that have
same conditional independence relations (Munteanu and
Bendou (2001); Chickering (2002)). Maximum Likehood
(ML) estimation of CPT is executed everytime a struc-
ture is obtained. The precision and accuracy criteria are
computed and compared to the validation criteria. If the
first attempt does not fullfill the criteria, the BN structure
is further optimized using Tabu and Tabu order algorithms
(Glover (1986); Acid and de Campos (2003); Teyssier and
Koller (2012)). The model with the lowest MDL score is
accepted for further analysis. Relearning and optimization Fig. 7. Failure inferences in BNL&RE testing dataset, time
of structure and CPT using adjustment of BNL&RE and interval [3480,3490].
V dataset is required if the optimization of structure still
does not satisfy the criteria. As a result, the BN model
(Figure 5) was finally obtained with average accuracy The resulting BN is employed to infer failure probabilities
of 97.2% and a final ratio of BNL&RE to V equals to in the BNL&RE testing dataset (1890 time intervals)
62:38. The Predictors, in this model, are differentiated with 24 failure occurrences). The probabilities for each
with different colors following their type of data sources. failure are plotted in Figure 6. The occurrences can be
The yellow, green, pink and light brown colors represent distinguished from the set of points with high probabilities.
product, process, equipment and maintenance related pre- For a zoom in version, refer to example on Figure 7 with
dictors respectively whereas failure type is the target node. observation of failure (a)(Blue curve) for time interval
It is used therefore to plot failure probabilities on testing 3487. The failure is characterized by its high probability
dataset on the discretized time intervals for each failure. (0.64) compared to average probability of failure (a) which
These graphs are the input to next task. is 0.18, whereas no failure probability drops to 0.2.

848
SAFEPROCESS 2015
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851 849

Fig. 8. Step-3&4: Extraction and refinement of rules from each failure probability distribution

3.3 Step-3: Definition and Refinement of Rule(s) for As a result, similar patterns have been detected for all
Failure Prediction occurrences of Failure b and Failure c. Failure b is contin-
uous increasing probabilities and Failure c is a W or M
point-to-point pattern with consecutive values are at least
5 times smaller or bigger. Next, critical regions common to
Plotted graphs provided in step-2 plays a pivotal role in all occurrences of each failure are defined. For Failure a, no
this step to define and refine pattern/rule(s) for detection pattern could be extracted and a CR of all occurrences for
of failure before its occurrence. For the application in this this failure needs to be established first before assigning
study case, we propose a scheme to find the pattern/rules the rules of consecutive number of points. The Table 2
(Figure 8). Initially, based on the probabilities distribution summarizes our results for patterns/rules and their re-
observation of each failure, detect the existence of an ob- spective critical regions. The probability distribution for
vious pattern. If it exist, define Critical Region (CR) com- each failure in chosen time intervals with the count of
mon to all of respective failure occurrence for pattern Pi . occurrences is presented in brackets (Figure 9). In the
The limits for this region are selected as min(probability) graphs we can spot the horizontal lines showing the upper
and max(probability), observed among all failure occur- and lower limit of the CR whereas in brown box are the
rences. However, in case of no pattern, CR are defined patterns associated to defined rules.
first with min and max limits defined as max(mean, mod,
median) and max(probabilities) respectively for all failure Table 2. Summary of results from rule(s) ex-
occurrences. This follows the computation of min and max traction
number of consecutive points as potential identifiers of the Base
Failure Critical Region (CR) Rule(s)
respective failures. The cardinality of [min,max] becomes of Rule
Lower limit=max(mean)=0,175 1) [Min=2;Max=22]
the number of rule(s) extracted from respective region for Failure a Non-pattern Upper limit=max(probability consecutive
each failure. This step concludes with the prediction of observed for the failure)=0,64 points inside the CR.
1) Sequentially increasing
failure using rule(s) on the training BNL&RE dataset and Lower limit=min(probability of
probabilities &,
failure occurrence)=0,08
the computation of lead time for each prediction. User Failure b Pattern
Upper limit=max(probability
2) [Min=3;Max=11]
consecutive points inside the
defined criteria is set as the minimum of time needed observed for the failure)=0,71
CR following rule 1.
to fix each type of failure. If none of the rules extracted Lower limit=min(probability of 1) M/W Pattern &,
failure occurrence)=0,04 2) 2 consecutives
fulfills the criteria of user defined lead time, the scheme is Failure c Pattern
Upper limit=max(probability points of the pattern,
repeated to identify new patterns by relaxing the CR. observed for the failure)=0,57 P(t)=factor of 5*—P(t+1)—

849
SAFEPROCESS 2015
850
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851

Fig. 9. Proof of concept for rule(s) extraction on failure

As the final task of this step, prediction of failures us-


ing rule(s) plus computation of lead time, which will be
categorized as early and/or late detection are made on
the BNL&RE training dataset. Every retrieved prediction
is associated to its proper lead time. Certain set of re-
sults and example of information obtained are summarized
in the Table 3 below. Rules without late detection are
those that appear only once in between failure intervals.
Failure c made 15 predictions, but 2 missed failure occur-
rences are recorded. The loopback for refining of patterns
and rules could not be completed in this case study.
Table 3. Example of results and information
for rule(s) testing
Earliest Latest
Total Total
Failure CR Rule(s) detection detection
Occ. Pred.
[minmax] [minmax]
2 cons. [7h28m [0h02m
47
Points in CR 513h15m] 0h26m]
4 cons.
Points in CR
12
[3h26m
478h37m]
[1h46m
10h30m]
Fig. 10. Predictability Index calculated for chosen rules
10 cons.
Failure a [0.175;0.64] 7 6 [1h10m 5h09m]
Points in CR
22 cons. Points in CR 3 [0h28m 2h56m]
3.5 Discussions
Rising & 3 cons. [35h08m [4h22m
13
Points in CR 135h33m] 5h30m]
Rising & 5 cons.
10
[5h33m [1h29m There are several explanations on the observations from
Points in CR 125h37m] 3h45m]
Failure b [0.08;0.71] 10
Rising & 11 cons. Table 3. Failure a, rule 2, consecutive points in CR has a
7 [0h39m 1h48]
Points in CR high number of predictions and big early detection. These
15 (+
Failure c [0.04;0.57] 13
Min Factor 5 W/M
Pattern in CR
missed
[7h32m
78h55]
[4h22m
5h30m]
are not very significant and it is due to the choice of
prediction)
lower limit as 2 consecutive points that is easily detected.
An increase in lower limit is recommendable. Another
3.4 Step-4: Prediction of Failures using Rules on Validation significant remark is the missed predictions on Failure c
Dataset and Computation of Predictability Index (PI) pattern. Even though failure detections are superior than
the total existing failures occurrences, missed predictions
still appear and we argue that another refinement is
Step-4 is composed of validation of rule(s) in Step-3 required especially because Failure c is related to out of
using Validation (V) dataset and of computation of PI. control situation (unknown failure). Refining rules is not
However, in this study case, the final task of validation presented in this paper; but it is an important task in our
and computation of predictability index (PI) was executed methodology.
without users opinion. Some results are presented using
five selected rules in Figure 10. High indices are obtained, The concept of early and late prediction is important
responding to high prediction accuracy, reliability and to find the balance of under and over engineering. On
time gain. one hand, early maintenance and repair can eliminate

850
SAFEPROCESS 2015
September 2-4, 2015. Paris, France A. Abu-Samah et al. / IFAC-PapersOnLine 48-21 (2015) 844–851 851

potential unscheduled time caused by equipment failure prognostic capabilities. Reliability, IEEE Transactions
because early diagnosis and appropriate actions plans can on, 61(4), 872–883.
be activated. On the other hand, early interventions can Kobbacy, K.A., Vadera, S., McNaught, K., and Chan, A.
cause unnecessary corresponding cost of maintenance and (2011). Bayesian networks in manufacturing. Journal
resources. Therefore, it is clear that a proper definition of of Manufacturing Technology Management, 22(6), 734–
lead time is needed for our methodology to operate in the 747.
most effective way. This is left to the choice of end user Lam, W. and Bacchus, F. (1994). Learning bayesian
who can better judge the type of failure and associated belief networks: An approach based on the mdl principle.
repair duration to ultimately decide the target lead time Computational intelligence, 10(3), 269–293.
for early failure prediction. Li, Z., Zhou, S., Choubey, S., and Sievenpiper, C. (2007).
Failure event prediction using the cox proportional haz-
4. CONCLUSION ard model driven by frequent failure signatures. IIE
transactions, 39(3), 303–315.
This paper presents the methodology for failure predic- Lu, S., Tu, Y.C., and Lu, H. (2007). Predictive condition-
tion using BN approach and is complemented with the based maintenance for continuously deteriorating sys-
extraction of rules for failure prediction with computation tems. Quality and Reliability Engineering International,
of lead time and predictability index. Its advantage is 23(1), 71–81.
the use of Predictors coming from multiple data sources Margaritis, D. (2003). Learning Bayesian network model
as predictors in a single prediction model. It uses event structure from data. Ph.D. thesis, US Army.
driven Predictors as temporal characteristics successfully Munteanu, P. and Bendou, M. (2001). The eq framework
to predict the potential failures. Promising results from for learning equivalence classes of bayesian networks.
this offline prediction in case study demonstrates interest In Data Mining, 2001. ICDM 2001, Proceedings IEEE
to extend it for real time predictions and using machine International Conference on, 417–424. IEEE.
learning algorithms for pattern/rule(s) extraction. Salfner, F. (2005). Predicting failures with hidden markov
There are some limitations and potentials in our work. models. In Proceedings of 5th European Dependable
First, rules are extracted, separately, for each failure. Computing Conference (EDCC-5), 41–46.
However, multiple failure detection in the same interval Samah, A.A., Shahzad, M.K., Zamaı̈, E., and Hubac, S.
are not treated in the proposed methodology. Second, (2014). Methodology for integrated failure-cause diag-
PI is a presentation of average of average from predic- nosis with bayesian approach: Application to semicon-
tion accuracy and precision plus time gain percentage. A ductor manufacturing equipment. In Second European
sensitivity analysis is required to assure online prediction Conference of the Prognostics and Health Management
reliability. Third, we have only used the BN model for Society 2014.
failure inference when in fact it can be used as a fault Susto, G.A., Pampuri, S., Schirru, A., De Nicolao, G.,
diagnosis tool as well. We are looking for these potentials McLoone, S., and Beghi, A. (2012). Automatic control
as our future perspectives. and machine learning for semiconductor manufacturing:
Review and challenges. In Proceedings of the 10th
REFERENCES European Workshop on Advanced Control and Diagnosis
(ACD 2012).
Acid, S. and de Campos, L.M. (2003). Searching for Susto, G.A., Schirru, A., Pampuri, S., Pagano, D.,
bayesian network structures in the space of restricted McLoone, S., and Beghi, A. (2013). A predictive main-
acyclic partially directed graphs. Journal of Artificial tenance system for integral type faults based on support
Intelligence Research, 445–490. vector machines: An application to ion implantation.
Arroyo-Figueroa, G. and Sucar, L.E. (1999). A tempo- In Automation Science and Engineering (CASE), 2013
ral bayesian network for diagnosis and prediction. In IEEE International Conference on, 195–200. IEEE.
Proceedings of the Fifteenth conference on Uncertainty Teyssier, M. and Koller, D. (2012). Ordering-based search:
in artificial intelligence, 13–20. Morgan Kaufmann Pub- A simple and effective algorithm for learning bayesian
lishers Inc. networks. arXiv preprint arXiv:1207.1429.
Chickering, D.M. (2002). Learning equivalence classes of Vrignat, P., Avila, M., Duculty, F., and Kratz, F. (2015).
bayesian-network structures. The Journal of Machine Failure event prediction using hidden markov model
Learning Research, 2, 445–498. approaches.
das Chagas Moura, M., Zio, E., Lins, I.D., and Droguett, Yang, Z.M., Djurdjanovic, D., and Ni, J. (2008). Main-
E. (2011). Failure and reliability prediction by support tenance scheduling in manufacturing systems based on
vector machines regression of time series data. Reliabil- predicted machine degradation. Journal of Intelligent
ity Engineering & System Safety, 96(11), 1527–1534. Manufacturing, 19(1), 87–98.
Gallagher, N.B., Wise, B.M., Butler, S.W., White, D., and Zhou, Z.J., Hu, C.H., Xu, D.L., Chen, M.Y., and Zhou,
Barna, G.G. (1997). Development and benchmarking of D.H. (2010). A model for real-time failure prognosis
multivariate statistical process control tools for a semi- based on hidden markov model and belief rule base.
conductor etch process: improving robustness through European Journal of Operational Research, 207(1), 269–
model updating. In Proc. ADCHEM, volume 97, 78–83. 283.
Glover, F. (1986). Future paths for integer programming
and links to artificial intelligence. Computers & opera-
tions research, 13(5), 533–549.
Haddad, G., Sandborn, P.A., and Pecht, M.G. (2012). An
options approach for decision support of systems with

851

You might also like