Event Log Modeling and Analysis For System Failure Prediction

IIE Transactions (2011) 43, 647–660
Copyright
C “IIE”
ISSN: 0740-817X print / 1545-8830 online

DOI: 10.1080/0740817X.2010.546385
Event log modeling and analysis for system failure prediction
YUAN YUAN1 , SHIYU ZHOU1,∗ , CRISPIAN SIEVENPIPER2 , KAMAL MANNAR2 and YIBIN ZHENG2
1
Department of Industrial and Systems Engineering, University of Wisconsin–Madison, Madison, WI 53706, USA
E-mail: szhou@engr.wisc.edu
2
GE Healthcare, Pewaukee, WI 53072, USA
Received August 2009 and accepted November 2010
Event logs, commonly available in modern mechatronic systems, contain rich information on the operating status and working
conditions of the system. This article proposes a method to build a statistical model using event logs for system failure prediction. To
achieve the best prediction performance, prescreening and statistical variable selection are adopted to select the best set of predictor
events, coded as covariates in the statistical model. In-depth discussion of the prediction power of the model in terms of false alarm
and misdetection probability is presented. Using a real-world example, the effectiveness of the proposed method is further confirmed.
Keywords: Event logs, Cox proportional hazard model, variable selection, prediction power
1. Introduction failure. However, before its total failure, a faulty detector

can cause a series of other events such as an analog-to-
The rapid developments that have occurred in information digital converter error, communication error, or software
technology have created the ability to automatically collect error. By observing these preceding events (subsequently
data on the events that occur in a mechatronic system while called predictor events), we can predict that the key failure
the system is in use. For example, manufacturers have im- event is about to occur. With accurate failure prediction,
plemented data acquisition and transmission systems that preventive maintenance can be conducted to reduce unex-
collect system event logs from installed medical diagnos- pected machine downtimes and maintenance costs. Thus,
tic imaging systems. The events recorded in the event logs it is highly desirable to develop a modeling and analysis
are related to various machine activities, critical system methodology for event logs to enable the accurate (in a sta-
failures, operator/user actions, task status, etc. Figure 1 tistical sense) prediction of the occurrence of failure events.
illustrates a simplified event sequence from a Computer In order to achieve this objective, a critical step is to es-
Tomography (CT) machine that contains 18 events of four tablish a rigorous mathematical model to describe the rela-
different types. In this figure, K represents a system failure, tionships between the failure events and other events in the
such as the “scan abort” in CT machines, and A, B, and C event log. Formally, an event sequence S is a triplet (Ts , Te ,
represent other system event types. For example, event A s) defined on a set of events E, where Ts and Te are the start
could indicate that the temperature at a location in the ma- and end times of the sequence, respectively, and s =< (E1 ,
chine is above a certain level and event B could indicate a t1 ), (E2 , t2 ), . . . , (Em , tm ) > is an ordered sequence of events
communication error within the machine. The occurrences such that Ei ∈ E for all i =1, 2, . . . , m and ti is the occur-
of events have been marked along a timeline. In practice, rence time of Ei with Ts ≤ t1 ≤ · · · ≤ tm ≤ Te . The problem
a large number of events are often recorded. For instance, of predicting failure occurrences can be formulated as fol-
the event log of a typical CT machine during a 1-month lows: given the event sequence S containing, among others,
period could contain more than 1000 000 events within 200 occurrences of the failure event K, how do we construct a
different types. mathematical model to predict the occurrence of the failure
It is generally believed that the event logs can act as a rich event K?
information source about the system’s working conditions Techniques to predict the failure event(s) based on the
and they can be used for condition monitoring, diagnosis, analysis of event sequence data have been proposed in the
and maintenance decision making. For example, a faulty literature. These methods can be roughly classified into
detector in a CT machine will eventually lead to a scan abort design-based methods or data-driven rule-based methods.
In the design-based methods, the expected event sequence
is obtained from a system design and it is compared with
∗ the observed event sequence. The system failure is identified
Corresponding author
0740-817X
C 2011 “IIE”
648 Yuan et al.
Fig. 1. A simplified example of an event log.
or predicted based on the results of such a comparison. In simple association measure of events. No systematic statis-
design-based methods, untimed and timed automata (Sam- tical variable selection procedure is used. Thus, the method
path et al., 1995; Chen and Provan, 1997; Sampath et al., might encounter difficulties when applied to a large scale
1998; Contant et al., 2004, 2006), Petri Net models (Car- data set. Second, in their work and as in most of the existing
doso et al., 1991, 1999; Srinivasan and Jafari, 1993; Hol- literature on the applications of the Cox model, the main
loway et al., 1997; Zhou and Venkatesh, 1999), and time focus is on how to fit the Cox model and how to evaluate
template models (Holloway and Chand, 1994, 1996; Hol- the impact of covariate factors on system survival. There
loway, 1995; Das and Holloway, 2000) are used to describe are very few papers in the literature that have investigated
the temporal relationships among events. However, in many the prediction of an individuals’ survival time using the
cases, the occurrence of events is random and no predefined Cox model, which is a very relevant problem in mainte-
knowledge about temporal relationships is available. Thus, nance policy making and terminal-stage patients’ survival
design-based methods cannot be applied to these cases. The time analysis. In the existing literature, to the best of our
data-driven rule-based methods do not require system de- knowledge, only Henderson et al. (2001) and Henderson
sign information. Instead, these methods use data-based and Keiding (2005) have investigated point and interval
objective measures, such as support (Mannila et al., 1997; predictions for an individual’s survival time of last-stage
Klemettinen, Mannila and Verkamo, 1999), confidence cancer patients, based on simple Cox models with time-
(Agrawal et al., 1993), recall (Lin et al., 2002), lift (Cham- independent covariates. However, the results of these papers
berlin, 1996), Laplace (Clark and Boswell, 1991), odds ra- cannot be applied to the case of the Cox model used to fit
tio (Tan and Kumar, 2000), information gain (Church and data from event logs because the covariates of this model
Hanks, 1989), conviction (Brin et al., 1997), and Piatetsky– are time dependent. Furthermore, the performance of the
Shapiro (Piatetsky–Shapiro and Matheus, 1991), to eval- prediction, which will be very important in determining the
uate the temporal associations between events and then maintenance policy, is not comprehensively studied.
identify interesting events to build up association rules or Considering these research gaps, in this article we pro-
classification rules to predict future events (Klementtinen, pose a systematic method to predict individual failures
1999; Klemettinen, Mannila and Toivonen, 1999; Xiao and based on using a Cox model to fit heuristic event log data.
Dunham, 2001; Dunham, 2003; Li et al., 2003). However, In the model fitting, an interesting heuristic, called the Ad-
the rule-based methods also have certain limitations for justed Relative Frequency (ARF), is proposed to measure
use in event prediction. For example, it is difficult for these the association between the failure event and the predictor
rule-based methods to incorporate information on the time events. ARF is more comprehensive than the association
interval between events. Moreover, adjusting the prediction measure used by Li et al. (2007). Since it incorporates mea-
results if multiple predictor events occur simultaneously in sures of both “confidence” and “recall” of the association.
a given time period is also a very challenging task. The variables are firstly pre-screened using ARF and then
Recently, Li et al. (2007) proposed a method of using the a statistical variable selection step is conducted using a re-
Cox Proportional Hazard (PH) model to describe the event cently developed optimization technique, a parallel genetic
logs. The Cox PH model is a survival model that describes algorithm (Zhu and Chipman, 2006). With these two steps,
the relationship between the system survival probability at the significant predictor events can be identified and the
a certain time (i.e., probability of no failures) and the in- dimension of the problem can be significantly reduced. Af-
fluential factors in the system. The key idea is the concept ter building the Cox model for event logs, based on the
of using past events as time-varying covariates (i.e., the specific needs of failure prediction in maintenance decision
predictors) in the Cox PH model to quantify the associa- making, we propose to use the qth quantile residual life
tion between past events and a current failure event. The as the point prediction of the failure time. We also specify
case study in their paper illustrates that this idea is quite that the maintenance operations will occur in a period of
promising. However, there are still several important issues time around this prediction. The probabilities that a fail-
unsolved in their paper. First, the variable selection step, ure happens outside this period are systematically evalu-
which is a critical step in model fitting for a large-scale data ated, which will enable service engineers to choose optimal
set, is done through a prescreening step that is based on a maintenance schemes.
Failure prediction using event logs 649
The rest of this article is organized as follows. In Section two steps are presented in Section 2.2 and Section 2.3,
2, the model building procedure is introduced. Section 3 respectively. The third step will be discussed in detail in
discusses the prediction scheme and the prediction power of Section 3.
the proposed method. We illustrate the effectiveness of the
proposed method through a real-world example in Section
2.2. Prescreening of variables using ARF
4. Finally, conclusions are drawn in Section 5.
The event logs often contain many event types. For exam-
ple, a typical event log from a subsystem of a CT machine
2. Building a Cox PH model using event logs often consists of more than several hundred different types
of events. It is difficult to directly apply conventional statis-
2.1. Assumptions tical variable selection schemes to this raw data set due to
its high dimensionality. Thus, a prescreening step is needed
For an event sequence S as described in the previous sec- to reduce the number of PEs. To achieve this goal, an ap-
tion, the event space E can be further divided into different propriate association measure between the PEs and the FE
sets: Failure Events (FEs) that record the system break- is needed.
downs, Predictor Events (PEs) that are significant indica- In the literature on data mining, extensive and numerous
tors of system failure and thus should be included in the association measures have been proposed. Most of these
statistical models as covariates, and other events that do measures are expressed as a function of the probability of
not belong to the above two sets. For the sake of simplic- the occurrence of an event or as the conditional proba-
ity, we make the following assumptions regarding event bility of the occurrance of an event. McGarry (2005) has
logs. presented a comprehensive survey on association measures
1. We assume there is only one event in the FE set, denoted in data mining, and Geng and Hamilton (2006) extended
as K and the corresponding occurrence time is tK . This this work by including analysis of the properties of these
is not a restrictive assumption because for a system with measures. Intuitively, for a PE A to be included in the pre-
multiple types of failures, we can analyze different failure diction model for failure event K it should satisfy the fol-
types individually and build the prediction models for lowing properties: (i) P(K|A), i.e., the probability that if
each of them, respectively. A occurs, then K occurs, should be high (this measure is
2. After each failure, the system is restored to an as-new also called confidence in the literature); (ii) P(A|K), i.e., the
state. Therefore, failures are considered to be indepen- probability that if K occurs, then A should occur, should
dent of each other. be high as well (this measure is also called recall in the liter-
3. An event is considered to be closely related to the FE ature); and (iii) to have enough samples in the model fitting
if it occurs with high probability within a short period later on, the occurrences of A should not be too rare in the
of time (called a time window in this article) before the data set.
occurrence of the FE. This is a reasonable assumption Based on these properties, we propose a heuristic indica-
for most physical systems. tor, called theARF, which combines the confidence and re-
4. An event sequence may contain zero or multiple FEs. call preoperties to measure the association between a given
Since failures are independent of each other, we can event and the FE. To clearly explain the concept of ARF,
view an event sequence with multiple failures as multi- we introduce the following symbols as listed in Table 1 to
ple shorter sequences with only one failure at the end. represent different types of event sequences. In the table,
Thus, in the following, we assume that all of the event the symbol E (Si ) represents the set of all event types in Si
sequences contain one failure or no failure. If an event and t is the size of the time window before the FE. We
sequence does not contain a failure, then we say it is assume t is given based on physical knowledge of the sys-
censored. tem. Based on the temporal relationship of the occurrence
of the given event E j (e.g., event “D” in the table) and the
In practice, the FE is usually given but the PEs need to be FE K, the event sequences are classified into five different
selected among a large number of non-failure events. If the sets, represented by F0 j , F1 j , F2 j , G0 j , and G1 j , respectively.
set of PEs and the quantitative relationship between the PE Using the simple notations in Table 1, we can graphically
and FE are known, we can predict the occurrences of the represent the associations between the FE and a given event
FE based on the occurrences of the PEs. Therefore, based in multiple event sequences. Figure 2 illustrates 18 event
on the event log data, we need to (i) identify the set of PEs; sequences (11 with failure at the end and seven without a
(ii) build a quantitative model between the events in the failure) for four different given events A, B, C, and D. Please
PE set and FE; and (iii) predict the occurrences of failure note that all the diagrams in Fig. 2 consider the same 18
based on the quantitative model. To achieve these goals, event sequences. This graphical representation can provide
three steps: (i) variable prescreening based on ARF; (ii) us with insights to the association between a given event
statistical variable selection; and (iii) model-based failure and the FE. For example, as shown in Fig. 2, event A has a
prediction are developed. The technical details of the first low confidence but a high recall for the FE K; event B has
650 Yuan et al.
Table 1. Five types of event sequences for a given event E j
Example event sequence Si (E j = D) Explanation Representing symbol Set symbol

K ∈E (Si ) but D ∈
/ E (Si ) F0j
K ∈E (Si ), D ∈E (Si ) but tD < tK − t F1j
K ∈E (Si ), D ∈E (Si ), and F2j

tK > tD > tK − t
K∈
/ E (Si ) and D ∈
/ E (Si ) G0 j
K∈
/ E (Si ) but D ∈E (Si ) G1 j
both a high confidence and a high recall; event C has both respect to failure K is defined as
a low confidence and a low recall; and event D has a high
confidence but a low recall. It is worth mentioning that most
ψ|F2 j |
existing association measures as summarized in Geng and r (E j , K) =
Hamilton (2006) monotonously increase as the frequency |F2 j | + |F1 j | + |G1 j |

of events increases and hence will miss type D events and |F2 j |
+ (1 − ψ)I > c , (1)
sometimes might classify type A events as good indicators. |F0 j | + |F1 j | + |F2 j |
To take both confidence and recall into consideration, we
propose the following heuristic measure of the association
between the FE and a given event E j . where |.| gives the number of elements in a set; the sets F 0 j ,
F 1 j , F 2 j , and G 1 j are defined as in Table 1; 0 ≤ ψ ≤ 1; and
Definition 1. Given a set of event sequences and a non- I(.) is the indicator function.
failure event E j , the ARF of the occurrences of E j with The first term of the ARF measures the confidence of
event E j with respect to failure. Clearly, for any E j , the
first term is between [0, 1]. For frequent events with a low
prediction power such as type A events in Fig. 2(a), this
value will be low because the term |G 1 j | will be large in
the denominator. The second term is the percentage of
failure events that are “captured” by the event E j within
the time window; i.e., the recall of event E j . Thus, if the
recall is less than a threshold value c, the second term in
the ARF will be zero. For type D events, as long as its
recall achieves a certain level, say c, its ARF value will be
determined by its confidence and hence it will remain high.
The purpose of incorporating recall in the ARF through
a threshold value is to ensure that very rare spontaneous
events, which should not be included in the model fitting
because of their small sample size, will have a low ARF
value.
From this discussion, it is clear that ARF satisfies the
Fig. 2. Associations between four types of events and the FE: (a) three desired properties of a good association measure
for event A; (b) for event B; (c) for event C; and (d) for event D. listed at the beginning of this section. With ARF, we can
conduct a prescreening operation and select the PE set as ship between system hazard function and covariates as
p
PE = {E j |E j ∈ E, r (E j , K) > rc } (2) T
h(t | Z(t)) = h 0 (t) exp γ Z(t) = h 0 (t) exp γk Zk(t) .
where rc is a pre-selected threshold value. k=1
To implement the prescreening procedure based on ARF, (4)
three parameters, ψ, c, and rc , need to be determined. It The function h(t|Z(t)) is the hazard function and h 0 (t)
is difficult to have a fixed rule to decide the values of these is the baseline hazard rate function. The vector γ is the
parameters for general scenarios. As heuristic parameters, coefficient vector and Z(t) = (Z1 (t), Z2 (t), . . . , Zp (t))T is
they are often determined based on the data set at hand. the covariate vector. With this hazard model, we can obtain
Based on our experience, we have created the following the survival function as
qualitative guidelines for the selection of these parameters. p
t
1. Parameter c with a range from zero to one gives the min- S(t | Z(t)) = exp − h 0 (u) exp γk Zk(u) du .
0
imum recall constraint for a good predictor. If c is very k=1
small, the second term I(|F2 j |/(|F0 j | + |F1 j | + |F2 j |) > (5)
c) will be one for most events, which means only confi- In the event log modeling, a PE can be encoded in the
dence is considered. In this case, very rare spontaneous Cox model as a covariate. For example, given the event
events could be mistakenly viewed as a good PE. For sequence shown in Fig. 1, the event A can be coded as a
example, if an event occurs only once in the data set but time-dependent covariate as
it happens to be within the time window before the FE,
0, 0 ≤ t < the occurence time ofA,
then its confidence will be one and thus it will be viewed ZA(t) = (6)
1, t ≥ the occurence time ofA.
as a good PE. On the other hand, if c is very large, only
those events that capture a very large portion of the In this way, the relationship between predictor event A
failures will have a non-zero value of the second term. and the occurrence of the key FE can be quantified. Since
In this case, events with a relatively low occurrence fre- the issues associated with the fitting of the Cox PH model
quency but high confidence might be mistakenly elim- have been extensively discussed in Li et al. (2007), we omit
inated. Usually, for a data set containing hundreds of the details here.
event sequences, we choose c to be a moderate value, say In the prescreening step, a much reduced PE set is
15%, in order to strike a balance between keeping events selected. However, heuristic association measures cannot
with good confidence but relatively low occurrence fre- guarantee the statistical significance of PEs. In addition,
quency and avoiding those very rare spontaneous events. many events in the PE set may be correlated, which can
2. Parameter ψ is the weight coefficient with a range from impact the Cox model fitting and make the prediction un-
zero to one. A low value of ψ will impact the power of reliable. Therefore, a rigorous statistical variable selection
ARF on PE selection, whereas a high value of ψ will step is necessary. Variable selection is a well-studied field.
result in over-fitting. In practice, ψ can be set to 0.5 to The variable selection techniques can be mainly classified
have equal weights on both terms. into two types. One is based on comparisons of a series
3. The threshold rc controls the total number of PEs that of models according to various criteria, including Residual
are chosen. Generally, the choice of rc is determined by Sum of Squares (RSS), Akaike Information Criteria (AIC),
the characteristics of the occurrence of events and the the Bayesian Information Criteria (Seber and Lee, 2003),
specific system requirements. C p , and prediction error, etc. The other is shrinkage-based
variable selection methods, such as ridge regression (Se-
2.3. Statistical model fitting and variable selection ber and Lee, 2003), Least Absolute Shrinkage Selection
Operator (LASSO) (Tibshirani, 1996), and Least Angle
In this article, we adopt the Cox PH model to describe Regression (LARS) (Efron et al., 2004). Considering the
the temporal relationship between PEs and the FE as in Li computational efficiency and the preference of building a
et al. (2007). The basic idea is as follows. Denote the time to model with good prediction power, we select the AIC as the
failure as a random variable T. Then the hazard function, criterion in the variable selection step. Stone (1977) proved
also called the conditional failure rate function, is defined the asymptotic equivalence of the choice of model by AIC
as and cross-validation (one type of estimation of prediction
f (t) error). Other reasons for using AIC are that it requires no
h(t) = lim (Pr(t ≤ T < t + t | T ≥ t)/t) = (3) assumptions on the covariates and it can be easily calcu-
t→0 S(t)
lated based on the partial likelihood of Cox model estima-
where f (t) is the probability density function of T and tion. It is known that AIC tends to select more variables
S(t) is Pr(T > t), the survival function of T. The Cox PH and over-fit the data. To address this problem, we apply the
model (Cox, 1972; Klein and Moeschberger, 2003) is a semi- recently developed method: the AIC-based Parallel Genetic
parametric model that can be used to describe the relation- Algorithm (PGA; Zhu and Chipman, 2006). The PGA has
652 Yuan et al.
been shown to avoid over-fitting and is able to identify the as Z∗ (t) = Z(t) for t ≤ t ∗ and Z∗ (t) = Z(t ∗ ) for t > t ∗ . Since
best subset of variables efficiently. The details of the PGA Z∗ (t) will not change after t ∗ , a straightforward derivation
algorithm used in this article can be found in Appendix shows that the survival function in Equation (7) can be
A. The real-world case study shows that PGA is a quite simplified to
efficient algorithm. t
∗ ∗
T ∗
S(t | t , Z (t)) = exp − h 0 (u) exp γ Z(t ) du
t∗
exp(γ T Z(t∗ ))
3. Model-based failure prediction and evaluation S0 (t)
= (8)
S0 (t ∗ )
Based on a given Cox PH model as in Equation (5) and a t
sequence of PEs encoded in Z(t), the conditional survival where S0 (t) = exp(− 0 h 0 (u)du) is the baseline survival
function and the mean residual life of the system can be es- function.
timated. The mean residual life is often used in the literature The mean residual life of the system is
as a point prediction of the failure. However, in most main- ∞
∗ ∗ ∗
tenance services, the maintenance operations are always E(T−t | T > t , Z (t)) = (u − t ∗ ) f (u | t ∗ , Z∗ (t))du,
t∗
conducted over a period of time. For example, it usually (9)
takes a week for a major maintenance service for a CT ma- where f (t | t ∗ ,Z∗ (t)) can be computed as –dS(t |
chine. In maintenance decision making, managers are very t ∗ ,Z∗ (t))/dt. It should be noted that the survival function
interested in the probability of failure related to the mainte- and the mean residual life given in Equation (8) and Equa-
nance period. Specifically, in order to evaluate the cost of a tion (9), respectively, are conditioned on a known sequence
certain maintenance policy, they would like to evaluate the of PEs that occurred before t ∗ .
probability of a failure occurring before the maintenance Naturally, the mean residual life of the system can be
period (which determines the cost of unexpected failure) used as a point prediction of the occurrence time of the
and after the maintenance period (which determines the system failure at a given time instant t ∗ . However, for a given
cost of unnecessary maintenance). Targeting these specific survival function and a time instant, the mean residual
needs in failure prediction for maintenance decision mak- life is fixed. In many practical scenarios, managers might
ing, we investigate the following failure prediction scheme. want to have some flexibility in the prediction so that the
In this scheme, we predict that the failure will occur at prediction error can be adjusted. Furthermore, when the
the qth quantile residual life of the system, denoted as tq , distribution of survival time is highly skewed, the mean
and the maintenance occurs in a period denoted as [tq − tl , residual life will not be a good point prediction. Therefore,
tq + th ], where the period th + tl is the pre-specified mainte- instead of using the mean residual life, we propose to use
nance period. In Section 3.1, the concept of tq is introduced. the qth quantile residual life as the point prediction of
In Section 3.2, the quantitative evaluation of the misdetec- failure. Given a survival function S(t|t ∗ , Z∗ (t)), the qth
tion rate (the probability of failure before the maintenance quantile residual life of the system at time t ∗ (Klein and
period) and the false alarm rate (the probability of failure Moeschberger, 2003) is defined as
after the maintenance period) of this maintenance scheme
are presented. tq = inf{t : S(t | t ∗ , Z∗ (t)) ≤ 1 − q}. (10)
As shown in Fig. 3, tq is the time instant at which the
3.1. Failure prediction survival function S(t | t ∗ ,Z∗ (t)) drops below 1 − q, which
means that the probability that the system survives after
Let T be the random variable denoting the survival time, tq is less than 1 − q. If the baseline survival function is
f (t) be the density function of T, and F(t) be the corre- monotonic, which is common in practice, then tq can be
sponding cumulative distribution function. Based on Equa- computed as follows:
tion (5), the survival function of the system can be written ∗

S0 (t) exp(γ Z(t ))
T
as,
tq = inf t : ≤1−q
S(t|Z∗ (t)) S0 (t ∗ )
Pr(T > t|t ∗ , Z∗ (t)) ≡ S(t | t ∗ , Z∗ (t)) =
S(t ∗ |Z∗ (t)) ∗
= inf t : S0 (t) ≤ (1 − q)exp(−γ Z(t )) × S0 (t ∗ ) . (11)
T

t
exp − 0 h 0 (u) exp γ T Z∗ (u) du
=
∗ . Since S0 (t) is monotonic decreasing and its inverse function
t
exp − 0 h 0 (u) exp (γ T Z∗ (u)) du S0−1 exists, tq can be expressed as
∗

tq = t : S0 (t) = (1 − q)exp(−γ Z(t )) × S0 (t ∗ ) ,
T
(7) (12)
∗
In this expression, t is the time instant at which the predic- and thus,
tion is made and Z∗ (t) is the covariate function that encodes
Z(t ∗ ))
tq = S 0 −1 (S0 (t ∗ ) × (1 − q)exp(−γ
T
the known PE sequence until t ∗ . Specifically, Z∗ (t) is defined ). (13)
S(t|t*, Z(t* ))
A
B 1-q
C
tA tC tB t* tq t
Fig. 3. The qth quantile residual life.
Several points need to be mentioned regarding tq . The β, is defined as

value of tq is a conditional value instead of a marginal
value. It is conditioned on no failure occurring before t ∗ , β = Pr(T < tq − tl ). (15)
the PE sequence before t ∗ is known, and Z∗ (t) for t > t ∗
is fixed. The last assumption is reasonable for conditional The intuition behind these two definitions is quite simple:
prediction tq and is referred to as the nonanticipativity rules, the value α is the probability that the system fails after
which means when we make decisions at t ∗ , we can only the valid maintenance time period, and thus the prediction
utilize the information known until t ∗ . On the other hand, generates a false alarm; the value of β is the probability that
if we consider the non-conditional values of tq with respect the system fails before the maintenance period, and thus
to Z(t) for t > t ∗ , i.e., we treat Z(t) as a random function, we miss the failure in the prediction. If the system survival
we need to get the joint distribution of all PEs, which is function is given, we could compute the False Alarm Rate
quite difficult in practice. Therefore, using a conditional tq (FAR) and MisDetection Rate (MDR) using the following
is more tractable and, furthermore, tq is clearly a function lemma.
of the q value and thus provides more flexibility in the Lemma 1. If the system survival function S(t | t ∗ ,Z∗ (t)) is
prediction. given as in Equation (7) and the qth quantile residual life
Based on the concept of quantile residual life and given tq as given in Equation (13), then the FAR and MDR of the
length of tl + th , the maintenance period is [tq − tl , tq + th ], maintenance period [tq − tl , tq + th ] given at time instant t ∗
where tq − tl and tq + th are the lower bound and the upper is
bound of the maintenance period, respectively. Clearly, the
values of tl and th are not fixed if only the constraint of tl + th
is given. In the following discussion, we leave tl and th as ∗
S0 (tq + th ) exp(γ Z(t ))
T
independent variables and the conclusions will thus be valid ∗

α = ,
for any values of tl and th , Using the technique developed S0 (t ∗ )
⎧
in the next section, managers can find the optimal values ⎪ 0 when tl ≥ tq ,
⎨ exp(γ T Z(t∗ ))
of tl and th that minimizing the maintenance cost.
β∗ = S0 tq − tl . (16)
⎪
⎩1 − otherwise
S0 (t ∗ )
3.2. Evaluating the probability of a failure occurring outside
the maintenance period The proof of this lemma is straightforward and is therefore
omitted.
Based on the maintenance period, we define the false alarm Clearly, the presented FAR and MDR are conditional
rate and the misdetection rate for this period as follows: values, being conditioned on T > t ∗ and Z∗ (t) is fixed.
Definition 2. Given the maintenance period [tq − tl , tq + th ] From this lemma, we can obtain the following insights
and the system failure time T, the false alarm rate, denoted about how the FAR and MDR are influenced by parame-
as α, is defined as ters q, γ , tl , and th .
1. For fixed tl , th , and q(tl = 0, th = 0, q = 0), when γ
α = Pr(T > tq + th ). (14)
increases, α ∗ decreases whereas β ∗ increases. This is be-
cause an event with larger γ accelerates the failure to
Definition 3. Given maintenance period [tq − tl , tq + th ] and a larger extent, and thus results in more misdetections
the system failure time T, the misdetection rate, denoted as and fewer false alarms.
654 Yuan et al.
2. For fixed tl , th , and γ (tl = 0, th = 0, γ = 0), when q where θ = eγ , ρ = 1 − q. The proof of this lemma can be
increases, α ∗ decreases, whereas β ∗ increases. This is not found in Appendix B.
surprising because a large q yields late predictions, which It can be seen that even for this simple case, the expres-
results in more misdetections and fewer false alarms, and sions of α and β are complicated. A close examination of
vice versa. these expressions reveals the following influence of param-
3. For fixed γ and q( γ = 0, q = 0), when th increases, α ∗ eters q, γ , tl , and th on α and β.
decreases. When th = 0, α ∗ reaches its maximum value
of 1 − q. Similarly, when tl increases, β ∗ decreases. When
tl = 0, β ∗ reaches its maximum value of q; when tl > tq ,
β ∗ = 0. 1. For fixed tl , th , t ∗ , and q (tl = 0 , th = 0, q = 0), when the
coefficient γ increases, α decreases, whereas β increases.
These properties can be obtained by observing Equa- The reason for this behavior is that if the PE occurs after
tion (16) and noting that S0 (t) is a monotonic decreasing t ∗ , the prediction made for t ∗ is more likely to be later
function. than the true failure time, which results in bigger MDR
A point that needs to be emphasized about Equation and smaller FAR.
(16) is that α ∗ and β ∗ are conditioned on the values of 2. For fixed tl , th , and γ (tl = 0, th = 0, γ = 0), when
Z∗ (t) = Z(t ∗ ) for t > t ∗ . However, in practical cases, some t ∗ increases, both α and β decrease. As t ∗ increases,
PEs may happen after t ∗ , the system survival function will the information incorporated in the prediction increases.
therefore be different, and thus the prediction performance Therefore, the prediction is more precise.
will change. Therefore, in order to describe the prediction 3. The impact of tl and th on α and β with all other pa-
performance more precisely, it is desirable to evaluate the rameters fixed is similar to their impacts on α ∗ and β ∗
average value of α and β under randomly occurring PEs. as mentioned in Lemma 1.
Unfortunately, it is very difficult, if not impossible, to get
a closed-form expression for the average α and β for a
Figure 4 illustrates the typical average errors under the
general case with an arbitrary number of PEs. However,
above simple case, where we set λ = 1, µ = 1, t ∗ = 0.5,
for some simplified cases, a closed-form expression can be
tl = 0.2, th = 0.2 and vary q from zero to one. For both
derived and insights can be obtained. In the following, we
FAR and MDR, we show the results for three different
present the results for an average α and β under a simplified
values of γ , including 0.5, 2, and 5. These results are ob-
case, where (i) there is only one PE whose occurrence time
tained through numerical simulations and are clearly con-
is exponentially distributed; and (ii) the baseline survival
sistent with the listed three insights. It should be noted
function is a simple exponential function. Although sim-
that these results on α and β are based on the assumption
ple, this case can reflect the basic characteristics of certain
that only one PE is in the system (i.e., one covariate in the
systems and the results can provide some general insights
Cox model). For cases with multiple PEs, there are two
into the average prediction performance. Strictly speaking,
possible scenarios. In the first scenario, the occurrences of
by specifying the baseline survival function, the model is
the PEs are independent of each other. In this scenario,
no longer semi-parametric model. However, we still keep
although the expressions of α and β will become much
the PH assumptions and this simple case can be viewed as
more involved, the basic characteristics of the prediction
a special case of the Cox PH model.
performance in terms of α and β will be similar to the
single PE case because in principle, we can consider each
Lemma 2. Given a system with a survival function as in Equa- PE individually in this case. In the second scenario, the oc-
tion (7) and S0 (t) = e−λt . The occurrence of the PE follows currences of the PEs are correlated. This is a much more
exponential distribution µe−µt and its coefficient in the Cox complicated case. Numerical studies might be the only way
model is γ . If maintenance period at t ∗ is given as [tq − tl , to obtain insights to the prediction performance in this
tq +th ], where tq is given in Equation (13), then the average α case.
and β are
∗ ∗
(λ − λθ) ρ (λ+µ)/λ e−(λ+µ)th + µ(ρ θ − ρ)e−λth θ e−(λ+µ)t + µρe−λθ(t +th )
α= (17)
(λ − λθ ) e−(λ+µ)t∗ + µe−λt∗ θ
⎧
⎪ logρ
⎪ 0 when tl > −
⎪ ,
⎪
⎪ λ
⎪
⎨ µ + λ − λθ − µρ eλtl θ + λ (1 − θ) ρ (λ+µ)/λ e(µ+λ)tl e−(λ+µ)t∗
θ
logρ logρ
β= −(λ+µ)t ∗ −λt ∗θ when − < tl < − , (18)
⎪
⎪ λ (1 − θ ) e + µe λθ λ
⎪
⎪ λ − λθ − µ(ρ θ − ρ)eλtl θ + λ (1 − θ) ρ (λ+µ)/λ e(µ+λ)tl e−(λ+µ)t − µe−λt θ − µeλθ(tl +t )
∗ ∗ ∗
⎪
⎪
⎩ , otherwise.
λ (1 − θ ) e−(λ+µ)t∗ + µe−λt∗ θ
Fig. 4. α and β under different γ .
4. Case study After calculating all of the ARF values of 1913 events,
we chose the top 50 events as the PE set. The results for the
The event logs used in this case study were collected from first eight events are shown in Table 2.
a medical imaging system. The data set had the following Then, we used the PGA to select the best subset. For
characteristics. the initial population we used a Bernoulli random variable
generator with a success probability of 0.3 to generate 50
1. There were 923 log files corresponding to 923 systems, individuals. The crossover position and the mutation posi-
among which 608 contained at least one FE. The remain- tion were generated as discrete uniform random numbers
ing 315 log files did not contain failures and thus were within the range from one to 50. The mutation rate was cho-
censored. Once a failure occurs, the failed component sen as 0.1. The obtained bubble plots for the main effects
is replaced with a new one. Thus, the multiple failures and interactions are shown in Fig. 5.
within a system were treated as being independent. In Figure 5 suggests that we should choose events A, B, C,
total, we had 852 independent samples that experienced D, F, and G for the main effects, which also have very high
a failure and 315 samples that were right-censored. ARF values, and CF, AC, and BD as interactions. How-
2. There were 1931 distinct non-failure events in the data ever, after model fitting, it was found that the interaction
set. Most of them had no bearing on the failure. Except term BD is not statistically significant. This is not surpris-
for some regular routine events, the occurrences of non- ing because the statistical significance is determined by the
failure events were random. relative reduction of RSS, whereas the variable selection
3. The system survival time varied from less than 1 day to criterion AIC is influenced by the absolute reduction in
more than 700 days. The number of events in each event RSS. A covariate might provide enough absolute reduction
log was in the order of 104 . in RSS but not enough relative reduction. To make the final
model compact and reliable for prediction, we eliminated
BD from the final model. The estimated coefficients of the
4.1. Fitting the Cox PH model final model are listed in Table 3. The Cox–Snell pseudo
To build an effective prediction model, we first applied the
Table 2. ARF values of the top eight events
proposed variable prescreening method to the real data set.
Based on field knowledge, c was set to 0.15 and window size Event ID ARF Rank in bubble plot
to 10 days. We gave equal weights to the two terms in the
ARF; therefore, ψ was 0.5. Hence, the ARF was specified A 0.9602 1
as C 0.9312 3
F 0.9281 6
0.5|F2 j | G 0.8728 7
r (E j , K) = B 0.8719 2
|F2 j | + |F1 j | + |G1 j |
D 0.8666 4
|F2 j | J 0.854 21
+ 0.5I > 0.15 . (19)
|F0 j | + |F1 j | + |F2 j | L 0.8536 20
656 Yuan et al.
Fig. 5. Bubble plots.
R-square of the model was 0.534, which indicates that the The value of tq is now 167.30 days, as shown in Fig. 7,
model closely fits the data. The estimated baseline survival and the maintenance period is updated to [157.30, 177.30],
function (setting covariates as zero) is shown as the middle which is much earlier than the original period. This is be-
curve in Fig. 6. The upper and lower curves provide the cause the estimated coefficient for event A is positive and
95% confidence bounds for this estimation. the occurrence of this event will indicate an acceleration of
the occurrence of system failures. However, the true failure
occurs at day 180.10 and this is after day 177.30. Therefore,
4.2. Model-based failure prediction the prediction made at day 26.82 turns out to be a false
alarm.
As discussed in Section 3, given t ∗ and the baseline survival
Similarly, we can follow this procedure, i.e., making pre-
function S0 (t) of the prediction model, assuming the event
dictions whenever a PE occurs and checking if K occurs in
sequence before t ∗ is known, we can make a prediction and
the maintenance period, for all of the 852 event sequences
set the maintenance period as [tq − tl , tq + th ]. For example,
with a failure in the data set. Based on the real historical
given day 0, the conditional survival function, which equals
data, MDR and FAR can be estimated as
the baseline survival function, is shown in Fig. 7. Suppose
we choose q = 0.8, and hence tq is 4041 days. If tl and th Number of false alarms
are set to 10 days, the maintenance period at day 0 will be α̂ = ,
Total number of of event sequences
[4031, 4051]. Number of misdetections
In practice, a prediction can be made after each PE oc- β̂ = . (21)
curs. For example, we first make a prediction at day 0. At Total number of event sequences
day 26.82, event A happens and its coefficient in the Cox
model is 3.068. We make a second prediction based on the
1.00
conditional survival function in Equation (20):
S (t | 26.82, Z(26.82) = (1, 0, 0, 0, 0, 0, 0, 0))

0.95
exp(3.068)
S0 (t)
= . (20)
0.90
Survival Probability
S0 (26.82)
0.85
Table 3. Summary of estimated coefficients

0.80
Covariate event γ exp(γ ) se(γ ) p-Value

0.75
A 3.068 21.49 0.1308 0

C 1.751 5.759 0.1061 0
F 1.307 3.694 0.1682 8.00 × 10−15
0.70
G 1.027 2.793 0.0836 0

B 1.756 5.79 0.0887 0 0 100 200 300 400 500 600 700
D 0.802 2.229 0.0787 0
t (days)
CF –0.804 0.448 0.2026 7.30 × 10−5
AC –0.411 0.663 0.1932 3.30 × 10−2 Fig. 6. Estimated baseline survival function.
Fig. 7. Two predictions made at day 0 and day 26.82.
As shown in Section 3, we know that increasing q results that the proposed prediction model performs uniformly
in FAR decreasing and MDR increasing. If q varies from better than the naı̈ve method.
zero to one, the estimated FAR and MDR can be plotted
to show the trade-off between FAR and MDR.
To check if the proposed method can improve the pre- 5. Conclusion
diction performance, we also plot the α̂ versus β̂ curve of a
naı̈ve method, which is commonly used in reliability anal- In this article, we proposed a systematic methodology to
ysis as a non-parametric method for failure prediction. In utilize event logs for system failure prediction. A Cox model
the naı̈ve method, the empirical cumulative distribution of was adopted as the quantitative prediction model. The vari-
the failure time ecdf(t) is estimated based on historical data. able selection for model building and the prediction power
The qth quantile of ecdf(t) is used to predict the failure time in terms of FAR and MDR were studied in detail. The
as shown in Equation (22): methods for prediction performance evaluation developed
could provide guidelines on the design of an optimal main-
τq = inf{t : ecd f (t) ≥ q}. (22)
tenance policy. For example, based on the values of α and β
The corresponding maintenance period of the naı̈ve errors, we could optimally determine the maintenance pe-
method is specified as [τ q − 10, τ q + 10]. riod by determining tq , tl , and th using a search algorithm.
Since the empirical cumulative distribution reflects the Finally, the results of a real-world case study demonstrated
marginal distribution of the failure time, whereas the Cox the effectiveness of the proposed methodology.
PH model describes the distribution of the failure time con- There are still a number of open questions in this study.
ditioned on the PEs, by comparing these two models we can First, it should be noted that the FAR and MDR we dis-
further validate the significance of the selected PEs in terms cussed in Section 3 are actually the “intrinsic” prediction
of improving prediction performance. Figure 8 illustrates errors of the Cox PH model. In the evaluation of these er-
rors, we assume the system degradation follows exactly the
Cox PH model under consideration. In other words, we as-
sume that the model under consideration is the true system
model and the discrepancy between this model and the ac-
tual true model is not considered. These intrinsic errors do
provide useful bounds on the values of prediction errors. It
is in general quite challenging to evaluate the overall pre-
diction error including both the intrinsic error and the error
due to model discrepancy because we will never know the
underlying true model in most practical scenarios. How-
ever, we could still obtain some insights to the overall error
if the uncertainties in the model fitting are known. We will
investigate this topic and report the findings in a future
paper.
Second, in this article, several parameters in the vari-
able selection procedure including the ARF calculation and
Fig. 8. Comparison of prediction performance. the PGA procedure are chosen heuristically. It would be
658 Yuan et al.
interesting to see if some insights could be developed for Henderson, R., Jones, M. and Stare, J. (2001) Accuracy of point pre-
the sensitivities of these procedures with regard to these dictions in survival analysis. Statistics in Medicine, 20, 3083–
3096.
parameters, thus allowing more concrete guidelines to be
Henderson, R. and Keiding, N. (2005) Individual survival time predic-
established for the setup of these parameters. In addition, it tion using statistical models. Journal of Medical Ethics, 31, 703–
could be very interesting to consider using survival models 706.
other than the Cox model to describe the relationships be- Holloway, L.E. (1995) trajectory encoding for systems with irregular
tween PEs and the FEs. We will continue in this direction observations. Automatica, 31, 405–418.
Holloway, L.E. and Chand, S. (1994) Time templates for discrete event
and further investigate these issues.
fault monitoring in manufacturing systems, in Proceedings of the
American Control Conference, pp. 701–706.
Holloway, L.E. and Chand, S. (1996) Distributed fault monitoring in
Acknowledgements manufacturing systems using concurrent discrete-event observa-
tions. Integrated Computer-Aided Engineering, 3, 244–254.
Holloway, L.E., Krogh, B.H. and Giua, A. (1997) A survey of
The authors thank the editor and the referees for their
Petri net methods for controlled discrete event systems. Dis-
valuable comments and suggestions. This research was crete Event Dynamic Systems - Theory and Applications, 7, 151–
supported by NSF grants 0757683 and 0545600 and GE 190.
Healthcare. Klein, J.P. and Moeschberger, M.L. (2003) Survival Analysis: Techniques
for Censored and Truncated Data, second edition, Springer, New
York, NY.
Klementtinen, M. (1999) A knowledge discovery methodology for
References telecommunication network alarm databases, Ph.D. Thesis. Uni-
versity of Helsinki, Helsinki, Finland.
Agrawal, R., Imielinski T. and Swami, A. (1993) Mining associations Klemettinen, M., Mannila, H. and Toivonen, H. (1999) Interactive ex-
between sets of items in massive databases, in Proceedings of the ploration of interesting findings in the telecommunication network
ACM-SIGMOD International Conference on Management of Data, alarm sequence analyzer TASA. Information and Software Technol-
pp. 207–216. ogy, 41, 557–567.
Brin, S., Motwani, R., Ullman J. and Tsur, S. (1997) Dynamic itemset Klemettinen, M., Mannila, H. and Verkamo, A.I. (1999) Association rule
counting and implication rules for market basket data, in Proceed- selection in a data mining environment. Principles of Data Mining
ings of the ACM-SIGMOD International Conference on the Man- and Knowledge Discovery, 1704, 372–377.
agement of Data, pp. 255–264. Li, Z., Zhou, S., Choubey, S. and Sievenpiper, C. (2007) Failure event
Cardoso, J., Valette, R. and Dubois, D. (1991) Petri nets with uncertain prediction using cox proportional hazard model driven by frequent
markings. Lecture Notes in Computer Science, 483, 64–78. failure signatures. IIE Transactions, 39, 303–315.
Cardoso, J., Valette, R. and Dubois, D. (1999) Possibilistic Petri nets. Li, Z.G., Sun, M.T., Dunham, M.H. and Xiao, Y.Q. (2003) Improving
IEEE Transactions on Systems Man and Cybernetics Part B - Cyber- the web site’s effectiveness by considering each page’s temporal in-
netics, 29, 573–582. formation. Advances in Web-Age Information Management, 2762,
Chamberlin, D. (1996) Using the new DB2: IBM’s object-relational 47–54.
database system. IBM Intelligent Miner User’s Guide, 1(1). Lin, W., Ruiz, C. and Alvarez, S.A. (2002) Efficient adaptive-support
Chen, Y.L. and Provan, G. (1997) Modeling and diagnosis of timed association rule mining for recommender systems. Data Mining and
discrete event systems—a factory automation example. Proceedings Knowledge Discovery, 6, 83–105.
of the American Control Conference, 1, 31–36. Mannila, H., Toivonen, H. and Verkamo, A.I. (1997) Discovery of fre-
Church, K.W. and Hanks, P. (1989) Word association norms, mutual quent episodes in event sequences. Data Mining and Knowledge Dis-
information and lexicography, in Proceedings of the 27th Annual covery, 1, 259–289.
Meeting of the Association for Computational Linguistics, pp. 76– McGarry, K. (2005) A survey of interestingness measures for knowledge
83. discovery. The Knowledge Engineering Review, 201, 39–61.
Clark, P. and Boswell, P. (1991) Rule induction with CN2: some recent Piatetsky-Shapiro, G. and Matheus, C. (1991) Discovery, analysis and
improvements, in Proceedings of the Fifth European Conference on presentation of strong rules, in Knowledge Discovery in Databases,
Machine Learning, pp. 151–163. Piatetsky-Shapiro, G. and Frawley, W.J. (eds.), MIT Press, Cam-
Contant, O., Lafortune, S. and Teneketzis, D. (2004) Diagnosis of inter- bridge, MA, pp. 229–248.
mittent faults. Discrete Event Dynamic Systems - Theory and Appli- Sampath, M., Lafortune, S. and Teneketzis, D. (1998) Active diagnosis
cations, 14, 171–202. of discrete-event systems. IEEE Transactions on Automatic Control,
Contant, O., Lafortune, S. and Teneketzis, D. (2006) Diagnosability of 43, 908–929.
discrete event systems with modular structure. Discrete Event Dy- Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K. and
namic Systems - Theory and Applications, 16, 9–37. Teneketzis, D. (1995) Diagnosability of discrete-event systems. IEEE
Cox, D.R. (1972) Regression models and life-tables. Journal of the Royal Transactions on Automatic Control, 40, 1555–1575.
Statistical Society Series B - Statistical Methodology, 34, 187–220. Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K. and
Das, S.R. and Holloway, L.E. (2000) Characterizing a confidence space Teneketzis, D. (1996) Failure diagnosis using discrete-event mod-
for discrete event timings for fault monitoring using discrete sens- els. IEEE Transactions on Control Systems Technology, 4, 105–124.
ing and actuation signals. IEEE Transactions on Systems Man and Seber, G.A.F. and Lee, A.J. (2003) Linear Regression Analysis, second
Cybernetics Part A - Systems and Humans, 30, 52–66. edition, John Wiley & Sons, Hoboken, NJ.
Dunham, M.H. (2003) Data Mining Introductory and Advanced Topics, Srinivasan, V.S. and Jafari, M.A. (1993) Fault-detection monitoring using
Prentice Hall/Pearson Education, Upper Saddle River, NJ. time Petri Nets. IEEE Transactions on Systems Man and Cybernetics,
Efron, B., Johnstone, I., Hastie T. and Tibshirani, R. (2004) Least angle 23, 1155–1162.
regression. Annals of Statistics, 32(2), 407–499. Stone, M. (1977) An asymptotic equivalence of choice of model by cross-
Geng, L. and Hamilton, H.J. (2006) Interestingness measures for data validation and Akaike’s criterion. Journal of the Royal Statistical
mining. ACM Computing Surverys, 38(3), 9. Society Series B - Statistical Methodology, 39, 44–47.
Tan, P. and Kumar, V. (2000) Interestingness measures for association r t: the occurrence time of failure, the baseline survival func-
patterns: a perspective. Presented at the KDD Workshop on Post- tion is e−λt ;
processing in Machine Learning and Data Mining. r γ : the coefficient of the predictor event in the Cox model;
Tibshirani, R. (1996) Regression shrinkage and selection via the LASSO.
Journal of the Royal Statistical Society Series B - Statistical Method- Given tl , th , and t > t ∗ , the average FAR α and MDR β
ology, 58(1), 267–288.
can be expressed as
Xiao, Y.Q. and Dunham, M.H. (2001) Efficient mining of traversal pat-
terns. Data & Knowledge Engineering, 39, 191–214. α = Pr(t > tq + th | te < t ∗ ) × Pr(te < t ∗ | t > t ∗ )
Zhou, M. and Venkatesh, K. (1999) Modeling, Simulation, and Control
of Flexible Manufacturing Systems: a Petri Net Approach. World + Pr(t > tq + th | te > t ∗ ) × Pr(te > t ∗ | t > t ∗ )
Scientific, River Edge, NJ. β = Pr(t < tq − tl |te < t ∗ ) × Pr(te < t ∗ | t > t ∗ )
Zhu, M. and Chipman, H.A. (2006) Darwinian evolution in parallel
+ Pr(t < tq − tl | te > t ∗ ) × Pr(te > t ∗ | t > t ∗ )
universes: a parallel genetic algorithm for variable selection. Tech-
nometrics, 48, 491–502. Since the FAR and MDR are calculated using the same
procedures, in the following we only discuss the calculation
of false alarm rate for simplicity.
Appendix A: PGA Algorithm
Case 1. te < t∗ .
Inputs: We can use the results in Lemma 1 to obtain
M: number of total events; Pr(t > tq + th | te < t ∗ ) = S(tq + th |t ∗ , te < t ∗ )
m: population size; = (1 − q) e−λth exp(γ ) . (A1)
L: number of parallel single-path GAS;
N: number of generations to evolve for each single path; Case 2. te > t∗
Algorithm: The conditional survival function under this case is
(λt∗ −λt)
Step 1. For each path, initialize a random population of e , when t < te
S(t|t ∗ , te > t ∗ ) = ∗
size m. e(λt −λte ) × e(λte −λt) exp(γ ) when t > te
Step 2. While (t < N) (A2)
2.1. For each path, fit the Cox model using all in- and tq is
dividuals in current generation t and calculate
log (1 − q)
their AIC values; tq = t ∗ − . (A3)
2.2. For each path, select out the top m/2 individ- λ
uals with the smallest AIC values to consti- Therefore,
tute the “survival pool” and generation (t+1)
Pr t > tq + th | te > t ∗
respectively. ∞
∗
2.3. Repeat for each path: = S(tq + th | t ∗ , te > t ∗ ) · µeµt −µte dte (A4)
2.3.1. Randomly select a father and mother t∗
2.3.2. Perform cross-over operation and gen- If te > tq + th :
erate a child
∗
2.3.3. Perform mutation operation on this S(tq + th | t ∗ , te > t ∗ ) = e(λt −λ(tq +th ))
, (A5)
new child
2.3.4. Add this new child to generation (t+1). or if te < tq + th :
Until generation (t+1) has m individuals. S(tq + th | t ∗ , te > t ∗ ) = e(λt
∗
−λte )
× e(λte −λ(tq +th )) exp(γ )
2.4. t = t + 1; (A6)
End While. Equation (A4) can be calculated as
Step 3. Collect all the last generations, i.e., generation N, tq +th
for L paths. Average the results and rank all events. ∗
S(tq + th | t ∗ , te > t ∗ ) × µeµt −µte dte
∗
Output: t
∞
∗
The optimal subset of events. + S(tq + th |t ∗ , te > t ∗ ) × µeµt −µte dte
tq +th
Finally, we get
Appendix B : Proof of Lemma 2
Pr t > tq + th |te > t ∗
Assuming there is only one PE, we define the notations µ (1 − q)exp(γ ) e−λth exp(γ )
used in the following as: = (1 − q)(λ+µ)/λ × e−(λ+µ)th +
λ + µ − λ exp (γ )
r te : the occurrence time of PE with density function µe−µt ; −(λ+µ−λ exp(γ ))(th − log(1−q)

× 1−e λ )
. (A7)
660 Yuan et al.
To calculate Pr (te > t ∗ |t > t ∗ ), we use Bayes’ theorem to his master’s in Industrial Engineering and Ph.D. in Mechanical Engineer-
get ing from the University of Michigan in 2000. His research interests in-
clude in-process quality and productivity improvement methodologies by
Pr (te > t ∗ |t > t ∗ ) integrating statistics, system and control theory, and engineering knowl-
Pr (t > t ∗ |te > t ∗ ) × Pr(te > t ∗ ) edge. His research is sponsored by the National Science Foundation,
= Department of Energy, Department of Commerce, and industries. He is
Pr (t > t |te > t ∗ )×Pr(te > t ∗ )+Pr (t > t ∗ |te < t ∗ )×Pr(te
∗ < t∗ )
(A8) a recipient of a CAREER Award from the National Science Foundation
and the Best Application Paper award from IIE Transactions in 2006. He
Since is a member of IIE, INFORMS, ASME, and SME.
∗ ∗
Pr(t > t ∗ |te > t ∗ ) × Pr(te > t ∗ ) = e−λt × e−µt (A9)
Crispian Sievenpiper is an Applied Research Architect at GE Healthcare
Pr(t > t ∗ |te < t ∗ ) × Pr(te < t ∗ )
t∗ Global Service Technology. He received M.S. degrees in Computer Sci-
∗ ence and Statistical Science from the State University of New York and
= e−λt ×e(λt−λt ) exp(γ ) × µe−µt dt, (A10) a Ph.D. from The University of Wisconsin. His research interests are in
0 using telematics to improve service delivery of complex systems.
Equation (A8) can be expressed as
∗
(µ+λ−λ exp (γ )) e−(µ+λ)t Kamal Mannar Ph.D. is a lead engineer at General Electric—Energy’s
Pr (te > t ∗ |t > t ∗ ) = . Advanced Technology Operations (ATO), located in Atlanta, Georgia. In
(λ−λ exp (γ )) e−(µ+λ)t∗ +µe−λt∗ exp(γ ) this role, he leads numerous programs related to remote monitoring, ad-
(A11) vanced reliability and risk modeling, and fleet analytics for power plants
In all and wind turbines. He received his M.S. in Manufacturing Systems Engi-
∗ ∗
(λ − λθ ) ρ (λ+µ)/λ e−(λ+µ)th + µ ρ θ − ρ e−λth θ e−(λ+µ)t + µρe−λθ(t +th ) neering in 2005 and Ph.D. in Industrial and Systems Engineering from the
α= University of Wisconsin–Madison in 2006. His research focus includes
(λ − λθ) e−(λ+µ)t∗ + µe−λt∗ θ
(A12) developing methodologies for prognostics and health management and
asset optimization. In his previous role in GE Healthcare he developed
where θ = eγ , ρ = 1 − q. algorithms and tools to detect anomalies, predict failures, and optimize
the remaining equipment life of imaging equipment such as MRI and CT
systems.
Biographies
Yibin Zheng received a B.S. degree in Electronics from Zhongshan Uni-
Yuan Yuan is a Ph.D. candidate in Industrial and Systems Engineering versity, Guangzhou, China; M.A. degree in Physics from the State Uni-
at the University of Wisconsin–Madison. She holds a B.S. in Automatic versity of New York at Buffalo and a Ph.D. degree in Electrical and
Control from Tsinghua University, China. Her research interests are in Computer Engineering from Purdue University, West Lafayette, Indiana.
the general area of mathematical model building for data clouds, with Between 1996 and 2000 he held a Senior Electrical Engineer position at
applications in reverse engineering, production and service systems, and the GE Global Research Center in Niskayuna, New York, where he re-
CAD/CAE design. She also works with GE Healthcare on software and ceived the 1999 Dushman Award, among many other recognitions, for his
hardware failure prediction based on event logs. outstanding technical contributions. He left GE to pursue an academic
career at the University of Virginia between 2000 and 2007. In 2007 he
Shiyu Zhou is an Associate Professor in the Department of Industrial rejoined GE in the GE Healthcare Division, where he is currently a Prin-
and Systems Engineering at the University of Wisconsin–Madison. He cipal Scientist in the Global Service Technologies unit. He has six U.S.
received his B.S. and M.S. in Mechanical Engineering from the University patents and published more than 30 journal papers and more than 50
of Science and Technology of China in 1993 and 1996, respectively, and conference papers. He is a Senior Member of the IEEE.

Event Log Modeling and Analysis For System Failure Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Event Log Modeling and Analysis For System Failure Prediction

Uploaded by

Copyright:

Available Formats

IIE Transactions (2011) 43, 647–660

ISSN: 0740-817X print / 1545-8830 online

Event log modeling and analysis for system failure prediction

Received August 2009 and accepted November 2010

1. Introduction failure. However, before its total failure, a faulty detector

Fig. 1. A simplified example of an event log.

Example event sequence Si (E j = D) Explanation Representing symbol Set symbol

K ∈E (Si ), D ∈E (Si ) but tD < tK − t F1j

K ∈E (Si ), D ∈E (Si ), and F2j

Fig. 3. The qth quantile residual life.

Several points need to be mentioned regarding tq . The β, is defined as

independent variables and the conclusions will thus be valid ∗

Fig. 4. α and β under different γ .

Fig. 5. Bubble plots.

conditional survival function in Equation (20):

S (t | 26.82, Z(26.82) = (1, 0, 0, 0, 0, 0, 0, 0))

Table 3. Summary of estimated coefficients

Covariate event γ exp(γ ) se(γ ) p-Value

A 3.068 21.49 0.1308 0

G 1.027 2.793 0.0836 0

Fig. 7. Two predictions made at day 0 and day 26.82.

You might also like