Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Expert Systems with Applications 36 (2009) 4745–4752

Contents lists available at ScienceDirect

Expert Systems with Applications


journal homepage: www.elsevier.com/locate/eswa

Knowledge acquisition for diagnosis model in wireless networks


Raquel Barco a,*, Pedro Lázaro a, Volker Wille b, L. Díez a, Sagar Patel b
a
Departamento Ingeniería de Comunicaciones, University of Málaga, E.T.S.I. Telecomunicación, Campus Universitario de Teatinos s/n, E-29071 Málaga, Spain
b
Nokia Siemens Networks, Consulting and Systems Integration, Ermine Business Park, Huntingdon, Cambridge PE29 6YJ, UK

a r t i c l e i n f o a b s t r a c t

Keywords: In the near future, several radio access technologies will coexist in Beyond 3G mobile networks (B3G) and
Automated management they will be eventually transformed into one seamless global communication infrastructure. Self-manag-
Expert systems
ing systems (i.e. those that self-configure, self-protect, self-heal and self-optimize) are the solution to
Network operation
tackle the high complexity inherent to these networks. This paper proposes a probabilistic model for
Diagnosis
Mobile communications self-healing in the radio access network (RAN) of wireless systems. The main difficulty in model construc-
Probabilistic reasoning tion is that, contrary to other application domains, in wireless networks there are no databases of previ-
Wireless networks ously classified cases from which to learn the model parameters. Due to this reason, in this paper, a
Bayesian networks knowledge acquisition procedure is proposed to build the model from the knowledge of troubleshooting
Self-healing experts. In order to support the theoretical concepts, a model has been built and it has been tested in a
Fault management live network, proving the feasibility of the proposed system. Additionally, a knowledge-based model has
Self-optimizing networks been compared to a data-based model, showing the benefits of the former when the number of training
cases is scarce.
 2008 Elsevier Ltd. All rights reserved.

1. Introduction tem is a difficult task that requires operations personnel around


the clock in several regional offices. Even with reliable hardware
The mobile telecommunication industry has experienced signif- and software, there are always faults which have to be rectified
icant changes in the recent past and it will continue to evolve in as otherwise the end-user will either experience sub-optimal ser-
the foreseeable future. Beyond 3G mobile networks (B3G) (Jamali- vice levels or no service at all. As in most countries several opera-
pour, Wada, & Yamazato, 2005) will comprise a set of interrelated tors are competing for subscribers, it is imperative to quickly
and rapidly growing wireless networks, applications which will re- rectify such occurrences because otherwise users will naturally
quire increasing bandwidth, and users who will demand high qual- switch to competing network operators. Hence, fault management,
ity of service at low cost, all within a limited spectrum allocation. also called troubleshooting (TS), is a key aspect of the operation of
In a few years, the highly complex and heterogeneous radio access a wireless network in a competitive environment. As the RAN of
network (RAN) will be composed of different technologies, such as cellular systems is by far the biggest part of the network, most of
GSM, UMTS and WLAN. As a result, the operation of the RAN will be the TS activities are focused on this area.
a tough challenge that cellular network operators will have to ad- Troubleshooting comprises the following three tasks: fault
dress. Self-managing systems (i.e. those that self-configure, self- detection; cause diagnosis (i.e. identification of the problem’s
protect, self-heal and self-optimize) are the cost-effective solution cause); and solution deployment (i.e. fixing the problem). Amongst
to tackle the high complexity inherent to these networks. This pa- them, diagnosis of the cause of faults is the most complex and
per is focused on self-healing (also called automatic troubleshoot- time-consuming task. Diagnosis is currently a manual process
ing) in the RAN of wireless systems, which, up to now, has received accomplished by experts dedicated to daily analysis of the main
little attention in the existing literature, despite the huge interest key performance indicators (KPIs) and the alarms of the network,
of network operators and manufacturers of equipment (Altman aiming to determine the cause of the problems.
et al., 2006). In this paper, a system has been proposed for automatic diagno-
As the RAN consists of a high number of pieces of equipment sis of the RAN of wireless systems. Given the value of some KPIs
that are distributed across the entire service area (e.g. country), and alarms, the system calculates the probabilities of the possible
maintaining and operating this large and technically complex sys- fault causes using a diagnosis model and the Bayes’ rule. There are
two options to define the parameters of the model: either experts
* Corresponding author. Tel.: +34 952137184; fax: +34 952132027. in the RAN elicit them (knowledge-based model) or they are learnt
E-mail address: rbarco@uma.es (R. Barco). from training data (data-based model).

0957-4174/$ - see front matter  2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2008.06.042
4746 R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752

Currently, in mobile communication networks there are not his- shooting in the RAN of wireless networks is presented. Subse-
torical collections of diagnosed cases. Furthermore, diagnosis of the quently, a probabilistic diagnosis system for automatic diagnosis
RAN of cellular networks is not documented in the existing litera- is proposed, which comprises a method and a model.
ture. Thus, the experience of troubleshooting experts is, in most
cases, the only source of information to build a diagnosis model. 3.1. Definitions
Due to these reasons, in this paper, a knowledge acquisition proce-
dure is proposed to build the probabilistic model from the knowl- The first step in troubleshooting is the detection, that is the
edge of experts in troubleshooting the RAN. The main advantage of identification of the cells with problems. A problem is a situation
the procedure is that the model can be easily built by troubleshoot- occurring in a cell that has a degrading impact on the service. Every
ing experts, without the need to know anything about probabilistic operator uses a different method to identify the problematic cells,
models. As a consequence, domain experts can transfer their which can be based on different performance indicators, e.g.
expertise using a language that they understand. dropped calls, access failures, congestion, etc. The most severe
In order to support the theoretical concepts, a model has been problem for mobile network operators are cells experiencing a high
built and it has been tested in a live network, proving the feasibility number of dropped calls because a dropped call has a very negative
of the proposed system. impact on the service offered to the end-user. In that sense, the
This paper is organized as follows. In Section 2, previous work re- dropped call rate (DCR) is a good indicator of the quality of the cell.
lated to self-healing in the RAN of wireless networks is summarized. Once the cells with problems are isolated, a diagnosis of the
In Section 3, some concepts related to fault diagnosis are defined cause of the problems should be done for each problematic cell.
and the systems for self-healing and for automatic diagnosis are A cause or fault is the defective behavior of some logical or physical
presented. In Section 4, the knowledge acquisition procedure is component in the cell that provokes failures and generates a high
described. In Section 5, the system proposed in this paper is evalu- DCR, e.g. bad parameter value, hardware fault, etc. A symptom is a
ated. In addition, a data-based model and a knowledge-based model KPI or an alarm whose observed value might be used to identify a
are compared. Finally, conclusions are presented in Section 6. fault, e.g. the number of handovers due to interference. The aim of
the diagnosis system is to identify the cause of a problem based on
the values of some symptoms. Details about the most common
2. Related work
causes and symptoms in wireless networks can be found in Barco
et al. (2005) for GSM/GPRS and in Khanafer et al. (in press) for
First steps in self-healing in the RAN of wireless networks have
UMTS.
been focused on performance visualization (Lehtimäki & Raivio,
2005a) and in fault detection (Laiho, Raivio, Lehtimäki, Hätönen,
3.2. Automatic troubleshooting system
& Simula, 2005; Lehtimäki & Raivio, 2005b). However, very few ref-
erences can be found on diagnosis in the RAN. Automatic diagnosis
Fig. 1 shows the architecture of an automatic troubleshooting
has been extensively studied in other fields, such as diagnosis of
system (TSS) for the RAN of wireless networks. The fault detection
diseases in medicine (Ng & Ong, 2000), troubleshooting of printer
subsystem (FDS) provides the automatic diagnosis subsystem
failures (Heckerman, Breese, & Rommelse, 1995) and diagnosis in
(ADS) with the list of faulty cells to be diagnosed. The ADS requires
the core of communication networks (Steinder & Sethi, 2004). Nev-
a diagnosis model on which its reasoning mechanisms are based.
ertheless, diagnosis in the RAN of cellular networks has some dis-
The subsystem named model definition is in charge of building
tinctive characteristics, such as the continuous nature of KPIs and
the diagnosis model to be used by the ADS. Diagnosis models can
the existence of logical faults not related to a physical piece of
be built based either on the expertise of human troubleshooters
equipment (e.g. a wrong configuration). This makes automatic
or on statistics from the network. Those statistics are normally
diagnosis techniques used in other application domains not di-
saved in the network management system (NMS). The inputs to
rectly applicable to cellular networks.
the ADS are the symptoms, that is, alarms and KPIs for each of
Research studies in automation of diagnosis in the RAN of cellu-
the faulty cells. The NMS contains historical databases with the
lar networks have been traditionally focused on alarm correlation
values of all those inputs. The output of the ADS is a diagnosis on
(Wietgrefe, 2002). Alarm correlation consists in the conceptual
the fault that is causing the problems in each malfunctioning cell.
interpretation of multiple alarms, so that a single new meaning is
In addition, the ADS proposes a list of actions, ranked by their effi-
assigned to the original alarms. Although alarm correlation can be
ciency (efficiency = probability of the action solving the problem/
considered a first step in the diagnosis of faults, it does not provide
cost of the action), to be sequentially executed until the problem
conclusive information to identify the cause of problems, especially
is solved. These actions may be just changing a configuration
if the possible causes are not only faults in pieces of equipment.
parameter from a remote terminal or may involve sending person-
Other categories of faults, such as interference or wrong configura-
nel to a site to replace a faulty piece of equipment. The TSS may
tion are difficult to identify if KPIs are not considered.
even execute software-related repair actions. Although, normally,
In Barco, Wille, and Díez (2005) a diagnosis system for the RAN
operators prefer that the TSS only proposes the actions, but the fi-
of wireless networks based on probabilistic methods was pro-
nal decision be handled by a human expert (the TSS in this case
posed. KPIs were modelled as continuous variables. The problem
acts as a so called decision support system). Finally, the ADS is also
of this approach is that the results were very sensitive to an inac-
intended to generate a report about the diagnosed cause and the
curate definition of the parameters of the model or to a scarce
steps carried out in order to recover from the fault (which may
number of training examples. In other application domains, dis-
be integrated with the trouble ticket systems present in most com-
crete models have proven to be less sensitive to model parameters
munication networks).
(Pearl, 1988) than continuous ones. This is the reason why in the
The TSS can work independently from the NMS, but most of its
following sections a discrete model has been adopted.
benefits are achieved when it is an integrated part of it. This inte-
grated solution will provide direct access to information required
3. Problem formulation in fault analysis as well as access to the operator’s fault manage-
ment system. An integrated solution is also beneficial in case of
In this section, some terminology used in diagnosis of wireless multi-vendor networks and of multi-system networks (GSM,
networks is first described. Then, a system for automatic trouble- UMTS, WLAN). Hence, all relevant troubleshooting cases can be
R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752 4747

Fig. 1. Automatic troubleshooting system (TSS).

automatically directed to the TSS and if it finds the solution, the of each possible cause. Given the value of the symptoms
case is cleared, reported and filed. If the problem is not found by {S1, . . ., SM}, the probability of cause ci can be obtained as
the expert system, it can be redirected to the specialists for further Q
analysis and the final conclusions can be incorporated into the Pðci Þ  M j¼1 PðSj jc i Þ
Pðci jEÞ ¼ PK QM ð1Þ
knowledge of the expert system. Pðc
i¼1 i Þ  j¼1 PðSj jc i Þ
This paper is focused on the ADS, which is the most complex
subsystem and, up to now, has received little attention in existing where P(ci) are the prior probabilities of the causes and P(Sjjci) are
literature. the probabilities of the symptoms given the causes.
Eq. (1) has been obtained applying the Bayes’ rule and taking
3.3. Bayesian modelling into account the following assumptions: (i) only a cause can hap-
pen at the same time; (ii) symptoms are independent given the
Two components of the ADS have been distinguished: the diag- causes. These assumptions are realistic in the RAN of wireless net-
nosis model and the inference method. The diagnosis model repre- works, but even if they were not, this model has proven to give
sents the knowledge on how the identification of the fault cause is very good results (Rish, 2001).
carried out. The elements of the model are causes and symptoms. Let a case be the set composed of the value of the symptoms in a
The inference method is the algorithm that identifies the cause of faulty cell (e.g. average values in a day) together with the actual
the problems based on the value of the symptoms. cause of the problems. Cases may be used either to train the sys-
Defining the diagnosis model comprises two phases. Firstly, the tem, i.e. to calculate the parameters of the model, or to test the sys-
qualitative model should be identified, that is, the causes and tem, i.e. to calculate the diagnosis accuracy (percentage of cases in
symptoms for diagnosis in a given technology (GSM, GPRS, UMTS, a test set correctly classified).
multi-technology networks, etc.). Causes can be modelled as dis-
crete random variables with two states: {absent/present}. Two 3.4. Model parameters
types of symptoms are considered: Alarms and KPIs. Alarms can
also be modelled as discrete random variables with two states: The parameters of the model are thresholds and probabilities.
{off/on}. KPIs are inherently continuous, but they can be modelled On the one hand, the thresholds are interval boundaries for the dis-
either as continuous or discrete random variables. In the latter, the cretization of the continuous symptoms. That is, tj,k is the k thresh-
discretized KPI may have any discrete number of states, each rep- old for symptoms Sj, which partitions it into states sj,k and sj,k+1. On
resenting a subset of the continuous range of the KPI, e.g. {normal/ the other hand, according to Eq. (1), the probabilities are the
high/very high}. following:
Secondly, the quantitative model should be specified, that is,
the parameters of the model. In a discrete model, the parameters  Prior probabilities of causes: P(ci), i = 1 . . . K.
are thresholds for the discretized KPIs and probabilities.  Conditional probabilities of symptoms given causes: P(Sj = sj,kjci),
Once the quantitative and qualitative models have been de- i = 1. . .K, j = 1 . . . M, is the probability of the symptom Sj being in
fined, the inference method consists in calculating the probability state sj,k given that the cause is ci.
4748 R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752

4. Knowledge acquisition Table 2


Quantitative model defined by expert

Building the probabilistic model based on the knowledge from Parameters Range Description Ner
experts in the application domain, that is, knowledge acquisition, parameters
PM
involves two phases. Firstly, knowledge gathering, that is, obtain- ti,j i = 1, . . ., M Threshold j for symptom Si i¼1 T i
ing the knowledge from experts. Secondly, model construction, j = 1, . . ., Ti Ti: number of thresholds of symptom Si
that is, defining the model based on the previously acquired infor- PCi i = 1, . . ., K Probability of cause Ci = on K
mation provided by experts. PM
P Si;j jC k i = 1, . . ., M Probability of symptom Si = si,j i¼1 Ri  Qi
8C k 2 C ir j = 1, . . ., Qi given cause Ck = 1 and Ch = 0 "h–k
4.1. Knowledge gathering PM
P Si;j jC 0 i = 1, . . ., M Probability of symptom Si = si,j i¼1 Q i
j = 1, . . ., Qi given cause C k ¼ 0; 8C k 2 C ir
Knowledge gathering is composed of the phases presented in
Fig. 2, which will be explained in the paragraphs below. Table 1
summarizes the qualitative information that the expert should specifies the possible causes of the fault category, that is, the
provide, whereas quantitative information can be found in Table 2: causes of the problem in the network for which the diagnosis
model is being built (e.g. ‘‘High DCR”), {C1, . . ., CK}. It is recom-
1. Select fault category. Fault categories are the diverse problems mended to include a cause called ‘‘Other causes”, in order to
that the RAN may suffer, such as ‘‘High DCR” or ‘‘Congestion”. A cover any other possible cause of the problem not explicitly
different model is built for each fault category. included in the defined causes. Secondly, the expert is
2. Define variables. There should be a database of causes and demanded to enumerate the symptoms that may help to iden-
symptoms. The expert has the chance of either selecting a var- tify the previously defined causes, {S1, . . ., SM}. The states, si,j, of
iable from the database or defining a new one, which should each symptom, Si, should also be specified.
then be incorporated into the database. Firstly, the expert 3. Define relations. In this phase, the user should define which are
the causes, C ir ¼ fC ir1 ; . . . ; C irR g, related to each symptom Si. The
i
term ‘‘related” is used to qualify those variables which have a
strong direct inter-dependency. For example, the cause ‘‘Lack
1. Select Fault Category of coverage” is related to the symptom ‘‘Percentage of UL sam-
ples with level <100 dBm”, whereas the cause ‘‘UL interfer-
ence” is not related to that symptom. The explanation is that
2. Define variables a lack of coverage reduces the received signal level in compar-
ison to the average received signal level in a network without
problems, whereas when the cause is interference, the received
signal level is not significantly decreased in comparison to the
3. Define relations level in a cell without problems. The causes not related to
symptom Si will be denoted as C in ¼ fC in1 ; . . . ; C inKR g.
i
4. Specify thresholds. For each continuous symptom, Si, interval
limits (i.e. thresholds), ti,j, between each defined interval should
4. Specify thresholds be requested to the user.
5. Specify probabilities. Verbal probability expressions are often
suggested as a method of eliciting probabilistic information
5. Specify probabilities (Renooij & Witteman, 1999). The number of verbal expressions
should be reduced in order to avoid misinterpretations. In addi-
tion, it is advisable to use a graphical scale with numbers on one
side and words on the other. In our experiments with cellular
6. Link to database (NMS) network operators, experts were asked to choose one out of five
levels of probabilities: ‘‘Almost certain”, ‘‘Likely”, ‘‘Fifty-fifty”,
‘‘Improbable” and ‘‘Unlikely”. Those levels are mapped to the
probabilities 0.85, 0.7, 0.5, 0.3 and 0.1, respectively.The proce-
Fig. 2. Phases in knowledge acquisition.
dure to define the probabilities is as follows. Firstly, the expert
is requested about the prior probabilities of each of the possible
Table 1 causes of the problem, PC i . As causes have only two states
Qualitative model defined by expert
(absent/present), only the probability of the cause being present
Parameters Range Description Example is demanded. Secondly, probabilities for symptoms are
Fi i = 1, . . ., A Fault categories F1 = High DCR requested. For a symptom Si, the requested probabilities,
A: number of fault categories PSi;j jC k , should be those of each state of the symptom given that
Ci i = 1, . . ., K Causes C1 = UL interf. each of the related causes, C k 2 C ir , is present and the other
K: number of causes causes are absent. In addition, the probability of each state of
Si i = 1, . . ., M Symptoms S30 = % UL the symptom given that none of the related causes are present,
M: number of symptoms interf.HOs PSi;j jC 0 , should be defined. In all cases, the expert should take into
si,j i = 1, . . ., M Symptom states s30,1 = low account that the sum of the probabilities over all the states of a
j = 1, . . ., Qi Qi: number of states of given symptom should be 1. The expert should be warned if this
symptom Si was not the case.
C ir ¼ fC ir1 ; . . . ; C irR g i = 1, . . ., M Set of causes related to C 1r ¼ fC 3 ; C 4 g 6. Link symptoms to database. The last step is linking the symp-
i
symptom Si toms in the model to the data in the NMS. Thus, symptoms
Ri: number of causes
should be related to a parameter (performance indicator, coun-
related to Si
ter, etc.) available in the NMS or a combination of parameters.
R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752 4749

4.2. Model construction Table 3


Example of causes and symptoms for GSM/GPRS networks

According to Eq. (1), the required probabilities to build the Causes Symptoms
model are the prior probabilities of causes, P(ci), and the probabil- Interference in uplink Dropped call rate (DCR), %
ities of symptoms given causes, P(Sjjci). Therefore, the data pro- Interference in downlink % Uplink quality handovers
vided by experts, which are those in Tables 1 and 2, should be Bad target cell coverage % Downlink quality handovers
converted into the required probabilities in Eq. (1). Combiner fault % Samples with downlink RXQUAL out of band 0
TRX fault % Samples with uplink RXLEV < 10
For causes, the probabilities elicited by the experts, PC i , are di- Bad coverage (borders) % A interface component of DCR
rectly the probabilities of the causes, P(ci). Taking into account that Bad coverage (holes) % Samples on uplink idle channels out of band 1
the model assumes that the causes are mutually exclusive, the sum A-bis interface fault A-bis alarms
of the probabilities of the causes should be 1. There are two ways
of dealing with that constraint: either the expert is responsible of
checking that he gave the right probabilities or it is allowed that  Unknown diagnosis
the elicited probabilities do not comply with that constraint, and – On going: Analysis was started by the experts, but it was
then they are automatically modified. The latter will be done not completed, i.e. the correct cause was not yet known.
according to the following procedure: – Inconclusive: The real cause of the problem was unclear
and it was not possible to sort it out by the experts. There-
 If the sum of probabilities is higher than 1, each probability fore, it was not possible to know if the analysis done by
should be normalized by the sum of all the probabilities of the the ADS was correct.
causes.
 If the sum of probabilities of the causes is lower than 1, a cause  Known diagnosis
cK+1, named ‘‘Others”, is added. That cause stands for any other – Correct: The analysis done by the ADS gave the right
cause of the problem not considered by the expert. The probabil- cause, which was verified and confirmed by the experts.
ity of that cause is 1 minus the sum of the probabilities of the – Incorrect: The analysis done by the ADS provided the
causes. wrong cause, when the experts knew the correct answer.

Regarding the probabilities of the symptoms, P(Sj = sj,kjci), the


probability of Sj conditioned to related causes has been explicitly Fig. 3 shows the causes proposed by the ADS, classified into the
elicited by the expert: previous four categories. The figure only shows those causes which
occurred in some cell during the trial. Results show that there were
PðSj ¼ sj;k jci Þ ¼ P Sj;k jCi ; C i 2 C jr ð2Þ 23 cases out of 55 where the correct cause was known by the ex-
perts. The right cause was proposed 11 times (47.8%) by the ADS
The probability of the symptom conditioned to non-related causes, and in 12 times the ADS gave incorrect causes (52.2%).
which is the same for all non-related causes, is Consequently, the model was fine-tuned during the following
three months. The results at the end of that period is shown in
PðSj ¼ sj;k jci Þ ¼ P Sj;k jC0 ; C i 2 C jn ð3Þ Fig. 4. From these results, it can be observed that the amount of
cases where the cause was known by the expert was increased
from 42% to 66%. As in results on the first period, there were some
5. Empirical evaluation causes which were very well spotted by the ADS. However, almost
no improvement was achieved for causes 9 (Bad coverage in the
Firstly, this section presents the results of evaluating a knowl- borders) and 25 (TRX fault). The total diagnosis accuracy increased
edge-based model in a live network. Subsequently, the sensitivity from 48% to 61%.
of the model outcome to imprecision in its parameters is studied. The achieved diagnosis accuracy (61%) was considered to be
Finally, a knowledge-based model is compared to a data-based quite promising by the troubleshooters. It was realized that fine-
model. tuning of the model was essential, not only to adjust the parame-
ters in the model, but also to identify new causes and symptoms.
5.1. Knowledge-based model

Following the knowledge acquisition procedure explained in


12
Section 4, a model for diagnosis in GSM/GPRS networks was devel- On going
oped based on the expertise of troubleshooters. The model com-
10 Inconclusive
prised 29 faults and 51 symptoms. In addition, an automatic
Number of cases

diagnosis system, according to the principles in Section 3, was de- Correct


ployed in a live network. The model was fine-tuned while it was 8 Incorrect
tested to diagnose faulty cells. Table 3 shows some of the causes
and symptoms in the model. 6
The trial in the live network had a duration of about six months.
Results of the trial are studied in two phases: ‘‘Trial 1”, which in- 4
cludes partial results at the middle of the trial period; and ‘‘Trial
2”, which includes final results at the end of the trial period. The 2
main evaluated aspects were the diagnosis accuracy and the time
to perform troubleshooting. 0 6 9 10 12 15 24 25 26 27
Troubleshooting experts analyzed the faulty cells and the as-
Cause number
sessed diagnosis was compared with that provided by the ADS.
Cases were classified by experts in four groups: Fig. 3. Trial results for first period.
4750 R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752

7 Secondly, the same error value seems more serious in parameters


On going near 0 or 100 than in those in the middle range, because the result-
6 ing parameter could be very close to one of the limits (leading to a
Inconclusive
probability equal to zero or one in the case of probabilities and to a
5 Correct
single state of the discretized symptom in the case of thresholds).
Number of cases

Incorrect In order to overcome those problems, it has been proposed to add


4
noise to the log-odd rather than directly to the probabilities (Prad-
3 han et al., 1996) or thresholds. The new parameter with added
noise, T, in terms of the nominal parameter t and noise e is
2
100
1 T¼ 100  ; e ¼ Normalð0; rÞ ð4Þ
1þ t
 1  10e
0 where t and T are expressed in % and e follows a normal
1 5 4 6 7 8 9 12 25 26 27
Cause number distribution.
For example, Fig. 5 shows the log-odd normal distribution cen-
Fig. 4. Trial results for second period. tered around the nominal parameter of 20% and 50% for various
values of the standard deviation r of the noise. Reference (Kipersz-
tok & Wang, 2001) argues that this distribution is only adequate for
However, the fine-tuning process was considered as very time- standard deviations lower than one. Effectively, in Fig. 5, it can be
consuming. observed that for values of r > 1 the distribution becomes bimodal,
Another analyzed aspect was the time the ADS took to make the which is equivalent to consider an expert who assesses a threshold
analysis. During the trial it was measured that the time that the or probability near 0 and erring in judgment by such margin that
ADS took to carry out the diagnosis ranged from one to 5 min the best parameter may be close to 100. This is the reason why r
per cell, depending on the NMS load. This time was mainly spent has been maintained in the range of (0, 1).
in retrieving the symptoms from the NMS, as the time to carry A model whose parameters were obtained from 2000 training
out the inference was negligible. The system significantly reduced cases was taken as the starting point for the analysis of sensitivity
troubleshooting time compared to regular human troubleshooting to imprecision in model parameters. Those parameters were se-
procedures, which can last several days for the most complex lected because they were considered to be the ‘‘optimum”,
cases. Even with an incorrect cause suggestion by the ADS, the whereas the added noise represented the imprecision in those
analysis results provided extremely valuable clues to the trouble- parameters. In this analysis it is assumed that the probabilities
shooting experts. are defined independently of the thresholds. Thus, the followed
Finally, the trial allowed to get a basic idea of the fault causes methodology consists in fixing the parameters of one type (either
that could be further investigated. The ADS presented a summary thresholds or probabilities) and adding noise to the other parame-
of the abnormal symptoms, which helped to verify the diagnosed ters. This method is useful for knowledge-based models, in which
cause. Therefore, the automatic diagnosis system was recognized parameters are independently specified.
as very valuable and time-saving by all participants in the trial. The sensitivity was separately analyzed for imprecision in the
thresholds, in the prior probabilities of causes and in the condi-
5.2. Sensitivity analysis tional probabilities of symptoms. In addition, analysis for impreci-
sion in all the parameters at the same time was also performed.
It is common that the parameters in the model are not those Fig. 6 shows the average diagnosis accuracy vs. the level of noise
which would achieve the best diagnosis accuracy, either because r. It can be clearly appreciated that the system is almost insensi-
of the availability of a scarce number of training cases in data- tive to imprecision in the prior probabilities. On the contrary, the
based models or because of imprecision in the elicited parameters highest sensitivity is related to imprecision in thresholds. This
in knowledge-based systems. This is the reason why it is so impor-
tant to analyze how changes in the specified parameters would im-
pact the diagnosis results. 0.12
Basically, there are two approaches to sensitivity analysis: the- σ =0.1; t = 20
oretical and empirical. The theoretical approach expresses the pos- σ =0.2; t = 20
terior probability of interest in terms of the parameters under 0.1 σ =0.3; t = 20
study. The empirical methods examine the effects of varying the σ =0.5; t = 20
σ =1; t = 20
parameters of the model on the diagnosis. In the latter case, the 0.08 σ =2; t = 20
most frequent approach consists in adding random noise to the
σ =0.1; t = 50
probabilities and examining the effects on the diagnosis results
f (T)

(Henrion et al., 1996;Kipersztok & Wang, 2001;Pradhan, Henrion, 0.06


Provan, del Favero, & Huang, 1996).
In order to evaluate the sensitivity of the system proposed in
0.04
this paper, the procedure has been the following. Random noise
at increasing levels has been added to the parameters (i.e. thresh-
olds and probabilities) and its effect on the performance has been 0.02
evaluated. Both probabilities (expressed as percentages) and
thresholds are in the range (0, 100)%. The reason for this in the case
of the thresholds is that the symptoms in the model are expressed 0
0 10 20 30 40 50 60 70 80 90 100
as percentages (e.g. % downlink quality handovers, % samples with
uplink RXLEV < 10). Adding random noise directly to the parame-
T
ters has two problems. Firstly, a large additive error may produce Fig. 5. Log-odd normal distribution centered around the nominal threshold 20 or
a parameter out of range (i.e. greater than 100 or lower than 0). 50.
R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752 4751

80 80
Knowledge-based
70
70 Data-based
Average diagnosis accuracy (%)

Diagnosis Accuracy (%)


60
60
50

50 40

30
40
Thresholds 20
Prior
30 Conditional 10
All

20 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 N=50 N=1000

σ Training set size (N)

Fig. 6. Sensitivity analysis. Fig. 7. Comparison knowledge-based vs. data-based model.

gether with the actual cause would be saved. In this way, the data-
sensitivity analysis proves the importance of setting accurate
base of training cases would be growing with each new case and a
parameters for the diagnosis model.
data-based model could be ready in a reasonable time.
5.3. Comparison with data-based model
6. Conclusions
The aim of the following experiments was to compare the re-
sults obtained from a knowledge-based system with those from This paper has presented an automatic diagnosis system for the
a data-based system. With that purpose, a three month trial was RAN of wireless networks. A knowledge acquisition procedure has
carried out in a GSM/GPRS network. The network was composed been proposed to build the diagnosis model required by the sys-
of about 25,000 cells and 10 NMSs. Everyday some problematic tem. The techniques in this paper can be used to increase opera-
cells were identified and their faults were manually diagnosed tional efficiency in current and future wireless networks (GSM,
by experts in troubleshooting. The average value of the main KPIs GPRS, UMTS, WLAN, B3G networks, . . .).
on the day the fault occurred, together with the diagnosed cause, Experimental results have shown the feasibility of the proposed
were saved in a database of classified cases. From those data, methods. The knowledge acquisition procedure has allowed to de-
1000 cases were used as training set and 1000 cases were used fine a knowledge-based model and to use it in a live network with
as test set. promising results. A sensitivity analysis has shown that a knowl-
Two data-based models were built, which differed in the size of edge-based model is especially sensitive to inaccuracies in the def-
the training set used to calculate their parameters, either a subset inition of thresholds. Experiments have also proven that a
of 50 cases or the whole 1000 cases. In order to build both models, knowledge-based model is the best option when the number of
an Entropy Minimization Discretization (Fayyad & Irani, 1993) training cases is scarce or nonexistent. On the contrary, when a
algorithm was applied to discretize KPIs based on the training large database of training cases is available, data-based models
set. Once thresholds were obtained, a Maximum Likelihood Esti- should be preferred.
mation was used in order to learn the probabilities from the train-
ing set. In parallel, a knowledge-based model was defined by References
troubleshooting experts.
Altman, Z., Skehill, R., Barco, R., Moltsen, L., Brennan, R., Samhat, A., et al. (2006). The
Fig. 7 shows the results of evaluating the three models on the Celtic Gandalf framework. In Proceedings of the IEEE mediterranean
same test set. The figure presents the diagnosis accuracy depend- electrotechnical conference MELECON’06, Benalmádena, Spain.
ing on the size of the training set N. It can be observed that, logi- Barco, R., Wille, V., & Díez, V. (2005). System for automated diagnosis in cellular
networks based on performance indicators. European Transactions on
cally, the accuracy obtained with the knowledge-based model Telecommunications, 16(5), 399–409.
does not depend on N. When the size of the training set is small Fayyad, U., Irani, K. (1993). Multi-interval discretization of continuous valued
(N = 50), the knowledge-based model outperforms the data-based attributes for classification learning. In Proceeding of the international joint
conference on artificial intelligence, Chambery, France.
one. However, when the number of training cases is large Heckerman, D., Breese, J., & Rommelse, K. (1995). Decision-theoretic
(N = 1000), then the data-based model provides the best results. troubleshooting. Communication of the ACM, 38(3), 49–57.
An important factor to consider is the long time required to Henrion, M., Pradhan, M., Favero, B. D., Huang, K., Provan, G., & O’Rorke, P. (1996).
Why is diagnosis using belief networks insensitive to imprecision in
build a knowledge-based model (e.g. months), especially taking probabilities? In Proceedings of the annual conference on uncertainty in artificial
into account that normally the parameters should be fine-tuned intelligence, Portland, Oregon.
after trials in a real network. However, in most cases, a knowl- Jamalipour, A., Wada, T., & Yamazato, T. (2005). A tutorial on multiple access
technologies for beyond 3G mobile networks. IEEE Communications Magazine,
edge-based model is the only option in the RAN of wireless net-
43(2), 110–117.
works, due to the lack of training cases and the difficulties to Khanafer, R., Solana, B., Triola, J., Barco, R., Nielsen, L., & Altman, Z., et al. (in press).
obtain them. Thus, a feasible solution to implement in a live net- Automated diagnosis for UMTS networks using bayesian network approach.
work would be to start with a knowledge-based model. Everyday, IEEE Transactions on Vehicle Technology. doi:10.1109/TVT.2007.912610.
Kipersztok, O., & Wang, H. (2001). Another look at sensitivity of bayesian networks
the ADS would be run to help the troubleshooting experts. Then, to imprecise probabilities. In Proceedings of the international workshop on
every time a fault was solved, the symptoms of all faulty cells to- artificial intelligence and statistics, Florida, USA.
4752 R. Barco et al. / Expert Systems with Applications 36 (2009) 4745–4752

Laiho, J., Raivio, K., Lehtimäki, P., Hätönen, K., & Simula, O. (2005). Advanced analysis Pradhan, M., Henrion, M., Provan, G., del Favero, B., & Huang, K. (1996). The
methods for 3G cellular networks. IEEE Transactions on Wireless Communication, sensitivity of belief networks to imprecise probabilities: An experimental
4(3), 930–942. investigation. Artificial Intelligence, 85(1–2), 363–397.
Lehtimäki, P., & Raivio, K. (2005). A knowledge-based model for analyzing GSM Renooij, S., & Witteman, C. (1999). Talking probabilities: Communicating
network performance. In Proceedings of the international conference on industrial probabilistic information with words and numbers. International Journal of
and engineering applications of artificial intelligence and expert systems, Bari, Italy. Approximate Reasoning, 22(3), 169–194.
Lehtimäki, P., & Raivio, K. (2005). A SOM based approach for visualization of GSM Rish, I. (2001). An empirical study of the naive bayes classifier. In Proceedings of the
network performance data. In Proceedings of the international symposium on international joint conference on artificial intelligence, Seattle, USA.
intelligent data analysis, Madrid, Spain. Steinder, M., & Sethi, A. (2004). Probabilistic fault localization in communication
Ng, G., & Ong, K. (2000). Using a qualitative probabilistic network to explain systems using belief networks. IEEE/ACM Transactions on Networking, 12(5),
diagnostic reasoning in an expert system for chest pain diagnosis. In Computers 809–822.
in Cardiology (pp. 569–572). USA: IEEE. Wietgrefe, H. (2002). Investigation and practical assessment of alarm correlation
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible methods for the use in GSM access networks. In Proceedings of the IEEE/IFIP
inference. San Francisco, California: Morgan Kaufmann. network operations and management symposium, Florence, Italy.

You might also like