Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 57, NO.

4, JULY 2008 2451

Automated Diagnosis for UMTS Networks Using


Bayesian Network Approach
Rana M. Khanafer, Beatriz Solana, Jordi Triola, Raquel Barco, Lars Moltsen,
Zwi Altman, Senior Member, IEEE, and Pedro Lázaro

Abstract—This paper presents an automated diagnosis in trou- Area Network (WLAN). As a result, the operation of the RAN
bleshooting (TS) for Universal Mobile Telecommunications Sys- will be a tough challenge that the operators will have to tackle.
tem (UMTS) networks using a Bayesian network (BN) approach. In addition, cellular network operators have to find ways to
An automated diagnosis model is first described using the Naïve
Bayesian Classifier. To increase the performance of the diagnosis reduce the cost of their services and improve their quality to
model, the entropy minimization discretization (EMD) method counter the threat posed by emerging technologies, such as
is incorporated into the model to select optimal segments for telephony based on the WLAN.
the discretization of the input symptoms. In the first phase, the In a mature cellular network that has undergone most of its
diagnosis model is constructed using a dynamic simulator. The site roll out, the major cost is associated with the operation
simulator TS platform allows generation of a large amount of data
required to study the relations between faults and symptoms. In of the network. As the network consists of a high number
the second phase, the diagnosis model is adapted to a real UMTS of pieces of equipment that are distributed across the entire
network using counters and key performance indicators (KPIs) country, maintaining and operating this large and technically
recovered from an Operations and Maintenance Center (OMC). complicated system is a difficult task that requires operator
Results for the automated diagnosis using both network simulator personnel around the clock in several regional offices. Even
and real UMTS network measurements illustrate the efficiency of
the proposed TS approach and its importance to mobile network with reliable hardware and software, there are always faults
operators. that have to be rectified as otherwise, the end user will either
experience suboptimal service levels or no service at all. As in
Index Terms—Automated diagnosis, Bayesian networks (BNs),
entropy minimization discretization (EMD), faults, symptoms, most countries, several operators are competing for subscribers,
troubleshooting (TS), Universal Mobile Telecommunications and it is imperative to quickly rectify such occurrences because
System (UMTS) network. otherwise, users will naturally switch to competing network op-
erators. Hence, fault management, also called troubleshooting
(TS), is a key aspect of the operation of a cellular system in
I. I NTRODUCTION a competitive environment. As the RAN of cellular systems is
by far the biggest part of the network, most TS activities are
T HE MOBILE telecommunication industry has experi-
enced significant changes in the recent past, and it will
continue to evolve in the foreseeable future. The current sce-
focused on this area.
TS comprises the following three tasks: 1) fault detection
nario comprises a complex set of interrelated and rapidly (FD); 2) cause diagnosis (i.e., identification of the problems’
growing wireless networks, applications that require increasing cause); and 3) solution deployment, namely fixing the problem.
bandwidth, and users who demand high quality of service at Among the TS tasks, the diagnosis of the cause of faults is the
low cost but with a limited spectrum. In a few years, the highly most complex and time-consuming one. A cause could be a
complex and heterogeneous Radio Access Network (RAN) will hardware failure (like a broken base-band card in a node B)
comprise different technologies, such as the Global System or a bad parameter value (i.e., transmission power, antenna tilt,
for Mobile Communications (GSM), the Universal Mobile or a control parameter such as a Radio Resource Management
Telecommunications System (UMTS), and the Wireless Local (RRM) parameter). The term symptom refers to indicators that
may help to identify the fault cause. There are two types of
Manuscript received February 21, 2007; revised May 28, 2007, September symptoms, i.e., counters and/or key performance indicators
17, 2007, and October 3, 2007. The review of this paper was coordinated by (KPI), and alarms.
Prof. Y.-B. Lin.
R. M. Khanafer, J. Triola, and Z. Altman are with France Telecom
The first steps in the automation of the TS process in cellular
R&D, 92794 Issy les Moulineaux, France (e-mail: rana.khanafer@orange- networks have been focused on performance visualization [1]
ftgroup.com; jordi.triolabosch@orange-ftgroup.com; zwi.altman@orange- and FD [2]–[5]. Regarding automatic diagnosis, very few ref-
ftgroup.com).
B. Solana is with Telefónica I+D, 28043 Madrid, Spain (e-mail: solana@
erences can be found on the diagnosis in the RAN of cellular
tid.es). networks. However, automatic diagnosis has been extensively
R. Barco and P. Lázaro are with ETSI Telecomunicación, University of studied in other fields, such as diagnosis of diseases in medicine
Málaga, 29071 Málaga, Spain (e-mail: rbm@ic.uma.es; plazaro@ic.uma.es).
L. Moltsen was with Moltsen Intelligent Software, 9220 Aalborg, Denmark. [6], TS of printer failures [7], and diagnosis in the core of
He is now with Wirtek, 9220 Aalborg, Denmark (e-mail: lars.moltsen@ communication networks [8]. However, diagnosis in the RAN
Wirtek.com). of cellular networks has some distinctive characteristics, such
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. as the continuous nature of performance indicators and the
Digital Object Identifier 10.1109/TVT.2007.912610 existence of logical faults, such as a wrong configuration, that

0018-9545/$25.00 © 2008 IEEE


Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
2452 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 57, NO. 4, JULY 2008

are not related to a physical piece of equipment. This makes


the automatic diagnosis techniques used in other application
domains not directly applicable to cellular networks.
Research into the automation of diagnosis in the RAN
of cellular networks has been traditionally focused on alarm
correlation [9]. Alarm correlation consists of the conceptual
Fig. 1. Example of the DAG of a BN.
interpretation of multiple alarms so that new meanings are
assigned to the original alarms. It is a process that underlines This paper is organized as follows. First, in Section II, BNs
different tasks, e.g., reduction of multiple occurrences of an will be briefly introduced, and their application to diagnosis
alarm into a single alarm, inhibition of low-priority alarms in modeling in cellular networks will be described. In Section III,
the presence of higher priority alarms, substitution of a specific the adaptation of the BN model for UMTS fault diagnosis is
set of correlated alarms by a new one, etc. Although alarm presented, taking into account the particularities of this RAN.
correlation can be considered a first step in the diagnosis of In this section, the results obtained when testing the diagnosis
faults, it does not provide conclusive information to identify the model using a semidynamic simulator are also presented. In
cause of problems, particularly if the possible causes are not Section IV, the results obtained when using the model in a live
only faults in pieces of equipment. Other categories of faults, UMTS network are described. Finally, conclusions and future
such as interference or lack of coverage, are difficult to identify lines of work are outlined in Section V.
if performance indicators are also not considered.
Some automatic diagnosis systems have also been pro-
posed for cellular networks, and they have been tested on II. D IAGNOSIS S YSTEM B ASED ON BN S
GSM/General Packet Radio Service (GPRS) networks. In [10], A. Introduction to BNs
an automatic diagnosis system for cellular networks was pro-
posed, which identified the fault cause based on the values of A BN [14] is a pair (D, P ) that allows efficient represen-
performance indicators. The reasoning method, which used a tation of a joint probability distribution over a set of random
Naïve Bayesian Classifier [11], can be applied to diagnosis in variables U = {X1 , . . . , Xn }. D is a directed acyclic graph
GSM/GPRS, third generation (3G), or multisystems networks. (DAG) whose vertices correspond to the random variables
In addition, a diagnosis model for GSM/GPRS RANs was also X1 , . . . , Xn and whose edges represent direct dependencies
described. In [12] and [13], Bayesian networks (BNs) [14] between the variables. The second component P is a set of
were proposed as the reasoning method for automatic diagnosis conditional probability functions,
 for each variable, P =
one 
in cellular networks. In those papers, performance indicators {p(X1 | 1 ), . . . , p(Xn | n )}, where i is the parent set of
were modeled as discrete random variables, and the trials were Xi in U . If D contains a link from  Xi to Xj , then Xi is a parent
carried out in live GSM/GPRS networks. Expert systems built of Xj , and thus, it belongs to j . Similarly, Xj isa child of Xi .
according to this Bayesian approach have many advantages For example, in Fig.
  1, the four parent sets are A = {C, D},
when compared to other techniques used in other application B = {D}, and C = D = ∅.
domains for diagnosis. For example, BNs efficiently model the The set P defines a unique joint probability distribution over
uncertainty inherent to human reasoning. In [12] and [13], the U given by
parameters of the diagnosis model (probabilities and thresholds   
n 
for the discretization of performance indicators) were specified 
p(U ) = p(X1 , . . . , Xn ) = p Xi  . (1)
by diagnosis experts.  i
i=1
The aim of this paper is to propose an automated fault
diagnosis system for UMTS networks. This paper provides a Note that when the random variables Xi have a finite set
complete view of the diagnosis process, from the mathematical of states, the rightmost part of (1) is usually a much more
model to its implementation in a real UMTS network. To compact representation of p(U ) than the more straightforward
apply automated diagnosis techniques developed for GSM to representation p(X1 , . . . , Xn ), which exponentially grows with
UMTS RAN, the particularities of this technology have to be the number of random variables.
taken into account. A UMTS network is an interference-limited A BN can be used to infer p(Xj = x|E), representing the
system in which users share the same frequency bandwidth. probability of a variable Xj being in a certain state x, given the
Parameter modification or faults in one sector could impact on available evidence E = (y1 , . . . , yk ), where yi is the observed
the performance of neighboring sectors. A typical example is state of the variable Yi , and {Y1 , . . . , Yk } ⊂ {X1 , . . . , Xn }. In
the modification of macrodiversity (MD) hysteresis windows the following sections, this shall be used to compute accurate
(i.e., add and drop windows) in one station that results in the probabilities of fault causes given the observed “symptoms.”
creation or suppression of links in neighboring cells. Similarly,
alarms in a given station could be triggered due to parameter
B. Diagnosis System
modifications or faults occurring in a neighboring station. In
addition, in this paper, some methods to learn the parameters Two components of the automatic diagnosis system have
of the model from data are proposed. Specifically, a method been distinguished, i.e., the model and the inference method.
to calculate the probabilities in the BN and two algorithms to The diagnosis model represents the knowledge on how the
discretize the continuous symptoms are adopted. identification of fault causes is carried out. The elements of the
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
KHANAFER et al.: AUTOMATED DIAGNOSIS FOR UMTS NETWORKS USING BAYESIAN NETWORK APPROACH 2453

symptoms). From the stored data, histograms are computed. To


derive closed-form (continuous) probability density functions
(pdfs), a very large database is required, which is seldom
available [10]. When the number of points is relatively small
(i.e., couple of tens of points for each symptom and fault cause),
probability tables lead to more accurate results of (2) than
approximated continuous pdfs. The derivation of probability
Fig. 2. NBM. tables calls for discretization techniques in which thresholds
for the symptoms (i.e., counters and KPIs) are specified. When
mapping a continuous value to an interval, some information is
model are causes and symptoms. The inference method is the lost. Thresholds should be set so that the resulting states throw
algorithm that identifies the cause of the problems based on the away as little information as possible. This typically means that
value of the symptoms. the state provides information about the presence or absence
Defining the diagnosis model comprises several phases. First, of a fault. As an example, a symptom could be defined for
the causes and symptoms for diagnosis in a UMTS network the “Dropped Call Ratio.” One state for this symptom could
have to be identified (see Section IV). Causes can be modeled represent the “normal” situation, when no fault is present. This
as discrete random variables with two states (absent/present). could cover the range from 0% to 2%. The next state could
We consider two types of symptoms, i.e., Alarms and KPIs. cover the “high” range from 2% to 10%, and finally, a third
Alarms can also be modeled as discrete random variables with state could cover the extreme or “very high” range from 10%
two states (off/on). KPIs can be modeled as discrete random to 100%. Clearly, the two latter states somewhat indicate the
variables with two, three, or more states, each representing a presence of a problem.
subset of the continuous range of the KPI. The threshold can be either defined by diagnosis experts
The chosen BN structure is a so-called Naïve Bayes Model or learnt from training data. If enough data are available for
(NBM), which is also known as a Simple Bayes Model or Naïve training, the latter leads to a more accurate model than the
Bayesian Classifier. The NBM (Fig. 2) consists of a parent node former and, consequently, to a better performance of the diag-
C, whose states are the possible fault causes, and children nodes nosis system. In this section, the methods used in the proposed
S1 , . . . , SN , which represent the symptoms and may have any diagnosis system for UMTS are presented.
discrete number of states. Two different discretization methods for specifying thresh-
It should be pointed out that since a random variable can olds have automatically been analyzed—an unsupervised
only be in one of its finite set of states, this model assumes method [the percentile-based discretization (PBD)] and a su-
that there can only be one fault happening at the same time. pervised one [the entropy minimization discretization (EMD)].
This is a simplification that is not always true in real life, but its 1) PBD: The PBD method is a very simple discretization
impact on the diagnosis performance is considered minor, and method where the user first specifies a percentage X. Then, the
the simplicity of this type of model is a clear benefit. X% percentile of the symptom values in a training set is the
According to Section II-A, the required probabilities for this state threshold.
model are the following: The PBD method is appropriate when the data contains no
— p(C = ck ) for each fault cause ck ; information about the presence of faults. For example, let us
— p(Si = si,j |C = ck ) for each state si,j of Si and each assume that some user has a database with symptom values
fault cause ck . for 1000 cells of a UMTS network. Then, for the “Dropped
Call Ratio” KPI with three states (“normal,” “high,” and “very
Thus, with the NBM structure, the desired conditional prob- high”), the threshold between “normal” and “high” could be
abilities of the fault causes given the observed symptoms can calculated as the 90% percentile, and the threshold between
be calculated as “high” and “very high” could be the 99% percentile. However,
when using this method, the user should always have in mind

N where the used data set comes from. That is, the previous
p(C = ck ) p(Si = si,j |C = ck ) example would work when the data are mainly composed of
i=1
p(C = ck |E) = . (2) well-performing cells. Nevertheless, if the data come from
p(E)
the top 100 worst performing cells, then the aforementioned
percentiles would probably not make sense.
This equation assumes that the symptoms Si are independent
2) EMD: The EMD method presented here requires the data
given the fault cause C. According to the independency proper-
used for learning to contain not only the symptom values but
ties of a BN, this is exactly what the NBM structure indicates.
also the observed fault causes.
In the machine learning literature, several discretization
methods have been proposed. This paper is focused on the EMD
C. Specifying KPI Thresholds
method, which is based on entropy calculation. Entropy-based
The different probabilities in the right-hand side of (2) are discretization methods have been shown to be efficient with
calculated from data points corresponding to the analyzed cases advantages with respect to other techniques [15], thus justifying
stored in a database (i.e., fault causes with the corresponding their usage for diagnosis in cellular networks.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
2454 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 57, NO. 4, JULY 2008

The EMD method is a minimal entropy base heuristic that


was developed by Fayyad and Irani [16]. For each continuous-
valued attribute Si , the algorithm is defined as follows.
1) The cases of a training set D are first sorted by increasing
the value of the symptom Si . The midpoint mj between
the symptom values in each successive pair of cases in the
sorted sequence is considered as a potential threshold.
2) Each candidate threshold mj partitions the data set D into
two subsets D1 and D2 . The class entropy of the partition
is evaluated as described below.
3) The best threshold ti for Si is the candidate point that
minimizes the class entropy of the partition.
In [16], boundary cut points were defined, which are values
of mj between two cases with different causes in the sequence
of sorted cases. It was proven that the evaluation of only the
boundary cut points is sufficient to find the minimum class
entropy, which, in general, highly diminishes the number of Fig. 3. “Knowledge Builder” tool.
candidates to be evaluated.
Let |R| denote the number of cases in a subset R, and let
|R(ck )| be used for the number of cases in R with C = ck . The
class entropy of the subset R is defined as


K
|R(ck )| |R(ck )|
Ent(R) = − · log2 (3)
|R| |R|
k=1

where K is the number of causes.


Let mj be a boundary cut point of the set of D cases, which
partitions it into the subsets D1 and D2 . The class information
entropy of the partition induced by mj is the average of the
class entropies of the subsets

|D1 | |D2 |
Ent(D, mj , Si ) = · Ent(D1 ) + · Ent(D2 ) (4)
|D| |S|

where D1 is the subset of cases in D with values of Si lower


Fig. 4. Execution tool known as “TheCure.”
than mj , and D2 is the set of D cases excluding those in the
set D1 . Thus, a binary discretization for D is determined by
where the cause was ck , and |Si | is the number of states of the
selecting the threshold ti for which Ent(D, mj , Si ) is minimal
symptom Si .
among all the candidate cut points mj . This method can then be
recursively applied to both of the partitions induced by ti until
some stopping condition is achieved, thus creating multiple E. TS Toolset: TheCure
intervals for the symptom Si .
Within the Eureka Celtic Gandalf project [17], a toolset has
been developed that supports the methodology presented in the
D. Specifying Probabilities previous sections. This toolset has been used to produce the
The prior probabilities of the causes can be easily defined by results documented in Sections III and IV.
diagnosis experts, taking into account their experience on the The toolset consist of a “Knowledge Builder” tool that en-
frequency of occurrence of each fault type. However, experts ables the user to specify a model from expert knowledge and/or
find it much more difficult to elicit the conditional probabilities from available learning data and an execution tool known as
of symptoms given the causes. Thus, those probabilities can “TheCure.”
be learned from training data by means of Laplace’s law of Fig. 3 shows a screen dump of the “Knowledge Builder,”
succession. That is where a number of fault causes are shown on the upper left part,
and a number of symptoms are shown on the lower left part.
nki,j + 1 Fig. 4 shows “TheCure” with five cells loaded. On the lower
p(Si = si,j |C = ck ) = (5)
N k + |Si | right part, the evidence (i.e., observed KPI values) is shown
for the cell selected in the left panel. Above the evidence is
where nki,j is the number of cases where the cause was ck and the actual diagnosis, that is, the computed probabilities of fault
the symptom Si was in state si,j , N k is the number of cases causes given the evidence.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
KHANAFER et al.: AUTOMATED DIAGNOSIS FOR UMTS NETWORKS USING BAYESIAN NETWORK APPROACH 2455

III. A UTOMATED D IAGNOSIS FOR UMTS RAN link. For simplicity of notation, the add and drop window
U SING A N ETWORK S IMULATOR parameters with too high and too low values are denoted by
RRM_MD+ and RRM_MD− , respectively. It is supposed that
The diagnosis model for a UMTS network requires to
the difference between the add and drop windows is kept
determine appropriate causes and the available indicators
constant as typically implemented in real networks. The add
(symptoms) to efficiently infer causes of faults. Data from
and drop windows of a given cell have an impact on the creation
the network are “expensive,” and to construct a significant
and suppression of links of its neighboring cells. In other words,
statistical database requires precious time from radio experts.
poor quality in one cell could be caused by a fault in another
A dynamic network simulator allows to produce a large amount
cell. Hence, one needs to keep track of certain symptoms of a
of “cheap” data needed to adapt the BN diagnosis model
BS and of its neighbors, making the TS problem more complex.
to the new technology. It allows determination of the data
In a similar manner, one could consider other RRM parameters,
requirements to construct an accurate diagnosis model, i.e.,
such as admission and congestion control parameters.
number of case studies, the symptoms associated to each cause,
2) Symptoms: In the context of a simulator, alarms can be
etc. A simulator can be used to study only a subset of problems
defined using both counters and KPIs. When a KPI value ex-
that can occur in a real network: causes related to antenna and
ceeds a predetermined threshold, an alarm can be triggered. In a
system parameters (i.e., common channel power, maximum
real network, such an alarm would correspond to a “flag” raised
base station (BS) transmitted power, neighboring list declara-
when processing data from the Operations and Maintenance
tion) and causes related to different RRM functionalities, such
Center (OMC) or capture tool. The following symptoms are
as admission and congestion control or mobility. The simulator
considered.
used in this paper is a semidynamic simulator based on
correlated snapshots with time intervals on the order of 1 s [18]. — blocked call rate (BCR);
— dropped call rate (DCR);
A. Model Construction — MD blocking rate. If a request to establish a new (addi-
tional) link with a BS is denied, it is considered as MD
Although one can benefit from the TS experience in GSM blocking. The ratio between the number of MD blockings
[12], [13], the particularities of the UMTS technology should and the total number of requests to establish additional
be taken into account when developing the diagnosis model. links defines the MD blocking rate. The MD blocking rate
For example, UMTS is an interference-limited system in which of a BS is calculated here as the average blocking rate of
interference can produce coupling effects between neighboring all the mobiles having this station as the best server;
cells. Hence, a faulty cell can considerably reduce the perfor- — capacity/throughput. For real-time traffic, capacity is
mance in a neighboring cell. given in terms of the number of mobiles per service. For
A reduced model has been built to prove the feasibility of the nonreal-time traffic, the downlink throughput is used as a
techniques proposed in this paper. The causes and symptoms in capacity indicator;
the model are the following ones. — Ping-pong. The ping-pong KPI is calculated as the fre-
1) Causes: Two types of fault causes are considered, i.e., quency of active set updates.
hardware problems and bad parameter values, which result in
poor quality indicators and alarms. A parameter value could
be too big or too small (with respect to an optimal value) and
B. Case Study
denoted, respectively, as Par+ and Par− . The faults considered
by the simulator are the following. This section illustrates the application of the BN model for
Channel element breakdown: This cause is a hardware the automated diagnosis of UMTS networks using the semidy-
problem. One or several channel elements in a node B could namic network simulator. Hence, data from the simulator are
be out of service. used to adapt and fine tune the reference model and to assess
Pilot power: A too high Common Pilot Channel the amount of data necessary for accurate diagnosis.
(CPICH) power P ilot+ will extend too much the service zone 1) Simulation Setup: The causes and symptoms used for
of the node B, thus becoming overloaded. Conversely, a pilot constructing the diagnosis model are listed below.
power P ilot− that is too low will decrease the cell extent too Causes:
much, reduce its load, and push traffic to neighboring cells. — channel element breakdown in a BS (hardware fault);
Antenna tilt: As in the pilot case, tilt+/− affects the cell — bad settings of system and RRM parameters: CPICH
extent, its load, and that of its neighbors. An up-tilted antenna power, antenna tilt, and mobility parameters (add and
will create interference in neighboring cells and will deteriorate drop windows).
their performance, whereas down-tilted antennas will reduce
the cell range and may cause coverage holes. In the construction of the Bayesian model, parameters that
Mobility parameters: Mobility parameters are of partic- are too high or too low are considered to be distinct fault causes.
ular importance in mobile networks. The hysteresis events In addition, a “normal” state has been included in the Cause
1A and 1B, or add and drop windows, respectively, for soft node, which stands for nonfaulty cells triggering an alarm.
handover (HO), are considered here. Add window defines the Symptoms:
threshold for adding a new link to the active set of a mobile, and — BCR;
the drop window defines the threshold for removing an existing — DCR;
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
2456 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 57, NO. 4, JULY 2008

Fig. 6. BCR histograms for (black) nonfaulty and (white) faulty cells due to
Fig. 5. One hundred ninety-six sector UMTS network for testing the auto-
excess of pilot power.
mated diagnosis model. A buffer zone of 151 sectors is added to minimize
truncation effects.

— MD blocking rate;
— capacity;
— ping-pong indicator.
To improve the model accuracy and to take into account the
interference-related coupling effects between neighboring BSs
[19], the KPIs of neighboring cells (from the list above) have
also been included in the BN model.
A 66 trisectorial site network in a dense urban environment
is considered (with a total of 196 sectors, with two sites having
only two sectors). A buffer zone with 151 sectors is added to
minimize the truncation effects and is used in all the simulations
for network evaluation (see Fig. 5).
Fig. 7. MD blocking call rate histograms for (black) nonfaulty and (white)
For each given cell and each selected KPI, a neighboring KPI faulty cells due to excess of pilot power.
has also been calculated and computed as the result of averaging
the KPI values for that KPI related to the four neighbors discretization methods have been analyzed, i.e., an unsuper-
having the highest traffic flux with that cell. Each KPI (either vised method (the PBD) and a supervised one (the EMD).
neighboring or serving cell) has been averaged over 3000 time The threefold cross-validation statistical test [20] has been se-
steps of 1 s in the simulator. The statistics for cells with faults lected to compute the diagnosis accuracy of both discretization
have been calculated as follows. Fifteen simulations for each techniques. Thus, the entire diagnosis workflow (discretization,
one of the seven faults have been carried out. In each one of model probabilities estimation, and performance evaluation)
these simulations, one single fault was introduced into 14 of the has been repeated three times using at each iteration different
196 cells. Thus, 210 (= 14 cells/simulation · 15 simulations) couples of training and test sets. The final results are calculated
data points per fault have been stored in the data set. Finally, by averaging the performances from three iterations.
210 nonfaulty cells (normal conditions) having triggered an For comparison purposes, both discretization techniques use
alarm along the 105 different simulations have also been added a four-state segmentation of all the symptoms, having for each
in the whole data set. KPI the states labeled “low,” “normal,” “high,” and “very
In BN inference, the symptoms’ histograms for both normal high.” Hence, for the EMD method, the algorithm described
and faulty cells are of particular interest and are illustrated in Section II-C has been recursively applied twice, whereas for
below. Fig. 6 compares the BCR histograms for the normal and the PBD, the 70th, 80th, and 90th percentile of each symptom
faulty cells in the case of a pilot value (denoted as P ilot+ ) of distribution have been considered to find the three thresholds.
38 dBm that is too high (33 dBm is considered as a normal For both methods and for each iteration, a training set of
value). A shift to the right of the P ilot+ histogram indicating 140 data points (cases) per fault has been used, namely two
quality degradation can be clearly noticed in the figure. thirds of the whole data set of 210 cases.
The histogram for the MD blocking rate for the normal and Once the discretization process has been performed, the
faulty cells in the case of a pilot value that is too high is depicted estimation of the BN parameters, i.e., the prior probabilities
in Fig. 7. As in the BCR case, the excess of pilot power results and the conditional ones in (4), is carried out by means of the
in shifting the histogram to the right. method described in Section II-D.
2) Diagnosis Performance Using Two Different Discretiza- Table I summarizes the discretization obtained for the BCR
tion Methods: The discretization techniques and the per- symptom using both methods and their associated conditional
formance evaluation are currently considered. Two different probabilities related to “normal” and P ilot+ faulty cells,
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
KHANAFER et al.: AUTOMATED DIAGNOSIS FOR UMTS NETWORKS USING BAYESIAN NETWORK APPROACH 2457

TABLE I
CONDITIONAL PROBABILITY TABLE FOR NORMAL AND FAULTY CELLS IN
THE C ASE OF E XCESS P ILOT P OWER P ROBLEM , i.e., P(BCR|“normal”)
AND P(BCR|P ilot+), R ESPECTIVELY , U SING (A) EMD AND (B) PBD

Fig. 8. Entropy versus boundary cut points for the neighboring ping-pong
symptom.

TABLE II
DIAGNOSIS ACCURACY COMPARISON USING PBD AND
EMD DISCRETIZATION TECHNIQUES

Fig. 9. Entropy versus boundary cut points for the DL throughput symptom.

TABLE III
DIAGNOSIS RESULTS: THE UPPER PART CORRESPONDS TO THE
i.e., P (Si = si,j |“normal”) and P (Si = si,j |P ilot+ ), respec- PERCENTAGE OF CORRECT DIAGNOSIS OF FAULTY CELLS , AND THE
tively, where in this case, Si is the BCR symptom. This table is LOWER PART CORRESPONDS TO THE CORRECT DETECTION
OF F AULTY T RIGGERED A LARMS
derived from the histogram previously depicted in Fig. 6.
In the following step, the remaining 70 data points per fault
(i.e., one third of the whole data set) are introduced into the
Execution module to perform the diagnosis. The final diagnosis
is obtained using (2), where the different probabilities are
selected from the tables. The table entry is defined by the row
with the interval to which the symptom belongs. By computing
(2) for all possible causes, one obtains all the conditional represents the entropy evolution as a function of the boundary
posterior probabilities for the all possible fault causes given the cut points for the first iteration, whereas the dotted black and
set of symptoms and, in particular, the cause with the highest dashed gray lines represent the entropy evolution for the first
probability to occur. Table II summarizes the obtained results, and second subsets of the second iteration of the algorithm,
where fault diagnosis and false alarm detection illustrate the respectively. The minimum value for each curve determines the
ability of the system to identify faulty and nonfaulty cells best threshold and is represented in the figures as a vertical line.
(and not only to diagnose faults), respectively. One can see Table III gives the overall performance of the automated
how the discretization impacts the inference quality. The EMD diagnosis using the EMD technique with threefold cross-
outperforms the PBD method in terms of diagnosis accuracy for validation test for the first two causes with the highest prob-
both fault diagnosis and false alarm detection. abilities. For each one of the threefold validation tests, 490
3) Detailed Results Using the EMD Method: In this section, different faulty cells (70 points for each of the seven faults)
the results using the EMD method are presented in more detail. have been introduced into the Execution module as test set. On
Figs. 8 and 9 show the entropy calculated using (4) as a function average (computing the three different test sets), 344 among the
of the boundary cut point for two symptoms [i.e., neighboring 490 generated faults have been correctly diagnosed, namely the
ping-pong and downlink (DL) throughput] for each one of highest probability has been attributed to the correct fault with
the two iterations of the algorithm. The continuous gray line 70.2% diagnosis accuracy. In 88 of the remaining 146 cases, the
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
2458 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 57, NO. 4, JULY 2008

probability given by the BN to the correct diagnosis has been — Other: The symptoms’ values included in the model are
the second highest probability (i.e., in 88.2% of the cases, the expected to significantly change if one of the above faults
right fault has been the first or the second in the list of causes is causing problems. The last cause corresponds to the
ranked by probability). rest of the faults that could originate a significant DCR,
Next, the ability of the system to identify nonfaulty cells (and such as the following:
not only to diagnose faults) has been assessed. Together with • terminal problems;
490 faulty cells, 70 different false alarm cells have been present • hardware and software problems;
at each one of the three test sets. Among the different simu- • bad parameter definition.
lations, on average, 34 out of the 70 nonfaulty cells triggering If new causes are added, additional symptoms and alarms
alarms have been correctly identified as normal cells without may be needed to achieve a correct diagnosis. The symptoms
faults, and 16 of the remaining cells have been identified as have been discretized according to thresholds learned using the
faulty but with a “normal” second diagnosed state (i.e., in PBD method (when performing this part of the work the EMD
71.4% of the diagnosed cases have been identified as normal functionality was not yet available). For that purpose, a database
in the first or the second cause ranked by probability). with more than 500 cases has been used.
4) Convergence: The number of cases required in the BN The symptoms utilized for the diagnosis model are listed
learning phase (i.e., the number of data points for BSs trigger- below. It is noted that some of the drop call counters are
ing alarms due to faults and false alarms) is an important point. associated to specific causes.
A convergence study has shown that diagnosis results converge — DCR for speech calls;
from 85 data points. Hence, the assumption of using 140 data — DCR due to missing neighbor relation;
points per fault in the training is correct. — DCR during soft HO execution;
— DCR due to lost of synchronization with node B;
IV. TS IN R EAL UMTS N ETWORKS — DCR due to a different cause (other);
— establishment fail rate for radio resource control
This section describes the application of automated TS in connections;
a real UMTS network. First, the diagnosis model is derived — establishment fail rate of radio access bearer connections
following the three steps below (see Section II-B): for speech calls;
1) identification of the fault’s causes and their associated — soft HO fails;
symptoms; — average number of radio links per cell;
2) BN construction (structure and number of states); — relocation failures during 3G–2G HO;
3) model learning using expert knowledge and data ex- — total number of inter-Radio Access Technology (RAT)
tracted from an OMC (thresholds and probabilities). HO attempts;
Then, FD is performed using the learned model. It should — number of calls drop during inter-RAT HO;
be pointed out that although certain parts of the model creation — received signal strength indicator;
are automatically performed, the roll of the radio expert and — failure to add the cell to the active set. This indicator is
the incorporation of the expert knowledge have proven to be calculated for the four neighboring cells with the highest
essential. number of HOs.

B. Learning Cases
A. Model Creation
The model construction has required to manually perform
The starting point for generating the UMTS TS model is the
fault diagnosis of several cells where the correct diagnosis
identification of appropriate symptoms (counters and KPIs) that
was not previously available. From the diagnosis data, the
could help in the FD. Counters and KPIs recovered from the
Knowledge Builder is able to train the model and calculate
OMC are the sole available information. Once the symptoms
thresholds and probabilities. The data used to build the model
are identified, the faults that directly or indirectly have an im-
comprises counters and KPIs recovered from the OMC with
pact on these counters are selected. Finally, the symptom–fault
a daily resolution. Several trials have been carried out (for
relation is determined. The proposed model considers a cell to
adjusting thresholds, determining relations between indica-
be problematic if the DCR is high, namely higher than 1%.
tors and causes, etc.) until the model convergence has been
The identified faults that can cause a high DCR are the
achieved with 97% correct diagnoses for a learning set of
following.
77 cases.
— lack of coverage;
— uplink interference;
C. Fault Diagnosis
— lack of 3G neighbors;
— soft HO problem; The last step consists of applying the diagnosis model to a
— 2G neighbors’ problem test set of problematic cells to evaluate its performance. Forty-
• bad definition of 3G–2G neighbors; two faulty cells have been selected, and their corresponding
• congestion in 2G neighboring cell; symptoms, namely counters and KPIs, have been introduced
• lack of 2G neighbors. into the Execution module. Among the 42 selected cells, six to
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
KHANAFER et al.: AUTOMATED DIAGNOSIS FOR UMTS NETWORKS USING BAYESIAN NETWORK APPROACH 2459

TABLE IV additional ones could be added to improve the performance of


DISTRIBUTION OF FAULTS FOR THE TESTING SET
THAT COMPRISES 42 CASES the model. Finally, in the present model, only 3G statistics have
been incorporated. The addition of certain counters and KPIs
from the neighboring 2G cells to the model such as load- or
interference-related symptoms could be a promising avenue to
further improving the diagnosis model.

V. C ONCLUSION
This paper has presented research on a UMTS-automated
TABLE V TS carried out in the framework of the Eureka Celtic Gandalf
DIAGNOSIS ACCURACY PER FAULT project. The BN approach has been selected and adapted mainly
based on previous work, and again, it has been found partic-
ularly effective for the automated diagnosis task. To further
enhance automation and performance of the BN model, the
automated learning of both KPI thresholds and model proba-
bilities has been thoroughly investigated.
The PBD and EMD methods have been used for threshold
setting. It has been shown that the EMD method achieves a
discretization of the input symptoms that is closer to optimal
TABLE VI than when using the PBD method, since the resulting model
FIVE CASES FOR WHICH DIAGNOSIS IS WRONG FOR THE FIRST CAUSE
WITH THE HIGHEST PROBABILITY AND CORRECT FOR THE SECOND ONE shows better performance. In a first phase, the diagnosis model
has been studied on a semidynamic UMTS simulator. The
simulator allows the generation of a large amount of “cheap”
data required to learn the diagnosis model and to relate symp-
toms to faults. In a second phase, the automated TS model
has been adapted to a real UMTS network utilizing counters
and KPIs recovered from an OMC. In 88% of the considered
cases, the correct fault has been diagnosed, and in the remaining
cases, the diagnosis has been wrong in the first option but
correct in the second one. These encouraging results illustrate
the potential benefit of automated TS for a wireless network
operator. Finally, the methodology presented in this paper can
be extended to other RANs, including heterogeneous networks
and core networks.

ACKNOWLEDGMENT

nine cases have been selected for each type of cause, depending This paper was carried out in the framework of the Eureka
on how representative it is in the entire sample space, namely Celtic Gandalf project.
in the real network. In 88.1% of the cases, the correct fault has
been diagnosed. In the remaining cases, i.e., in five of the cells, R EFERENCES
the diagnosis has been wrong in the first option but correct [1] P. Lehtimäki and K. Raivio, “A knowledge-based model for analyzing
in the second one. Another important issue is that among the GSM network performance,” in Proc. Int. Conf. Ind. Eng. Appl. Artif.
Intell. Expert Syst., Bari, Italy, Jun. 2005.
37 cells correctly diagnosed, for 24 (i.e., for nearly 65%), the [2] J. Laiho, M. Kylväjä, and A. Höglund, “Utilisation of advanced analy-
right fault has been diagnosed with a probability higher than sis methods in UMTS networks,” in Proc. IEEE Veh. Technol. Conf.,
90%. These encouraging results illustrate the effectiveness of Birmingham, AL, May 2002, pp. 726–730.
[3] J. Laiho, K. Raivio, P. Lehtimäki, K. Hätönen, and O. Simula, “Advanced
the BN approach of automated TS. More detailed results for analysis methods for 3G cellular networks,” IEEE Trans. Wireless Com-
the diagnosis phase are presented and analyzed below. The mun., vol. 4, no. 3, pp. 930–942, May 2005.
distribution of faults for the testing set is listed in Table IV. [4] A. J. Hoglund, K. Hatonen, and A. S. Sorvari, “A computer host-based
user anomaly detection system using the self-organizing map,” in Proc.
The diagnosis accuracy per fault is depicted in Table V. IEEE-INNS-ENNS Int. Joint Conf. Neural Netw., Como, Italy, Jul. 2000,
Table VI presents a closer look at the five cells (denoted vol. 5, pp. 411–416.
“CELL A” to “CELL E”) for which the diagnosis has been [5] P. Lehtimäki and K. Raivio, “A SOM based approach for visualization of
GSM network performance data,” in Proc. Int. Symp. Intell. Data Anal.,
wrong for the first diagnosed cause (and correct for the second). Madrid, Spain, Sep. 2005.
The cases of wrong diagnosis in Table VI hint that the [6] G. Ng and K. Ong, “Using a qualitative probabilistic network to explain
main problems are related to the identification of 2G neigh- diagnostic reasoning in an expert system for chest pain diagnosis,” in
Proc. Comput. Cardiol., Sep. 2000, pp. 569–572.
bors and Coverage faults. Further effort should be invested by [7] D. Heckerman, J. Breese, and K. Rommelse, “Decision-theoretic trou-
verifying whether certain symptoms should be removed or if bleshooting,” Commun. ACM, vol. 38, no. 3, pp. 49–57, Mar. 1995.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
2460 IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, VOL. 57, NO. 4, JULY 2008

[8] M. Steinder and A. Sethi, “Probabilistic fault localization in communi- Jordi Triola was born in Figueres, Spain, in 1981.
cation systems using belief networks,” IEEE/ACM Trans. Netw., vol. 12, He received the M.Sc. degree in telecommunica-
no. 5, pp. 809–822, Oct. 2004. tions engineering from Universitat Politècnica de
[9] H. Wietgrefe, “Investigation and practical assessment of alarm correlation Catalunya (UPC), Barcelona, Spain, in 2004, and
methods for the use in GSM access networks,” in Proc. IEEE/IFIP Netw. a radio communications specialization degree from
Operations Manage. Symp., Florence, Italy, Apr. 2002, pp. 391–403. École Supérieure d’Électricité (Supélec), Gif-sur-
[10] R. Barco, V. Wille, and L. Dýez, “System for automatic diagnosis in cellu- Yvette, France.
lar networks based on performance indicators,” Eur. Trans. Telecommun., After his training period with SONDRA
vol. 16, no. 5, pp. 399–409, Oct. 2005. (a joint laboratory between the National University
[11] P. Langley, W. Iba, and K. Thompson, “An analysis of Bayesian classi- of Singapore and Supélec), he joined France
fiers,” in Proc. 10th Nat. Conf. Artif. Intell., 1992, pp. 223–228. Telecom R&D, Issy les Moulineaux, France, in
[12] R. Barco, R. Guerrero, G. Hylander, L. Nielsen, M. Partanen, and 2005. He has participated in different projects on wireless network engineering,
S. Patel, “Automated troubleshooting of mobile networks using Bayesian including radio network simulators and troubleshooting tools. His research
networks,” in Proc. IASTED Int. Conf. CSN, Malaga, Spain, Sep. 2002, interests include radio mobile network optimization, quality evaluation, and
pp. 105–110. automated troubleshooting techniques.
[13] R. Barco, L. Nielsen, R. Guerrero, G. Hylander, and S. Patel, “Automated
troubleshooting of a mobile communication network using Bayesian
networks,” in Proc. IEEE Int. Workshop MWCN, Stockholm, Sweden,
Sep. 2002, pp. 606–610.
[14] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plau-
sible Inference. San Francisco, CA: Morgan Kaufmann, 1988.
[15] R. C. Holte, “Very simple classification rules perform well on most com-
monly used datasets,” Mach. Learn., vol. 11, no. 1, pp. 63–90, Apr. 1993.
[16] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of
continuous-valued attributes for classification learning,” in Proc. 13th Int.
Joint Conf. Artif. Intell., 1993, pp. 1022–1027.
[17] Z. Altman, R. Skehill, R. Barco, L. Moltsen, R. Brennan, A. Samhat,
R. Khanafer, H. Dubreil, M. Barry, and B. Solana, “The Celtic Gandalf
framework,” in Proc. IEEE MELECON, Benálmadena, Spain, May 2006,
pp. 595–598.
[18] A. Samhat, Z. Altman, M. Francisco, and B. Fourestié, “Semi-dynamic
simulator for large scale heterogeneous wireless networks,” Int. J. Mob. Raquel Barco received the M.Sc. degree in telecom-
Netw. Des. Innov., vol. 1, no. 3/4, pp. 269–278, 2006. munication engineering and the Ph.D. degree from
[19] S. B. Jamaa, H. Dubreil, Z. Altman, and A. Ortega, “Quality indicator the University of Málaga, Málaga, Spain, in 1997 and
matrices and their contribution to WCDMA network design,” IEEE Trans. 2007, respectively.
Veh. Technol., vol. 54, no. 3, pp. 1114–1121, May 2005. From 1998 to 2000, she was with the European
[20] R. Kohavi, “A study of cross-validation and bootstrap for accuracy esti- Space Agency, Darmstadt, Germany. From 2000 to
mation and model selection,” in Proc. 14th Int. Joint Conf. Artif. Intell., 2003, she worked part-time for Nokia Networks.
San Mateo, CA, 1995, pp. 1137–1143. Since the end of 1999, she has been with the
Communication Engineering Department, Univer-
sity of Málaga. Her research interests include satel-
lite and mobile communications, mainly focussing
self-regulation of radio networks.
Rana M. Khanafer received the B.Sc. degree in
telecommunication from Saint-Joseph University,
Beirut, Lebanon, in 1999, the M.Sc. degree in com-
puter science from the University of Paris 6, Paris,
France, in 2001, and the Ph.D. degree in com-
puter science from “Ecole Nationale Supérieure des
Télécommunications,” Paris Cedex 13, France,
in 2005.
She was an Assistant Professor from 2001 to 2004
and a Research Assistant from 2004 to 2005 with
the University of Paris 6. Since 2005, she has been
a Research Engineer with France Telecom R&D, Issy les Moulineaux, France,
and has participated in different projects on wireless network engineering, per-
formance evaluation, and design of traffic controls for multiservice networks.
Her research interests include mobile communications, quality evaluation, and
end-to-end QoS in multiservice networks. Lars Moltsen received the M.Sc. degree in computer
science and mathematics from Aalborg University,
Aalborg, Denmark, in 1996.
He is an experienced Entrepreneur, R&D Engi-
neer, and Software Architect, technically special-
Beatriz Solana received the Master Eng. degree izing in artificial intelligence (Bayesian networks)
in telecommunications engineering (radio commu- and mobile technology (3GPP standards, in partic-
nication area) from Madrid Polytechnic University, ular GSM, UMTS, and long-term evolution). From
Madrid, Spain, in 2000. 1996 to 2000, he was with Hugin Expert, Denmark,
Since March 2000, she has been with Telefónica developing software for Bayesian reasoning. From
I+D, Madrid, where she became an R&D Engineer 2000 to 2003, he was a Research Specialist with
with the Radio Communication Systems Group, tak- Nokia Networks, working on UMTS RRM algorithms, contributing to one
ing part in projects related to planning/optimization patent and a number of conference papers. From 2003 to 2007, he was the
tool development for 2G and 3G mobile radio sys- Managing Director of his own company i.e., Moltsen Intelligent Software,
tems. Likewise, she has collaborated in consultancy specializing in wireless communication software and automation of processes.
tasks and support radio planning activities with other The company grew and was sold in February 2007 to Wirtek, Aalborg, which
companies within the Telefónica Group. She has worked on European projects is a Danish/Romanian telecom software provider, where he is currently the
under the Celtic initiative. Business Unit Manager of the software services business unit.
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.
KHANAFER et al.: AUTOMATED DIAGNOSIS FOR UMTS NETWORKS USING BAYESIAN NETWORK APPROACH 2461

Zwi Altman (SM’98) received the B.Sc. and M.Sc. Pedro Lázaro received the M.Sc. degree in telecom-
degrees in electrical engineering from Technion— munication engineering from the University of
Israel Institute of Technology, Haifa, Israel, in 1986 Málaga, Malaga, Spain, in 1997.
and 1989, respectively, and the Ph.D. degree in elec- From 1997 to 1999, he was with the European
tronics from the Institut National Polytechnique de Space Agency, Darmstadt, Germany. From 1999 to
Toulouse, Toulouse, France, in 1994. 2001, he was with the International Telecommunica-
He was a Laureate of the Lavoisier scholarship tion Union, Geneva, Switzerland. Since 2001, he has
from the French Foreign Ministry in 1994, and from been with the Communication Engineering Depart-
1994 to 1996, he was a Post-Doctoral Research ment, University of Málaga. His research interests
Fellow with the University of Illinois at Urbana include satellite and mobile communications, mainly
Champaign. Since 1996, he has been with France focussing in self-regulation of radio networks.
Telecom R&D, Issy les Moulineaux, France, and has participated in different
projects on wireless network engineering. He was the Project Coordinator of
the Eureka Celtic Gandalf project that dealt with the automation of management
tasks in heterogeneous wireless networks. His research interests include mobile
communications, autonomic networking, and data mining.
Dr. Altman was on the winning team of the 2003 Innovation Prize of France
Telecom R&D and was the corecipient of the Wheeler Award for the Best
Application Paper of the IEEE Antennas and Propagation Society in 2005.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:07:43 UTC from IEEE Xplore. Restrictions apply.

You might also like