An Automatic Detection and Diagnosis Framework For Mobile Communication Systems

184 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 9, NO.
2, JUNE 2012
An Automatic Detection and Diagnosis Framework

for Mobile Communication Systems
Péter Szilágyi and Szabolcs Nováczki
Abstract—As the complexity of commercial cellular networks only more difficult but also impose higher cost both on capital
grows, there is an increasing need for automated methods de- investment and operational expenditures (CAPEX and OPEX).
tecting and diagnosing cells not only in complete outage but with The solution for these problems is to a significant degree
degraded performance as well. Root cause analysis of the detected
anomalies can be tedious and currently carried out mostly seen in network management automation, voiced by operator
manually if at all; in most practical cases, operators simply reset community representative Next Generation Mobile Network
problematic cells. In this paper, a novel integrated detection and (NGMN) Alliance [1] and adopted by telecommunications
diagnosis framework is presented that can identify anomalies and standards body 3rd Generation Partnership Project (3GPP) by
find the most probable root cause of not only severe problems starting to develop standards covering self-configuration, self-
but even smaller degradations as well. Detecting an anomaly is
based on monitoring radio measurements and other performance optimization and self-healing for LTE, collectively referred
indicators and comparing them to their usual behavior captured to as Self-Organizing Networks (SON) [2], [3]. As the most
by profiles, which are also automatically built without the need mature one of the three domains, self-configuration focuses
for thresholding or manual calibration. Diagnosis is based on on automatic connectivity and commissioning of network
reports of previous fault cases by identifying and learning their elements, implementing functions like plug-and-play LTE
characteristic impact on different performance indicators. The
designed framework has been evaluated with proof-of-concept base stations [4], neighbor cell discovery [5], collision and
simulations including artificial faults in an LTE system. Results confusion-free assignment of physical cell identity [6] or
show the feasibility of the framework for providing the correct relay node auto-connectivity [7]. Most of these functions are
root cause of anomalies and possibly ranking the problems by contained by 3GPP Release 9 standards, thus they are part
their severity. of even the first LTE deployments. Self-optimization provides
Index Terms—Fault management, key performance indicator, operational time functions like energy saving, load balancing
network management automation, root cause analysis, self- [8] or automatic coverage optimization [9]. Self-healing aims
healing. at automated fault discovery, mainly focusing on the detection
of problems like cell outage (i.e., if a cell is completely
I. I NTRODUCTION unusable) [10], [11] and to a smaller extent also diagnosis
[12], [13].
T HE joint and rapid growth of mobile broadband data
rates, offered services and user terminal capabilities
challenge network operators in several ways. Smartphones
Study on self-healing is behind the other two areas for
obvious reasons. The need for self-configuration is clear as
with high-end multimedia applications require not only high it deals with the earliest phase in the life-cycle of a network
bandwidth but low call setup delay and round trip time com- element, operating with relatively straightforward algorithms,
parable to fixed line access as well, demanding capacity boost providing cost saving already in initial deployments. The
and QoS improvements via continuously upgrading existing goal of self-optimization is also well understood although the
radio access technologies (RAT) such as High Speed Packet implementation is not that mature. Self-healing, however, is
Access (HSPA) and introducing new ones such as Long Term the most complex of the three domains, bearing difficulties
Evolution (LTE) and beyond. Deploying new cells on top of arising from different functions, vendors, software versions
older technology leads to heterogeneous networks with co- and hardware types co-existing in a single network, with
existing radio access technologies (GSM, UMTS, (I-)HSPA, all their specific fault cases. Tedious manual troubleshooting
LTE) and different layers of the same technology (LTE macro-, processes used by operators usually produce no fault records
pico- and femto-cells as well as relays) often sharing a large and statistics to be analyzed later, rendering automation of
part of the same IP-based backhaul. As a consequence, not troubleshooting more difficult. For these reasons, previous
only the number of network elements are increasing at a fast self-healing studies have focused mainly on simple use cases
pace but also their interactions are becoming more complex. like detection of complete cell outages; however, root cause
Increasing complexity of mobile networks has impacts on analysis of degradations would also bring tremendous benefits
the management of these networks as well. Configuration, for operators, not only by making troubleshooting simpler but
optimization and troubleshooting of network elements are not saving a lot of cost as well.
Manuscript received June 17, 2011; revised January 17, 2012. The associate In this paper, we contribute to self-healing by proposing
editor coordinating the review of this paper and approving it for publication an automated detection and diagnosis framework for mobile
was A. Sethi. networks. In our work, diagnosis is approached as an intu-
The authors are with Nokia Siemens Networks, Budapest, Hungary (e-mail:
peter.1.szilagyi@nsn.com). itive process, applying the reasoning and thinking of human
Digital Object Identifier 10.1109/TNSM.2012.031912.110155 diagnosis to mobile networks. Diagnosis for humans and for
1932-4537/12/$31.00
c 2012 IEEE
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA. Downloaded on November 04,2023 at 09:10:29 UTC from IEEE Xplore. Restrictions apply.
SZILÁGYI and NOVÁCZKI: AN AUTOMATIC DETECTION AND DIAGNOSIS FRAMEWORK FOR MOBILE COMMUNICATION SYSTEMS 185
networks are compared in order to understand and identify the Diagnosis means to investigate the root cause that could
key challenges of the latter. Based on that, the requirements have caused the detected symptoms. In the proposed frame-
for a practically usable automated diagnosis framework are work, the input of the diagnosis is the output of the detection.
provided as well as the detailed design of a baseline solution The output of the diagnosis might as well be that there is
created along these requirements. The proposed framework in fact no problem at all. Usually, after the diagnosis of
automatically generates profiles of performance indicators to the root cause is done, certain corrective actions have to be
capture the faultless behavior of a network and later uses them performed in order to resolve the problem. Sometimes the root
as a basis of comparison to identify significant deviations from cause is harder to investigate than providing the action without
the usual behavior. Diagnosis is made by learning the impacts knowing the underlying mechanisms; e.g., several failures can
of different faults on different performance indicators. We have a common corrective action (like restarting the cell) but
evaluated our work in a simulated LTE environment featuring the root cause is unknown for the network operator. It is even
artificial fault cases. Results indicate that the framework could possible that the associated action is not a direct correction of
be a feasible alternative for root cause analysis. the fault but the recommended escalation (e.g., calling vendor
The rest of this paper is organized as follows: in Section II, support line). Therefore, using the corrective action instead of
the definition of detection and diagnosis are given that will the specific root cause is also acceptable. The root cause or
be consistently used throughout the paper and we outline the the corrective action are what the diagnosis returns and they
challenges of diagnosis in cellular networks. In Section III, the will be jointly referred to as the target of the diagnosis.
proposed automated detection and diagnosis framework is pre- Based on the established terminology, human diagnosis is
sented including automated profile creation, detection based now taken as an example of a successful diagnosis system and
on these profiles and the diagnosis process. Section IV gives a it is compared to diagnosis for mobile networks in order to
study on using the framework in a simulated LTE environment understand the challenges behind the latter.
for identifying different faults with detailed simulation results
and observations. Section V is dedicated to the management A. Human diagnosis
of the framework itself and to identify synergies with other
automated network management functions. In Section VI, re- In medical diagnosis cases, the patient usually has already
lated work on fault diagnosis automation is discussed. Finally, detected the presence of a couple of symptoms (such as
in Section VII, we conclude our paper and outline possible fever, headache or sore throat) but there are also symptoms
directions for future work. detectable only by a doctor due to the required apparatus
(e.g., ECG, X-ray, etc.). These symptoms correspond to Key
II. OVERVIEW OF D ETECTION AND D IAGNOSIS Performance Indicators (KPI) in a mobile network, which
Detection and diagnosis are not sharply separated in com- would be examined by the operator’s troubleshooter. From the
mon speech. By the phrase “detecting a problem” one often anamnesis and further measurements, the doctor formulates
means two things actually: first, the confirmation that there the diagnosis, e.g., concluding that the patient’s disease, i.e.,
is a problem at all and second, the verification of the nature the root cause, is flu. This diagnosis step is based on a good
or type of the problem itself. An example is to say to detect understanding of the underlying methodology of infections;
that the car does not start because the plug is generating no it is well-known, widely studied and educated to specialists
spark. The correct terminology in this example would be to throughout their medical training. The corresponding treat-
say that an unusual behavior has been detected (i.e., the car ment that solves the problem in the vast majority of the cases
does not start) and it is diagnosed that, e.g., the cause is a is also available, e.g., the medicines to take in order to relieve
cracked spark plug that has to be replaced. Detecting that the pain and reduce fever. So the route taken by the doctor can
car will not start does not necessarily mean that there is a be referred to as:
problem with the car itself; nevertheless, simply looking at DETECTION ( SYMPTOMS ) → DIAGNOSIS → ACTION
the symptom level with this granularity it is impossible to
tell if there is a serious problem with the car or it just has Making a root cause diagnosis first and then having a cor-
to be refilled. Therefore, if an unusual behavior is detected, responding treatment at hand is possible due to the following
a more thorough diagnosis has to be conducted in order to circumstances:
find out if there is actually a problem and what is the root 1) Human beings react to diseases the same way since they
cause behind it. Since the terms “detection” and “diagnosis” have the same KPIs to examine and the same root cause
often carry an implicit duality and thus appear overloaded in has impacts on the same KPIs across humans; thus, the
common speech, they have to be precisely defined before used results of studies made on one population are directly
in an engineering system such as a cellular network: applicable to another.
Detection basically means to identify something unusual 2) Results are not deprecated by time since the evolution
in the network. However, in the context of the proposed of the human genome is very slow compared to the pace
integrated framework, the role of the detection process is only of the evolution of technology. The knowledge acquired
to provide a common view of possible indicators (symptoms) fifty years ago on the symptoms of flu is still valid today.
to the diagnosis to facilitate their correlation but deciding if 3) Huge amount of systematic and researchable data has
there is a fault at all or what it is will be left to the diagnosis. been collected, stored and published.
The interface between the detection and diagnosis part is Now let us examine the differences between the medical
discussed in Section III-A. and the mobile network scenarios from the same perspective:
186 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 9, NO. 2, JUNE 2012
1) Mobile communication systems are versatile, have dif- KPIs and divide the state of the KPIs into two substates
ferent KPIs, react differently to the same root cause; based on whether there is a threshold violation or not. This
the results of studies made on one system are not can be combined with more advanced techniques, e.g., only
applicable to another (not even within the same RAT). considering the KPI if it has been violating the threshold
Also, the processes and practices of the network operator continuously for some time or it has crossed the threshold
influence this behavior. for a given number of times. However, all these common
2) Results are deprecated quickly as technology evolves. techniques applied on top of thresholding have the same
3) Very little data, if any, is being recorded by operators problem: they quantize the KPIs into a binary space, losing
and even that is seldom disclosed to equipment vendors information and thus being unusable for reliable diagnosis. In
or research institutes. Usually, operators only try a the next section, we introduce a framework that overcomes
few basic actions to fix a seemingly wrong cell (e.g., this limitation and utilizes also the quantity of deviation from
resetting it) then instead of detailed root cause analysis the profile, operating without any thresholds.
they rather call technical support of the vendor of the
network element. Measurement data from live networks
III. I NTEGRATED D ETECTION AND D IAGNOSIS
do not contain any annotation regarding if it has been
F RAMEWORK
collected during faultless state or there were some faults.
In summary, automatic diagnosis in a mobile communi- In the previous section, we defined what is meant by the
cation system seems to be more challenging than human terms detection and diagnosis, introduced KPIs and profiles,
diagnosis, so it is important to have realistic goals for such and showed not only the motivations but also the challenges
systems. Generally, if even the network operator is unable to of automated diagnosis in cellular networks. Now we turn
deliver the correct diagnosis in a reliable manner, it cannot be to present our own work in detail, describing an automatic
expected from an automatic framework to learn from such an framework that integrates detection and diagnosis.
inconsistent input and outperform the human troubleshooter. In our opinion, the principles along which a successful and
Especially, defining unsuitable KPIs that provide the same practically feasible diagnosis framework should be built can be
symptoms for different faults is not a problem of the frame- summarized in three requirements. First, the framework should
work but that of the selection of the KPIs. have realistic prerequisites regarding the type and amount of
data it requires from operators. Second, the framework has to
integrate into the operator’s network seamlessly, without being
B. Performance monitoring in mobile systems obtrusive or interfering with the operator’s existing deployed
Performance monitoring provides data by which the state workflows. Failure to do so would require the operator to
of a mobile system can be evaluated. As already mentioned, perform major modifications of the existing infrastructure and
the performance and status (i.e., faultless, degraded, etc.) established processes, which could result in the rejection of
of mobile networks are monitored through Key Performance the framework. Finally, every deployed mobile network and
Indicators (KPI). A KPI is a metric ranging from low level net- various parts of the same network are different in several ways:
work measurements to higher level quality indicators through the set of available KPIs, the radio environment (e.g., rural
which a specific aspect of the network can be measured. Some vs. urban areas), equipment vendors, software versions; the
KPIs are user behavior (and consequently time) dependent, framework has to be flexible enough so that it is able to adapt
such as downlink traffic in a cell; others, such as Channel to these different environments.
Quality Indicator (CQI), Random Access Channel (RACH) The high-level operation of both the current manual trou-
attempts or call drop ratio have lower correlation with user bleshooting process and the proposed automatic framework
behavior as they do not depend on the application run by the is shown in Fig. 1. In the current manual process, low-level
user. The latter KPIs are usually better for evaluating the state HW/SW alarms are provided by the platform and additional
of the system because the randomness introduced by users is measurement-based KPIs may be available as well; these can
not that apparent. Note that in this paper, the term KPI is used also be subject to thresholding and profiling via Operations
in a wider meaning (i.e., any observable piece of performance Support Systems (OSS) tools that can raise alarms in case
information available in the system) as opposed to the KPI thresholds are violated. This approach can result in lots of
definition found in [14]. uncorrelated alarms presented directly to the operator who
A data set describing the usual (i.e., faultless) behavior of has to do the diagnosis manually, usually requiring to look at
a KPI is called the profile of the KPI. A profile can be as current KPI values directly. Alarm correlation [15] can help
simple as the average of the measured KPI values during reduce the volume of the alarm flow; however, as the input
fixed time slots (e.g., hourly aggregated and averaged channel for alarm correlation depends on the quality of the available
quality reports in LTE) or it may be statistically more advanced alarms in the first place, useful troubleshooting information
representation of the behavior of the KPI. Profiles of KPIs can that could be inferred from KPIs but not visible from alarms
be built per-cell, per-base station (BTS) or even on a wider may still be missed.
aggregation layer, e.g., considering traffic in a whole Radio The framework consists of two building blocks, detection
Network Controller (RNC). and diagnosis, with automated functionalities. Detection oper-
It needs to be identified if a KPI behaves according to its ates on a per-KPI basis and its goal is to maintain a measure
profile or there is a statistically significant deviation from of how much the current value of the KPI deviates from its
it. An established way to do it is to define thresholds on profile. It is also a wrapper around different kinds of KPIs,
Manual troubleshooting process KPI measurements

and HS/SW alarms
Automatic troubleshooting with our proposed
detection and diagnosis framework KPI1 KPI2 KPI|K|
Detection
KPI per−KPI detection
KPI values Operator ϕ ϕ ϕ computes KPI levels
Measurement Thresholding
(using profiles)
Low−level manual KPI level interface
ALARMs diagnosis
Diagnosis
initiate active measurements
HW/SW alarms
manual
troubleshooting Fig. 2. Overview of the detection process and its input and output interfaces.
Detection interface
(using profiles) reports of
previous
KPI level
interface
fault cases overlapping are averaged to obtain samples ai of variable K̄
as follows:
Diagnosis query
(automatic) x1 , . . . , xn , xn+1 , . . . , x2n , . . . , x(k−1)n+1 , . . . , xkn (1)

automatic avg avg avg
High−level ALARMs
troubleshooting ↓ ↓ ↓
interface reports
Operator a1 a2 ak
(supervise) (training feedback)
The profile is built from k samples of K̄, a1 , . . . , ak , by
Fig. 1. Overview of the troubleshooting process with the current manual calculating the sample mean μ0 (K̄) and sample variance
troubleshooting workflow and the proposed automatic detection and diagnosis σ02 (K̄) from which the sample deviation is σ0 (K̄):
framework.
1 1 2
k k
μ0 (K̄) = ai ; σ02 (K̄) = ai − μ0 (K̄) (2)
providing a uniform interface called KPI level (to be explained k i=1 k i=1
shortly) towards the diagnosis, which is responsible for root According to the central limit theorem, if the xi samples are
cause analysis with the help of a database containing previous independent and identically distributed (i.i.d.), the resulting
fault cases. Instead of sending an alarm on each occasion the ai samples are normally distributed as n → ∞. If the xi
detection process finds a suspicious KPI (introducing false samples are originating from different users or from the same
positives), only the output of the diagnosis is sent to the user but based on independent events (e.g., handovers or
operator in case a fault has been diagnosed. In the rest of independent calls), the i.i.d. assumption is practically satisfied,
this section, the detection and diagnosis steps are described d
thus K̄ → N μ0 (K̄), σ02 (K̄) . The parameters n and k are
separately in detail, starting with the detection process. KPI dependent; as a rule of thumb, n > 40 is needed for
independent xi samples to make sure K̄ is normally distributed
A. Detection and k > 20 is needed to obtain a statistically significant
Traditional detection methods rely on predefined thresholds population of ai samples. In case there is serial correlation
configured separately for each KPI and when a threshold is vi- between the xi samples (e.g., consecutive channel quality
olated, an alarm is generated. However, there is a fundamental reports of the same user), n should be much higher. Fig. 3
problem with this approach: threshold violations only facilitate shows the distribution of average CQI samples obtained from
binary indications about the state of the KPI. With a threshold a real HSPA network and the corresponding Quantile-Quantile
set to x0 for a certain KPI, a KPI value x0 −ε is still considered (Q-Q) plot that demonstrates normality of the averaged sam-
perfect whereas x0 + ε is already the worse possible (ε → 0). ples. Similar graphs are given for LTE CQI data obtained from
The problem still exists if the threshold is not an absolute one the simulator used for evaluation as presented later in this
but calculated from observing the statistical distribution of the paper. Common normality tests such as Anderson–Darling,
KPI during a faultless period of the system. Shapiro–Wilk, etc. all accept the normality of both the real
Instead of the current process, we propose a novel unified and simulated average CQI samples with level α = 0.05.
KPI interface that returns a real number from the continuous During operational phase, i.e., after the profiling of K
range [0, 1] called the level of the KPI, denoted by ϕ, describ- has been completed, the current average sample mean of K,
ing how well the current behavior of the KPI corresponds to denoted by X (K) , is continuously composed of the latest n
the profile. Level 0 means perfect conformance to the profile samples of K with a sliding detection window as follows:
and the level asymptotically approaches 1 as the deviation . . . , xt−n , xt−n+1 , . . . , xt (3)
from the profile increases. As shown in Fig. 2, the level
of each KPI is fed directly (without any thresholding) into avg
↓
the diagnosis process, which uses these levels in order to
X (K)
determine the root cause of the fault (if any) for the cell.
Since the current behavior of a KPI is to be compared to where xt is the latest sample of K. The variable X (K) can be
its profile, first the profile of the KPI has to be obtained as transformed into a standard normal variable Z (K) as follows:
detailed in the following. Let the individual samples (measure-
ments) of KPI K be xi . Each n consecutive samples without Z (K) = X (K) − μ0 (K̄) /σ0 (K̄) (4)
Data density and fitted normal pdf Q-Q plot

ϕ− (K) ϕ+ (K)
0.3 4
1
Real CQI data
0.2
(a) 2 (b)
Level of KPI K
0 0.75
0.1
−2 0.5
0 −4
0.25
13 15 17 19 21 23 25 −4 −2 0 2 4
0.25 3
Simulator CQI data
0
0.2 (c) 2 (d)
1 −8 −6 −4 −2 0 2 4 6 8
0.15
0.1 0 Difference of X (K) from the profile in σ0 (K̄) units, Z (K)
−1
0.05
−2
0 Fig. 4. Asymmetric level functions for KPI K with C = −4. The graph of
−3 the symmetric level function ϕ± (K) is composed of the graph of ϕ− (K)
5 6 7 8 −3 −2 −1 0 1 2 3 for Z (K) < 0 and ϕ+ (K) for Z (K) ≥ 0.
Fig. 3. (a) CQI histogram from a real HSPA cell showing the density and
fitted theoretical normal distribution of average ai samples obtained as hourly
averages throughout 14 days (336 samples); (b) corresponding Q-Q plot to uniform approach to any KPI without need for the operator to
demonstrate normality; (c) similar CQI histogram of 100 samples from the
LTE simulator used in this paper for evaluation; (d) corresponding Q-Q plot.
fine-tune KPI specific parameters and algorithms.
KPIs are different with regards to how well their usual
behavior can be captured by a profile. The best is if a KPI
Based on the above defined Z (K) and Φ(·), the cumulative is a system invariant whose change can only be due to some
distribution function of N (0, 1), the level of K, generally anomaly. These KPIs can have single profiles taken at any
denoted by ϕ(K), is defined in three cases as follows. If K faultless period of time and used afterwards for detection. This
denotes a quantity that should not increase (i.e., increasing is kind of KPI, however, is quite rare; most KPIs depend on the
an anomaly) compared to the profiled value but decreasing is behavior of users (e.g., throughput depends on the application
acceptable (e.g., call drop ratio), the level function is ϕ+ (K) a subscriber chooses to use and the content downloaded
as given below. If K denotes a quantity where decreasing is or uploaded) and, therefore, more or less follow complex
a problem but increasing is not (e.g., handover success rate), patterns. Since most aspects of a user’s behavior are time
the level function is ϕ− (K). Finally, if K should sustain a dependent, these user behavior dependent KPIs are also time
value same as or close to the profile (i.e., either more or dependent. Constructing a single profile for these KPIs would
less is undesirable) then the proper implementation is ϕ± (K). yield a large σ0 (K̄) that may not be sensitive enough later
The first two level functions are called asymmetric level in the detection phase to respond to any fault with increased
functions, the latter is called symmetric level function. The KPI level. Therefore, multiple profiles can be constructed from
level functions are shown in Fig. 4 and defined as follows: samples taken at different times (or, alternatively, not depend-
ing on time but the application subscribers use). Computing
ϕ+ (K) = Φ C + Z (K) (5) the KPI level is then based on the appropriate profile (same
time of day, same applications). Additionally, interpolation
ϕ− (K) = Φ C − Z (K) (6) between the profiles is possible: if there are two profiled means
ϕ± (K) = ϕ+ (K) + ϕ− (K) (7) μ0 (K̄, t1 ) and μ0 (K̄, t2 ) with t1 and t2 as their respective
central time (i.e., the middle of the time interval during which
where C is a constant that controls the horizontal shift of the samples built in the profile were taken), the interpolated
the Φ function; intuitively, −C gives the number of standard profile mean used for detection at t where t1 < t < t2 can be
deviations by which the current average sample mean of K
(t − t1 )μ0 (K̄, t2 ) + (t2 − t)μ0 (K̄, t1 )
should deviate from the profile to result in level 0.5. The value μ0 (K̄, t) = (8)
t2 − t1
is implementation dependent with C = −4 recommended
by the authors as this transforms the sensitive part of the Time values t1 , t2 and t are meaning times of the day,
level function to 2–6 standard deviations and gives negligible therefore they are wrapping over at 24h 00m each day.
ϕ± (K) ∼ 10−4 |Z (K) =0 . Note that C is not a KPI specific The integration of existing low level binary indicators (e.g.,
threshold; it is rather a uniform property of the level function. HW/SW alarms) into the KPI level interface can also be done
If there is no problem in the cell, X (K) has the same in a straightforward way. Each alarm can be transformed into a
distribution as K̄, thus the level of K would be close to 0 special KPI with no conventional profile; instead, the detection
since X (K) is close to μ0 (K̄). However, if there is a fault or should produce level 0 if the alarm or other indicator is off
an unusual circumstance that affects the KPI in question in a and 1 if it is on (formally, the detection window size is n = 1,
bad way, the level function moves towards 1. The advantage i.e., the latest sample completely determines the KPI level).
of the level function compared to raising an alarm at some Note that the presented way of computing the KPI level is
threshold is that the level function gives a continuous feedback a plugin in a sense that it is not the only possible way to map
that is sensitive to the amount of deviation of the recent KPI original KPI values to [0, 1]. The strengths of the proposed
samples from the profile. Also, the level function provides a method are its simplicity, flexibility and low computational
TABLE I TABLE II
F ORMAL DEFINITIONS OF TYPES AND VARIABLES USED BY THE KPI SUBSETS ASSOCIATED WITH A DIAGNOSIS TARGET T . E ACH ROW
DIAGNOSIS PROCESS REPRESENTS A DIFFERENT KPI SUBSET Ki ∈ K̂T WITH FULL BULLETS
MARKING EACH Ki ∈ K PRESENT IN THE CORRESPONDING SUBSET.
K set of all KPIs
KPI subset K1 K2 K3 ··· K|K|
K a particular KPI, K ∈ K
T set of diagnosis targets (actions and root causes) K1 • • ◦ ··· •
K2 • ◦ • ··· ◦
|K|, |T| cardinality of set K and T, respectively K3 ◦ ◦ • ··· •
R̂ list of all reports (the same report can be present more .. .. .. .. .. ..
than once); see (9) for the formal definition of reports . . . . . .
κ(R) KPI set of report R K|K̂T | ◦ • • ··· •
τ (R) target of report R
R̂T list containing all reports associated with a given target
T , i.e., R̂T = {R : τ (R) = T, ∀R ∈ R̂} by a structure that is called a report. Reports can be obtained
K̂T list containing the KPI subsets from all reports associated based on a priori expert knowledge on target-KPI relation (not
with a given target T , i.e., K̂T = {κ(R) : ∀R ∈ R̂T } mandatory) or can be added based on analyzed previous fault
ϕ(K) current measured level of KPI K cases from fault history (if exist) and continuously as faults
T(K) likelihood of KPI K, given target T happen during network operation. When the same fault with
identical symptoms happens multiple times, the same report
has to be added each time in order to model the relative
cost. Other, statistically more advanced techniques could also occurrences of the target associated with the report so that
be employed to compute a KPI level as long as they satisfy the the diagnosis can take into account the frequency of targets as
KPI level interface, i.e., maintain the range of the mapping. well. Formally, a report R is a pair that consists of the KPI
Further study in this direction would be interesting but it is subset of the report, denoted by the operator κ(R), and the
out of scope of this particular paper. diagnosis target of the report, denoted by the operator τ (R):
R = (κ(R), τ (R)) (9)
B. Diagnosis
where κ(R) ⊆ K and τ (K) ∈ T.
The diagnosis process operates directly on the KPI levels
Table II shows a generic example of KPI subsets associated
supplied by the detection process; there is no thresholding that
with a given target T . Each row corresponds to a different
classifies KPIs into two or more states, sparing the operator
report. Since the target of the reports is the same, the target
from having to calibrate each threshold manually. Now, the
is not indicated in the table. The number of columns equals
details of the diagnosis process are introduced; the symbols
to the number of KPIs, |K|. The number of reports associated
and notations used in the formal discussion are collected in
with target T (i.e., number of rows) is |R̂T | = |K̂T |.
Table I for reference.
A special report called the null report, denoted by R0 , is a
1) Diagnosis target: So far it has been understood that the
built-in report that is always present. The null report is defined
target of the diagnosis means either a root cause or a corrective
as follows:
action. However, not all targets are required to identify an
R0 : κ(R0 ) = ∅, τ (R0 ) = T0 (10)
actual fault case or a real corrective action: it is possible that,
based on the KPI levels, one would suspect a failure but it Apart from the null report, no other report R should have
turns out that there is no fault, only an unusual but perfectly κ(R) = ∅. The null report is used to identify the state of the
valid distribution of the KPI levels (e.g., the majority of the system when there is no problem at all, obviously associated
users are moving towards the cell edge, which results in worse with the empty KPI subset.
channel conditions to be reported although there is nothing 3) Scoring system: The diagnosis process monitors the
wrong with the antenna equipment itself). These corner cases KPI levels and, based on the expert knowledge that describes
can also be made into specific targets by the operator to let which KPIs are characteristic to which targets, lists the targets
the diagnosis process explicitly recognize those circumstances in descending order of relevance. At the top of the list is
as non-faulty thus increasing the accuracy of the diagnosis. the winning target that is the most relevant at the time the
Besides the targets specified by the operator, there is a special diagnosis was run. The process is based on a scoring system
built-in target used by the diagnosis process called the null that gives a score to each diagnosis target based on how
target and denoted by T0 , which has the semantic of “no well the reports associated with the target collectively match
problem” (if interpreted as root cause) or “nothing to do” (if the KPIs currently showing significant deviation from their
interpreted as corrective action). respective profile, i.e., having high KPI level. The more exact
2) Expert knowledge: Any data requested from the operator the match, the higher the score that is given to that target.
in order for an automatic framework to work is commonly Different reports for the same target do not necessarily have
referred to as expert knowledge. In our case, expert knowl- identical associated KPI subsets. Therefore, the target specific
edge is deliberately kept simple: it contains diagnosis targets likelihood value is introduced for KPIs. For each target T and
associated with KPIs that deviate from their usual behavior KPI K, it can be derived from the expert knowledge how likely
(i.e., profile) when the corresponding target root cause occurs it is for KPI K to have high level (i.e., observed with high
or the KPIs that return to their usual behavior after performing deviation from the profile) given that T is in the system. This
the target corrective action. Each association is represented conditional probability, denoted by T(K), can be calculated as
Consistency value 1 Detection

0.75 cf T(K) KPI level
interface
0.5 weighted continuous Hamming distance
0.25 separately
s(K1 ) s(K2 ) s(K|K| ) for each
Diagnosis
0 target
0 0.25 0.5 0.75 1
T s(Ki )
Likelihood of KPI K given target T , (K)
score
Fig. 5. The consistency function returns 0 at 0.5 (total inconsistency) and S(T1 ) S(T2 ) S(T|T| ) of each
1 at the edges of its definition domain (full consistency). In between, it is target
piecewise linear.
Fig. 6. Overview of the diagnosis process and its relation to the detection
process; |K| and |T| denote the cardinality of set K and T, respectively.
the relative frequency of occurrence of K under the condition
that T is present:
KPI is the more likely for the target:
(K) =
T 1
· 1Ki (K) (11)
S(T ) = s(K) (14)
R̂T ∀Ki ∈K̂T ∀K∈K

where 1Ki is an indicator function: s(K) = cf T(K) · dT(K) (15)
1 if K ∈ Ki where dT(K) ∈ [0, 1] is a continuous Hamming distance
1Ki (K) = (12)
0 if K ∈
/ Ki defined as
T ϕ(K) if T(K) ≥ 1/2
Note that the likelihood function is target specific, since d (K) = (16)
1 − ϕ(K) if T(K) < 1/2
the relative frequency of occurrence of a given KPI can be
different for each target. If T(K) is closer to 1 than to 0 (i.e., the KPI is included in
The likelihood value is used to determine how consistently the KPI subset of at least half of the reports for target T ),
a KPI is associated with any given target. If KPI K is always the expected level of the KPI in order to be a good match is
present in (or missing from) the KPI subsets of the reports high, thus higher score should be given as the current level
for target T , the association of K with T is fully consistent. approaches 1. On the other hand, if T(K) is closer to 0 than
It means that, upon observing the KPI level of K, it can be 1 (i.e., the KPI is not included in the KPI subset of most of
decided with high confidence whether the KPI behaves as if T the reports for target T ), the expected level of the KPI in order
was present or not: if K is included in all KPI subsets reported to provide a good match is low, thus higher score should be
for T , the diagnosis process recognizes a high KPI level of given as the current level approaches 0.
K as a confirmation for the presence of T , and does it with a The ordering of the targets is done by their respective score.
high confidence, since it has happened in all reported cases so If two targets have the same score, the one having the higher
far when T was diagnosed. For measuring how consistently number of overall reports is considered more relevant as that
a KPI appears with a given target, the consistency value is is a more frequent fault. Let S1 = S(T1 ) and S2 = S(T2 ).
introduced, which also takes its value from the range [0, 1]. Formally, the binary operator T1 > T2 is defined on T as
With the previous examples, the consistency value of the KPI follows:
would be 1 meaning the highest possible consistency. The
other extremity would be if a KPI was present in exactly half S1 > S2 ∨ S1 = S2 ∧ R̂T1 > R̂T2 (17)
of the reports associated with a given target; that corresponds Note that there is no separate mechanism in the diagnosis
to a consistency value 0, which basically means that one process that decides whether there is a fault or not in the first
can neither reason for nor against the presence of that target place; instead, the null target is put to fair competition with
regardless of the current level of the KPI. The consistency the other targets and may come out as being the most relevant
value easily follows from the likelihood value through the one by the same measure as a faulty target would.
following consistency function cf(·), shown in Fig. 5: An optional normalization of the scores can be performed
1 by using the following normalized score function:
cf : [0, 1] → [0, 1]; cf(x) = 2 x − (13)
2 S (T ) = S(T0 )−1 · S(T ) (18)
Note that the consistency function itself, contrary to the The normalized score of the null target is by definition 1.0.
likelihood function, is not target dependent. If the diagnosis is run on the cell layer (with per-cell profiles
The score S(T ) given for target T is simply the sum of the of KPIs), normalized scores are useful to rank the cells based
so-called per-KPI scores, as visualized in Fig. 6. A per-KPI on the severity of the faults. A target with higher normalized
score, denoted by s(K), is based on the distance of the current score (i.e., more deviation from the normal behavior) is more
level of K (at the time running the diagnosis) from either 0 urgent to fix. The framework can also be used to diagnose
or 1, depending on whether the presence or the absence of the on other layers, e.g., on base station or even RNC layer by
TABLE III
P ROFILES OF THE THREE KPI S ; n IS THE NUMBER OF SAMPLES IN A generate air interface related KPIs and also to be able to model
WINDOW AND k IS THE NUMBER OF AVERAGED SAMPLES , ACCORDING TO artificial fault cases. In order to fulfill these requirements, the
(1); t IS THE TIME IT TOOK TO CREATE THE PROFILE . simulator features a detailed LTE air interface implementation
including Okumura-Hata path loss model, per-BTS shadowing
K n k t μ0 (K̄) σ0 (K̄)
and fast fading (COST 207 channel model, Typical Urban,
CQI 10000 100 0h 59m 7.651 0.302
alternative 6-tap channel) assuming 20 dB penetration loss.
CD 100 100 21h 11m 0.016 0.013 The simulation scenario followed the reference system sim-
HO-TADV 50 50 9h 59m 2.914 0.048 ulation scenario specified by 3GPP [16] with 19 LTE base
stations placed at 500 m inter-site distance in a triple-ring
layout (with 1, 6 and 12 base stations in each ring), each
using KPIs, profiles and expert knowledge on those layers. BTS powering 3-sectored cells with proper horizontal antenna
With normalized scores, the severity of problems on the same characteristics, operating in SISO mode at 20 MHz bandwidth.
layer can be ranked. There were 500 users in the system randomly distributed,
Due to the mechanism of likelihood and consistency values, moving at 3 kmph according to the random way-point mobility
the scoring system performs better in identifying the correct model. The 12 outer base stations in the layout were used
root cause or corrective action if the KPI subsets reported for interference generation only but no users were allowed to
for the same target are similar. If this is not the case (i.e., actually connect to any of them; if a user was about to do
contradictory reports are put in the database), it can result so (i.e., making a handover to one of those cells), the user
in close-to-zero consistency values for lots of KPIs and the was relocated to a random position instead. Handover was
diagnosis system would always give low score to this target. implemented between the cells accessible to users according
Therefore, when adding a new report R for target T , it can be to the standard A3 trigger mechanism with 3 dB handover
checked automatically how well the KPIs in the new report offset, 0.5 dB hysteresis and 320 ms time to trigger interval.
match that of the reports already existing for T . The check can The following three KPIs were modeled in the simulator:
be made, e.g., by constructing a virtual ϕ KPI level for each Channel Quality Indicator (CQI), Call Drop (CD) and Han-
K ∈ K according to whether K ∈ κ(R) (ϕ (K) = 1) or K ∈ / dover Timing Advance (HO-TADV). CQI was reported by
κ(R) (ϕ (K) = 0) and running the diagnosis on these virtual active users, i.e., during test calls, in each ms according to
KPI levels as if they were coming from the detection process. the LTE standard. Calls were modeled as Poisson processes
If the winning target is T , the new target can be accepted as (exponentially distributed call length and waiting time) with
being consistent with the knowledge collected so far. If not, drop probability modeled as a function of average channel
it can be a sign of having an underlying fault case different quality on a per-second basis. If the average SINR was below
from what triggered the previous reports for T , even if T is the the threshold of the lowest usable CQI = 1 (≈ −7 dB) for
same corrective action that the operator would perform in all one second, the call was dropped. Otherwise, the probability
cases. If that happens, the system can automatically advise the of dropping the call exponentially decreased towards higher
operator to introduce a new target or use a different existing SINR values so that at the SINR threshold of CQI = 1 and
one for R instead of T . CQI = 7 (≈ 4.7 dB) the drop probabilities were 0.05 and
In order to have access to low-level fine granularity mea- 10−5 , respectively. Finally, Handover Timing Advance is a
surements, the detection (providing KPI levels) should be KPI showing the LTE Timing Advance of a user according
preferably located at the BTS. The diagnosis may also be run to its source cell before inter-site handovers. This is a KPI
there on a per-cell level, in which case the KPI level interface invented by us, not standardized or used by any prior art as
is intra-node. In case some KPIs are retrieved from the OSS is. Intuitively, it gives a measure of the cell radius: the higher
or the diagnosis is run in the OSS, care should be taken not to the mean of the HO-TADV values measured in a cell, the
avoid overloading the OSS and the transport network with KPI larger the corresponding coverage area. Note that a handover
or KPI level queries; for example, the diagnosis of different only provides a sample of HO-TADV in the source cell;
cells should be run in randomized time slots to balance traffic an incoming handover deliberately delivers no sample in the
and load on OSS. target cell. The reason is that due to the handover offset and
hysteresis, the mean of the timing advance would be different
IV. E VALUATION in a cell for incoming and outgoing handovers, and mixing the
In the previous section, we have presented the theoretical two would result in a profile with higher deviation. Choosing
considerations behind the detection and diagnosis framework. the outgoing handovers over the incoming ones delivers more
In this section, we evaluate the feasibility of the work with samples thus gives faster reaction after a degradation when
a simulator capable of producing per-cell KPIs and per-cell users are handing away from the affected cell.
profiles in an LTE radio access network. Although we modeled only three KPIs, in real mobile
networks there could easily be as many as 50 KPIs or more
including radio and transport measurements, alarms, OAM
A. Overview states, etc., some of which may be available only at a specific
The detection and diagnosis framework presented in the network deployment or at a specific operator. This is, however,
previous section was tested using an LTE simulator we de- not a problem for the diagnosis framework; it is part of its
veloped for this purpose. The simulator was implemented flexible design that the KPI portfolio can be arbitrary; there
fully in C++ on Linux platform and its main goal was to are no mandatory KPIs in order for the diagnosis to work.
TABLE IV TABLE V
R EPORTS FORMING THE EXPERT KNOWLEDGE OF THE SIMULATION . T HE TARGET- SPECIFIC LIKELIHOOD VALUES , AS PRE - COMPUTED BY THE
LAST ROW CONTAINS THE NULL REPORT WITH ITS EMPTY KPI SUBSET. DIAGNOSIS PROCESS AFTER THE REPORTS HAVE BEEN COLLECTED .
KPI subset target number of reports target CQI CD HO-TADV

{ CQI } UE 3 UE 1.0 0.0 0.0
{ HO-TADV, CQI } TX 4 TX 1.0 0.0 1.0
{ CD } SH 4 SH 0.2 1.0 0.0
{ CQI, CD } SH 1 T0 0.0 0.0 0.0
∅ T0 1
The level functions for the KPIs also have to be defined.

Asymmetric level functions were used for each of the three The diagnosis process in the faulty cell is expected to be
KPIs. Level function (6) was used for both CQI and HO- able to differentiate between these four scenarios. In order to
TADV where decreasing tendency of KPI values is considered keep the number of symbols down, the above defined scenario
an anomaly. On the other hand, level function (5) was used names are also used as the diagnosis targets in the system.
for CD, which should be sensitive to increased KPI samples. Although in case of UE there is no actual fault in the network,
As the first step in preparing the simulation, a profile these three cases will be collectively referred to as faults
was created (as explained in Section III-A) for each KPI in nonetheless to maintain simplicity.
each cell separately by collecting and processing a number Now that the scenarios and targets have been defined, the
of samples listed in Table III. Note that although CQI was only thing left for the diagnosis process to have is the expert
reported by active users in each ms, only every 100th of knowledge, i.e., the reports linking KPIs to diagnosis targets.
the reports from the same user was taken as a sample for This expert knowledge was collected by conducting shorter
the CQI KPI in a particular cell in order to decrease serial interactive simulations in the different scenarios, placing the
correlation between consecutive CQI samples. Also, the size faults in different cells, monitoring KPI levels through the
of the window for CQI was set to be relatively high compared detection process (without the diagnosis part) and observing
to the other two KPIs that are based on truly independent which KPIs are sensitive to the different fault cases (i.e., their
events. The simulation time needed to create the profile as level has increased above 0.5 in accordance with Section V-B)
well as the profile mean and average are also present in the and submitting reports accordingly; this is the same process an
table. Note that in the simulator it was enough to create one operator would use in a real network deployment. The reports
profile for each KPI due to the stable radio environment and collected this way are summarized in Table IV. Instead of
the uniform user behavior; in a real deployment, it can be writing out multiple identical reports, it is just put in the table
necessary to create multiple profiles (and possibly interpolate how many times they have been added. Note that there could in
between them, as explained in Section III-A). fact be several reports with the same KPI subset and diagnosis
The diagnosis process has been tested in four different target, but, as it is demonstrated by the SH target, the same
scenarios, constructed as follows. The first scenario, referred diagnosis target does not necessarily receive the same KPI
to as “UE”, modeled an unusual (but valid) user behavior subsets all the time. This is expected and that is exactly why
by forcing the users to the cell edge, not allowing them to the likelihood values have been introduced in the diagnosis
approach the serving base station closer than 200 m. This is procedure.
obviously a justified user behavior and not a consequence of
a fault of any equipment or network element but it can have The expert knowledge collected is by no means surprising
an effect on certain KPIs (most likely on CQI) that could be if one considers the KPIs in the environment. If the users
mistaken for a symptom of a real fault case. Our intention with move away from the serving base station, their channel quality
this scenario was to test if this case can be reliably identified becomes lower due to the increased path loss and interference;
by the diagnosis process. The second scenario, denoted by however, it has no impact on call drop ratio or how far
“TX”, employed reduced transmission power in a cell from handovers are made in the affected cell. On the other hand,
the normal 46 dBm level to 34 dBm. In a real network deploy- if some parts of the cell are under high shadowing, the call
ment, this could happen for example due to damaged cabling drop probability is expected to increase in the cell, but again,
(via increased power dissipation) but too much downtilt of an it does not make the cell shrink or deteriorate the average
antenna has effectively the same symptom. The third scenario channel quality (this is due to the small area occupied by
called “SH” was a high shadowing case, when a relatively the shadowing compared to the whole coverage area of the
small part (about 5%) of a cell’s coverage area was under high cell). However, sometimes (e.g., when there are more users
shadowing. This can happen due to changed radio propagation walking into the shadowing area) their obviously lower CQI
environment (construction work, buildings, etc.). The fourth reports can gain majority in the sliding detection window (see
scenario, bearing the name of the null target T0 , was the (3)), resulting in higher KPI level also for CQI. Finally, the
reference case without any fault or unusual behavior in the reduction of the transmission power in a cell results in lower
system. In scenarios UE, TX and SH, only one cell (the same channel quality and also handovers to be made closer to the
in each scenario) was affected by the corresponding fault or base station as the coverage area of neighbor cells penetrate
anomaly at a time. into the coverage area of the faulty cell.
TABLE VI
N ORMALIZED SCORES OF THE TARGETS IN EACH SCENARIO IN THE CQI diagnosis: UE (0h32m – 1h33m)
MONITORED CELL AT 1 h 00 m SIMULATION TIME . Call drop
HO-TADV presence of fault (0h30m – 1h30m)
1.0
scenario T0 scenario TX scenario SH scenario UE
KPI level
0.8
T0 1.00 TX 2.05 SH 1.20 UE 1.33
0.6
UE 0.75 UE 1.53 T0 1.00 TX 1.00
0.4
SH 0.65 T0 1.00 UE 0.67 T0 1.00
0.2
TX 0.50 SH 0.58 TX 0.33 SH 0.67
0.0
0h00m 0h15m 0h30m 0h45m 1h00m 1h15m 1h30m 1h45m 2h00m
time
B. Simulation results
Fig. 7. KPI levels of the monitored cell in scenario UE (users are not allowed
to approach the serving base station closer than 200 m).
After the expert knowledge has been collected, the diagnosis
process was activated in the system. As the first step, the target
specific likelihood values of each KPI were pre-computed
according to (11). The numerical results are given in Table V. base station. These are obviously reflected in the CQI level but
For each of the four scenarios outlined before, there was the diagnosis maintained a steady outcome nonetheless. This
a simulation covering two hours of network operation. For is also due to a time to trigger mechanism applied on top of the
the first 30 minutes, there was no fault in the system in any diagnosis: if the winning target changes, it becomes official
of the scenarios. Then, the fault according to the scenario and reported to the operator only if its number one position
was deployed in one of the cells (except for the reference remains stable for at least t = 60 s. Although this brings a
scenario of course) and the simulation proceeded with one latency in the diagnosis, it also stabilizes it. Other (typically
hour of simulation time, during which the fault was present. even higher) time to trigger values may be more suitable in real
After that, the fault was withdrawn (an action corresponding deployments, also depending on how frequently the diagnosis
to deploying a solution in a real network) and the simulation is run and how often KPI samples are collected. Alternatively,
continued for another 30 minutes before it terminated. For the if the output of the diagnosis is one of the faulty targets,
reference case, the simulation was running without any change the frequency of running the diagnosis can be automatically
in the radio environment for the whole two hours simulation increased to find out if it is a stable situation.
time. During these simulations, the faults were deployed in After the fault is withdrawn, the CQI level soon falls back
different cells than in any of the previous interactive sessions near to zero. Again, the same time to trigger mechanism is
with the purpose of collecting the expert knowledge. Also, the applied here: the diagnosis process only reports the return of
random seed for each simulation was altered uniquely. the normal (faultless) state after the null target has been the
The diagnosis process was run once every second in the winning target for at least t seconds. One can observe that
monitored cell, i.e., the one affected by the fault or, in the the CD level slightly increases towards the end of the fault
reference case, one of the cells of the central BTS. The current deployment period; this can be due to the users continuously
KPI levels and the output of the diagnosis (each target with its having worse channel quality (as it can be seen from the CQI
normalized score) have been logged each time. As an example level sticking to 1.0) but its significance is negligible and
for the output of the diagnosis, Table VI shows the normalized does not bother the diagnosis at all. It is an advantage of
scores of each target in descending order. Snapshots were the diagnosis process that no exact match is required between
taken at 1h 00m simulation time, when the faults have already the KPIs with high level and the KPI subset of a target. In
been deployed for 30 minutes. It is clear that the target with traditional systems without diagnosis, the operator could get
the highest score matches the scenario in each case, which an alarm signaling high call drop level only to be canceled
means that the diagnosis was able to correctly identify the soon, just adding more noise to the anyway quite dynamically
target corresponding to each scenario. Since these scores are changing alarm list. Overall, the fault deployment has been
normalized as per (18), the severity of the fault cases can also in the system from 0h 30m to 1h 30m and the diagnosis was
be ranked as TX > UE > SH (actually, since UE requires no reporting the correct target from 0h 32m to 1h 33m . Considering
action from the operator, it boils down to TX > SH). Hence, the one minute latency introduced by the time to trigger
if these faults were to appear simultaneously in a network and mechanism, this result means that the diagnosis follows the
the operator was to go after one fault only at a time, it would state of the system quite closely.
be better to take TX first. Similarly to scenario UE, Fig. 8 shows the KPI level curves
The KPI levels of CQI, CD and HO-TADV were also for scenario TX. This time both the CQI and HO-TADV levels
recorded periodically during the simulations. The curves for jump up after reducing the transmission power of the cell.
scenario UE are shown in Fig. 7. After the fault deployment at The CQI level is due to all users experiencing worse channel
0h 30m , the level of CQI starts increasing rapidly and shortly quality compared to the profile (since the power reduction
approaches 1.0. During the fault deployment, it stays close affects everybody in the cell, no matter how far they are from
to 1.0, occasionally dropping for short periods. This is due the base station). This symptom alone could easily be mistaken
to the user movements and handovers; if a user moves to a for scenario UE; what indeed differentiates the two scenarios
position within the main lobe of the antenna, the received is the behavior of HO-TADV. Its rapid ascent is due to the
signal will be still good despite the increased distance from the relatively large number of users handing over to neighbor
target throughout the whole simulation. In summary, the KPI

CQI diagnosis: TX (0h32m – 1h35m)
Call drop levels reacted quickly to the faults deployed in the system and
1.0 also after the faults have been withdrawn. It was due to the
large number of active users, the choice of KPIs and the high
KPI level
0.8
0.6 quality profiles. The latency of the detection is determined
0.4 by the rate at which KPI samples are collected at a cell. In
0.2 case of KPIs based on radio measurements and mobility or
0.0
traffic parameters, it depends on the number of active users,
0h00m 0h15m 0h30m 0h45m 1h00m 1h15m 1h30m 1h45m 2h00m which can have significant variations in a real network within
time a day. From the KPIs chosen in the simulator, CQI is by far
the fastest and HO-TADV is the slowest since handovers were
Fig. 8. KPI levels of the monitored cell in scenario TX (transmission power
degradation from 46 dBm to 34 dBm).
the rarest events compared to calls and and CQI reports.
V. N ETWORK I NTEGRATION
CQI diagnosis: SH (0h33m – 1h36m)
Call drop After the evaluation of the feasibility of the detection and
1.0
diagnosis framework, we discuss practical and implementation
issues and identify possibilities for the extension of the frame-
KPI level
0.8
0.6
work and key cooperation points with other network functions.
0.4
0.2 A. Active measurements
0.0 The assumption so far has been that KPIs in a system
0h00m 0h15m 0h30m 0h45m 1h00m 1h15m 1h30m 1h45m 2h00m are measured regularly and their value is always up-to-date.
time
However, in reality it is not the case for all KPIs (e.g.,
Fig. 9. KPI levels of the monitored cell in scenario SH (high shadowing in because they would generate too much logging if they were
some small spots of the coverage area). enabled all the time); therefore, additional steps are needed
when these missing KPIs could indeed play a key role in
delivering the correct diagnosis. The level of KPIs that are
cells after the power reduction; and what is more, these are not regularly measured should be ignored by the diagnosis
not only the users who were already close to the cell edge process; however, the impact of different faults on these KPIs
but some of those closer to the base station as well. They should be part of the expert knowledge so that measuring them
would have a lower timing advance before the handover, which when actually needed (i.e., when they can be the differentiator
yields increased HO-TADV level (note that the level function between targets) gives a result directly processable by the
of HO-TADV is the same as that of the CQI, i.e., sensitive diagnosis process. An on-demand measurement for a KPI like
to decreased KPI samples). After the power was reset to its this is called an active measurement.
original value at 1h 30m , the channel quality reports soon put Active measurements can be made for KPIs that are not
the CQI level back to zero. The HO-TADV level follows with measured regularly, for KPIs with missing measurements (due
some lag; the reason for this is that each handover updates to error in data transport or processing), or even for KPIs that,
HO-TADV in the source cell only but the monitored cell is although measured and updated regularly, are more outdated
the target of those handovers that happen due to restoring compared to other KPIs (e.g., their regular measurement
the power. Therefore, the HO-TADV catches up due to the interval is one hour and there would be still 20 minutes until
handovers of users actually moving out of the monitored cell, the next measurement at the time the diagnosis is taken). If
which is a slower process. Nevertheless, the diagnosis has the active measurement can be initiated automatically by the
already picked up the correct target (in this case, the null diagnosis process, there is no interruption in the automation.
target) sooner, since the increased level of HO-TADV alone On the other hand, some KPIs cannot be measured without
already matches the null target best (according to the provided human assistance (e.g., they require switching on additional
expert knowledge). The level of CD stays low through the equipment or the measurement has to be taken at a specific
whole simulation as expected. location on the field). In these cases, the diagnosis process
The KPI curves of scenario SH are shown in Fig. 9. In can send an alarm with the semantics “measurement required”
this case, it is the CD level that reacts to the high shadowing to the operation including the list of KPIs for which a
deployed in 0h 30m . The CQI level stays low since the high measurement is needed.
shadowing area is relatively small (5%) compared to the In order to decide which KPI to take for active measure-
coverage area of the whole cell. Also, the HO-TADV level ment, one can take into account the relevance of the KPI and
is low because the radius of the cell has not shrunk. However, the cost of the active measurement. From relevance point of
the high shadowing was strong enough to increase the call view, if targets T1 and T2 are close, it is better to select the
drop ratio, because each user walking into it with an active KPI arg maxK∈K T1(K) − T2(K) because it is expected to
call was likely to have it dropped. be the best tie-breaker by effectively panning away the scores
In the reference case, the levels of all three KPIs were close of those targets in opposite direction. For the cost part, each
to zero all the time; the diagnosis reliably delivered the null KPI can have an associated cost reflecting the effort required
for its active measurement or the time it takes to conduct may produce symptoms similar to fault cases. Care should be
the measurement. KPIs that can easily or quickly be subject taken not to mistake one for the other. Therefore, coordination
of an active measurement should have lower cost than those [17] between the diagnosis and other functions is important.
requiring human interaction or take long to obtain. Deploying a new base station would introduce so much
There can be KPIs that are so specific to active measure- change in the radio environment of the new cells that it would
ments that they are never measured otherwise at all, e.g., render existing profiles unusable, necessitating re-profiling
making a test call to a specific cell. Similarly to low level during which automated diagnosis in the affected cells should
alarms integrated into the detection process, these KPIs have be turned off. Switching off a cell for energy saving produces
no profile, enabling one sample to generate an up-to-date KPI the same symptoms as a power failure. However, by adding
level in the detection; their association with targets, however, the administrative state of the cell as a KPI (possibly to be
should be included in the expert knowledge as with other KPIs. checked as an active measurement if a faulty target comes up),
Section VI discusses active measurements in related work. the diagnosis process can automatically recognize the situation
and avoid false alarms.
B. Management of expert knowledge The diagnosis process is not always the one being instructed
by other SON functions but it can be the other way around:
An advantage of the framework is that it can start off even
the diagnosis can also trigger other functions in the network
with an empty expert knowledge base and have it populated
such as cell reset or automatic cell outage compensation.
gradually as faults happen and the operator supplies the root
cause or the corrective action if the former is unknown. Iden- VI. R ELATED W ORK
tifying the KPI subset for the reports can also be automated
There have been several approaches published aiming at
by running the detection process and taking those KPIs whose
providing a solution to the requirement of automated diagnosis
level is greater than, e.g., 0.5 at the time the fault was manually
in systems with increasing complexity and, consequently,
diagnosed. Note that this is only an example for a semi-
increasing number of potential fault cases; this subsection
automatic report creation and it is not the only possibility to
gives an overview of these techniques.
create mapping between KPIs and targets; however, if such a
Rule-Based Systems (RBS) were used in several technology
threshold is indeed chosen to be used, it can be a common
areas to address automated fault diagnosis [18]. These are
one for all KPIs due to the unified level concept and it is by
expert systems that consist of a set of “IF condition THEN
no means the reestablishment of the legacy KPI thresholding.
action” rules to store and maintain knowledge. Rule-based
As the number of reports grows, the diagnosis framework can
systems can effectively handle small, static and deterministic
be put into operation. First, the operator is likely to still do
domains; however, the rule set grows exponentially with the
the diagnosis but also consult with the automatic diagnosis
domain size and it is hard to create rules in dynamic or
and compare the results. If the framework has stabilized in
non-deterministic fields, which makes rule management a
supplying the expected diagnosis, it could be entrusted with
challenging task. Therefore, RBSs have not been seriously
doing the work alone.
considered for diagnosis automation in cellular networks.
Central management of the expert knowledge in the OSS
Bayesian Networks (BN) can be effectively used in do-
system could be a practical implementation of the knowledge
mains where RBSs fail due to the uncertain nature of expert
database, with regular synchronization to a local cache in each
knowledge. A BN is a probabilistic graphical model usually
base station. The expert knowledge should be stored in a form
represented by a Directed Acyclic Graph (DAG). The nodes of
that enables editing of existing reports, e.g., complementing a
the DAG are random variables with discrete states and each of
corrective action target with the corresponding root cause or
its edges represents conditional (usually causal) dependence
correcting the KPI subset or the target of a report that was
between the connected nodes. Each node is associated with
based on mistaken observations.
a conditional probability table, which could be filled based
While the KPI levels can be continuously updated as
on either expert knowledge [19] or analysis of previous fault
new KPI samples arrive, the diagnosis does not need to be
cases [20] (similarly to the reports in this paper) as well. The
constantly active; it can be run periodically (e.g., hourly or
table is used to compute the actual conditional probability
even daily if it is considered to be enough) and send an
of the nodes given the current state of its parents. When
alarm containing the ordered list of targets in case the winning
applied in the fault diagnosis domain, one set of nodes can
target in a cell is different from the null target. Sending the
represent symptom variables while another set can represent
complete list (or, alternatively, the top-five or top-ten targets)
faults that have effect on the symptoms. Thus, in a BN,
gives options to the operator if the diagnosis turns out to
uncertain relationships between faults and their symptoms can
be wrong. Even if that happens, there are right away further
be effectively represented by the conditional probability tables.
targets provided to the troubleshooter to check for validity.
Methods exist that enable the BN to learn the parameters as
well as its structure from observed data. In the work presented
C. SON coordination in [21], a naı̈ve Bayesian classifier is introduced for automated
The automated diagnosis will certainly not be the only diagnosis in UMTS networks. Using Bayesian classifier with
automated function in a network, only one of the many differ- continuous KPIs has also been proposed, which demonstrates
ent SON and other automated functions. As a consequence, the capability of BNs to represent not only discrete variables
changes in the radio environment and network configuration but continuous ones as well. Study on comparison of contin-
can be due to deliberate actions carried out by other entities but uous and discrete BN diagnosis systems [22] shows that the
continuous model produces better diagnosis accuracy and less builds its knowledge base by learning from earlier fault cases
sensitivity to imprecise model parameters if a large number observed and diagnosed and it finds the most similar earlier
of training cases are available. However, the discrete model is case to diagnose a new fault (i.e., by summing the weighted
preferred in limited training set scenarios. continuous Hamming distances). Note, however, that unlike
Several competitive Neural Network (NN) algorithms are the traditional CBR approach, the proposed framework does
studied and compared in [23] for fault detection and diag- not compare a new case directly to individual previous cases
nosis of a simulated CDMA2000 network. The system is (i.e., reports) but to the cumulative KPI footprint of all reports
represented by one global and several local normality profiles associated with a common target, represented by the likelihood
(i.e., the expected distribution of quantization errors). The and consistency values.
joint application of global and local profiles results in a more The concept of active measurements appears in other pro-
robust and efficient diagnostic tool. In [24], the authors present posals as part of diagnosis; however, none of these proposals
their results of applying another variant of NN called Self- focus on mobile communication systems. In [28], the au-
Organizing Maps in cell performance data analysis to enhance thors integrate service-level monitoring with fault management
traditional manual troubleshooting process in 3G mobile net- functions such as event correlation and fault diagnosis for
works. The method effectively visualizes and groups cells interconnected Ethernet networks. As part of this work, they
based on their performance, which helps experts be more proposed a method for root cause analysis based on Petri-
effective in troubleshooting and parameter optimization. nets that schedules so-called active investigation checks (e.g.,
In [25], the authors describe an anomaly detection and specialized measurements, database queries, etc.) to narrow
fault diagnosis framework using a bi-cycle Auto Regressive the set of potential root causes for the actual fault. The
(AR) model and evidential reasoning of Dempster–Shafer constructed Petri-net defines an active action plan (i.e., a
(DS) theory. First, a change detection algorithm is applied sequence of measurements) that drives the diagnosis from
to all observed performance variables where a dedicated AR a triggering alarm towards finding its root cause by acquir-
model is used to determine the profile of each variable as well ing necessary information on the fly. An earlier work [29]
as the deviation of the actual observation from the model. combines Bayesian networks with decision theory to achieve
Second, the pattern of deviations is fed into a classification the same goal; active investigations have also appeared with
engine based on DS reasoning to determine the most probable incremental alarm correlation for fault diagnosis [30].
root cause of the actual anomaly. The authors applied their
framework for IP networks focusing on packet forwarding VII. C ONCLUSION
anomalies.
In this paper, we presented an integrated detection and
In [26], the authors present a novel approach to detect
diagnosis framework where the detection process is based
anomalies affecting several mobile users with dynamic profile
on automatically built profiles. Diagnosis follows a reasoning
identification. The proposal is based on analysis of unidimen-
similar to the thinking of an operator, looking at previous
sional distributions of certain features across individual mobile
fault cases and trying to find the best matching root cause.
users using multiple timescales. Important conclusion is that
Detection and diagnosis are connected via a unified KPI level
while an automated detector can help recognize the statistical
interface that integrates different kind of KPIs and eliminates
anomalies (significant deviation for expected behavior), their
the need for manual adjustment and threshold calibration.
semantic interpretation (diagnosis) is still up to human experts.
Active measurements can help in tie resolution or reassuring
Another main approach to fault detection in complex sys-
the output of the diagnosis. The framework is KPI and RAT
tems is Case-Based Reasoning (CBR). The CBR approach is
agnostic; although its feasibility has been tested using KPIs in
fundamentally different from RBSs, BNs and NNs in the way
a simulated LTE system, it could be used in 2G, 3G or HSPA.
it addresses fault diagnosis and problem solving in general.
Future work should be devoted to further evaluation of the
The main differentiator is that CBR systems do not require
system with more KPIs and more complex fault cases as well
a structured knowledge base that must be acquired from
as comparison with other fault detection and diagnosis meth-
domain experts. There is no need to define any associations
ods. Working on real network KPI data instead of simulations
or general relations between symptoms and fault cases and
is also a priority. The investigation of the impact of a dynamic
use this knowledge to build a model. Instead, CBRs store
network environment on expert knowledge is also needed.
specific knowledge extracted from previously observed fault
cases to form a continuously updated knowledge base. In order
to solve a given problem, a similar former case is retrieved R EFERENCES
from the case base and applied to the current situation. [1] N. Alliance, A Requirement Specification, Dec. 2008. Available:
Another characteristic that makes CBR different from other http://www.ngmn.org/uploads/media/NGMN Recommendation on
SON and O M Requirements.pdf
expert systems is that the knowledge is continuously updated [2] 3GPP, “Telecommunication management; Self-Organizing Networks
during operation, since every new case is incorporated into the (SON); Concepts and requirements,” 3rd Generation Partnership Project
database after it has been resolved. The CBR has been applied (3GPP), TS 32.500, Oct. 2010.
[3] 3GPP, “Evolved Universal Terrestrial Radio Access Network (E-
for several problem domains for diagnosis (e.g., for software UTRAN); Self-configuring and self-optimizing network (SON) use
systems [27]) but there has been no proposal to address cases and solutions,” 3rd Generation Partnership Project (3GPP), TR
cellular systems. The diagnosis framework described in this 36.902, Apr. 2011.
[4] H. Sanneck, Y. Bouwen, and E. Troch, “Context based configuration
paper shares some basic concepts with CBR systems. The management of plug & play LTE base stations,” in Proc. 2010 IEEE
most important common point is that the proposed framework Network Operations and Management Symposium, pp. 946–949.
[5] 3GPP, “Telecommunication management; Automatic Neighbour Rela- network approach,” IEEE Trans. Veh. Technol., vol. 57, no. 4, pp. 2451–
tion (ANR) management; Concepts and requirements,” 3rd Generation 2461, July 2008.
Partnership Project (3GPP), TS 32.511, Apr. 2011. [22] R. Barco, P. Lázaro, L. Dı́ez, and V. Wille, “Continuous versus discrete
[6] T. Bandh, G. Carle, H. Sanneck, L. Schmelz, R. Romeikat, and B. Bauer, model in autodiagnosis systems for wireless networks,” IEEE Trans.
“Optimized network configuration parameter assignment based on graph Mobile Comp., vol. 7, no. 6, pp. 673–681, June 2008.
coloring,” in Proc. 2010 IEEE Network Operations and Management [23] G. Barreto, J. Mota, L. Souza, R. Frota, and L. Aguayo, “Condition
Symposium, pp. 40–47. monitoring of 3G cellular networks through competitive neural models,”
[7] P. Szilágyi and H. Sanneck, “LTE relay node self-configuration,” in IEEE Trans. Neural Networks, vol. 16, no. 5, pp. 1064–1075, Sep. 2005.
Proc. 2011 IFIP/IEEE International Symposium on Integrated Network [24] J. Laiho, K. Raivio, P. Lehtimäki, K. Hätönen, and O. Simula, “Ad-
Management, N. Agoulmine, C. Bartolini, T. Pfeifer, and D. O. Sullivan, vanced analysis methods for 3G cellular networks,” IEEE Trans. Wire-
editors, pp. 841–855. less Commun., vol. 4, no. 3, pp. 930–942, May 2005.
[8] A. Lobinger, S. Stefanski, T. Jansen, and I. Balan, “Load balancing in [25] N. Samaan and A. Karmouch, “Network anomaly diagnosis via statisti-
downlink LTE self-optimizing networks,” in Proc. 2010 IEEE Vehicular cal analysis and evidential reasoning,” IEEE Trans. Network and Service
Technology Conference – Spring, pp. 1–5. Management, vol. 5, no. 2, pp. 65–77, 2008.
[9] I. Siomina, P. Varbrand, and D. Yuan, “Automated optimization of [26] A. D’Alconzo, A. Coluccia, F. Ricciato, and P. Romirer-Maierhofer,
service coverage and base station antenna configuration in UMTS “A distribution-based approach to anomaly detection and application
networks,” IEEE Wireless Commun., vol. 13, no. 6, pp. 16–25, Dec. to 3G mobile traffic,” in Proc. 2009 IEEE Global Telecommunications
2006. Conference, pp. 1–8.
[10] 3GPP, “Telecommunication management; Self-Organizing Networks [27] S. Montani and C. Anglano, “Case-based reasoning for autonomous
(SON); Self-healing concepts and requirements,” 3rd Generation Part- service failure diagnosis and remediation in software systems,” in
nership Project (3GPP), TS 32.541, Mar. 2011. Proc. 2006 European Conference on Case-Based Reasoning (ECCBR),
[11] P. Zanier, R. Guerzoni, and D. Soldani, “Detection of interference, Lecture Notes in Artificial Intelligence 4106, pp. 489–503. Springer-
dominance and coverage problems in WCDMA networks,” in Proc. 2006 Verlag, 2006.
IEEE International Symposium on Personal, Indoor and Mobile Radio [28] P. Varga and I. Moldován, “Integration of service-level monitoring
Communications, pp. 1–5. with fault management for end-to-end multi-provider Ethernet services,”
[12] P. Szilágyi and S. Nováczki, “Radio channel degradation detection IEEE Trans. Network and Service Management, vol. 4, no. 1, pp. 28–38,
and diagnosis based on statistical analysis,” VTC-2011 Spring IWSON June 2007.
Workshop, vol. 15. [29] D. Heckerman, J. S. Breese, and K. Rommelse, “Decision-theoretic
[13] R. Barco, V. Wille, L. Dı́ez, and P. Lázaro, “Comparison of probabilistic troubleshooting,” Commun. ACM, vol. 38, pp. 49–57, Mar. 1995.
models used for diagnosis in cellular networks,” in Proc. 2006 IEEE [30] Y. Tang, E. Al-Shaer, and R. Boutaba, “Efficient fault diagnosis using
Vehicular Technology Conference – Spring, vol. 2, pp. 981–985. incremental alarm correlation and active investigation for Internet and
[14] 3GPP, “Telecommunication management; Key Performance Indicators overlay networks,” IEEE Trans. Network and Service Management,
(KPI) for Evolved Universal Terrestrial Radio Access Network (E- vol. 5, no. 1, pp. 36–49, 2008.
UTRAN): Definitions,” 3rd Generation Partnership Project (3GPP), TS
32.450, Apr. 2011. Péter Szilágyi received an M.Sc. degree in Software
[15] H. Wietgrefe, “Investigation and practical assessment of alarm corre- Engineering from the Budapest University of Tech-
lation methods for the use in GSM access networks,” in Proc. 2002 nology and Economics (BUTE), Hungary, in 2009.
IEEE/IFIP Network Operations and Management Symposium, pp. 391– Currently he is a research engineer at Nokia Siemens
403. Networks, focusing on Self-Organizing Networks
[16] 3GPP, “Physical layer aspect for evolved Universal Terrestrial Radio including self-configuration solutions for LTE-A Re-
Access (UTRA),” 3rd Generation Partnership Project (3GPP), TS 25.814, lay nodes and contributing to 3GPP RAN3 standard-
Oct. 2006. ization. His interests include self-healing in 3G and
[17] H. Sanneck, C. Schmelz, T. Bandh, R. Romeikat, G. Carle, and B. Bauer, LTE systems as well as developing proof-of-concept
“Policy-driven workflows for mobile network management automation,” and performance simulations for LTE.
in 2010 International Wireless Communications and Mobile Computing
Conference.
[18] C. Angeli and A. Chatzinikolaou, “On-line fault detection techniques Szabolcs Nováczki received an M.Sc. degree in
for technical systems: a survey,” International J. Computer Sci. & Appl., Electrical Engineering from the Budapest University
vol. 1, no. 1, pp. 12–30, 2004. of Technology and Economics (BUTE), Hungary,
[19] R. Barco, P. Lázaro, V. Wille, L. Dı́ez, and S. Patel, “Knowledge in 2006. Szabolcs is a research engineer at Nokia
acquisition for diagnosis model in wireless networks,” Expert Syst. Siemens Networks, focusing on Self-Organizing
Appl., vol. 36, no. 3, pp. 4745–4752, 2009. Networks solutions, particularly self-healing aspects
[20] R. Barco, V. Wille, L. Dı́ez, and M. Toril, “Learning of model parameters of cellular networks. He currently studies the appli-
for fault diagnosis in wireless networks,” Wireless Networks, vol. 16, pp. cability of machine learning techniques to achieve
255–271, Jan. 2010. automated fault detection, diagnosis and prediction
[21] R. Khanafer, B. Solana, J. Triola, R. Barco, L. Moltsen, Z. Altman, and in 3G and LTE radio access networks.
P. Lázaro, “Automated diagnosis for UMTS networks using Bayesian

An Automatic Detection and Diagnosis Framework For Mobile Communication Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Automatic Detection and Diagnosis Framework For Mobile Communication Systems

Uploaded by

Copyright:

Available Formats

184 IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, VOL. 9, NO.