10.1109 Icenco.2017.8289796 J73u

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

On the Use of Predictive Analytics Techniques for

Network Elements Failure Prediction in Telecom


Operators
Ahmed F. Fahmy Ahmed H. Yousef Hoda K. Mohamed
Computers & Systems Computers & Systems Computers & Systems
Department, Faculty of Department, Faculty of Department, Faculty of
Engineering, Ain Shams Engineering, Ain Shams Engineering, Ain Shams
University University University
Cairo, Egypt. Cairo, Egypt. Cairo, Egypt.
Ahmed.fahmy@eng.asu.edu.eg ahassan@eng.asu.edu.eg hoda.korashy@eng.asu.edu.eg

Abstract— Reliability of the telecom operator optical network related work, then Sections 3 and 4 demonstrate the
infrastructure is one of the most important competitive advantages methodology and the results from deploying it in a real
of the operator to provide the required quality of service to its
environment. Finally, Section 5 displays the conclusion.
customer. Massive amounts of voice and data traffic can be lost
due to network failure, especially in the case of failures of nodes
II. BACKGROUND AND RELATED WORK
(optical switches). In this paper, we use predictive analytics
techniques in one of the biggest telecom operators in the Middle
East to predict the node failures in advance to take the To the extent of our knowledge, the predictive analytics
precautionary measures before it fails. algorithms have not been used in optical network equipment
failure prediction using our proposed methodology. The work
Keywords— data mining; predictive analytics; telecom; in this field focuses on the protection and restoration of failures
network element; failure
in fiber-optic networks. For example, [2], [3] presented the
I. INTRODUCTION efforts towards network protection algorithms. But, these
algorithms detect the failure only after it occurs. The same in
Significant amounts of data and voice traffic can be lost [4], the authors tried to propose four different protection
due to network failure. There may be cases in which only schemes for converged access network architecture, where a
parts of the node fail, and some parts remain functional. For hybrid passive optical network is deployed to provide high
example, only some of the links on an optical switch may bandwidth to the end users. However, they did not consider how
cease to function because of some hardware failure in the to forecast the failure. In Summary, some of the above
switch. In such cases, the protection scheme may be able to researchers were able to handle the protection and restoration
treat them as individual link failures and take appropriate of failures in fiber-optic networks. However, no method was
measures to recover them. However, determining the nature designed to build a generic framework to predict failure of
and the extent of the failure may be difficult. There is another nodes (optical switches) in telecom Operators.
problem with allowing different failure modes for nodes:
Most protection schemes do not support simultaneous III. METHODOLOGY
recovery from several failures. Therefore, we assume that To explain our methodology, we need first to define the
none of the channels on the failed node remain active. research problem, research challenges, and the solution
In this paper, we propose a methodology for building
Node Failure Prediction Model that will help to predict the A. Research Problem
node failures in advance to take the precautionary measures To maintain the high quality of fiber optic nodes, detection of
before it fails. Predictive analytics [1] will be applied to failed or malfunctioning node is essential. The failure of the
achieve this objective. The term "predictive analytics" node is either because of communication device failure or
describes any data mining approach with four attributes: battery, environment, and sensor device-related problems. To
focus on prediction more than description, classification or check the failed node manually in such environment is
clustering, short time to insight (measured in minutes or troublesome. As per the current process, nodes are
hours), focus on the resulting insights and its relevance to the maintained either through periodical maintenance schedule
business, and focus on usability by making the tools or as and when there is failure reported. This process is
accessible to business users. Predictive analytics can bring expensive as there is no information available on which nodes
clarity and consistency to any situation where the likely future should be prioritized over others for maintenance and also
behavior or condition is uncertain. The paper organization is since the failure is fixed only after the node has failed, it leads
as follows: First, Section 2 describes the background and to bad customer experience. Therefore, there is need to be

250
978-1-5386-4266-5/17/$31.00 ©2017 IEEE
able to predict which nodes are more likely to fail and address CPU @ 2.20 GHz; RAM is 16GB; and 64-bit operating system.
them on priority. So, the research question is “How to use Rapid Miner license used is an educational licensee to avoid the
Measurement/customer/call/Field technician information to limitation of the trial version of dealing only with ten thousand
rows as a maximum.
predict the co-axial node failure much ahead in time so that
network operations team can fix it before it fails thereby A. Data Understanding and Preparation
reducing the cost of maintenance, improving customer Data pre-processing is often a critical and essential part of
experience and improving the node lifetime?” the prediction process [6]. The primary purpose of data pre-
processing is to have a final and correct data set that is only
B. Research Challenges containing those columns that will contribute well to the data
There are several complications involved in predicting a node mining algorithms and model. Data pre-processing helps to
failure. Below are some major complications which need to focus on [7] the most critical issues within the famous
be addressed to predict a node failure. knowledge discovery from data process [8], [9].
▪ Multiple Reasons for failure – Failure of a co-axial cable
Node can happen because of several reasons. Failure can also
occur because of non-controllable factors such as
thunderstorms, cyclones, extreme weather conditions,
accidents, etc. Other reasons for Node failure can be
controllable such as technical problems, poor specifications
of the equipment, poor infrastructure, stolen bandwidth, etc.
It is a challenge to cleanly capture all these factors into a data
model which can be used for predictive analytics
▪ Actionability of prediction – It is of utmost importance to
give a fair prediction of failing nodes much ahead of time for
operations/maintenance team to be able to effectively use the
prediction ability. This further adds to the complexity of the
predictive model.
Fig. 1. The proposed Node Failure predictive model
▪ Seasonality – Relationship between failure and affecting
reasons can change with time and may lead to decay of 1) Defining Node Failure
accuracy of predictive model. There are multiple ways to define a Node Failure. Network
▪ Data captured from different sources – It is a challenge to Maintenance or Engineering team can define and record when
be able to accurately and effectively map together with the a Node failed and what was the reason for failure. In our case,
data captured from different sources at one place. A Fair we have used outage flag to determine Node Failure.
amount of efforts and business understanding are required to 2) Identify and Define Predictor Variables
A set of predictor variables which are believed to have an
roll up the data which is captured at different levels and
impact on Node performance are created using data sets which
different sources and prepare a single version of the truth. are 24 hours before the Node failure from different sources
C. Solution Overview such as Measurement Data, Customer Data, Call interaction
data and Field technician data. Table I shows the attributes of
The Proposed solution to existing node maintenance process these datasets. For developing any analytics model, the first
is to create a predictive model. This predictive model will be step is to build one dataset with all the necessary data that will
able to predict the nodes which have higher propensity to fail go as an input in the model. To build a predictive Node Failure
as compared to other nodes and help maintenance team in Model, we need to create various calculated fields/metrics,
prioritizing node maintenance and pro-actively maintain which are derived from various data sources. All the different
nodes that are likely to fail. This proposed process will help data information which is available is rolled up at Node Day
level so that all these metrics can be directly used for
in decreasing the overall node failure rate, increase the
exploratory and model training exercise.
customer experience and reduction in costs of visits made to
node site for maintenance. TABLE I. ATTRIBUTES OF DATA SETS USED IN THE PREDICTION.
The proposed Node Failure predictive model to solve the
Data Set Definition
research problem is shown in Fig. 1.
Measurement /Technical Data These variables capture the technical
parameters such as Signal to Noise
IV. EXPERIMENT Ratios (SNR), Correctible and
The data mining activities are carried out using Rapid Miner Uncorrectable FECC (Forward Error
Corrections), Signal Alarm Counts,
v7.5 [5]. The experiment was done on an HP desktop computer Count of Flux Devices, etc.
model: Envy. The processor is Intel(R) Core (TM) i7-6560 U

251
Customer Data These variables capture customer 3. Based on exploratory analysis of each variable, we define
information such as Lifestyle flag variables for those technical variables which shows step
segments of customers, product
profile of customers, billing amounts relationship with outage. For example – USNR (upstream
of customers associated with the single to noise ratio) – outage is high if this variable is either
Node, Video on Demand orders by the below 28 or above 35. Therefore, for this variable we create a
customers, etc. flag variable which takes value 1 if USNR is between 28 and
Call Interaction Data These variables capture the volume
and increase in Repair/Technical calls 35; takes value 0 otherwise.
made by customers for Cable 4. Variables such as # of calls in last X days (where X =
Boxes, Phone, and Modems, etc. 2,3,5,7,15 etc.), # of calls/customer in last X days etc.
Field Technician Data These variables capture the count of Data pre-processing is often a critical and essential part of the
Field technician opened, completed,
etc. The higher the number of Field
prediction process. It is well known that data pre-processing or
technician orders opened the higher data preparation stage can take a considerable amount of
the chances of an issue at the Node. processing or analysis time and effort [10] and as we have a
massive amount of data which will affect the mining process.
Below Table II is a sample of the variables created from the So, it was mandatory to execute some techniques of data
consolidated datasets. reduction. Once we have created different types of variables
that can be used in the model, next step is to select best variables
TABLE II. SAMPLE OF THE VARIABLES CREATED FROM THE CONSOLIDATED which will be used for modeling. One of the essential
DATASETS
techniques that were used for that purpose is feature selection
Variable Description FS [11], [12]. ‘‘A feature can be considered irrelevant if it is
Var 1 Installations done from the previous day. independent of the class labels’’ [13]. As per Hall in [14]:
Total No of IVR calls about set-top “Feature selection FS is the process of identifying and removing
Var 2
boxes/modem/phone as much irrelevant and redundant information as possible.”. The
Var 3 (MAX) Signal Alarm Cable counts. The benefits of FS is detailed in the literature [15], [16]. Feature
number of set top boxes that show signal selection can be classified into two main categories, filter
alarms in Unified. methods [17] and Wrapper methods [16]. In the experiment, the
Var 4 USNR - (MAX) Average upstream signal Rapid miner operator “Weight by Chi Squared Statistic” was
to noise ratio for the node as polled used. This operator estimates the of the attributes’ relevance by
from CMTS Port. This variable equal calculating the value of the chi-squared statistic for each
1 if it is between 28 and 35. Else 0.
Var 5
attribute of the input data set concerning the class attribute. The
Open Field Technician Orders (72Hours). The
higher the weight of an attribute, the more relevant it is
number of Field operations orders that were considered. Then, the features were ranked in descending order
opened on the in the last 72 hours. This will according to their score and the ranking weights were
represent the pending trouble calls. normalized between 0 and 1.
Var 6 (Environmental Temperature (ET). The node The chi-square statistic [18] is a nonparametric test used in
health tool will update at 15-minute intervals. statistics to determine if the observed frequencies distribution
Var 7 A number of connect CPEs (Customer
premises equipment)
differs from the theoretically expected frequencies.
χ 2=   
3) Variable Creation and Variable Selection
As discussed above, we have outage flag to define Node Where χ 2 is the chi-square statistic, O is the observed
failure. One additional cleaning that we have to is to remove frequency, and E is the expected frequency.
consecutive outage days. In real scenarios, an outage may The features were ranked in descending order according to their
remain consecutively for days. However, for modeling score and the ranking weights were normalized between 0 and
purpose, we would try to predict only the first day of the 1, based on that top 7 variables were chosen for modeling.
outage. Therefore, in the modeling data, we would only keep
the first day of every outage sequence and delete consecutive B. Modeling
outage days on each Node. In this step, new variables are Predictive modeling is a set of machine learning techniques
created which are to be used in the model. Since in the model which search for patterns in data sets and use those patterns to
we will only be using 24 hours prior information, all the create predictions for new situations. Then start training the
variables are created using only 24 hours prior information to models, changing some parameters in the experiments while
the actual day of the event. Data is created at Node-Day level. keeping the remaining classifiers’ parameters set to the default
values of Rapid Miner Studio 7. In this step, we will try three
Below are different types of variables which are created; different machine learning algorithms to train the model to
1. Min/Max/Avg. value of different technical variables rolled predict Node failure. Different learning algorithms make
up at Node-Day Level. different assumptions about the data and
2. Percent/Proportion variables created based on total
accounts attached to a Node. For example, percentage of set
boxes with Signal Alarm can be created using a ratio of Count
of set boxes with Signal Alarm and total accounts.

252
Fig. 3. Decision Tree diagram

Fig. 2. Stage of generating the prediction model using Rapid Miner and the 2- Train with Ensemble Model [20]. This approach uses a
Prediction Stage vote classification technique based on the majority
using the predictions of the inner learners. The rapid
have different rates of convergence. The one which works best, miner vote operator is an operator that has a sub-
i.e., minimizes some cost function of interest (failure rate in our process. The sub-process must have two learners at least
case) will be the one that makes assumptions that are consistent which are called base learners using KNN and Deep
with the data and has sufficiently converged to its error rate. learning ANN as base learners). This methodology has
Table III shows a quick comparison of three techniques. a benefit of not being based on “a priori” assumptions
and of allowing detection of links between factors the
TABLE III. A QUICK COMPARISON OF THREE USED MACHINE LEARNING conventional techniques such as logistic regression may
TECHNIQUES.
not be able to detect. This algorithm is capable of
modeling much more complex relationship between
Technique Advantages Disadvantages target and predictors
3- Train with Logistic Regression [21]. Logistic
Decision Trees -Simplicity of Results -No model scores. It is
-No implicit assumption of difficult to fix cut- off regression technique is used for prediction of a binary
linear relationship scores variable. We have defined Node Failure as a binary
- If tree becomes very event – Node has failed/not failed. Therefore, we can
large, it becomes apply logistic regression technique to predict whether
Logistic difficult toapresent
-Gives probability scores as an -Assumes log-linear the Node will fail or not in next 24 hours based on a set
Regression output relationship
-Specifically designed for - Co-linear variables of predictor variables that we have created. Logistic
predicting binary outcomes cannot be used regression will measure the relationship between the
outage and predictor variables by using probability
Ensemble -Ability to model extremely -Complexity of the scores as the predicted values of the outage. That is, this
Model complex non-linear results is difficult to model will provide a probability that a node will fail in
relationships which could go interpret from a business next 24 hours. These probability scores computed by
un-noticed in other technique point of view
logistic regression model can be used to prioritize the
node maintenance schedule. Fig. 2 represents the
second stage which is responsible for applying the
1- Train with Decision Tree [19]. A decision tree is a generated model to the validation. Along with the new
decision support tool that displays a tree-like graph of entries, the model generated is also provided as stage
possibilities and their possible consequences, including input, then applied through the “apply model” operator
chance event outcomes, resource costs. We have defined in Rapid Miner.
Node failure as a chance outcome, and we will use a
decision tree to identify variables which would help us C. Validation
identify nodes which are more likely to fail as compared To ensure that predictive model that we have built is as
to other nodes. Decision tree in Fig. 3 can be used to accurate as possible and it is actually fitting the “true”
identify certain combinations of the key variables which relationship between the variables and not just fitting the
lead to higher node failure propensity. Tree nodes in the characteristics of the data sample, it must be validated
above tree which are color-coded in red have the higher through out-of-sample testing. Therefore, to validate the
failure rate as compared to overall population, and hence, accuracy and robustness of the model we keep a hold-out
the nodes which qualify the criteria as per these red sample (validation sample) of 20% records, which are not
colored tree nodes should be prioritized for maintenance. used for model training. Once the model is trained on training
Below are the lift calculations if the maintenance is sample, its performance is checked on validation sample to
prioritized on decision tree ensure that model is performing as good on the validation
sample as it is in the training sample. Therefore, the whole
data is divided into two parts – Training (80%) and Validation

253
(20%). Model training is done on the 80% sample, and then
the model is validated on 20% sample. This data partitioning
is done using simple random sampling procedure. To validate
the model, we compare the model performance on training
and validation. If the model performance does not deteriorate
on the validation sample as compared to training sample, then
we can say that the model is stable. Performance of the model
can be measured in terms of Area under the ROC curve.
1- Lift Calculation for Decision Tree
% population of red colored tree nodes: 16.6%
% of total node failures covered: 31% Lift: 88 % [= (31%-
16.6%)/16.6%]
2- Lift Calculations for Ensemble models
Fig. 5. The Lift chart for Logistic Regression
The whole population can be divided into 10 bins (10 pctl
each) as shown in Fig. 4 based on the model scores and then Table IV shows, the comparison of the performance of the three
populations under the top few bins can be prioritized for machine learning algorithms used to train the model to predict
maintenance as these nodes will be more likely to fail as Node failure
compared to other nodes.
% of total population selected = 20% TABLE IV. THE COMPARISON OF THE PERFORMANCE OF THE THREE MACHINE
LEARNING ALGORITHMS USED TO TRAIN THE MODEL TO PREDICT NODE FAILURE
% of total outages can be captured without using Model =
20% Algorithms’ Decision Tree Ensemble Model Logistic
Lift % Regression
% of total outages can be captured based on rank ordering by 88% 90% 90%
Model = 38% Lift = (38%-20%)/20% = 90%
A lift of 90% means that the site visits' cost to determine and Selection of the best model depends primarily on factors such
service this 38 % of the failing nodes can be reduced to almost as stability & robustness of the model, the simplicity of the
half of what it would cost if done without the help of the model output and lift (business benefit) provided by the model.
model. Based on these metrics logistic model is proving better than
other two techniques for the data sample that we had used to
develop these models. Logistic is showing higher lift as
compared to Decision Tree and also the flexibility of choice as
much population as it is possible from an operational standpoint
where Decision Tree breaks the population into fix segments
and makes it difficult to fix the size of the population that
operations team may want to target. When compared to
Ensemble Model. Both logistic and Ensemble Models are
giving same lift, however, given the complexity of the output
from Ensemble Models – Logistic model is preferred.
D. Controlling
In this phase, we start to define and implement controls to
sustain improvement and set up for continuous mining of the
Fig. 4. The Lift chart for Ensemble Model data, and finally plan for monitoring and maintenance of the
model. As discussed in the Solution Overview section,
3- Lift Calculations for Logistic Regression proposed process with the predictive model will help
The whole population can be divided into 20 bins (5 pctl maintenance team in prioritizing the technician’s visits to nodes
each) as shown in Fig. 5 based on the model scores and then that are more likely to fail in near future than others. Hence,
populations under the top few bins can be prioritized for reducing the costs of visits to the node as well as reducing the
maintenance as these nodes will be more likely to fail as time lag between failing and fixing the node leading to
compared to other nodes. improvement in customer experience. Below figures illustrate
% of total population selected = 20% the existing reactive maintenance process as shown in Fig. 6 VS
% of total outages can be captured without using Model = proposed predictive maintenance process as shown in Fig. 7
20% which is based on the predictive model described in this paper.
% of total outages can be captured based on rank ordering by The model which we have developed based on the available
Model = 38% Lift = (38%-20%)/20% = 90% data sample can capture (35%-40%) of the failing nodes by
A lift of 90% means that the site visits' cost to determine and monitoring just 20% of the total nodes, giving a lift of 90%-
service this 38 % of the failing nodes can be reduced to almost 300%. This means that one will be able to visit (20%- 45%) of
half of what it would cost if done without the help of the the failing nodes in almost half the number visits which would
model.

254
be required as per the traditional process, leading the reduction [11] F.Sahin, G.Chandrashekar,”A survey on feature selection methods”,
Journal of Computers and Electrical Engineering Vol. 40 ,pages 16–28,
in costs of visits by almost (50%-75%).
2014.
[12] S. Garcõa, J. Luengo, F. Herrera, “Tutorial on practical tips of the most
influential data preprocessing algorithms in data mining”, Elsevier -
Knowledge-based Systems, 2015.
[13] Law MH, F. Mario AT, J. AK, “Simultaneous feature selection and
clustering using mixture models”. IEEE Trans Pattern Anal Mach Intel,
2004.
[14] M.A. Hall, “Correlation-based feature selection for machine learning”,
Ph.D. thesis, Waikato University, Department of Computer Science,
1999.
[15] Guyon, A. Elisseeff, “An introduction to variable and feature selection”,
Fig. 6. Existing reactive maintenance process Journal of Machine Learning Research Vol .3, pages 1157–1182,2003.
[16] Y. Dhote , S. Agrawal ,A. Jayant Deen , “A Survey on feature selection
techniques for internet traffic classification”, IEEE international
conference on Computational Intelligence and Communication Networks
(CICN) ,2015.
[17] L.C., Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, et al. “A
survey on filter techniques for feature selection in gene expression
microarray analysis,” IEEE/ACM Trans Computer Bioinform , 2012.
Fig. 7. Proposed predictive maintenance process [18] J. Ignacio Serrano1, J.P. Romero, M. Castillo1, E. Rocon ,E D. Louis,J.
Benito-León, “A data mining approach using cortical thickness for
diagnosis and characterization of essential tremor“,
V. CONCLUSION www.nature.com/scientificreports ,2017.
[19] M. Zhao, “The Decision Tree evaluation analysis for the relationship of
In this paper, we proposed a methodology for building Node the marketing and enterprise strategy “, International Conference on
Failure Prediction Model that will help to predict the node Robots & Intelligent System (ICRIS) ,2017.
[20] M. Fang, “A novel multiple classifiers integration algorithm with pruning
failures in advance to take the precautionary measures before it function,”, Fifth International Conference on Fuzzy Systems and
fails in the optical based network. The model was able to Knowledge Discovery, 2008.
capture (35%-40%) of the failing nodes by monitoring just 20% [21] T.Masamha,E.Mnkandla,A.Jaison,“Logistic regression analysis of
of the total nodes, giving a lift of 90%-300%. This means that information communication technology projects' critical success factors:
A focus on computer networking projects “, IEEE AFRICON ,2017.
one will be able to visit (20%- 45%) of the failing nodes in
almost half the number visits which would be required as per
the traditional process, leading the reduction in costs of visits
by almost (50%-75%), which will lead to an improvement in
customer experience and more operational efficiency by
avoiding data loss of optical network.

ACKNOWLEDGMENT
The author is feeling grateful for the support provided by
the management of the studied telecom operator.

REFERENCES
[1] M. Gupta, “Improving software maintenance using process mining and
predictive analytics,” IEEE International Conference on Software
Maintenance and Evolution (ICSME),2017.
[2] S. Ramamurthy, L. Sahasrabuddhe, and B. Mukherjee, “Survivable WDM
mesh networks,” J. Lightwave Technol. 21(4), 2003.
[3] X. Shao, Y. Bai, X. Cheng, Y.-K. Yeo, L. Zhou, and L. H. Ngoh, “Best
effort SRLG failure protection for optical WDM networks,” J. Opt.
Commun. Netw. 3(9), 739–749 ,2011.
[4] A. Shahid, C. Machuca, L. Wosinska, J. Chen,“ Comparative analysis of
protection schemes for fixed mobile converged access networks based on
hybrid PON “ , Conference of Telecommunication, Media and Internet
Techno-Economics (CTTE)”,2015.
[5] https://rapidminer.com [Aug. 19, 2017].
[6] Lian Yan, Richard H. Wolniewicz, Robert Dodier ,“Predicting customer
behavior in telecommunications”, IEEE Computer Society 1094-7167/04,
,2004.
[7] S. Garc´ıa, J. Luengo, F. Herrera, Data preprocessing in data mining.
Springer Publishing Company, Incorporated, 2015.
[8] J. Han, M. Kamber, J. Pei, Data mining: concepts and techniques.
Morgan Kaufmann Publishers Inc., 3rd edition, 2012.
[9] M.J. Zaki, W. Meira, Data mining and analysis: Fundamental concepts
and algorithms .Cambridge University Press, New York, NY, USA, 2014.
[10] V.N. Vapnik, The nature of statistical learning theory. Springer-Verlag
New York, Inc., New York, NY, USA, 1995.

255

You might also like