Data Mining 1-3

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 29

DATA MINING TECHNIQUES OF TELECOMMUNICATION COMPANIES IN NIGERIA

(CASE STUDY OF MTN NIGERIA)

BY

RAMLAT IBRAHIM

IMT/18D/2200

A PROJECT PROPOSAL SUBMITTED TO THE DEPARTMENT OF INFORMATION


TECHNOLOGY, MODIBBO ADAMA UNIVERSITY, YOLA. IN PARTIAL
FULFILLMENT OF THE REQUIREMENTS FOR THE AWARD OF THE DEGREE OF
INFORMATION TECHNOLOGY

MAY, 2023
APPROVAL PAGE
This project report entitled “Data Mining Techniques of Telecommunication Companies in Nigeria

(Case Study of MTNN).” meet the regulations governing the award of B.Tech degree of the

Modibbo Adama University, Yola and is approved for its contribution to knowledge and literacy

presentation.

-------------------------------------- ----------------------

Dr. Iliyasu Adamu Date

(Project Supervisor)

-------------------------------------- ----------------------

Name: Date

(External Examiner)

-------------------------------------- ----------------------

Dr. M. B. Ribadu Date

2
Table of Contents
Chapter One

1.1 Background of the Study............................................................................................................3

1.2 Statement of the Problem..........................................................................................................5

1.3 Aim and Objectives of the Study.......................................................................................................6

1.4 Research Questions.........................................................................................................................6

1.5 Significance of the Study.............................................................................................................6

1.7 Scope of the Study.....................................................................................................................7

1.8 Definition of terms.....................................................................................................................7

Chapter Two

2.1 Introduction....................................................................................................................................8

2.1 Review of Related Literatures...........................................................................................................9

2.3 Types of Telecommunication Data..................................................................................................12

2.4 Data Mining Applications...............................................................................................................14

2.5 Empirical Review...........................................................................................................................20

2.6 Summary......................................................................................................................................21

Chapter Three

3.0 Introduction..................................................................................................................................22

3.1 Research Design............................................................................................................................22

3.2 Study Area....................................................................................................................................22

3.3 Population Of The Study................................................................................................................23

3.4 Sample and Sampling Techniques...................................................................................................23

3.5 Data Collection Method.................................................................................................................23

3.6 Data Analysis.................................................................................................................................23

3
APPROVAL PAGE
This project report entitled Data Mining Techniques Of Telecommunication Companies In
Nigeria

(Case Study Of Mtn Nigeria), meet the regulations governing the award of B.Tech degree of the
Modibbo Adama University, Yola and is approved for its contribution to knowledge and literacy
presentation.

-------------------------------------- ----------------------

Dr. Iliyasu Adamu Date

(Project Supervisor)

-------------------------------------- ----------------------

Name: Date

(External Examiner)

-------------------------------------- ----------------------

Dr. M. B. Ribadu Date

4
CHAPTER ONE

INTRODUCTION

1.1 Background of the Study


The telecommunications industry generates and stores an incredible amount of knowledge (Han et al,
2015). These data include call detail data, which describes the calls that traverse the
telecommunication networks, network data, which describes the state of the hardware and software
components within the network, and customer data, which describes the telecommunication
customers (Han et al, 2014). The quantity of information is so great that manual analysis of the
info is difficult, if not impossible. the requirement to handle such large volumes of information led
to the event of knowledge-based expert systems. These automated systems performed important
functions like identifying fraudulent phone calls and identifying network faults. the matter with this
approach is that it's time-consuming to get knowledge from human experts (the “knowledge
acquisition bottleneck”) and, in many cases; the experts don't have the requisite knowledge. the
appearance of information mining technology promised solutions to those problems and for this
reason, the telecommunications industry was an early adopter of knowledge mining technology
(Roset et al, 2014).
Telecommunication data pose several interesting issues for data mining. The first concerns
scale, since telecommunication databases may contain billions of records and are amongst the largest
in the world. A second issue is that the raw data is often not suitable for data mining. For example,
both call detail and network data are time-series data that represent individual events. Before this
data can be effectively mined, useful “summary” features must be identified and then the data must
be summarized using these features. Because many data mining applications in the
telecommunications industry involve predicting very rare events, such as the failure of a network
element or an instance of telephone fraud, rarity is another issue that must be dealt with. The fourth
and final data mining issue concerns real-time performance because many data mining applications,
such as fraud detection, require that any learned model/rules be applied in real-time (Ezawa and
Norton, 2013). Several techniques have also been applied is tackling all these issues in
telecommunication companies. Telecommunication networks are extremely complex configurations
of equipment, comprised of thousands of interconnected components. Each network element is
capable of generating error and status messages, which leads to a tremendous amount of network
data. This data must be stored and analyzed in order to support network management functions, such
as fault isolation. This data will minimally include a time stamp, a string that uniquely identifies the

5
hardware or software component generating the message and a code that explains why the message
is being generated.
Telecommunication networks are extremely complex configurations of equipment, comprised
of thousands of interconnected components. Each network element can generate error and status
messages, which leads to a tremendous amount of network data. This data must be stored and
analyzed to support network management functions, such as fault isolation Nora (2017). This data
will minimally include a timestamp, a string that uniquely identifies the hardware or software
component generating the message and a code that explains why the message is being generated. For
example, such a message might indicate that “controller 7 experienced a loss of power for 30
seconds starting at 10:03 pm on Monday, May 12.”
Galambos (2014) viewed that the actual data mining task is the automatic or semi-automatic
analysis of large quantities of data to extract previously unknown interesting patterns such as groups
of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association
rule mining). This usually involves using database techniques such as spatial indices. Owning to this
huge rate of networked messaging being generated, skilled workers cannot handle all the incoming
and outgoing messages. For this reason, expert systems have been developed to automatically
analyze all the messages and also to take necessary action to execute commands, involving skilled
workers when a problem cannot be automatically resolved (Eze et al, 2016).
These patterns can then be seen as a kind of summary of the input data, and may be used in
further analysis or, for example, in machine learning and predictive analytics. For example, the data
mining step might identify multiple groups in the data, which can then be used to obtain more
accurate prediction results by a decision support system. Neither the data collection, data preparation,
nor result interpretation and reporting are part of the data mining step but do belong to the overall
KDD (Knowledge Discovery in Database) process as additional steps Nicholson (2014).

Since launch in August 2001, MTN has steadily deployed its services across Nigeria, as of
Monday 7th June, 2021 during its annual general meeting disclosed that they have achieve 89.9%
nationwide coverage in Nigeria. It now provides services in 223 cities and towns, more than 10,000
villages and communities and a growing number of highways across the country, spanning the 36
states of Nigeria and the Federal Capital Territory, Abuja. Many of these villages and communities
are being connected to the world of telecommunications for the first time ever.

6
1.2 Statement of the Problem
Fraud is a serious problem for telecommunication companies, leading to billions of dollars in lost
revenue each year. Fraud can be divided into two categories: subscription fraud and super imposition
fraud. Subscription fraud occurs when a customer opens an account with the intention of never
paying for the account charges. Super imposition fraud involves a legitimate account with some
legitimate activity, but also includes some “super imposed” illegitimate activity by a person other
than the account holder (Kolajo,and Adeyemo, 2012).

It is not feasible for people to analyze great amounts of data without the assistance of
appropriate computational tools. Therefore, the development of tools of an automatic and intelligent
nature becomes essential for analyzing, interpreting, and correlating data in order to develop and
select strategies in the context of each application. To serve this new context, the area of Knowledge
Discovery in Databases (KDD), came into existence with great interest within the scientific,
industrial, and commercial communities. The popular expression "Data Mining" is actually one of
the stages of the Discovery of Knowledge in Databases. The term "KDD" was formally recognized
in 1989 in reference to the broad concept of procuring knowledge from databases.
Super imposition fraud poses a bigger problem for the telecommunications industry and for
this reason data mining technique is used for identifying this type of fraud Bharati (2017). These
applications should ideally operate in real-time using the call detail records and, once fraud is
detected or suspected, should trigger some action.
This action may be to immediately block the call and/or deactivate the account, or may
involve opening an investigation, which will result in a call to the customer to verify the legitimacy
of the account activity. However it is against this background that this current study seek to examine
various data mining techniques of Mobile Telecommunication Network in Nigeria (MTNN).
  This research work therefore addresses the intelligent on data mining techniques which is
been used by MTN Nigeria . This will facilitate better performance of telecommunication companies
in data security and mining

1.3 Aim and Objectives of the Study


The general aim of this study is to analyze the data mining techniques of telecommunication
companies in Nigeria using MTN Nigeria as a case study to fully understand the concepts and moods
of data mining techniques. Specifically, the following are the objectives of this study

i. To provide an overview on data mining.

7
ii. To examine the various data mining techniques of MTN Nigeria.
iii. To identify the challenges of data mining faced by MTN Nigeria.
iv. To recommend ways of improving data mining techniques been used in MTN Nigeria

1.4 Research Questions


i. What is data mining?
ii. What are the various data mining techniques of telecommunication companies in Nigeria?
iii. What are the challenges of data mining faced by telecommunication companies in Nigeria?
iv. What are the ways to improve on the data mining techniques being used by MTN Nigeria?

1.5 Significance of the Study


The following are the significance of this study:

i. The outcome of this study will educate on data mining techniques of telecommunication
companies in Nigeria, the data mining applications and how they can be used in fraud
detection.
ii. This research will be a contribution to the body of literature in the effect of personality
trait on student’s academic performance, thereby constituting the empirical literature for
future research in the subject area.

1.7 Scope of the Study


Data mining is the application of descriptive and predictive analytics to support the marketing, sales
and service functions. Although data mining can be performed on operational databases, it is more
commonly applied to the more stable datasets held in data marts or warehouses. This study will
cover various data mining techniques used by telecommunication companies in Nigeria, bearing in
mind that there are many network providers in Nigeria and each of them employs one technique or
the other which suits the nature of data they handle, in order to fully achieved aims of this study
MTN Nigeria will be the company to be used as the case study.

1.8 Definition of terms


Data is defined as facts or figures, or information that's stored in or used by a computer.

Data mining is a process of extracting and discovering patterns in large data sets involving methods
at the intersection of machine learning, statistics, and database systems.

8
 network service provider (NSP) is a business or organization that sells bandwidth or network
access by providing direct Internet backbone access to internet service providers and usually
access to its network access points (NAPs).

Telecommunications refers to sending and receiving messages using an electrical device. It


encompasses transmitting voice, video, data, internet and other communications.

Application software is different than computer system software

Networking is the exchange of information and ideas among people with a common profession or
special interest, usually in an informal social setting. Networking often begins with a single point of
common ground.

Internet fraud is the use of Internet services or software with Internet access to defraud victims or
to otherwise take advantage of them.

9
CHAPTER TWO

LITERATURE REVIEW

2.1 Introduction
The telecommunications companies in Nigeria generate and stores a massive amount of data. These
data include call detail data, which describes the calls that extends across the telecommunication
networks, network data, which describes the state of the hardware and software components in the
network, and customer data, which describes the telecommunication customers Weiss et al, (2017).

Data an interdisciplinary subfield of computer science is the computational process of discovering


patterns in large data sets involving methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems. The overall goal of the data mining process is to wrest
information from a data set and transform it into an understandable structure for further use. Aside
from the raw analysis step, it involves database and data management aspects, data pre-processing,
model and inference considerations, interestingness metrics, complexity considerations, post-
processing of discovered structures, visualization, and online updating Eze et al, (2014).

The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to
extract previously unknown interesting patterns such as groups of data records (cluster analysis),
unusual records (anomaly detection) and dependencies (association rule mining). This usually
involves using database techniques such as spatial indices. These patterns can then be seen as a kind
of summary of the input data, and may be used in further analysis or, for example, in machine
learning and predictive analytics. For example, the data mining step might identify multiple groups
in the data, which can then be used to obtain more accurate prediction results by a decision support
system. Neither the data collection, data preparation, nor result interpretation and reporting are part
of the data mining step but do belong to the overall KDD process as additional steps Eze et al (2016).

A data mining algorithm is a set of heuristics and calculations that creates a data mining model from
data. To create a model, the algorithm first analyzes the data you provide, looking for specific types
of patterns or trends. The algorithm uses the results of this analysis to define the optimal parameters
for creating the mining model. These parameters are then applied across the entire data set to extract
actionable patterns and detailed statistics. The mining model that an algorithm creates from your data
can take various forms, including:

 A set of clusters that describe how the cases in a dataset are related.

10
 A decision tree that predicts an outcome and describes how different criteria affect that
outcome.
 A mathematical model that forecasts sales.
 A set of rules that describe how products are grouped together in a transaction, and the
probabilities that products are purchased together Bharati (2017).

2.1 Review of Related Literatures


Data mining, also known as knowledge discovery in databases, can be defined as the process
of analyzing large information repositories and of discovering implicit, but potentially useful
information (Han, Kamber, & Pei, 2011). Data mining has the capability to uncover hidden
relationships and to reveal unknown patterns and trends by digging into large amounts of data
(Sumathi, 2014). The functions, or models, of data mining can be categorized according to the task
performed: association, classification, clustering, and regression (Hui & Jha, 2000; Kao, Chang, &
Lin, 2003; Nicholson, 2006b). Data mining analysis is based normally on three techniques: classical
statistics, artificial intelligence, and machine learning (Girija 2014). Classical statistics is mainly
used for studying data, data relationships, as well as for dealing with numeric data in large databases
(David J. Hand, 1998). Examples of classical statistics include regression analysis, cluster analysis,
and discriminate analysis. Artificial intelligence (AI) applies “human-thought-like” processing to
statistical problems (Girija & Srivatsa, 2006).
AI uses several techniques such as genetic algorithms, fuzzy logic, and neural computing.
Finally, machine learning is the combination of advanced statistical methods and AI heuristics, used
for data analysis and knowledge discovery (Kononenko & Kukar, 2007). Machine learning uses
several classes of techniques: neural networks, symbolic learning, genetic algorithms, and swarm
optimization. Data mining benefits from these technologies, but differs from the objective pursued:
extracting patterns, describing trends, and predicting behavior. A typical data mining process, as
shown in Figure 1, is an interactive sequence of steps that normally starts by integrating raw data
from different data sources and formats. These raw data are cleansed in order to remove noise, and
duplicated and inconsistent data (Han et al., 2011). These cleansed data are then transformed into
appropriated formats that can be understood by other data mining tools, and filtration and
aggregation techniques are applied to the data in order to extract summarized data. In fact, interesting
knowledge is extracted from the transformed data. This information is analyzed in order to identify
the truly interesting patterns. Eventually, knowledge is visualized to the user. More detailed
information regarding a data mining process can be found in Han et al. (2011). Data mining, also
known as knowledge discovery in databases, can be defined as the process of analyzing large

11
information repositories and of discovering implicit, but potentially useful information (Han,
Kamber, & Pei, 2011). Data mining has the capability to uncover hidden relationships and to reveal
unknown patterns and trends by digging into large amounts of data (Sumathi & Sivanandam, 2006).
The functions, or models, of data mining can be categorized according to the task performed:
association, classification, clustering, and regression (Hui & Jha, 2000; Kao, Chang, & Lin, 2003;
Nicholson, 2006b). Data mining analysis is based normally on three techniques: classical statistics,
artificial intelligence, and machine learning (Girija & Srivatsa, 2006).
Classical statistics is mainly used for studying data, data relationships, as well as for dealing
with numeric data in large databases (David J. Hand, 1998). Examples of classical statistics include
regression analysis, cluster analysis, and discriminate analysis. Artificial intelligence (AI) applies
“human-thought-like” processing to statistical problems (Girija & Srivatsa, 2006). AI uses several
techniques such as genetic algorithms, fuzzy logic, and neural computing. Finally, machine learning
is the combination of advanced statistical methods and AI heuristics, used for data analysis and
knowledge discovery (Kononenko & Kukar, 2007). Machine learning uses several classes of
techniques: neural networks, symbolic learning, genetic algorithms, and swarm optimization. Data
mining benefits from these technologies, but differs from the objective pursued: extracting patterns,
describing trends, and predicting behavior. A typical data mining process, as shown in Figure 1, is an
interactive sequence of steps that normally starts by integrating raw data from different data sources
and formats. These raw data are cleansed in order to remove noise, and duplicated and inconsistent
data (Han et al., 2011). These cleansed data are then transformed into appropriated formats that can
be understood by other data mining tools, and filtration and aggregation techniques are applied to the
data in order to extract summarized data. In fact, interesting knowledge is extracted from the
transformed data. This information is analyzed in order to identify the truly interesting patterns.
Eventually, knowledge is visualized to the user. More detailed information regarding a data mining
process can be found in (Han et al. 2011).

12
Figure 2.1: Data mining process (Han et all, 2011)

Data mining techniques are applied in a wide range of domains where large amounts of data
are available for the identification of unknown or hidden information. In this sense, Girija and
Srivatsa (2006) indicate that data mining techniques used in WWW are called web mining, used in
text are called text mining, and used in libraries are called bibliomining. The term bibliomining, or
data mining for libraries, was first used by Nicholson and Stanton (2003) to describe the combination
of data warehousing, data mining and bibliometrics. This term is used to track patterns, behavior
changes, and trends of library systems transactions. Although the concept is not new, the term
bibliomining was created to facilitate the search of the terms “library” and “data mining” in the
context of libraries rather than in software libraries. Bibliomining is an important tool to discover
useful library information in historical data to support decision-making (Kao et al., 2003). However,
to provide a complete report of the library system, bibliomining needs to be used iteratively applied
in combination with other measurement and evaluation methods; as strategic information is
discovered, more questions may be raised and thus start the process again (Nicholson, 2003b).
Bibliomining, as any knowledge extraction method, needs to follow a systematic procedure in
order to allow an appropriate knowledge discovery. The bibliomining process starts by determining
areas of focus and collecting data from internal and external sources (Nicholson, 2003b). Then, these
data are collected, cleansed, and anonymized into a data warehouse. To discover meaningful patterns
in the collected data, the bibliomining process includes the selection of appropriate analysis tools and
techniques from statistics, data mining, and bibliometrics (Nicholson, 2006a). Interesting patterns are
analyzed and visualized through reports. The mining process will be iterated until the resulted
information is verified and proved by key users such as librarians and library managers (Shieh,

13
2010). The application of bibliomining tools is an emerging trend that can be used to understand
patterns of behavior among library users and staff, and patterns of information resource use
throughout the library (Nicholson & Stanton, 2006). Bibliomining is highly recommended to provide
useful and necessary information for library management requirements, focusing on the professional
librarianship issues, but highly database technical dependent (Shieh, 2010). Bibliomining can also be
used to provide a comprehensive overview of the library workflow in order to monitor staff
performance, determine areas of deficiency, and predict future user requirements (Prakash, Chand, &
Gohel, 2004).
The resulting information gives the possibility to perform scenario analysis of the library
system, where different situations that need to be taken into account during a decision-making
process are evaluated (Nicholson, 2006a). An additional application is to standardize structures and
reports in order to share data warehouses among groups of libraries, allowing libraries to benchmark
their information (Nicholson, 2006a). Therefore, in order to improve the interaction quality between
a library and its users, the application of data mining tools in libraries is worth pursuing (Chang &
Chen, 2006). The aim of this study is to investigate how far academic libraries are pragmatically
using data mining tools, and in which library aspects librarians are implementing them. To this end,
content and statistical analyses are used to examine articles that include case studies of academic
libraries implementing data mining tools.

2.3 Types of Telecommunication Data


2.3.1 Call Detail Data

Every time a call is placed on a telecommunications network, descriptive information about the
call is saved as a call detail record. The number of call detail records that are generated and stored is
huge. For example, AT&T long distance customers alone generate over 300 million call detail
records per day (Pregibon, 2013). Given that several months of call detail data is typically kept
online, this means that tens of billions of call detail records will need to be stored at any time. Call
detail records include sufficient information to describe the important characteristics of each call. At
a minimum, each call detail record will include the originating and terminating phone numbers, the
date and time of the call and the duration of the call. Call detail records are generated in real-time
and therefore will be available almost immediately for data mining Cortes (2014). This can be
contrasted with billing data, which is typically made available only once per month. Call detail
records are not used directly for data mining, since the goal of data mining applications is to extract
knowledge at the customer level, not at the level of individual phone calls. Thus, the call detail

14
records associated with a customer must be summarized into a single record that describes the
customer’s calling behavior. The choice of summary variables (i.e., features) is critical in order to
obtain a useful description of the customer. Below is a list of features that one might use when
generating a summary description of a customer based on the calls they originate and receive over
some time period P:

i. average call duration


ii. % no-answer calls
iii. % calls to/from a different area code
iv. % of weekday calls (Monday – Friday)
v. % of daytime calls (9am – 5pm)
vi. average # calls received per day
vii. average # calls originated per day
viii. # unique area codes called during P

These eight features can be used to build a customer profile. Such a profile has many potential
applications. For example, it could be used to distinguish between business and residential customers
based on the percentage of weekday and daytime calls. Most of the eight features listed above were
generated in a straightforward manner from the underlying data, but some features, such as the
eighth feature, required a little more thought and creativity Cortes, (2016). Because most people call
only a few area codes over a reasonably short period of time (e.g., a month), this feature can help
identify telemarketers, or telemarketing behavior, since telemarketers will call many different area
codes. The above example demonstrates that generating useful features, including summary features,
is a critical step within the data mining process. Should poor features be generated, data mining will
not be successful. Although the construction of these features may be guided by common sense and
expert knowledge, it should include exploratory data analysis Kukar, (2007). For example, the use of
the time period 9am-5pm in the fifth feature is based on the commonsense knowledge that the typical
workday is 9 to 5 (and hence this feature may be useful in distinguishing between business and
residential calling patterns).

2.3.2 Network Data

Telecommunication networks are extremely complex configurations of equipment, comprised


of thousands of interconnected components. Each network element is capable of generating error and
status messages, which leads to a tremendous amount of network data. This data must be stored and
analyzed in order to support network management functions, such as fault isolation Han et al (2014).

15
This data will minimally include a timestamp, a string that uniquely identifies the hardware
or software component generating the message and a code that explains why the message is being
generated. For example, such a message might indicate that “controller 7 experienced a loss of power
for 30 seconds starting at 10:03 pm on Monday, May 12. Due to the enormous number of network
messages generated, technicians cannot possibly handle every message. For this reason, expert
systems have been developed to automatically analyze these messages and take appropriate action,
only involving a technician when a problem cannot be automatically resolved (Weiss, 2018). As was
the case with the call detail data, network data is also generated in real-time as a data stream and
must often be summarized in order to be useful for data mining. This is sometimes accomplished by
applying a time window to the data. For example, such a summary might indicate that a hardware
component experienced twelve instances of a power fluctuation in a 10-minute period.

2.3.3 Customer Data

Telecommunication companies, like other large businesses, may have millions of customers.
By necessity this means maintaining a database of information on these customers. This information
will include name and address information and may include other information such as service plan
and contract information, credit score, family income and payment history Chang, (2003). This
information may be supplemented with data from external sources, such as from credit reporting
agencies. The customer data maintained by telecommunication companies does not substantially
differ from that maintained in most other industries Pei, (2011). However, customer data is often
used in conjunction with other data in order to improve results. For example, customer data is
typically used to supplement call detail data when trying to identify phone fraud.

2.4 Data Mining Applications


The two main factors on which Data Mining and BI applications interact on include the
availability of the problem that has to be approached and solved by the Data Mining and BI
technologies and the availability of Data for implementing the technologies. The main reason behind
the significance of Data Mining and Business Intelligence Applications in the Telecommunications
industry are the availability of tremendously large volume of data, Han et al, (2014)

2.4.1 Marketing and customer relationship management (CRM)

Telecommunication companies maintain a tremendous volume of data about their customers


and their call details. This information can be used to profile the customers and these profiles can be
used for marketing and forecasting purposes. The emphasis of marketing application in

16
telecommunication industry has moved from identifying new customers to measuring customer value
and then taking steps to return the profitable customers. This shift has happened because it is
expensive to acquire new customers than retaining the existing ones Hand, (2014). A numerous Data
Mining method can be used to generate the customer lifetime value (the total net income a company
can expect from a customer over time) for telecommunication customers. Different Data Mining
techniques are used to model customer lifetime value for telecommunication customers. The key
element of modeling the lifetime value for a telecommunication customer is to estimate how long
he/she will remain with their current network. It will help the company to predict when a customer is
likely leave and to take proactive steps to retain the customer. One of the serious issues that the
telecommunications industries face is the customer churn David, (1998).

The process that a customer leaving a company is referred to as churn and churn analysis can
be done through numerous systems and methods. Network Fault Isolation & Prediction
Telecommunication networks are comprised of highly complex configurations of hardware and
software. Since the industry requires optimum network efficiency and reliability, most of the
network elements have the capability of self – diagnosis and generating status and alarm messages.
Expert systems were developed to handle alarms. Network fault isolation in the Telecommunication
industry is a quiet tedious task because of the Following reasons Han et al, (2014). Huge volume of
data a single fault can generate different unrelated alarms. Hence alarm correlation has an important
role in predicting network faults. A proactive rapid response is very much essential for maintaining
the reliability of the network. Data mining techniques like classification, neural network and
sequence analysis can be used for identifying network faults. The telecommunication Alarm
Sequence Analysis (TASA) is a Data Mining tool which support fault identification by searching for
recurrent patterns of algorithms This information can be used to generate a rule-based alarm
correlation system, which can be used for identifying faults in real time. Genetic algorithm is another
method to predict the telecommunication switch failures. Time weaver is a genetic algorithm which
has the capability to operate directly on the raw network level time series data. This algorithm will
identify patterns that will successfully predict the target event. Bayesian Belief Networks can also be
used to identify the network faults Standard classification tools can be used to generate rules to
predict future failures, but it has several draw backs. Most importance drawback of this is that some
information will be lost in reformulation process Jha, (2014).

17
2.4.2 Marketing/Customer Profiling

Telecommunication companies maintain a great deal of data about their customers. In


addition to the general customer data that most businesses collect, telecommunication companies
also store call detail records, which precisely describe the calling behavior of each customer. This
information can be used to profile the customers and these profiles can then be used for marketing
and/or forecasting purposes. We begin with one of the most well-known and successful marketing
campaigns in the telecommunications industry: MCI’s Friends and Family promotion. This
promotion was initially launched in the United States in1991 and, although now retired, was
responsible for significant growth in MCI’s customer base. The promotion offered reduced calling
fees when calls are placed to others in one’s calling circle. This promotion purportedly originated
when market researchers noticed small subgraphs in the call graph of network activity—which
suggested the possibility of adding entire calling circles rather than the costly approach of adding
individual subscribers (Han, Altman, Kumar, Mannila & Pregibon, 2002). It is worth noting that
MCI relied primarily on its customers to bring in members of their calling circle, even though MCI
could have utilized its call detail data to generate a list of the people in each calling circle.

The most likely reason for this is that MCI did not want to anger its customers by using
highly personal information (calling history). This demonstrates that privacy concerns are an issue
for data mining in the telecommunications industry, especially when call detail data is involved. The
MCI Friends and Family promotion relied on data mining to identify associations within data.
Another marketing application that relies on this technique is a data mining application for finding
the set of non-U.S. countries most often called together by U.S. telecommunication customers
(Cortes & Pregibon, 2001). One set of countries identified by this datamining application is: Jamaica,
Antigua, Grenada, Dominica.

This information is useful for establishing and marketing international calling plans. A
serious issue with telecommunication companies is customer churn. Customer churn involves a
customer leaving one telecommunication company for another. Customer churn is a significant
problem because of the associated loss of revenue and the high cost of attracting new customers.
Some of the worst cases of customer churn occurred several years ago when competing long distance
companies offered special incentives, typically $50or $100, for signing up with their company—a
practice which led to customers repeatedly switching carriers in order to earn the incentives.
Datamining techniques now permit companies the ability to mine historical data in order to predict
when a customer is likely to leave. These techniques typically utilize billing data, call detail data,

18
subscription information (calling plan, features, contract expiration data) and customer information
(e.g., age).

Based on the induced model, the company can then take action, if desired. For example, a
wireless company might offer a customer a free phone for extending their contract. One such effort
utilized a neural network to estimate the probability h(t) of cancellation at a given time t in the future
(Datta, 2014). In the telecommunications industry, it is often useful to profile customers based on
their patterns of phone usage, which can be extracted from the call detail data. These customer
profiles can then be used for marketing purposes, or to better understand the customer, which in turn
may lead to better forecasting models. In order to effectively mine the call detail data, it must be
summarized to the customer level as described earlier in this chapter. Then, a classifier induction
program can be applied to a set of labeled training examples in order to build a classifier. This
approach has been used to identify fax lines (Orji, 2014) and to classify a phone line as belonging to
a business or residence (Cortes, 2018). Other applications have used this approach to identify phone
lines belonging to telemarketers and to classify a phone line as being used for voice, data, or fax.
Two sample rules for classifying a customer as being a business or residential customer are shown
below (using pseudo-code). These rules were generated using SAS Enterprise Miner, a sophisticated
data mining package that supports multiple data mining techniques. The rules shown below were
generated using a decision tree learner. However, a neural network was also used to predict the
probability of a customer being a business or residential customer, based solely on the distribution of
calls by time of day (i.e., the neural network had 24 inputs, one per hour of the day). The probability
estimate generated by the neural network was then used as an input (i.e. feature) to the decision tree
learner. Evaluation on a separate test set indicates that rule 1 is 88% accurate and rule 2 is 70%
accurate.

Rule 1:if < 43% of calls last 0-10 seconds and < 13.5% of calls occur during the weekend and neural
network says that P(business) >0.58 based on time of day call distribution then business Customer.
Rule 2:if calls received over two-month period from at most 3 unique area codes and <56.6% of
calls last 0-10 seconds then residential customer.
It is worth noting that because a telecommunications company generates a call detail record if the
calling (paying) party is its customer, the company will also have a sample of (received) calls for
non-customers. If a company has high overall market penetration, this sample may be large enough
for data mining. Thus, telecommunication companies have the technical ability to profile non-
customers as well as customers.

19
2.4.3 Fraud Detection

Fraud is the crime of using dishonest methods to take something valuable from another
person. This is very serious issue that the telecommunication industry faces since it leads to the loss
of revenue by billions of dollars. As provided by Gosset & Hyland 2014, the telecommunication
fraud can be defined as any activity by which telecommunication service is obtained without
intention of paying. Telecommunication fraud can be classified into two categories namely

i. Subscription fraud
ii. Superimposition fraud
Subscription fraud occurs when a customer opens an account with the intention of never
paying. Telecommunication companies consider Superimposition frauds are the most significant
problems which occurs when a perpetrator gains illegal access to the account of a legitimate
customer. Both subscriptions fraud and Superimposition fraud should be detected immediately. and
customer account should be deactivated. Cellular cloning was a very serious issue in 1990’s. This
was eliminated with the Authentication methods. Deviation detection and Anomaly detection are the
most common techniques used for detecting superimposed fraud. Combined use of customer
signatures dynamic clustering and pattern recognition are some other methods which are recently
applied in this area.

Absolute analysis and differential analysis are considered as the two main sub-categories of
approaches for fraud detection. According to saurkar et.al (2014), the most often used techniques for
fraud detection in telecommunication include statistical modeling, Bayesian rules, visualization
methods, clustering, rule discovery, neural network, Markov models as well as combinations of more
than one method. Customer data can also be used for detecting fraud. For example price plan and
credit rating information can be in cooperated into the fraud analysis. Another common method for
fraud detection is to create a profile of customer’s calling behavior and compare activity against this
behavior. This calling behavior can be generated by briefing the call detail records for a particular
customer. Fraud can be identified immediately after it happens, only if the call details records are
updated in real time. Fraud detection system works at the customer level, not at the individual call
level. Fraud detection application involves predicting a relatively rare event where the class
distributions involved is highly twisted.

20
2.4.4 Network Fault Isolation

Telecommunication networks are extremely complex configurations of hardware and software. Most
of the network elements are capable of at least limited self-diagnosis, and these elements may
collectively generate millions of status and alarm messages each month. In order to effectively
manage the network, alarms must be analyzed automatically in order to identify network faults in a
timely manner or before they occur and degrade network performance. A proactive response is
essential to maintaining the reliability of the network. Because of the volume of the data, and
because a single fault may cause many different, seemingly unrelated, alarms to be generated, the
task of network fault isolation is quite difficult. Data mining has a role to play in generating rules for
identifying faults. The Telecommunication Alarm Sequence Analyzer (TASA) is one tool that helps
with the knowledge acquisition task for alarm correlation (Klemettinen, Mannila &Toivonen, (2011).
This tool automatically discovers recurrent patterns of alarms within the network data along with
their statistical properties, using a specialized data mining algorithm. Network specialists then use
this information to construct a rule-based alarm correlation system, which can then be used in real-
time to identify faults.

TASA is capable of finding episodic rules that depend on temporal relationships between the
alarms. For example, it may discover the following rule: “if alarms of type link alarm and link failure
occur within 5 seconds, then an alarm of type-high fault rate occurs within 60 seconds with
probability 0.7.” Before standard classification tasks can be applied to the problem of network fault
isolation, the underlying time-series data must be represented as a set of classified examples. This
summarization, or aggregation, process typically involves using a fixed time window and
characterizing the behavior over this window. For example, if n unique alarms are possible, one
could describe the behavior of a device over this time window using a scalar of length n. In this case
each field in the scalar would contain a count of the number of times a specific alarm occurs. One
may then label the constructed example based on whether a fault occurs within some other time
frame, for example, within the following 5 minutes. Thus, two-time windows are required. Once this
encoding is complete, standard classification tools can be used to generate “rules” to predict future
failures. Such an encoding scheme was used to identify chronic circuit problems (Sasisekharan,
Seshadri& Weiss, 1996). The problem of reformulating time-series network events so that
conventional classification-based data mining tools can be used to identify network faults has been
studied. Weiss & Hirsh (2018) view this task as an event prediction problem while Fawcett &
Provost (2009) view it as an activity monitoring problem. Transforming the time-series data so that
standard classification tools can be used has several drawbacks. The most significant one is that

21
some information will be lost in the reformulation process. For example, using the scalar-based
representation just mentioned, all sequence information is lost. Time weaver (Weiss & Hirsh, 1998)
is a genetic-algorithm based data mining system that is capable of operating directly on the raw
network-level time series data (as well as other time-series data), thereby making it unnecessary to
re-represent the network level data. Given a sequence of timestamped events and a target event T,
Time weaver will identify patterns that successfully predict T. Time weaver essentially searches
through the space of possible patterns, which includes sequence and temporal relationships, to find
predictive patterns. The system is especially designed to perform well when the target event is rare,
which is critical since most network failures are rare. In the case studied, the target event is the
failure of components in the 4ESS switching system.

2.5 Empirical Review


Several research have been made in the field of customer attrition and retention analysis in banking
sectors. Some studies reveal that the most important variables influencing customer choice are
effective and efficient customer services, speed and quality services, variety of services offered and
low e-service charges, online banking facilities, safety of funds and the availability of
technologybased service(s), low interest rate on loan, convenient branch location, image of the bank,
well management, and overall bank environment [7-9]. On the other hand, customer is the core of
their operation, so nurturing and retaining them are important for their success.
Many research were held on customer retention as well as customer attrition analysis Lift is
used as a proper measure for attrition analysis and compare the lift of data mining models of decision
tree, boosted naive Bayesian network, selective Bayesian network, neural network and the ensemble
of classifiers of the above methods. Their main focuses were on attrition analysis using lift. Lift can
be calculated by looking at the cumulative targets captured up to p% as a percentage of all targets
and dividing by p%. A churn model with a higher predictive performance in a newspaper
subscription context was constructed support vector machines. They showed that support vector
machines show good generalization performance when applied to noisy marketing data. The model
outperforms a logistic regression only when the appropriate parameter-selection technique is applied
and SVMs are surpassed by the random forests.
A software using Clementine was used to analyze 300 records of customers Iran Insurance
Company in the city of Anzali, Iran. They used demographic variables to determine the optimal
number of clusters in K-means clustering and evaluated binary classification methods (decision tree
QUEST, decision tree C5.0, decision tree CHAID, decision trees CART, Bayesian networks, Neural
networks) to predict customers churn used Decision trees and Neural Networks to develop model to

22
predict churn. Models generated are evaluated using ROC curves and AUC values. They also
adopted cost sensitive learning strategies to address imbalanced class labels and unequal
misclassification costs issues discussed commercial bank customer churn prediction based on SVM
model, and used random sampling method to improve SVM model, considering the imbalance
characteristics of customer data sets. A study investigated determinants of customer churn in the
Korean mobile telecommunications service market based on customer transaction and billing data.
Their study defines changes in a customer’s status from active use to non-use or suspended as partial
defection and from active use to churn as total defection. Results indicate that a customer’s status
change explains the relationship between churn determinants and the probability of Churn. A neural
network (NN) based approach to predict customer churn in subscription of cellular wireless services.
Their results of experiments indicate that neural network-based approach can predict customer churn
with accuracy more than 92%.
An academic database of literature between the periods of 2000–2006 covering 24 journals
and proposes a classification scheme to classify the articles. Nine hundred articles were identified
and reviewed for their direct relevance to applying data mining techniques to CRM. They found that
the research area of customer retention received most research attention; and classification and
association models are the two commonly used models for data mining in CRM A critique on the
concept of Data mining and Customer Relationship Management in organized Banking and Retail
industries was also discussed. Most of these papers used existing customer’s data from a single
database. Some of them used only demographic data. But in our system, we used data from different
branches of a bank and merge these into a single database. We have analyzed borrower’s
transactional data. We focused on predicting prospective business sectors to disburse loan in retailing
commercial bank.
2.6 Summary
This literature review discussed the most prevailing data mining techniques machine-learning
and cluster analysis. Machine-learning algorithms could realize different functions such as
classification, prediction and association. These function systems and cluster analysis could
outperform the traditional methods on text mining and sentiment analysis, obtaining better accuracy
and larger capacity tolerance. Data mining provides a variety of systems for identifying cooperative
learning from vast datasets and an extensive range of methods for detecting useful knowledge from
massive datasets such as patterns, trends, and rules. Different data mining methods have been used in
social network and telecommunication analysis as focused on this paper. In this paper, the current
evaluation and update of data mining analysis were discussed and reviewed based on different

23
aspects analysis. Data mining techniques have been faces many challenges during this analysis area
to be resolve with aggressive improvement.

24
CHAPTER THREE
METHODOLOGY
3.1 Introduction
This chapter states the various methods will be used in research, as well as the population of the
study, and sampling techniques used in determining the sample size for the research. How data was
collected and analyzed is also discussed in this chapter.

The main objectives of this research will be achieved through quantitative methods, as inferential
statistics were used to measure the level of accuracy and validate responses from the respondents in
accordance to the objectives of the research.

3.2 Research Design


The research design used for this study will be the descriptive research design. Since data
characteristics were described using frequencies and percentages, and no manipulations of data or
variables were necessary, the researcher chose this research design. The researcher discarded other
alternatives such as the causal and explanatory research designs, because accurate findings and data
analysis may not be achieved.

3.3 Study Area

Adamawa is a state in northeastern Nigeria, whose capital and largest city is Yola. In 1991,
when Taraba State was carved out from Gongola State, the geographical entity Gongola State was
renamed Adamawa State, with four administrative divisions:
Adamawa, Michika, Ganye, Mubi and Numan. It is one of the thirty-six states that constitute the
Federal Republic of Nigeria. Adamawa is one of the largest states of Nigeria and occupies about
36,917 square kilometres. It is bordered by the states of Borno to the northwest, Gombe to the west
and Taraba to the southwest. Its eastern border forms the national eastern border with Cameroon.
Topographically, it is a mountainous land crossed by the large river valleys – Benue, Gongola and
Yedsarem. The valleys of the Mount Cameroon, Mandara Mountains and Adamawa Plateau form
part of the landscape.

3.4 Population Of The Study


The population for this study will be employees of Mtn Nigeria Yola Metropolis. The population
figure for the study is 632 respondents, comprising of Mtn staff from various departments such as
operations, finance, administration, marketing etc.

25
3.5 Sample and Sampling Techniques
A total sample size of 400 respondents will be randomly selected using confidence interval of 5 and
confidence level of 95% (0.05) from the total population of 632 MTNNigeria workers in Yola
Metropolis. Based on the populations the sample size was determined at 5% error of tolerance and
95% degree of confidence, using Taro Yamane’s Formula:

n= N
1 + ne2
Where; n = Population Size
N = Total Number of Students
e = Error tolerance (5%)
1 = Theoretical Constant

3.6 Data Collection Method


Data for this study will be collected from the respondents through the use of questionnaires.
Questionnaires were shared to all 400 respondents of the organization, and field surveys through
responses to questions in the questionnaire served as the main source of primary data for this study.
Other information was collected from text books, journals and other secondary sources of data.

3.7 Data Analysis


Various analytical tools and software such as pie charts, bar charts, tables, and Statistical Package for
Social Science (SPSS) software were used in analysing data for this study.

Data collected were analyzed using frequencies and percentages. These frequencies and percentages
enabled the researcher to clearly represent true data characteristics and findings with a great deal of
accuracy. Interpretation and analysis of data was also used to describe items in tables and charts used
for this study.

26
REFERENCES

Aggarwal, C. (Ed.). (2017). Data Streams: Models and Algorithms; New York: Springer.

Aregbeyen, Ph.D, (2011) The Determinants of Bank Selection Choices by Customers: Recent and
Extensive Evidence from Nigeria. International Journal of Business and Social Science.Vol.
2, No. 22, pp.276-288.
Bayesian Network Models. (1995) Proceedings of the First International Conference on Knowledge
Discovery and Data Mining; 1995 August 20-21. Montreal Canada. AAAI Press: Menlo
Park, CA.

BenlanHea,YongShi,QianWan,Xi Zhao, (2014) Prediction of customer attrition of commercial banks


based on SVM model. 2nd International Conference on Information Technology and
Quantitative Management, ITQM 2014, Procedia Computer Science Vol. 31, pp.423 – 430.
Cortes, C., Pregibon, D. (2012) Signature-based methods for data streams. Data Mining and
Knowledge. Discovery; 5(3):167-182.

Cortes, C., Pregibon, D. Giga-mining. (2018) Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining; 174-178, 1998 August 27-31; New York, NY:
AAAI Press.

Ezawa, K., Norton, S. (2015) Knowledge discovery in telecommunication services data using
Bayesian network models. Proceedings of the First International Conference on Knowledge
Discovery and Data Mining; 1995 August 20–21. Montreal Canada. AAAI Press.

Ngai, E.W.T. , Li Xiu. D.C.K. Chau, (2009) Application of data mining techniques in customer
relationship management: A literature review and classification. Expert Systems with
Applications. Vol. 36, pp. 2592–2602.

Fawcett, T, Provost, F. (1999) Activity monitoring: Noticing interesting changes in behavior.


Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining; 53-62. San Diego. ACM Press: New York, NY.
Gary Cokins, Ken King, "Managing Customer Profitability and Economic Value in the
Telecommunication Indutry", SAS Institute White paper.

27
Hangxia Ma, Min Qin, Jianxia Wang. (2009), "Analysis of the Business Customer Churn Based on
Decision Tree Method", The Ninth International Conference on Control and Automation,
Guangzhou, China.

Han, J., Altman, R. B., Kumar, V., Mannila, H., Pregibon, D. (2002) Emerging scientific
applications in data mining. Communications of the ACM; 45(8): 54-58.

Hafeez Ur Rehman and Saima Ahmed, (2008) An Empirical Analysis of the determinants of bank
selection in Pakistan; A customer view.Pakistan Economic and Social Review.Vol. 46, no.
2, pp.147-160.
Kaplan, H., Strauss, M., Szegedy, M. (1999) Just the fax differentiating voice and fax phone lines
using call billing data. Proceedings of the Tenth Annual ACM-SIAM Symposium

Kazi Omar Siddiqi, (2011) Interrelations between Service Quality Attributes, Customer Satisfaction
and Customer Loyalty in the Retail Banking Sector in Bangladesh. International Journal of
Business and Management.Vol. 6, No. 3, pp.12-36.

KristofCoussement, and Dirk Van den Poel (2008) Churn prediction in subscription services: An
application of support vector machines while comparing two parameter-selection
techniques. Expert Systems with Applications.Vol. 34, pp.313–327.
Liebowitz, J. (1988). Expert System Applications to Telecommunications. New York, NY: John
Wiley
Berry, M. and G. Linoff. (2000) Mastering Data Mining. John Wiley and Sons, New York, USA,.

Menlo Park, CA, 1995Fawcett, T., Provost, F. (1997) Adaptive fraud detection. Data Mining and
Knowledge Discovery; 1(3):291-316.
Mozer, M., Wolniewicz, R., Grimes, D., Johnson, E., &Kaushansky, H. (2000). Predicting subscriber
dissatisfaction and improving retention in the wireless telecommunication industry
MO Zan, ZHOA Shan, LI Li, LIU Ai-Jun, (2007), "A predictive Model of Churn in
Telecommunications Base on Data Mining".IEEE International Conference on Control and
Automation", Guangzhou, China.
PAKDD 2006 Data Mining Competition, http://www3. ntu. edu.
sg/SCE/pakdd2006/competition/overview. Htm
Pareek, D.: Business Intelligence for Telecommunications. Auerbach Publications, Taylor & Francis
Group LLC.

28
Weiss, G. (2004). Mining with rarity: A unifying framework SIGKDD Explorations.
Yossi Ritcher, Elad Yom-Tov, Noam Slonim, (2008) "Predicting Customer Churn in Mobile
Networks through Analysis of Social Groups". SIAM.

Reza AllahyariSoeini, and KeyvanVahidyRodpysh, (2012) Evaluations of Data Mining Methods in


Order to Provide the Optimum Method for Customer Churn rediction: Case Study
Insurance Industry”, International Conference on Information and Computer Applications.
Vol. 24, pp.290-297.

Madhavi, S. , (2012) The Prediction of churn behaviour among Indian bank customer: An
application of Data Mining Techniques. International Journal of Marketing, Financial
Services & Management Research. Vol.1,o. 2, pp.11-19.

OshiniGoonetilleke T. L. , and H. A. Caldera, (2013) Mining Life Insurance Data for Customer
Attrition Analysis. Journal of Industrial and Intelligent Information. Vol. 1, no. 1.pp. 52-58.

Xiaohua Hu, (2005) A Data Mining Approach for Retailing Bank Customer Attrition Analysis.
Applied Intelligence. Vol. 22, pp. 47–60.

29

You might also like