Machine Learning Techniques For SIM Box Fraud Detection

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2019 International Conference on Communication Technologies (ComTech 2019)

Machine Learning Techniques for SIM Box Fraud


Detection
Mhair Kashir Sajid Bashir
Department of Computing Department of Computer Engineering Technology
Iqra University National University of Technology (NUTECH)
Islamabad, Pakistan Islamabad, Pakistan
muhammad.kashif@sco.gov.pk sajidbashir@nutech.edu.pk

Abstract—In today’s competitive environment, telecommunica- the global mobile traffic by significant margins approaching to
tion operators and service providers need to generate revenue by about 50 exabytes per month by 2021 [2]. In today’s increasing
designing and delivering innovative services to the subscribers. competitive environment, all telecommunication operators and
At the same time, a prime consideration is to minimize the cost
and prevent the revenue leakages. In this context, the industry service providers need to protect and generate revenue by
faces numerous challenges and different types of frauds. There designing and delivering the innovative services that attract the
is a continuous effort to tackle this problem by improving subscribers. Primary objective of telecommunication operators
the implementation methodology and the network protocols. is to maintain a healthy volume of subscribers by providing
However, in general, fraud detection is difficult and currently best products and services. On the other hand an increasing
addressed in a proactive manner. Fraudulent callers exploit the
weaknesses of specific protocol level solutions and avoid detection trend in variety and intensity of telecommunication frauds has
of the gray traffic. Our research work is inspired by classification been observed that leads to huge revenue losses.
algorithms used in machine learning and employed in different With passing year there is a huge increasing trend in telecom
fields of science and engineering e.g. images processing, speech frauds [3]. A telecom fraud is defined as the use of any telecom
recognition, spam email detection etc.. We have applied these service without paying the usage charges [4]. According to
machine learning techniques (MLTs) for the classification of
normal and fraudulent subscriber (SIM Box). We have used the the survey of the Communications Fraud Control Association
call detail records (CDRs) of normal and fraudulent subscriber (CFCA) conducted in 2013 the estimated global loss in the
as an input to identify the important attributes; 25 for each telecommunication fraud was about 46.3 Billion US dollars
customers. These attribute are used for classification of the with an estimated increase of 15% since 2011. A serious
normal and fraudulent subscribers using Neural Network (NN) effort improved the statistics in this regard and the number
and Support Vector Machine (SVM). A comparative performance
analysis of both techniques is also presented using various observed in 2017 is 29.2 Billion US dollars with an estimated
evaluation parameters. SVM using the kernel (Polynomial, Ra- 23% decrease over the last four years [5]. In November, 2013
dial, and Sigmoid) show best performance with an accuracy of Pakistan Telecom Authority (PTA) confiscated more than 700
99.24%. SVM Linear kernel show the worst performance with illegal international calls termination gateway in the country
accuracy of 95.18 % and 0.19 regression. In case of NN, Bayesian [6]. Thereafter, PTA implemented SIMs Verification system
Regularization and Resilient Back-Propagation algorithms show
best and worst performance with an accuracy of 99.87 % and in collaboration with National Data Registration Authority
99.53% respectively. (NADRA) to control illegal sale of SIM and reduce the SIM
Index Terms—Gray Traffic, Interconnect Bypass, Machine Box Fraud [7].
Learning, Neural Network, SIM Box Fraud, Support Vector Higher calling rates as compared to domestic services
Machine, Telecommunication attract a fraudulent to terminate the international calls on
I. I NTRODUCTION any local operator. Classified as SIM Box fraud it involves
utilizing illegal means to terminate the international operator’s
Telecom industry is growing rapidly because of an increas- traffic onto the intended receiver after re-initiating it as a
ing volume of communication among the people, making the domestic call. Since the telecom operators must maintain the
world a global village. Similar is the case in Pakistan, it confidentiality of subscriber information, the data available
was reported to be the world’s third fastest growing telecom for experimental research is limited that leads to a small set
industry in 2008. At the end of June 2013, the total mobile of options as solution. The fraud management systems are
subscribers were approximately 128.93 million [1]. The exten- generally specific to the fraud types and have little capability to
sive use of IP services on cellular networks is likely to inflate detect the emerging threats [8]. In addition, the huge repository
“978-1-5386-5106-3/19/$31.00 ©2019 IEEE” Personal use of this mate- maintained by the operators further complicates the real time
rial is permitted. Permission from IEEE must be obtained for all other uses, in decision making on user classification. Furthermore, SIM Box
any current or future media, including reprinting/republishing this material for fraudulent subscribers change their usage behavior and also
advertising or promotional purposes, creating new collective works, for resale
or redistribution to servers or lists, or reuse of any copyrighted component of frequently change the SIM so no historical data is available
this work in other works.” for analysis. SIM Box fraudulent subscriber pretends to be

4
normal subscribers and it is only after a detailed investigation
that a classification is possible.
The research community has offered different approaches
to detect the telecom fraud. Authors in [9] applied the data
mining technique for detection of subscription fraud. In this
approach system maintained the usage profiles of each sub-
scription. Aforementioned customer profile is matched with
the subscriber who has already commit a subscription fraud.
[10] proposed a rough fuzzy set based approach to detect fraud
in 3G mobile telecommunication network. The authors de-
signed a rule based system called Citi FMS to detect abnormal-
ities and raise alarm in case of an anomaly detection. In [11],
authors used the statistical and probabilistic KL-divergence
to find the dissimilarities between the characteristic of the
normal and fraudulent subscriber. The authors in [12] propose
a cooperative work flow design for telecommunications fraud Fig. 1. Legitimate and Fraudulent Call Setup.
control and propose a network embedding based approach for
fraud detection. Experimental data is used to demonstrate the
effectiveness of the proposed method. A model is discussed received through the mentioned channel is not only accounted
in [13] that attributes the behavioral sequences generated for by the two networks but also by the third party i.e. the
from consecutive behaviors, in order to capture the sequential international carrier.
patterns. This approach declares the deviating behaviours from In case of a fraudulent call, the fraudulent has an agreement
the established pattern as fraudulence. with telecom network in country ‘A’ for termination of its
The objective of this research work is to study the per- international traffic for country ‘B’. In most cases, such
formance of the SIM Box Fraud Detection using machine arrangements are made to reduce the termination call charges,
learning techniques. We have applied the machine learning however a loophole is used by fraudulent to avoid subscription
techniques for the classification of the normal and fraudulent charges. A shown with a dotted line in Fig.1, subscriber of
subscriber (SIM Box). CDRs are used to identify 25 attributes country ‘A’ wants to call subscriber of country ‘B’. When a
forming the feature set of each customers. These attribute are call is made, network of country ‘A’ hands over the call to
used as input to two well known MLTs, i.e. Neural Network, the fraudulent subscriber which is responsible for landing the
and Support Vector Machine for classification as fraudulent traffic to the destination. Such fraudulent or person take the
and non-fraudulent subscribers. A comparative performance call on IP and has no agreement in destination country for
analysis of the both techniques is also carried out using landing traffic. This is done by using the SIM box placed
different evaluation parameters. in the destination country. The SIM box has multiple SIM of
The remaining part of the research paper is organized as operators working in the destination country. The call received
following: In Section II we have discussed the legitimate is passed through the SIM box and same is landed directly on
international call flow and International calls termination flow to designated subscriber as a local call. Doing so, the fraudster
using the SIM Box. We have discuss our research methodology is not only tricking the network in destination country by
in Section III. Section IV presents the methodology and landing international traffic as local traffic but also damaging
results using Artificial NN while classification using SVM and the network of its rightful share revenue from international
respective results has been described in Section V. In Section traffic.
VI we have summarized the comparison of both techniques. III. P ROPOSED M ETHODOLOGY
II. P ROBLEM S TATEMENT Our research work focus on the performance evaluation
of the SIM Box Fraud Detection using machines learning
In legitimate international calls termination, telecom op-
techniques. The process consists of following steps:
erators have interconnect agreement with international calls
carrier for termination of international calls between two 1) Collection of the normal and SIM Box Fraud CDRs.
countries. As shown in Fig. 1, subscriber of country ‘A’ wants 2) Pre-processing of the CDRs to extract the required input
to call a subscriber of country ‘B’. When a call is made, the attributes for classification.
subscriber network in country ‘A’ routes the call through it’s 3) Division of the data into three different groups (training,
core network to an international traffic carrier. The call is then testing and validation).
handed over to the destined country core network which is 4) Performing the classification using the SVM and NN.
responsible for landing the call to the destined subscriber. In 5) Compare the results for analysis.
this situation of a legitimate connectivity the core networks in In the data set, the normal and fraudulent subscribers are
originating country and destination country are involved and identified with a numerical flag:
accounting of such calls is a normal process. Each call made/ • NN Flags

5
TABLE I
S UMMARY OF DATA USED FOR CLASSIFICATION USING NEURAL NETWORK
AND SVM.

S/No Description Quantity


1 Normal subscribers 8,695
2 Call detail records of normal subscribers 4,333,822
3 Fraudulent numbers 50
4 Call details records of fraudulent subscribers. 24,502
5 Duration of data collection (in days). 31

Fig. 3. Confustion matrix showing the performance of NN on the experimental


data used for fraudlent detection.

Actual response is calculated using sigmoid function given by:


yk = sgn wkT xk

Fig. 2. General structure of artificial neural network.
(3)
where (
+1, if x ≥ 0
– Normal Subscriber: 1 sgn (x) = (4)
– Fraudulent/SIM Box: 0 −1, if x < 0
• SVM Flags The adaptation vector can be written using the desired output
– Normal Subscriber: +1 by d as:
– Fraudulent/SIM Box: -1
w (n + 1) = w (n) + η [d (n) − y (n)] x (n) (5)
We use real data to perform the classification and compare the
performance of the algorithms and their variants. Summary of where the η ∈ (0, 1) represents the learning rate. In this
the data being used is given in Table. I. experimental process, we have used 25 attributes for normal
and fraudulent subscriber which are provided as input to the
IV. A RTIFICIAL N EURAL N ETWORK network. The performance of the algorithms is evaluated using
A common architecture of NN consists of the neuron, the following:
connection link, associate weight with the connection link and A. Confusion Matrix
activation function associated with each neuron as shown in
In NN confusion matrix is used to visualize the performance
Fig. 2. Input data is processed at each neuron also called
and error of the algorithm. In the context of supervised and
the nodes having three types i.e. input node, output node
unsupervised learning the performance of NN is evaluated
and hidden node. Input signal to a neuron undergoes scaling
using confusion matrix and matching matrix respectively.
defined by the associated weight of the node. Optimal weights
Confusion matrix is also referred to as a contingency table
are calculated during the training phase of the NN, however,
or an error matrix.
the process can continue during the testing and the decision
In Fig 3, four confusion matrices are shown that describe
making phases.
detection and error performance of the NN in case of vali-
The input to the NN is a vector x ∈ Rm+1 of m attributes
dation, training, test and complete data set. “All Confusion
such that it’s first element is x1 (n) = 1, where n shows the
Matrix” plotted for complete data set shows a total of 50 out of
time stamp. The prediction by the NN for the kth input vector
50 SIM fraud subscriber (“0”) classified correctly with 100%
is represented by yk ∈ {+1, −1} and the corresponding weight
performance while for normal subscriber 8681 out of 8195
vector is wk ∈ Rm+1 where wk1 = bk0 represents the bias
samples are identified correctly with 99.3% performance rate.
term as shown in Fig. 2. The output of the summing function
can be expressed as: B. Receiver Operation Characteristic
27
X In NN a receiver operating characteristic (ROC) is used
vk = wkT xk = wkj xkj (1) to graphically plot the performance of the machine learning
j=1 algorithm. The ROCs shown 4 graphically plot the binary
The decision boundary is a hyperplane that can be written as: classification of the data. The graph is created by using the
ratio of true positive rate against the false positive rate at
wk1 xk1 + wk2 xk2 + . . . + wkm xkm + b = 0 (2) various threshold settings.

6
TABLE II
P ERFORMANCE COMPARISON OF VARIANTS OF NEURAL NETWORK FOR
CLASSIFICATION .

S/No Algorithm Accuracy


1 Levenberg-Marquardt 99.84
2 Bayesian Regularization 99.87
3 BFGS Quasi-Newton 99.84
4 Resilient Back-propagation 99.53
5 Scaled Conjugate Gradient 99.76

Fig. 4. ROCs for the four data sets.

Fig. 6. Classification of data using SVM without a kernel.

the parametric optimization during the training phase. The


decision boundary that separates the two classes is given by
a hyperplane in Rm+1 as given by equation (2). The binary
classification of an input vector xi is performed using the
optimal weights as under:
(
T ≥ +1, Normal User
w xk (6)
≤ −1, Fraudulent User
If the data is not linearly separable then SVM maps the original
Fig. 5. Performance curves for the NN on the experimental data used for data into much higher dimensional space making it linearly
fraudlent detection.
separable. SVM introduces a kernel function κ (xi , xj ) for
this purpose. However, selection of appropriate kernel function
C. Performance Chart needs care and can be made out of the following choices:
In NN the performance chart is used to plot the mean Linear Kernel: κ (xi , xj ) = 1 + xiT xj (7)
squared error vs. epoch for all training, validation and test p
Polynomial Kernel: κ (xi , xj ) = 1 + xTi xj (8)
data sets. Mean square error calculates the difference between n p
o
Radial Basis Kernel: κ (xi , xj ) = exp −∂ 1 + xTi xj

the observation value and the simulation values. If the value of
the mean squared error is lower, then the performance is good (9)
and not acceptable otherwise. The performance is observed to
where p is the degree for polynomial and ∂ is the parameter
be best for validation data set followed by test and training
for radial basis functional (RBF).
data sets as shown in 5.
Figures. 6 and 7 depict the performance of the SVM with
D. Variants of NN and without a kernel. It can be appreciated that using a
kernel produces a more defined decision boundary as shown
We have used five variants of NN for classification of in 7. The summary of the classification performance using
fraudulent and non-fraudulent subscriber. The summary of the different kernels in SVM is given in Table III. The Polynomial,
performance observed through simulation results is given in Radial Basis and Sigmoid Kernels show better performance in
Table. II. The Bayesian regularization algorithm shows the best classification of the data set with 99.24% accuracy and 0.03
performance in classification of the data set with an accuracy Regression while Per-Computed Kernel has accuracy level
of 99.87%. Whereas, Resilient Back-propagation algorithm 95.78 % with 0.17 regression and Liner kernel show the worst
show the worst performance with an accuracy 99.53%. performance with an accuracy of 95.18 % and 0.19 regression.
V. S UPPORT V ECTOR M ACHINE VI. C ONCLUSION
Like other supervised learning algorithms, SVM performs The main objective of the research was to figure out the best
the (binary) classification of the input data on the basis of classification techniques for SIM box fraud. Different variants

7
Fig. 7. Classification of data using SVM with linear kernel.

TABLE III
P ERFORMANCE COMPARISON OF DIFFERENT SVM KERNELS .

S/No Kernel Validation Accuracy Regression


1 Linear Kernel 95.18 0.19
2 Polynomial Kernel 99.24 0.03
3 Radial Basis Kernel 99.24 0.03
4 Sigmoid Kernel 99.24 0.03
5 Per-Computed Kernel 95.78 0.17

of neural network and support vector machine are evaluated


to draw a performance comparison. Simulation results show
that artificial NN with Bayesian Regularization algorithm has
best accuracy as compared to SVM.
R EFERENCES
[1] Pakistan Telecommunication Authority “Annual Report 2013”
http://www.pta.gov.pk/annual-reports/annreport2013 1.pdf.
[2] Huan Zhoud, Hui Wang, Xiuhua Li and Victor C. M. Leung, “A Survey
on Mobile Data Offloading Technologies”, IEEE Access, 2018 .
[3] Communication Fraud Control Association “2013 Global
Fraud Loss Survey” http://www.cvidya.com/media/62059/global-
fraud loss survey2013.pdf.
[4] Chunlai Zhou and Ziyan Lin, “Study on fraud detection of telecom
industry based on rough set”, 2018 IEEE 8th Annual Computing and
Communication Workshop and Conference (CCWC), Las Vegas, NV,
USA.
[5] Communication Fraud Control Association, “2017 Global Fraud Loss
Survey”, www.cfca.org/fraudlosssurvey/.
[6] The Nation 20 Nov, 2013. http://www.nation.com.pk/business/20-Nov-
2013/pta-confiscates-711-illegal-gateways
[7] Dawn News 20 Dec, 2018. https://www.dawn.com/news/1452453.
[8] P Burge, J Shawe-Taylor, C Cooke, Y Moreau, B Preneel, C Stoermann.
“ Fraud Detection and management in Mobile telecommunication Net-
work” European Conference on Security and Detection,, 28-30 April
1997, Conference Publication No. 437, 0 IEE, 1997.
[9] S. Wu, N. Kang, L. Yang, “Fraudulent Behavior Forecast in Telecom
Industry Based on Data Mining Technology” , Communications of the
IIMA, 2007.
[10] W. Xu, Y. Pang, J. Ma, S. Wang, G. Hao, S. Zeng, Y. Qain, “Fraud
detection in telecommunication: a rough fuzzy set based approach”,
International Conference of Machine Learning and Cybernetics, 1249
- 1253, 2008.
[11] D. Olszewski “A probabilistic approach to fraud detection in telecommu-
nications” journal of Knowledge-Based Systems , Science Direct 2011.
[12] Liu X. and Wang X., “A Network Embedding Based Approach for
Telecommunications Fraud Detection.”, Lecture Notes in Computer
Science, vol 11151. Springer, 2018.
[13] J. Guo, G. Liu, Y. Zuo and J. Wu, “Learning Sequential Behavior
Representations for Fraud Detection,” 2018 IEEE International Con-
ference on Data Mining (ICDM), Singapore, 2018, pp. 127-136. doi:
10.1109/ICDM.2018.00028.

You might also like