Simbox Case Study Anonymised

Machine Learning Case
Study
SIM Box/Bypass Fraud Detection
PROPRIETARY & CONFIDENTIAL - Do not adapt, duplicate or distribute without written consent of Neural Technologies Limited
Table of Contents
1 Machine Learning Case Study with SIM Box data ......................... 2
1.1 Model Training ......................................................................................................................... 3
1.2 Model Training Outputs ........................................................................................................... 3
1.2.1 Cluster Description Table ................................................................................................ 4
1.2.2 Target Proportion Table ................................................................................................... 5
1.2.3 Fraud Clusters Analysis ................................................................................................... 6
1.2.4 Anomaly Clusters Analysis ............................................................................................ 13
1.2.5 Large Clusters Analysis ................................................................................................. 16
1.2.6 Classification Report ...................................................................................................... 21
1.2.7 Anomaly Entities ............................................................................................................ 22
1.2.8 Un-labelled Fraud Entities ............................................................................................. 27
1
1 Machine Learning Case Study with SIM Box data

Mobile operators worldwide suffer from interconnect bypass frauds using SIM boxes, affecting
interconnect revenues, call quality and customer experience. As per the CFCA Global Fraud Loss
Survey 2017, Bypass fraud is one of the major fraud types affecting operators. In the fraud loss
estimations, bypass fraud is ranked 2nd with an estimated loss of around $4.27Billion per year.
This Machine Learning Case Study describes how to apply and interpret the analysis output from the
Clustering ML model fed with SIM Box/Bypass fraud data.
The dataset used in this case study has 100,391 subscribers and 70 features that describe various
activities of the subscriber. Out of these subscribers, 99,973 are non-fraud accounts and 418 are fraud
accounts (0.41%).
The features used in this case study contains SIM summarised calling, billing and recharging activity
from the subscribers. The detection of bypass fraud is indicated by the ‘BYPASS’ column – ‘1’ is a
confirmed bypass MSISDN, ‘0’ is probably clean. Example features include:
FEATURES SOURCE Description
countMoLocalCalls Switch Count of MO Local Calls ( Call_Type = Voice MO )
countMoIntlCalls Switch Count of MO Calls ( Call_Type = Voice MO and call_class =INTL)
countMoVASCalls Switch Count of MO Calls ( Call_Type = Voice MO and call_class = VAS )
countMoFREECalls Switch Count of MO Call ( Call_Type = Voice MO and call_class =
TOLL_FREE
durMoLocalCalls Switch Duration of MO Local Calls ( Call_Type = Voice MO )
durMoIntlCalls Switch Duration of MO INT Calls ( Call_Type = Voice MO and call_class
=INTL)
durMoVASCalls Switch Duration of MO VAS Calls ( Call_Type = Voice MO and call_class
= VAS )
…
countMtVASSMS Switch Count of MT VAS SMS ( Call_Type = SMS MT and call_class =

VAS )
countMtFREESMS Switch Count of MT FREE SMS ( Call_Type = SMS MT and call_class
= TOLL_FREE
distinctLocalDestinations Switch Count of distinct destination , restricted to Call_Type = Voice
MO
…
getMedianTimeGap Switch - the median for the time gap betwee calls during a day (
derived applicable to Call_Type = Voice MO)
G_countMoLocalCalls Switch Same as above but this will sum up all previous days in the
OP + last day
2
FEATURES SOURCE Description
G_countMoIntlCalls Switch Same as above but this will sum up all previous days in the
OP + last day
…
G_valueChargedCalls Billing Same as above but this will sum up all previous days in the
System OP + last day
G_totalChargedData Billing Same as above but this will sum up all previous days in the
G_totalUnchargedData Billing Same as above but this will sum up all previous days in the
…
G_countRecharges Voucher Same as above but this will sum up all previous days in the
Server OP + last day
G_valueRecharges Voucher Same as above but this will sum up all previous days in the
Server OP + last day
number_of_days combined number of days where activity occurred based on the
criteria taken ( call_types mentioned above )
tariff CRM tariff profile of the subscriber
POS CRM Point of Sale
FC_DATE First call made by the subscriber
1.1 Model Training

Model training using the Nt ML Package is fully automated. The labelled data file containing known
SIM Box fraud and non-fraud subscribers is provided together with specifying all of the possible input
data fields to be used plus the known target outcome field. The training process identifies the
relevance and appropriate contribution of input data fields; this is fedback as part of the model training
output (see below).
1.2 Model Training Outputs

Model information extraction can be produced from the Nt ML Package. This extracts relevant
information which helps to explain the data and the model. Typically it returns five tables of
information:
1. Cluster Description Table
2. Cluster Feature Centroids and Spread
3. Distinct Features Table
4. Classification Report
5. Target Proportion Description Table
Details describing clusters extracted in terms of explanatory distinctive features and comparison of
centroids and spread are illustrated using graphical plots.
3
1.2.1 Cluster Description Table

This table provides an overview of the clustering model output for the dataset, where we have
highlighted the large clusters in blue, anomaly clusters in green, fraud clusters in purple and the
clusters that are anomaly and fraud in red:
cluster_id Population mean std min 20% quantile median 80% quantile max
0 23,296 4.02 2.09 1.49 2.81 3.50 4.81 94.06

1 6,358 5.11 3.32 1.70 3.50 4.41 6.19 194.22
2 4,519 4.80 2.80 1.64 3.14 3.96 5.91 47.06
3 135 26.53 14.10 6.86 14.31 27.52 35.48 96.10
4 520 13.87 8.92 4.38 8.24 10.38 17.49 64.70
5 456 15.14 12.80 5.01 8.69 10.99 17.08 121.61
6 6,920 4.86 2.24 1.84 3.34 4.17 5.97 36.93
7 4,763 7.32 6.27 2.43 4.47 5.79 8.72 139.87
8 448 14.32 11.27 4.70 8.45 11.12 16.83 130.10
9 445 15.77 11.27 3.18 8.95 11.77 21.05 113.80
10 19,750 3.65 1.86 0.99 2.57 3.24 4.34 63.62
11 31,977 3.40 2.14 1.23 2.34 2.91 4.01 135.89
12 309 15.42 12.92 5.50 6.58 10.10 24.57 141.78
13 113 182.43 19.30 - 182.28 184.74 186.88 229.45

14 185 21.29 10.92 8.24 12.52 17.69 28.92 69.49
15 197 30.26 21.55 6.55 21.61 26.94 31.66 243.09
The meaning from each of the columns is:

Column Name Description
cluster_id Cluster ID number
Population Number of data records within that cluster
mean Mean distance of data records to the cluster center
std Standard deviation of distances of data records to the cluster center
min Minimum distance of data records to the cluster center
20% quantile 20% quantile of distances of data records to the cluster center
median Median of distances of data records to the cluster center
80% quantile 80% quantile of distances of data records to the cluster center
Max Maximum distances of data records to the cluster center
Two important properties of the clusters can be inferred from the clustering model:
1. Cluster population (i.e. the number of data records in each cluster), and
2. Spread of data records within each cluster
From the table, one can observe that clusters ‘0’, ‘10’ and ‘11’ are the major clusters with the largest
population, making up to 75% of the total population. Evident from the distance columns, they also
tend to be dense (i.e. data records are quite similar to each other within the same cluster).
4
On the opposite spectrum, the model found several clusters with very few records each - clusters ‘3’,
‘13’, ‘14’ and ‘15’. The data records in these clusters will be considered as anomalous in the anomaly
detection function (marked as anomaly case 1), as the clustering model found that these sets of data
records are not similar to other groups of larger clusters.
1.2.2 Target Proportion Table

cluster_id target_class Population Pct within cluster
0 0 23,483 100.0%
1 4 0.0%
1 0 6,355 100.0%
1 2 0.0%
2 0 4,517 100.0%
1 1 0.0%
3 0 19 15.0%
1 108 85.0%
4 0 508 99.4%
1 3 0.6%
5 0 436 99.8%
1 1 0.2%
6 0 6,894 99.7%
1 24 0.3%
7 0 4,745 100.0%
1 2 0.0%
8 0 438 99.5%
1 2 0.5%
9 0 408 98.3%
1 7 1.7%
10 0 19,753 99.9%
1 26 0.1%
11 0 31,891 99.7%
1 86 0.3%
12 0 222 89.5%
1 26 10.5%
13 0 83 100.0%
1 - 0.0%
14 0 26 17.1%
1 126 82.9%
15 0 195 100.0%
1 - 0.0%
5
1.2.3 Fraud Clusters Analysis

This graph shows the clusters with at least 0.1% of known fraud accounts. It also shows the key
features explaining these clusters, and how these clusters relate to one another. In particular, we
observe that clusters ‘14’ and ‘3’ have a large percentage of fraud (>60%), and cluster ‘12’ has some
fraud. With cluster 12, key distinctive features are associated with free MO SMS.
The Nt ML Package can generate explanatory plots describing “what are the distinctive features for
each cluster” and “why they are distinctive” in order to aid users in understanding the profile of a
cluster.
This is shown the following sub-sections first for these fraud clusters (3, 12 and 14), then in following
sections for anomaly clusters (13 and 15) and large clusters (0, 10 and 11).
The plots identify the most distinctive features from most distinct at the top to least distinctive; this
refers to how important each feature is in separating this cluster of dealers from the general population
of all dealers.
A colour bar is shown under each data feature giving the level of distinctiveness together with the
numeric value given underneath the data feature label; this maps to the colour scale shown on the
right side of each plot. The top horizontal also shows this numeric scale. The measure of
distinctiveness is global across all clusters, i.e. one cluster may not have as strongly distinguishing
features compared to another cluster. For example, Cluster 11 only shows as yellow even for the
6
most distinctive feature as compared to Cluster 3 where stronger distinguishing features are shown
with a red colour bar.
Cluster details are also shown for each data feature by the blue circle and line; the circle gives the
typical data feature’s value for the cluster and the blue lines show the typical lower and upper value
associated with the cluster. Note that the spread is typically not symmetric. The green arrow shows
how this cluster differs from the general data feature value for all dealers. The direction of the arrow
indicates if the cluster represents a higher than normal value (points to right) or lower (points to left).
Labels on the blue circle and associated bar plus the green arrows gives original data feature values.
1.2.3.1 Fraud Cluster 3 (Pop. 135, Fraud 84.4%)
The above indicates that the SIMs in the cluster are most differentiated due to their high ratio of
unique destinations on local calls as well as high duration and count of MO international calls.
7
Below is a plot showing the fraud classification reasons analysis highlighting significant features which
are associated with fraud vs. non-fraud within the cluster. This is identifying distinct local destinations
together with a low value of distinct cells, recharge value and data charging.
8
Cluster 14 is being distinguished due to the count of MT free SMS services; this might indicate
camouflage activity being performed by fraudsters to disguise the underlying SIM box activity.
9
Though the fraud reasons analysis is highlighting the high ratio of unique destinations for local calls as
well as a low count of MO local calls and recharge value.
10
Similar to cluster 12, this is showing free SMS services but in this case MO instead. Again would
appear to be a camouflage activity.
11
The fraud reasons analysis is showing a low duration and count of MT local calls and also of
uncharged calls together with a low ratio of MT or MO local call count.
12
1.2.4 Anomaly Clusters Analysis

This graph shows the clusters whose population are less than 200. It also shows the keys features
explaining these clusters, and how these clusters relate to one another.
13
1.2.4.1 Anomaly Cluster 13 (Pop. 113, Fraud 0.0%)
This is showing that these SIMs are most distinguished by their count and duration of MO VAS calls,
and MO free calls.
14
1.2.4.2 Anomaly Cluster 15 (Pop. 197, Fraud 0.0%)
This is showing that these SIMs are most distinguished by their duration and count of MT international
calls.
15
1.2.5 Large Clusters Analysis

This graph shows the 3 largest clusters. It also shows the keys features explaining these clusters, and
how these clusters relate to one another.
16
1.2.5.1 Large Cluster 11 (Pop. 31,977, Fraud 0.3%)
This large cluster represents SIMs with low duration of uncharged calls, count and duration of MT local
calls, etc. Overall, low usage.
17
The small amount of fraud found in this cluster is strongly identified based on the ratio of unique
destinations for local calls.
18
In this case, SIMS are being distinguished based on their lower than normal time gap between calls
(i.e. more frequent calling) and more movement around cell sites. In other words, frequent users.
19
This cluster also appears to show lower than normal usage.
20
1.2.6 Classification Report

To evaluate the performance of the semi-supervised function of the ML model, 20% of the original
dataset is held away from training. This portion of the data (i.e. testing data) is not ‘seen’ by the model
during training. After the model is trained, the testing dataset is evaluated by the model. Hence the
results of the evaluation give unbiased accuracy metrics that indicate likely Production performance.
Index precision recall f1-score Support
0 0.99990 0.99985 0.99988 19995

1 0.96471 0.97619 0.97041 84
accuracy 0.99975
macro avg 0.98230 0.98802 0.98515 20079
weighted avg 0.99975 0.99975 0.99975 20079
21
1.2.7 Anomaly Entities

As well as clustering, the Nt ML Package can also identify entities whose behaviour deviates from its
own cluster norm. The following table shows the top 50 entities with such behaviour sorted by their
standardized distance from their respective cluster centres.
Below the table, plots are given showing the top 5 entities with the largest standardized distance
(coloured in red).
Entity Cluster ID Standardized
Distance
376743151 11 51.62
760718958 0 31.48
3133711707 11 28.06
1286347530 0 22.30
4219089884 0 22.06
1146251012 0 21.51
1999820679 10 20.32
933997323 0 19.63
505430694 10 18.61
776583152 7 18.37
1935702752 12 18.29
1101863067 11 17.97
322431738 7 17.82
1133969425 11 17.65
1305404162 11 17.62
1204125571 7 17.08
2872626540 11 16.21
1842302518 0 15.67
396180908 11 15.17
1023652607 7 15.06
1210718027 9 14.65
808044472 10 14.36
4174155735 0 14.33
2092715772 11 14.16
2930769892 2 14.05
919088656 7 13.93
4046865441 10 13.45
2490138536 10 13.20
1800017955 10 12.92
2565649299 7 12.88
3059733041 2 12.84
3579741937 10 12.74
4209657407 0 12.72
254643268 11 12.54
3598438238 11 12.38
2851577170 2 12.30
518764564 10 12.10
3690555592 11 11.98
4107830868 11 11.82
22
Entity Cluster ID Standardized

Distance
3026098905 11 11.65
450407584 0 11.54
3913846472 7 11.54
39413158 2 11.49
45360594 11 11.34
847378189 10 11.15
1750696192 0 11.06
454053221 0 11.06
3690343689 11 10.75
3255186323 2 10.75
3412639169 0 10.64
1.2.7.1 Anomaly Entity 376743151

This anomalous SIM falls outside the large population cluster 11 and is distinguished based on a high
ratio of MT to MO local calling.
23
1.2.7.2 Anomaly Entity 760718958

Similar to the previous anomaly, this anomalous SIM falls outside the large population cluster 0 and is
again distinguished based on a high ratio of MT to MO local calling.
24
1.2.7.3 Anomaly Entity 3133711707

Again, similar to the previous anomaly, this anomalous SIM falls outside the large population cluster
11 and is again distinguished based on a high ratio of MT to MO local calling.
25
1.2.7.4 Anomaly Entity 1286347530

This anomalous SIM falls outside the large population cluster 0 and is this time distinguished based on
a high count of MO and MT VAS SMS activity which would also sometimes be free SMS.
26
1.2.7.5 Anomaly Entity 4219089884

This anomalous SIM falls outside the large population cluster 0 and is again distinguished based on a
high ratio of MT to MO local calling.
1.2.8 Un-labelled Fraud Entities

Below is the list of entities which are predicted as fraud but not labelled as fraud in the original data.
These would represent candidate SIMs for investigation as unknown SIM Box frauds.
Entity Cluster ID Standardized Anomaly Type Prediction
Distance Score
2915107828 14 1.49 1 100.0%
2563634431 6 4.46 2 97.6%
1310938097 11 3.49 0 96.9%
27

Simbox Case Study Anonymised

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Simbox Case Study Anonymised

Uploaded by

Copyright:

Available Formats

Machine Learning Case

1 Machine Learning Case Study with SIM Box data

countMtVASSMS Switch Count of MT VAS SMS ( Call_Type = SMS MT and call_class =

FEATURES SOURCE Description

1.1 Model Training

1.2 Model Training Outputs

1.2.1 Cluster Description Table

0 23,296 4.02 2.09 1.49 2.81 3.50 4.81 94.06

13 113 182.43 19.30 - 182.28 184.74 186.88 229.45

The meaning from each of the columns is:

1.2.2 Target Proportion Table

1.2.3 Fraud Clusters Analysis

1.2.3.1 Fraud Cluster 3 (Pop. 135, Fraud 84.4%)

1.2.3.2 Fraud Cluster 14 (Pop. 185, Fraud 69.7%)

1.2.3.3 Fraud Cluster 12 (Pop. 309, Fraud 8.4%)

1.2.4 Anomaly Clusters Analysis

1.2.4.1 Anomaly Cluster 13 (Pop. 113, Fraud 0.0%)

1.2.4.2 Anomaly Cluster 15 (Pop. 197, Fraud 0.0%)

1.2.5 Large Clusters Analysis

1.2.5.1 Large Cluster 11 (Pop. 31,977, Fraud 0.3%)

1.2.5.2 Large Cluster 0 (Pop. 23,296, Fraud 0.0%)

1.2.5.3 Large Cluster 10 (Pop. 19,750, Fraud 0.1%)

This cluster also appears to show lower than normal usage.

1.2.6 Classification Report

Index precision recall f1-score Support

0 0.99990 0.99985 0.99988 19995

weighted avg 0.99975 0.99975 0.99975 20079

1.2.7 Anomaly Entities

Entity Cluster ID Standardized

1.2.7.1 Anomaly Entity 376743151

1.2.7.2 Anomaly Entity 760718958

1.2.7.3 Anomaly Entity 3133711707

1.2.7.4 Anomaly Entity 1286347530

1.2.7.5 Anomaly Entity 4219089884

1.2.8 Un-labelled Fraud Entities

You might also like