Big Data Analytics (2017 Regulation) : Insurance Fraud Detection

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 8

BIG DATA ANALYTICS (2017 REGULATION)

Insurance Fraud Detection


 Machine learning has a critical role to play in fraud detection and has numerous applications in automobile,
healthcare, and insurance fraud detection.
 Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity
to clusters that indicate fraudulent patterns.
Rideshare Data Analysis
 The publicly available Uber ride information dataset provides a large amount of valuable data around traffic,
transit time, peak pickup localities, and more.
Cyber-Profiling Criminals
 Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-
relations.
 The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation
division to classify the types of criminals who were at the crime scene.
Call Record Detail Analysis
 A call detail record (CDR) is the information captured by telecom companies during the call, SMS, and
internet activity of a customer.
 This information provides greater insights about the customer’s needs when used with customer
demographics.
Automatic Clustering of IT Alerts
 Large enterprise IT infrastructure technology components such as network, storage, or database generate
large volumes of alert messages.
 Because alert messages potentially point to operational issues, they must be manually screened for
prioritization for downstream processes.
Others: Image segmentation, Image Compression, Identifying cancerous data, Search engines etc.
BIG DATA ANALYTICS (2017 REGULATION)

Advantages:
 It is fast
 Easy to understand
 Robust
 Comparatively efficient
 If data sets are distinct then gives the best results
 Produce tighter clusters
 When centroids are recomputed the cluster changes.
 Flexible
 Easy to interpret
 Better computational cost
 Enhances Accuracy

Disadvantages:
 Sometimes choosing the centroids randomly cannot give fruitful results
 Needs prior specification for the number of cluster centers
 If there are two highly overlapping data then it cannot be distinguished and cannot tell that there are two
clusters
 With the different representation of the data, the results achieved are also different
 Euclidean distance can unequally weight the factors
 If very large data sets are encountered then the computer may crash
 Prediction issues
BIG DATA ANALYTICS (2017 REGULATION)

Determining Optimal Clusters:


 When using k-means clustering, users need some way to determine whether they are using the right number
of clusters.
Methods:
1. Elbow Method
2. Average Silhouette Method
3. Gap Statistic Method

Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and compute the corresponding

total within intra-cluster variation Wk.


BIG DATA ANALYTICS (2017 REGULATION)

Elbow Method:
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by
varying k from 1 to 10 clusters
2. For each k, calculate the total within-cluster sum of square (WSS)
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number
of clusters.
5. 4 is the optimal number of clusters. 
BIG DATA ANALYTICS (2017 REGULATION)

Average Silhouette Method: (The average silhouette approach measures the quality of a clustering)
 Compute the average distance from all data points in the same cluster (ai).
 Compute the average distance from all data points in the closest cluster (bi).
 The coefficient can take values in the interval [-1, 1].
 If it is 0 –> the sample is very close to the neighboring clusters.

 It it is 1 –> the sample is far away from the neighboring clusters.

 It it is -1 –> the sample is assigned to the wrong clusters or overlapping

 A high average silhouette width indicates a good clustering.

Compute the coefficient:


BIG DATA ANALYTICS (2017 REGULATION)

Average Silhouette Method: (The average silhouette approach measures the quality of a clustering)
A high avg. silhouette score indicates a good clustering.
BIG DATA ANALYTICS (2017 REGULATION)

Gap Statistic Method:


 The approach can be applied to any clustering method.
 The gap statistic compare the total intra-cluster variation for different values of k with their expected values
under null reference distribution of the data.
The gap statistics for a given k is defined as follows:
BIG DATA ANALYTICS (2017 REGULATION)

Gap Statistic Method:

According to this observation k = 2 is the optimal number of clusters in the data.

You might also like