Professional Documents
Culture Documents
Big Data Analytics (2017 Regulation) : Insurance Fraud Detection
Big Data Analytics (2017 Regulation) : Insurance Fraud Detection
Big Data Analytics (2017 Regulation) : Insurance Fraud Detection
Advantages:
It is fast
Easy to understand
Robust
Comparatively efficient
If data sets are distinct then gives the best results
Produce tighter clusters
When centroids are recomputed the cluster changes.
Flexible
Easy to interpret
Better computational cost
Enhances Accuracy
Disadvantages:
Sometimes choosing the centroids randomly cannot give fruitful results
Needs prior specification for the number of cluster centers
If there are two highly overlapping data then it cannot be distinguished and cannot tell that there are two
clusters
With the different representation of the data, the results achieved are also different
Euclidean distance can unequally weight the factors
If very large data sets are encountered then the computer may crash
Prediction issues
BIG DATA ANALYTICS (2017 REGULATION)
Cluster the observed data, varying the number of clusters from k = 1, …, kmax, and compute the corresponding
Elbow Method:
1. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by
varying k from 1 to 10 clusters
2. For each k, calculate the total within-cluster sum of square (WSS)
3. Plot the curve of WSS according to the number of clusters k.
4. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number
of clusters.
5. 4 is the optimal number of clusters.
BIG DATA ANALYTICS (2017 REGULATION)
Average Silhouette Method: (The average silhouette approach measures the quality of a clustering)
Compute the average distance from all data points in the same cluster (ai).
Compute the average distance from all data points in the closest cluster (bi).
The coefficient can take values in the interval [-1, 1].
If it is 0 –> the sample is very close to the neighboring clusters.
Average Silhouette Method: (The average silhouette approach measures the quality of a clustering)
A high avg. silhouette score indicates a good clustering.
BIG DATA ANALYTICS (2017 REGULATION)
According to this observation k = 2 is the optimal number of clusters in the data.