Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Data Mining project

Problem 1: Clustering

A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task to
identify the segments based on credit card usage.

1.1 Read the data and do exploratory data analysis. Describe the data briefly.

# Importing all the necessary libraries for the analysis

# Load the csv file for doing analysis

# head of the data set is obtained as below. This will show you the top 5 rows and the content.

# Shape of the data set is obtained as below; the data set contains 210 observations of data and 7 variables.

# Checking for missing values. There are no null values in any of the column
# Summary of data is obtained

1. Data looks legit as all the statistics seem reasonable

2. The mean and the median values are almost equal for all the variables.

3. There is no large difference between the 75% and the Max values.

4. By looking at the above to observation we can say that there are no extreme values in the data set.

# Check for information of the data set

By looking at the information we can say all the columns are having float as a data type.

# Check for Duplicates

There are no duplicated in the data set.

# Check for outliers

By looking at the results we can see few outliers are present for min_payment_amt /
probalbility_of_full_payment and all the variables are not having any outliers.

#Univariate and Multivariate analysis are given in the attached python notebook.
By looking at the heatmap we can say

Advance payments, credit limit vs Spending are highly correlated. Current balance vs spending, current balance
vs payments are also highly correlated.

1.2 Do you think scaling is necessary for clustering in this case? Justify

By looking at the summary of data we can see that mean, median, min, max for all the 7 distinctly different
from each other. Since the hierarchal clustering method uses distance-based computation, scaling is required
for unscaled data. Hence its required to scale and normalise the data. Transformation is required in order to
clustering for this case.

1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using Dendrogram
and briefly describe them

Clustering is a technic which is used for unsupervised learning. In Hierarchical clustering records are
sequentially grouped to create clusters, based on distances between records and distances between
clusters. we import dendrogram, linkage which is used to identify the optimal number of clusters.

After perform hierarchical clustering on the scaled data and following result is obtained. Wardlink method
is used to obtain this. Two clusters of “Green and Red” are obtained through the dendrogram. We find that
maximum number of details/customers fall under the red cluster.

We have used truncate function with the value P =10 to get the clear output of dendrogram.
If you look at above output (truncated value) and try to draw a line between 15 and 20 say 18, then we see
that 3 vertical lines falls under. By using the max clust and distance method we are able to see that 3
clusters are good enough.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score.

We apply K-means clustering which is a type of unsupervised learning. In this model we need to predetermine
the number of clusters and it’s a non-hierarchical model. First we need to get the optimal WSS plot in order to
derive at the optimal number of clusters.

Let’s choose random K value as 3 as show below. We are building the K mean model using K means cluster.

We are fitting the scaled data into K means model. We are able to see that the cluster mapping for the
variables.

 0 – indicated clusters 1

 1 – indicates cluster 2

 2 – indicated cluster 3

If you look at the K Mean inertia its nothing but the total WSS when K = 3 .

We can try for different clusters and find the inertia as given below. The larger the drop in WSS it better.

If the drop is not significant the additional cluster is not useful for us.

From 1 cluster to 2 cluster we have a significant drop close to 900 points

From 2 cluster to 3 cluster we do have a good drop close to 240 points

From 3 cluster to 4 cluster it’s not a significant only 50 points drop

From 4 cluster to 5 cluster it’s not a significant only 50 points drop


By looking at the drop we can say 3 is optimal for us.

By looking at the graph it’s also evident that drop is significant for 1 ,2 ad 3. post that the drop is very minimal.

By using silhouette score and Silhouette analysis we can find that the mapping of each variable to the specific
cluster is correct or not.

Silhouette score is better for 3 clusters than for 4 clusters. So, final clusters will be 3
We also find silwidth of all the variables

The smallest value of the silhouette width is 0.002 this indicates that no observation are wrongly mapped to a
cluster and also all the silhouette width are positive.

Hence, we conclude 3 to be the optimal number of clusters.

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for different
clusters.

Two new columns have been added to the data set Clus_Kmeans and sil_width.
All credit card users are now mapped to one of the three clusters which we have identified (0,1,2 clusters).

By computing the averages of all the customers, we can draw some conclusions
Looking at above summary we can conclude that

 Cluster 0: Medium spending group, however the min payment amount is less compared to other least
spending group.
 Cluster 1: Least spending group. If you look at the minimum payment amount done by these group
people are more than others. Averages of all other parameters are more or less same. We can try
increasing the credit limit of the customers who fall under the cluster 1. Since the average amt spent
in a single shopping is more are less equal to cluster 0.
 Cluster 2: Premium group who spends more money. Credit limit and all other parameters are also
relatively high compared to members of other clusters. The minimum payment amount is less
compared to the least spending group. Bank can look into that to increase the minimum payment
amount for the premium group.

Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The
management decides to collect data from the past few years. You are assigned the
task to make a model which predicts the claim status and provide recommendations
to management. Use CART, RF & ANN and compare the models' performances in
train and test sets.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it?
# Importing all the necessary libraries for the analysis

# Load the csv file for doing analysis

# head of the data set is obtained as below. This will show you the top 5 rows and the content.

# Shape of the data set is obtained as below; the data set contains 3000 observations of data and 10 variables.
# Checking for missing values. There are no null values in any of the column

# Summary of data is obtained

1. We can ignore Age column.

2. There is large difference between the 75% and the Max values and also the mean and median.

3. By looking at the above to observation we can say that there are extreme values (outliers are present)

Since Age is not going to give us any meaning full analysis for the current problem statement, we would like
to drop the variable and continue our analysis

# Check for information of the data set

By looking at the information we can say all the variables are having different data types. In order to do CART,
RF and ANN we need to convert them in to categorical codes to carry out the analysis process
# Check for Duplicates

There are duplicates in the data set. We need to clean up the duplicates.

Duplicates are now dropped.

# Check for outliers

As seen in the summary we are able to see outliers are present in all the variables. There are outliers in all the
variables. Random Forest and ANN can handle the outliers. Hence, Outliers are not treated for now

Checking the pairwise distribution of the continuous variables.

By looking at the heatmap we can say sales and commission are highly correlated.

There are mostly positive correlations between variables.


Since the object data types are different in order to do the analysis, we need to convert all the into categorical
code.

Now we can see that all the objects are changed into categorical code. The data set is good for analysis now.

2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network
Target or dependent variable in the data set is “Claimed” and all other variables are independent variables. If
you look at the proportions of 1’s and 0’s of the target variable

There is no issue of class imbalance here as we have reasonable proportions in both the classes

capture the target column ("Claimed") into separate vectors for training set and test set
splitting data into training and test set for independent attributes

The shape or dimension of the train and test data is derived as below

Building classification model CART.

The Decision Tree Classifier is built and parameters of the grid are given in the Jupyter notebook: The necessary
library for Decision tree, Random Forest and ANN is imported at initial step in the codes. Gini Impurity
measures the divergences between the probability distributions of the target attribute’s values and splits a
node such that it gives the least amount of impurity.

A tree is generated and the same can be visualized using http://webgraphviz.com/

As we can see the decision tree has been over grown and too many branches are grown. We need to prune the
model.

test.dot

If you look at the graph it not regularized, the depth of the tree is quite large and we will not be able to predict.
By using the Grid search method, we are going to find the best parameter and best estimator and we are going
to generate the tree again. Now you will be getting one more tree and the same can be visualized you will be
able to fin only minimum number of leaf’s available.

tree_regularized.dot
For the best parameter we are getting the best model (best grid) for training and testing data test.

Post that we can get generate the classification report for training and testing data as below.

If you look at the test and train data set classification the accuracy is 75% and the baseline accuracy is 68%. The
model is performing better than the baseline accuracy.

Recall % and F1 % are more for train set than the test set. The test set performance is with in the 10% so its no
overfitting or unfitting.
Building a Random Forest Classifier

Similar to CART model we are going to use the classifier for generating the Random forest and the same has
been given below.

In Random forest if we have too many arguments the model will run for longer time. We can change the
arguments and rerun again. Once after the model run is completed, we are fitting the model and then getting
the best parameter and estimator

Once the model is run, we need to predict on test and train data

Now we can generate the classification report for both test and train data set
If you look at the test and train data set classification the accuracy is 77% and the baseline accuracy is 68%. The
model is performing better than the baseline accuracy.

Recall % and F1 % are more for train set than the test set. The test set performance is with in the 10% so there
is no overfitting or underfitting.

Building a Neural Network Classifier


Similar to other models we are creating the MLP grid values and get the bets parameter and best estimator
values as below.

Now by using the best grid parameter we can predict the test and train data

Generate the classification report for both train and test model as below
If you look at the test and train data set classification the accuracy is 73% and the baseline accuracy is 68%. The
model is performing better than the baseline accuracy.

Recall % and F1 % are more for train set than the test set. However, the recall % is very low compared to other
and also its less than .5.

2.3 Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model
CART:

In decision tress it also tells you the variable importance

AUC/ROC for Training data is given below, ROC is a curve represents TP rate and FP rate
AUC/ROC for testing data is given below

Confusion matrix for the training data is given below


By looking at the above results we can say the False negative = 263 for training data

Confusion matrix for the testing data is given below

By looking at the above results we can say the False negative = 116 for testing data

Cart Conclusion

By looking at the confusion matrix, AUC/ROC curve and classification report we can derive the below

Train Data: AUC:80% Accuracy: 77% Precision: 67% f1-Score: 59% Recall: 53%

Test Data: AUC: 75% Accuracy: 75% Precision: 61% f1-Score: 55% Recall: 77%

Training and Test set results are almost similar, and with the overall measures being moderate, the model is a
good model. Recall is good for testing data. The base line accuracy on our data is

Random Forest

AUC/ROC for Training data is given below

AUC/ROC for testing data is given below


Confusion matrix for the training data

Confusion matrix for testing data

Random Forest Conclusion

Train Data: Accuracy: 77% Precision: 69% f1-Score: 57% Recall: 49%

Test Data: Accuracy: 76% Precision: 64% f1-Score: 55% Recall: 48%

Training and Test set results are almost similar, and with the overall measures being moderate. Recall % is less
compared to CART. This indicated the RF is not using the best parameters.

Neural Networks

AUC/ROC for Training data is given below


AUC/ROC for Testing data is given below

Confusion matrix for training data set

Confusion matrix for testing data set

Neural Network Conclusion

Train Data: AUC 76% Accuracy: 73% Precision: 68% f1-Score: 40% Recall 29%

Test Data: AUC 75% Accuracy: 74% Precision: 65% f1-Score: 40% Recall: 29%
Training and Test set results are almost similar, and with the overall measures being moderate. Recall and F1 %
are very poor compared to other models. The model is not suited for the current data set.

2.4 Final Model: Compare all the model and write an inference which model is
best/optimized.
Comparing the 3 models and the results of training and testing data are given below.
We are able to see that Recall and FI percentage are good for CART, RF compared to Neural networks and also
its above 0.5. For ANN the value is less than 0.5, which is not good. ANN is not a good model for this data set.

When comparing CART and RF by using single decision tree we are able to get better result than generating
multiple trees. Since we are getting good results with minimal tree, we tend to go for CART for the given data
set and avoid Random Forest as well.

2.5 Inference: Basis on these predictions, what are the business insights and
recommendations

CART model has relatively high Recall, FI core compared to another model. Precision and AUC is more Random
forest compared to another model. Recall is the popular measure over all, accuracy will be checked when the
target column is balanced. Hence CART seems to be a better model as a conclusion.

Channel and Type seems be the major factor which business needs to concentrate. These two factors play a
primary role in predicting the future claims.

You might also like