Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Data Mining Project

Anandarangan K
Data Science and Business Administration
Great Learning

1|Page
Contents:

Sl.N Questions Page


o No.
1.1 Read the data, do the necessary initial steps, and
exploratory data analysis (Univariate, Bi-variate, 5
and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in
this case? Justify 11
1.3 Apply hierarchical clustering to scaled data.
Identify the number of optimum clusters using 12
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and
determine optimum clusters. Apply elbow curve 13
and silhouette score. Explain the results
properly. Interpret and write inferences on the
finalized clusters.
1.5 Describe cluster profiles for the clusters defined.
Recommend different promotional strategies for 14
different clusters.
2.1 Read the data, do the necessary initial steps, and
exploratory data analysis (Univariate, Bi-variate, 16
and multivariate analysis).
2.2 Data Split: Split the data into test and train, build
classification model CART, Random Forest, 21
Artificial Neural Network
2.3 Performance Metrics: Comment and Check the
performance of Predictions on Train and Test sets
24
using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score, classification reports for
each model.
2.4 Final Model: Compare all the models and write an
inference which model is best/optimized. 27
2.5 Inference: Based on the whole Analysis, what are
the business insights and recommendations 27

2|Page
List of Figures

Particulars Page No.


Summary of bank dataset 5
Univariate analysis of bank dataset 6
Bivariate analysis of bank dataset 9
Correlation chart – Bank data 10
Before scaling & After Scaling plot 11
Dendrogram – Hierarchical Clustering 12
Elbow curve – Hierarchical Clustering 13
Univariate Analysis of Insurance data 17
Bivariate Analysis of Insurance data 18
Correlation Chart – Insurance data 21
Decision Tree – Random Forest Model 22

3|Page
List of Tables:

Particulars Page No.


CART model output 25
Random Forest Model Output 26
Artificial Neural Network Model Output 26
Model Comparison Table 27

4|Page
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers
to its customers. They collected a sample that summarizes the activities of users
during the past few months. You are given the task to identify the segments based
on credit card usage.
1.1 Read the data and do exploratory data analysis (3 pts). Describe the data
briefly. Interpret the inferences for each (3 pts). Initial steps like head() .info(),
Data Types, etc . Null value check. Distribution plots(histogram) or similar
plots for the continuous columns. Box plots, Correlation plots. Appropriate
plots for categorical variables. Inferences on each plot. Summary stats,
Skewness, Outliers proportion should be discussed, and inferences from
above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct
output and inferences should be logical and correct.

Solution

The below figure shows the descriptive summary of the dataset

The data looks evenly distributed with spending variable having the maximum standard
deviation.

As per the dataset there are 7 variables – all numerical datatype with 210 entries and no null
values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spending 210 non-null float64
1 advance_payments 210 non-null float64
2 probability_of_full_payment 210 non-null float64
3 current_balance 210 non-null float64
4 credit_limit 210 non-null float64
5 min_payment_amt 210 non-null float64
6 max_spent_in_single_shopping 210 non-null float64
dtypes: float64(7)
memory usage: 11.6 KB

5|Page
Univariate analysis

We are using boxplot and displot for univariate analysis.

6|Page
7|Page
Observation
 Distribution is right tailed for 6 variables.
 Propability of full payment is left tailed.
 We find outliers in propability_of_full_payment and min_payment_amt.
 We find 3 outliers in probability_of_full_payment and 2 outliers in min_payment_amt
 Since the proportion of outliers is very less we can ignore it.
 The amount of skewness of the different variables are given below.

probability_of_full_payment -0.537954
credit_limit 0.134378
advance_payments 0.386573
spending 0.399889
min_payment_amt 0.401667
current_balance 0.525482
max_spent_in_single_shopping 0.561897

Bivariate analysis

Bivariate analysis on all numeric data is performed through a pair plot.

8|Page
We could see that the distribution is skewed towards right tail for all the variables except
propability_of_full_payment variable having left tail.

9|Page
The above heatmap shows the correlation between all the numeric variables of the dataset.
We see strong positive correlation between the following pairs of variables.

o spending & advance_payments,

o advance_payments & current_balance,

o credit_limit & spending

o spending & current_balance

o credit_limit & advance_payments

o max_spent_in_single_shopping & current_balance

10 | P a g e
1.2 Do you think scaling is necessary for clustering in this case? Justify The
learner is expected to check and comment about the difference in scale of
different features on the bases of appropriate measure for example std dev,
variance, etc. Should justify whether there is a necessity for scaling and which
method is he/she using to do the scaling. Can also comment on how that
method works.

 As per the descriptive summary of the data, the values of the variables are at
different ranges.
 Variables spending and advanced_payments have a significantly higher value
compared to other variables.
 Hence scaling is needed.
 Once scaling is done it will be ready for further analysis as all variables comes
under same range(-3 to +3).

Before scaling After Scaling

 We are using the z-score method to standardize the data.


 After standardization using z-score the data looks like this

11 | P a g e
1.3 Apply hierarchical clustering to scaled data (3 pts). Identify the number of
optimum clusters using Dendrogram and briefly describe them (4). Students
are expected to apply hierarchical clustering. It can be obtained via Fclusters
or Agglomerative Clustering. Report should talk about the used criterion,
affinity and linkage. Report must contain a Dendrogram and a logical reason
behind choosing the optimum number of clusters and Inferences on the
dendrogram. Customer segmentation can be visualized using limited features
or whole data but it should be clear, correct and logical. Use appropriate plots
to visualize the clusters.
 Necessary modules dendrogram and linkage are imported from scipy.cluster.hierarchy
packages.
 The linkage method is chosen as ward
 The dendrogram is formed and cut with p value 25

 As per the dendrogram the optimal number of clusters formed are 3.


 The cluster array gets created using fcluster module with criterion maxcrust and number
of clusters 3.
array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 1,
1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1,
3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 3,
3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1, 1,
3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)

 The cluster number of the number of values under each cluster is given below
3 73
1 70
2 67
Name: cluster, dtype: int64

These kind of 3 clusters can be inferred as a pattern for high/medium/low usage of credit
cards.

12 | P a g e
1.4 Apply K-Means clustering on scaled data and determine optimum
clusters (2 pts). Apply elbow curve and silhouette score (3 pts). Interpret
the inferences from the model (2.5 pts). K-means clustering code
application with different number of clusters. Calculation of WSS(inertia for
each value of k) Elbow Method must be applied and visualized with
different values of K. Reasoning behind the selection of the optimal value
of K must be explained properly. Silhouette Score must be calculated for
the same values of K taken above and commented on. Report must contain
logical and correct explanations for choosing the optimum clusters using
both elbow method and silhouette scores. Append cluster labels obtained
from K-means clustering into the original data frame. Customer
Segmentation can be visualized using appropriate graphs.

 K-means clustering is performed on scaled data with the help of sklearn.cluster


package
 K-means for different number of clusters is performed to find different inertia values.
 There is a significant drop in the inertia value as the number of clusters increased.
 Based on the weighted sum of squares value the optimal number of clusters is found
out to be 3

 Cluster labels are attached to the scaled dataframe.


 Silhouette score is calculated using sklearn.metrics package
 The silhouette coefficient drops as the number of clusters increase until 4

 Hence the number of optimal clusters should be 3 or 4.

13 | P a g e
 We are going ahead with 3 clusters due to convenience of the pattern of spending
high, medium and low .

 Silhouette width is calculated and appended to the scaled dataframe.


 Composition of different clusters in the dataframe is given below.

1.5 Describe cluster profiles for the clusters defined (2.5 pts). Recommend
different promotional strategies for different clusters in context to the
business problem in-hand (2.5 pts ). After adding the final clusters to the
original dataframe, do the cluster profiling. Divide the data in the finalyzed
groups and check their means. Explain each of the group briefly. There should
be at least 3-4 Recommendations. Recommendations should be easily
understandable and business specific, students should not give any technical
suggestions. Full marks will only be allotted if the recommendations are
correct and business specific. variable means. Students to explain the profiles
and suggest a mechanism to approach each cluster. Any logical explanation is
acceptable.
Cluster profiles
o Cluster 1 : Medium Spending
o Cluster 2 : Low Spending
o Cluster 3 : High Spending

14 | P a g e
Recommendations
Cluster 3 – High Spending
1. Can be offered with loan options as they have maximum credit limit.
2. Discounts can be provided on their purchases as their spending habits are high .
3. Increasing their business loyalty with our card by providing personalised offers.
4. Possibility of having other credit cards is considerably high. Hence considering
competitive strategies for customer retention is needed.
5. Max_spent_in_single_shopping is high. Expensive and branded products must be
targeted for customized benefits.

Cluster 1 – Medium Spending


1. They could be middle class customers who pay their regular bills with their card.
Hence providing them cashbacks and discounts would be valuable to maintain
continuous transactions with our card.
2. Grocery Stores, Shoppihng Malls - Point of Sales should be considered for providing
rewards for regular customers.
3. Credit card based emi option for payment of insurance can be implemented for adding
value to the business.
4. Reward points for fuel transactions.
5. Increasing the credit limit and reducing the interest rate.

Cluster 2 – Low Spending

1. Early payment should be encouraged through offers.


2. Reminding them of their pending payments
3. Discounts on common transactions to keep them active on using the card for day to
day transactions.

15 | P a g e
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim
frequency. The management decides to collect data from the past few
years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF &
ANN and compare the models' performances in train and test sets.
2.1 Read the data and do exploratory data analysis (4 pts). Describe the data
briefly. Interpret the inferences for each (2 pts). Initial steps like head() .info(),
Data Types, etc . Null value check. Distribution plots(histogram) or similar
plots for the continuous columns. Box plots, Correlation plots. Appropriate
plots for categorical variables. Inferences on each plot. Summary stats,
Skewness, Outliers proportion should be discussed, and inferences from
above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct
output and inferences should be logical and correct.
Outlier Detection
Continous variables are checked for outliers and the output is given below
Age
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 198
Number of outliers : 0
% of Outlier : 23 %
% of Outlier : 0 %

Commision
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 362
Number of outliers : 2478
% of Outlier : 12 %
% of Outlier : 0 %

Duration
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 382
Number of outliers : 2293
% of Outlier : 34 %
% of Outlier : 0 %

Sales
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 353
Number of outliers : 2050
% of Outlier : 38 %
% of Outlier : 0 %

16 | P a g e
The outliers are ignored from a business point of view since Sales and Comission variables
can be extreme at times depending on nature of business.
Univariate Analysis
Distribution plot and box plot for numeric variables are generated

17 | P a g e
Bivariate Analysis

Dataset contains maximum number of customised products

18 | P a g e
The median number of insurance products sold in Gold and Silver plan and claimed higher
than other plans

The median sales numbers remains more or


less constant amount different agencies

o There is a steady increase in the


commission as the sales increases.
o Age has no significance with
commission, duration and sales
variables.
o Correlation coefficient is also very low.
o The heat map explains the correlation between the various numeric variables.

19 | P a g e
2.2 Data Split: Split the data into test and train(1 pts), build classification
model CART (1.5 pts), Random Forest (1.5 pts), Artificial Neural Network(1.5
pts). Object data should be converted into categorical/numerical data to fit in
the models. (pd.categorical().codes(), pd.get_dummies(drop_first=True)) Data
split, ratio defined for the split, train-test split should be discussed. Any
reasonable split is acceptable. Use of random state is mandatory. Successful
implementation of each model. Logical reason behind the selection of different
values for the parameters involved in each model. Apply grid search for each
model and make models on best_params. Feature importance for each model.

Decision Tree
 Decision tree is a white box classifier as we can see the logic behind the classification.
 The data is first split into test and train data.

20 | P a g e
 The model is initialized and we use grid search to get the optimal parameters to build the
model.
 Here we give a list of values under each parameters (criterion, max_depth,
min_sample_leaf,min_sample_split) for the grid search .
 The best fit for the decision tree is determined using GridsearchCV module and in this
case it is found as follows,

{'criterion': 'gini', 'max_depth': 4.85, 'min_samples_leaf': 44,


'min_samples_split': 260}

 Classification is performed based on gini value.


 Min_sample_leaf denotes how many observations need to be present in each of the
terminal
 Min_sample_split denotes the number of records that should be present before every
split happens in a decision tree.
 The optimized classification model is generated and fit into the training data.
 In order to visualize the tree we create a dot file.
 The data generated in the dot file is entered in a dot file processing website to get the
visualization of the model in decision tree format.
 The decision tree we obtain in the dot file is given below.

Random Forest

 Application of random forest increases the accuracy of decision trees.

21 | P a g e
 We use gridsearch module to prune for better accuracy of the model.
 Grid is defined as param_grid_rfcl with parameters max_depth,
max_features,min_samples_leaf, min_samples_split, n_estimators.
 Random forest object called within GridSearchCV module imported from
sklearn.model_selection package.
 Cross validation cv value is set as 3 .
 Gridsearch finds the optimal parameter value for accuracy stored in best_params.
{'max_depth': 6, 'max_features': 3, 'min_samples_leaf': 8,
'min_samples_split': 46, 'n_estimators': 350}

 best_estimator_ function shows the syntax for generating best accuracy model random
forest classifier.
RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=8,
min_samples_split=46, n_estimators=350,
random_state=1)

 Prediction is done on training data and testing data.


 Predictions can also be viewed as probabilities as given below.

 After prediction we can say that the random forest classifier model is built successfully.
 For model evaluation we use confusion matrix and classification report on training and
testing data.
 For a more clear picture of accuracy we use the roc curve and the area under the curve
score explains the accuracy of the model.
 The significance of different variables is evaluated through feature_importance_ .

22 | P a g e
Artificial Neural Network
 Being a black box algorithm, we cant find what values are used in trial and error method
for bias and weights to get into the optimal value.
 Building a neural network model is performed through MLPClassifier function
(Multilayer perception Classifier).
 MLPClassifer model is fit into train and test data to get the optimal parameters through
GridSearchCV function.
MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1,
solver='sgd', tol=0.01)
 Train and Test data are used in the model for prediction.
 Prediction propabilities are calculated.

2.3 Performance Metrics: Check the performance of Predictions on Train and


Test sets using Accuracy (1 pts), Confusion Matrix (2 pts), Plot ROC curve and
get ROC_AUC score for each model (2 pts), Make classification reports for
each model. Write inferences on each model (2 pts). Calculate Train and Test
Accuracies for each model. Comment on the validness of models (overfitting

23 | P a g e
or underfitting) Build confusion matrix for each model. Comment on the
positive class in hand. Must clearly show obs/pred in row/col Plot roc_curve
for each model. Calculate roc_auc_score for each model. Comment on the
above calculated scores and plots. Build classification reports for each model.
Comment on f1 score, precision and recall, which one is important here.

Model Evaluation
It is done through the following techniques
1. Confusion Matrix – It evaluates the performance of the model in four blocks as a 2X2
matrix
2. ROC curve – Receiver Operating Characteristic curve – This is a graph generated
with True positive rate on y-axis and False positive rate on x-axis
3. AUC – Area Under the Curve – The higher the AUC score, better will be the
performance of the model.
4. Accuracy score = rate of accuracy of the model calculated using test data and
predictions
5. Classification report – it is a measure for quality of The ways to check the validity of
the predictions are

o TN / True Negative: when a case was negative and predicted negative


o TP / True Positive: when a case was positive and predicted positive
o FN / False Negative: when a case was positive but predicted negative
o FP / False Positive: when a case was negative but predicted positive

- Recall value is the ability of the classifier to find all positive instances.

CART Model Output


Train data Test data
ROC curve

AUC score AUC: 0.823 AUC: 0.801

24 | P a g e
Confusion
Matrix array([[1309, 144], array([[553, 70],
[ 307, 340]], dtype=int64) [136, 141]], dtype=int64)

Accuracy 0.7852380952380953 0.7711111111111111

precision recall f1-score support precision recall f1-score support


Classificatio
n 0
1
0.81
0.70
0.90
0.53
0.85
0.60
1453
647
0
1
0.80
0.67
0.89
0.51
0.84
0.58
623
277
Report
accuracy 0.79 2100 accuracy 0.77 900
macro avg 0.76 0.71 0.73 2100 macro avg 0.74 0.70 0.71 900
weighted avg 0.78 0.79 0.78 2100 weighted avg 0.76 0.77 0.76 900

Random Forest Model Output


Train data Test data
ROC curve

AUC score 0.8563713512840778 0.8181994657271499

Confusion array([[1297, 156], array([[550, 73],


Matrix [ 255, 392]], dtype=int64) [121, 156]], dtype=int64)

Accuracy 0.8042857142857143 0.7844444444444445

precision recall f1-score support precision recall f1-score support


Classificatio
n 0
1
0.84
0.72
0.89
0.61
0.86
0.66
1453
647
0
1
0.82
0.68
0.88
0.56
0.85
0.62
623
277
Report
accuracy 0.80 2100 accuracy 0.78 900
macro avg 0.78 0.75 0.76 2100 macro avg 0.75 0.72 0.73 900
weighted avg 0.80 0.80 0.80 2100 weighted avg 0.78 0.78 0.78 900

Artificial Neural Network Output


Train data Test data

25 | P a g e
ROC curve

AUC score 0.7790697921796932 0.7587891360657353

Confusion array([[1340, 113], array([[576, 47],


Matrix [ 396, 251]], dtype=int64) [171, 106]], dtype=int64)

Accuracy 0.7576190476190476 0.7577777777777778

precision recall f1-score support precision recall f1-score support


Classificatio
n 0
1
0.77
0.69
0.92
0.39
0.84
0.50
1453
647
0
1
0.77
0.69
0.92
0.38
0.84
0.49
623
277
Report
accuracy 0.76 2100 accuracy 0.76 900
macro avg 0.73 0.66 0.67 2100 macro avg 0.73 0.65 0.67 900
weighted avg 0.75 0.76 0.73 2100 weighted avg 0.75 0.76 0.73 900

2.4 Final Model - Compare all models on the basis of the performance metrics
in a structured tabular manner (2.5 pts). Describe on which model is
best/optimized (1.5 pts ). A table containing all the values of accuracies,
precision, recall, auc_roc_score, f1 score. Comparison between the different
models(final) on the basis of above table values. After comparison which
model suits the best for the problem in hand on the basis of different
measures. Comment on the final model.
Model Comparison table
DTCL RFCL ANN
Train Data Test Data Train Data Test Data Train Data Test Data
Accuracy 78% 77% 80% 81% 75 75
Precision 70% 67% 72% 68% 69 69
Recall 53% 51% 61% 56% 39 38
f1 score 60% 58% 66% 62% 50 49
AUC score 82% 80% 86% 78% 77 75

 An ideal model should have big numbers in accuracy, precision, f1 score and Area Under ROC
curve .

26 | P a g e
 Looking at the numbers we have arrived at after Decision Tree , Random forest and Artificial
Neural Network Modelling we can say that Random Forest Classifier Model suits the best for
representing the business and making predictions.

2.5 Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective. There should be at least 3-4 Recommendations and
insights in total. Recommendations should be easily understandable and
business specific, students should not give any technical suggestions. Full
marks should only be allotted if the recommendations are correct and
business specific.
The Random forest model has better accuracy and f1 score. Hence I am
chosing the random forest classifier model to predict the claim status better.

Recommendations
 We need more information on online purchases of travel insurance without touring agent
commission are to be considered by the insurance firm on reducing administrative cost on
sales and distribution.
 Purchasing power, number of people travelling together, climate and season conditions of
the tours are needed for making more accurate predictions.
 The median sales of all agencies remains around same numbers. Hence agencies must be
encouraged to run marketing on touring insurance by incentivizing them better with
agency commissions.
 Some agencies perform better coming up with larger sales numbers at certain instances.
The reasons for that need to be found and implemented in other agencies to increase the
overall sales numbers for the product.
 Industry specific risks such as business interruption, claim handling costs must be
controlled through AI powered service delivery.

----End-----

27 | P a g e
28 | P a g e

You might also like