Professional Documents
Culture Documents
Data Mining
Data Mining
Anandarangan K
Data Science and Business Administration
Great Learning
1|Page
Contents:
2|Page
List of Figures
3|Page
List of Tables:
4|Page
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers
to its customers. They collected a sample that summarizes the activities of users
during the past few months. You are given the task to identify the segments based
on credit card usage.
1.1 Read the data and do exploratory data analysis (3 pts). Describe the data
briefly. Interpret the inferences for each (3 pts). Initial steps like head() .info(),
Data Types, etc . Null value check. Distribution plots(histogram) or similar
plots for the continuous columns. Box plots, Correlation plots. Appropriate
plots for categorical variables. Inferences on each plot. Summary stats,
Skewness, Outliers proportion should be discussed, and inferences from
above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct
output and inferences should be logical and correct.
Solution
The data looks evenly distributed with spending variable having the maximum standard
deviation.
As per the dataset there are 7 variables – all numerical datatype with 210 entries and no null
values.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spending 210 non-null float64
1 advance_payments 210 non-null float64
2 probability_of_full_payment 210 non-null float64
3 current_balance 210 non-null float64
4 credit_limit 210 non-null float64
5 min_payment_amt 210 non-null float64
6 max_spent_in_single_shopping 210 non-null float64
dtypes: float64(7)
memory usage: 11.6 KB
5|Page
Univariate analysis
6|Page
7|Page
Observation
Distribution is right tailed for 6 variables.
Propability of full payment is left tailed.
We find outliers in propability_of_full_payment and min_payment_amt.
We find 3 outliers in probability_of_full_payment and 2 outliers in min_payment_amt
Since the proportion of outliers is very less we can ignore it.
The amount of skewness of the different variables are given below.
probability_of_full_payment -0.537954
credit_limit 0.134378
advance_payments 0.386573
spending 0.399889
min_payment_amt 0.401667
current_balance 0.525482
max_spent_in_single_shopping 0.561897
Bivariate analysis
8|Page
We could see that the distribution is skewed towards right tail for all the variables except
propability_of_full_payment variable having left tail.
9|Page
The above heatmap shows the correlation between all the numeric variables of the dataset.
We see strong positive correlation between the following pairs of variables.
10 | P a g e
1.2 Do you think scaling is necessary for clustering in this case? Justify The
learner is expected to check and comment about the difference in scale of
different features on the bases of appropriate measure for example std dev,
variance, etc. Should justify whether there is a necessity for scaling and which
method is he/she using to do the scaling. Can also comment on how that
method works.
As per the descriptive summary of the data, the values of the variables are at
different ranges.
Variables spending and advanced_payments have a significantly higher value
compared to other variables.
Hence scaling is needed.
Once scaling is done it will be ready for further analysis as all variables comes
under same range(-3 to +3).
11 | P a g e
1.3 Apply hierarchical clustering to scaled data (3 pts). Identify the number of
optimum clusters using Dendrogram and briefly describe them (4). Students
are expected to apply hierarchical clustering. It can be obtained via Fclusters
or Agglomerative Clustering. Report should talk about the used criterion,
affinity and linkage. Report must contain a Dendrogram and a logical reason
behind choosing the optimum number of clusters and Inferences on the
dendrogram. Customer segmentation can be visualized using limited features
or whole data but it should be clear, correct and logical. Use appropriate plots
to visualize the clusters.
Necessary modules dendrogram and linkage are imported from scipy.cluster.hierarchy
packages.
The linkage method is chosen as ward
The dendrogram is formed and cut with p value 25
The cluster number of the number of values under each cluster is given below
3 73
1 70
2 67
Name: cluster, dtype: int64
These kind of 3 clusters can be inferred as a pattern for high/medium/low usage of credit
cards.
12 | P a g e
1.4 Apply K-Means clustering on scaled data and determine optimum
clusters (2 pts). Apply elbow curve and silhouette score (3 pts). Interpret
the inferences from the model (2.5 pts). K-means clustering code
application with different number of clusters. Calculation of WSS(inertia for
each value of k) Elbow Method must be applied and visualized with
different values of K. Reasoning behind the selection of the optimal value
of K must be explained properly. Silhouette Score must be calculated for
the same values of K taken above and commented on. Report must contain
logical and correct explanations for choosing the optimum clusters using
both elbow method and silhouette scores. Append cluster labels obtained
from K-means clustering into the original data frame. Customer
Segmentation can be visualized using appropriate graphs.
13 | P a g e
We are going ahead with 3 clusters due to convenience of the pattern of spending
high, medium and low .
1.5 Describe cluster profiles for the clusters defined (2.5 pts). Recommend
different promotional strategies for different clusters in context to the
business problem in-hand (2.5 pts ). After adding the final clusters to the
original dataframe, do the cluster profiling. Divide the data in the finalyzed
groups and check their means. Explain each of the group briefly. There should
be at least 3-4 Recommendations. Recommendations should be easily
understandable and business specific, students should not give any technical
suggestions. Full marks will only be allotted if the recommendations are
correct and business specific. variable means. Students to explain the profiles
and suggest a mechanism to approach each cluster. Any logical explanation is
acceptable.
Cluster profiles
o Cluster 1 : Medium Spending
o Cluster 2 : Low Spending
o Cluster 3 : High Spending
14 | P a g e
Recommendations
Cluster 3 – High Spending
1. Can be offered with loan options as they have maximum credit limit.
2. Discounts can be provided on their purchases as their spending habits are high .
3. Increasing their business loyalty with our card by providing personalised offers.
4. Possibility of having other credit cards is considerably high. Hence considering
competitive strategies for customer retention is needed.
5. Max_spent_in_single_shopping is high. Expensive and branded products must be
targeted for customized benefits.
15 | P a g e
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim
frequency. The management decides to collect data from the past few
years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF &
ANN and compare the models' performances in train and test sets.
2.1 Read the data and do exploratory data analysis (4 pts). Describe the data
briefly. Interpret the inferences for each (2 pts). Initial steps like head() .info(),
Data Types, etc . Null value check. Distribution plots(histogram) or similar
plots for the continuous columns. Box plots, Correlation plots. Appropriate
plots for categorical variables. Inferences on each plot. Summary stats,
Skewness, Outliers proportion should be discussed, and inferences from
above used plots should be there. There is no restriction on how the learner
wishes to implement this but the code should be able to represent the correct
output and inferences should be logical and correct.
Outlier Detection
Continous variables are checked for outliers and the output is given below
Age
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 198
Number of outliers : 0
% of Outlier : 23 %
% of Outlier : 0 %
Commision
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 362
Number of outliers : 2478
% of Outlier : 12 %
% of Outlier : 0 %
Duration
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 382
Number of outliers : 2293
% of Outlier : 34 %
% of Outlier : 0 %
Sales
Lower outliers : -53.5
Upper outliers : 142.5
Number of outliers : 353
Number of outliers : 2050
% of Outlier : 38 %
% of Outlier : 0 %
16 | P a g e
The outliers are ignored from a business point of view since Sales and Comission variables
can be extreme at times depending on nature of business.
Univariate Analysis
Distribution plot and box plot for numeric variables are generated
17 | P a g e
Bivariate Analysis
18 | P a g e
The median number of insurance products sold in Gold and Silver plan and claimed higher
than other plans
19 | P a g e
2.2 Data Split: Split the data into test and train(1 pts), build classification
model CART (1.5 pts), Random Forest (1.5 pts), Artificial Neural Network(1.5
pts). Object data should be converted into categorical/numerical data to fit in
the models. (pd.categorical().codes(), pd.get_dummies(drop_first=True)) Data
split, ratio defined for the split, train-test split should be discussed. Any
reasonable split is acceptable. Use of random state is mandatory. Successful
implementation of each model. Logical reason behind the selection of different
values for the parameters involved in each model. Apply grid search for each
model and make models on best_params. Feature importance for each model.
Decision Tree
Decision tree is a white box classifier as we can see the logic behind the classification.
The data is first split into test and train data.
20 | P a g e
The model is initialized and we use grid search to get the optimal parameters to build the
model.
Here we give a list of values under each parameters (criterion, max_depth,
min_sample_leaf,min_sample_split) for the grid search .
The best fit for the decision tree is determined using GridsearchCV module and in this
case it is found as follows,
Random Forest
21 | P a g e
We use gridsearch module to prune for better accuracy of the model.
Grid is defined as param_grid_rfcl with parameters max_depth,
max_features,min_samples_leaf, min_samples_split, n_estimators.
Random forest object called within GridSearchCV module imported from
sklearn.model_selection package.
Cross validation cv value is set as 3 .
Gridsearch finds the optimal parameter value for accuracy stored in best_params.
{'max_depth': 6, 'max_features': 3, 'min_samples_leaf': 8,
'min_samples_split': 46, 'n_estimators': 350}
best_estimator_ function shows the syntax for generating best accuracy model random
forest classifier.
RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=8,
min_samples_split=46, n_estimators=350,
random_state=1)
After prediction we can say that the random forest classifier model is built successfully.
For model evaluation we use confusion matrix and classification report on training and
testing data.
For a more clear picture of accuracy we use the roc curve and the area under the curve
score explains the accuracy of the model.
The significance of different variables is evaluated through feature_importance_ .
22 | P a g e
Artificial Neural Network
Being a black box algorithm, we cant find what values are used in trial and error method
for bias and weights to get into the optimal value.
Building a neural network model is performed through MLPClassifier function
(Multilayer perception Classifier).
MLPClassifer model is fit into train and test data to get the optimal parameters through
GridSearchCV function.
MLPClassifier(hidden_layer_sizes=200, max_iter=2500, random_state=1,
solver='sgd', tol=0.01)
Train and Test data are used in the model for prediction.
Prediction propabilities are calculated.
23 | P a g e
or underfitting) Build confusion matrix for each model. Comment on the
positive class in hand. Must clearly show obs/pred in row/col Plot roc_curve
for each model. Calculate roc_auc_score for each model. Comment on the
above calculated scores and plots. Build classification reports for each model.
Comment on f1 score, precision and recall, which one is important here.
Model Evaluation
It is done through the following techniques
1. Confusion Matrix – It evaluates the performance of the model in four blocks as a 2X2
matrix
2. ROC curve – Receiver Operating Characteristic curve – This is a graph generated
with True positive rate on y-axis and False positive rate on x-axis
3. AUC – Area Under the Curve – The higher the AUC score, better will be the
performance of the model.
4. Accuracy score = rate of accuracy of the model calculated using test data and
predictions
5. Classification report – it is a measure for quality of The ways to check the validity of
the predictions are
- Recall value is the ability of the classifier to find all positive instances.
24 | P a g e
Confusion
Matrix array([[1309, 144], array([[553, 70],
[ 307, 340]], dtype=int64) [136, 141]], dtype=int64)
25 | P a g e
ROC curve
2.4 Final Model - Compare all models on the basis of the performance metrics
in a structured tabular manner (2.5 pts). Describe on which model is
best/optimized (1.5 pts ). A table containing all the values of accuracies,
precision, recall, auc_roc_score, f1 score. Comparison between the different
models(final) on the basis of above table values. After comparison which
model suits the best for the problem in hand on the basis of different
measures. Comment on the final model.
Model Comparison table
DTCL RFCL ANN
Train Data Test Data Train Data Test Data Train Data Test Data
Accuracy 78% 77% 80% 81% 75 75
Precision 70% 67% 72% 68% 69 69
Recall 53% 51% 61% 56% 39 38
f1 score 60% 58% 66% 62% 50 49
AUC score 82% 80% 86% 78% 77 75
An ideal model should have big numbers in accuracy, precision, f1 score and Area Under ROC
curve .
26 | P a g e
Looking at the numbers we have arrived at after Decision Tree , Random forest and Artificial
Neural Network Modelling we can say that Random Forest Classifier Model suits the best for
representing the business and making predictions.
2.5 Based on your analysis and working on the business problem, detail out
appropriate insights and recommendations to help the management solve the
business objective. There should be at least 3-4 Recommendations and
insights in total. Recommendations should be easily understandable and
business specific, students should not give any technical suggestions. Full
marks should only be allotted if the recommendations are correct and
business specific.
The Random forest model has better accuracy and f1 score. Hence I am
chosing the random forest classifier model to predict the claim status better.
Recommendations
We need more information on online purchases of travel insurance without touring agent
commission are to be considered by the insurance firm on reducing administrative cost on
sales and distribution.
Purchasing power, number of people travelling together, climate and season conditions of
the tours are needed for making more accurate predictions.
The median sales of all agencies remains around same numbers. Hence agencies must be
encouraged to run marketing on touring insurance by incentivizing them better with
agency commissions.
Some agencies perform better coming up with larger sales numbers at certain instances.
The reasons for that need to be found and implemented in other agencies to increase the
overall sales numbers for the product.
Industry specific risks such as business interruption, claim handling costs must be
controlled through AI powered service delivery.
----End-----
27 | P a g e
28 | P a g e