Professional Documents
Culture Documents
Data Mining
Data Mining
Business Report
PGP-DSBA
August 2022
CONTENTS
1. Objective…………………………………………………………………. 2
2. Problem 1 : Clustering…………………………………………………….3-
a) Assumptions………………………………………………………..3
b) Solution 1.1……………………………………………………….3-11
c) Solution 1.2……………………………………………………….12
d) Solution 1.3……………………………………………………….12-14
e) Solution 1.4……………………………………………………….15
f) Solution 1.5……………………………………………………….15-16
1
OBECTIVE
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. You are given the task to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
1.2 Do you think scaling is necessary for clustering in this case? Justify
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model which
predicts the claim status and provide recommendations to management. Use CART, RF & ANN
and compare the models' performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-
variate, and multivariate analysis).
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification
reports for each model.
2.4 Final Model: Compare all the models and write an inference which model is best/optimized.
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
2
PROBLEM 1: CLUSTERING
ASSUMPTIONS
The dataset provided to us is stored as “bank_marketing_part1_Data.csv” which contains data of
210 customers and 7 variables namely:
IMPORTING PACKAGES
To import the dataset and perform exploratory data analysis on the given dataset the following packages
were imported.
SOLUTIONS
1.1 To perform exploratory data analysis on the dataset and describe it breifly.
3
The dataset is imported in Jupyter notebook using pd.read_csv() function and will store the dataset
in df. The top 5 rows of the dataset are viewed using pd.head() function.
4
Summary of the dataset
The missing values or “NA” needs to be checked and dropped from the dataset for the ease of evaluation
and null values can give errors or disparities in the result.
As computed from above command the dataset does not have any null or NA values.
UNIVARIATE ANALYSIS
Histograms, Box plot to check for outliers and Dist plots are plotted for all the numerical variables. The
result is below.
5
SPENDING
ADVANCE PAYMENTS
6
PROBABILITY OF FULL PAYMENT
7
CREDIT LIMIT
8
MAX SPENT IN SINGLE SHOPPING
INFERENCE: After plotting the Boxplots for all the variables we can conclude that a few outliers
are present in the variable namely, min_payment_amt which means that there are only a few customers
wose minimum payment amount falls on the higher side on the higher side on an average. Since only one
of the seven variables have a very small outlier value, hence there is no need to treat the outliers. This
small value will not create any difference in our analysis.
We can conclude from the above graphs that the most of the customers in our data have a higher spending
capacity, high current balance in their accouts and these customers spent a higher amount during a single
shopping event. Majority of the customers have a higher probability to make full payment to the bank.
MULTIVARIATE ANALYSIS
This graph helps us to check for any correlations between different variables.
9
INFERENCE: As per the Heat Map, we can conclude that the following variables are highly
correlated:
By this we can conclude that the customers who are spending very high have a higher current balance and
high credit limit. Advance Payments and maximum expenditure done in single shoppingare done by
majority of those customers who have high current balance in their bank accounts.
Probability of full payments are higher for those customers who have a higher credit limit.
Minimum payment amount is not correlated to any of the variables, hence, it is not affected by any
changes in the current balance or credit limit of the customers.
1.2 Do you think scaling is necessary for clustering in this case? Justify
11
For the data given to us, scaling is required as all the variables are expressed in different units such as
spending in 1000s, advance payments in 100s and credit limit in 10000s, whereas probability is expressed
as fraction or decimal values. Since the other values expressed in higher units will outweigh probabilities
and can give varied results hence it is important to scale the data using standard Scaler and therefore
normalise the values where the means will be 0 and standard deviation 1.
Scaling of data is done using importing a package called StandardScaler from sklearn.
For the dataset we will be using Agglomerative Hierarchical Clustering method to create optimum
clusters and categorising the dataset on the basis of these clusters.
12
To create a Dendogram using our scaled data we have firstly imported the package dendrogram, linkage
from scipy.cluster.hierarchy. Using this function we have created a dendrogram which shows two clusters
very clearly. Now, we will check that make-up of these two clusters using ‘maxclust’ and “Distance”. As
can be seen from above we will now take 2 clusters for our further analysis.
13
This above graph shows the last 10 links in the Dendrogram.
array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2, 2,
1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1, 1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3, 1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3, 1,
1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2, 1,
3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2, 3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3, 3,
3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1, 1,
3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1, 2,
1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)
14
As per the above plot i.e. within sum of squares (wss) method we can conclude that the optimal
number of clusters to be taken for k-means clustering is 3 since as per the elbow method it can be
easily seen in the curve that after 3 the curve gets flat.
Clusters that needs to be taken for k-means clustering is 3. The average silhouettes score is
coming to be 0.400 and minimum silhouette score is 0.002. The silhouette score ranges from -1
to +1 and higher the silhouette score better the clustering.
15
Variables Spendin Advance Probabilit Curren Credi Min Max
g Payment y of full t t Paymen spent in
s payment balance Limit t Amt single
shoppin
g
***************************************
***************************************
16
PROBLEM 2: CART-RF-ANN
Assumptions
The dataset provided to us contains data of 3000 customers and 10 variables namely:
17
SOLUTIONS
2.1 Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).
Importing the dataset
The dataset is read using pd.read_csv(). The head function.
18
Summary of the dataset
The above dataset does not have any null or N/A values.
19
UNIVARIATE ANALYSIS
Box plots of variables to check for Outliers, Bar plots for all categorical variables and
Histograms.
AGE
20
COMMISSION
21
DURATION
22
SALES
23
BAR PLOTS - CHANNEL
24
BAR PLOTS – PRODUCT NAME
We can conclude from the above graphs that the majority of the customers doing a claim in our
data belong to age group of 25-40with the type of Tour Agency firm being Travel agency,
Channel being Online, Product Name being customized plan and Destination being Asia.
25
MULTIVARIATE ANALYSIS
Heat Map
As interpreted from the above heat map, there is no or extremely low correlation between the
Variables given in the dataset.
26
2.2 Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network.
CART Model
Classification and Regression Trees (CART) are a type of Decision trees used in Data mining. It
is a type of supervised learning technique where the predicted outcome is either a discrete or
class of the dataset or the outcome is of continuous or numerical in nature.
Using the Train datasaet we will be creating the CART model and then further testing the model
on the Test Dataset.
Below are the variable importance values or the feature importance to build the tree.
Using the GridSearchCV package from sklearn.model_selection we will identify the best
parameters to build a regularized decision tree. Hence, doing a few iterations with the values we
got the best parameters to build the decision tree which are as follows:
27
These best grid parameters are henceforth used to build the regularized or pruned Decision Tree.
RANDOM FOREST
Random Forest is another Supervised Learning Technique used in Machine learning which
consists of many decision trees that helps in predictions using individual trees and selects the
best output from them.
Using these best parameters evaluated using GridSearchCV a Random Forest Model is created
which is further used for model performance evaluation.
Artificial Neural network (ANN) is a computational model that consists of several processing
elements that receive inputs and deliver outputs based on thei.r predefined activation functions.
Using the train dataset (X_train) and test dataset (X_test) we will be creating a Neural network
using MLPClassifier from sklearn.metrics.
Using the GridSearchCV package from Sklearn.model_selection we will identify the best
parameters to build an artificial Neural Network Model namely, mlp. Hence, doing a few
iterations with the values we got the best parameters to build the ANN Model which are as
follows:-
28
Using these best parameters evaluated using GridSearchCV an Artificial Neural Network Model
is created which is further used for model performance evaluation.
To check the Model Performances of the three models created above certain model evaluators
are used i.e. Classification Report, Confusion Matrix, ROC_AUC Score and ROC Plot. They are
calculated first for train data and then for test data.
CART MODEL
Classification Report
29
Confusion Matrix
30
Random Forest report
Classification Report
31
Confusion Matrix
32
Artificial Neural Network Model
Classification Report
33
Confusion Matrix
34
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized.
INSIGHTS: From the above table comparing the model performance evaluators for the three
models it is quite clear that the Random Forest Model is performing well as compared to the
other two as it has high precisions for both training and testing data and although the AUC Score
is the same for all the three models for training data but for testing data it is the highest for
Random Forest Model . Choosing Random Forest Model is the best option in this case as it will
exhibit very less variance as compared to a single decision tree or a multi-layered Neural
Network.
35
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations
For the business problem of an insurance firm providing Tour Insurance, we have attempted to
make a few data Models for predictions of probabilities. The models that are attempted are
namely, CART or classification and Regression Trees, Random Forest and Artificial Neural
Network (MLP). The three models are then valuated on training and testing datasets and their
performance scores are calculated.
The accuracy, Precision, F1 Score are computed using classification Report. The confusion
matrix , AUC_ROC Scores and ROC plot are computed for each model separately and
compared. All the three models have performed well but to increase our accuracy in determining
the claims made by the customers we can choose the Random Forest Model. Instead of creating a
single Decision Tree it can create a multiple decision trees and hence can provide the best claim
status from the data.
As seen from the above model performance measures, for all the models i.e.CART, Random
Forest and ANN have performed exceptionally well. Hence, we can choose either of the models
but choosing Random Forest Model is a great option as even though they exhibit the same
accuracy but choosing Randon Forest over Cart Model us way better as they have much less
variance than a single decision tree.
36