Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021

25/07/2021 Project- Clustering - Jupyter Notebook
Problem 1: Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task
to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis).
In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import scipy.cluster.hierarchy as sch
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5
In [2]:
# Load Dataset
df=pd.read_csv("bank_marketing_part1_Data.csv")
In [3]:
# Data Information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spending 210 non-null float64
1 advance_payments 210 non-null float64
2 probability_of_full_payment 210 non-null float64
3 current_balance 210 non-null float64
4 credit_limit 210 non-null float64
5 min_payment_amt 210 non-null float64
6 max_spent_in_single_shopping 210 non-null float64
dtypes: float64(7)
memory usage: 11.6 KB
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 1/42

In [4]:
# Value Counts of Data
df.dtypes.value_counts()
Out[4]:
float64 7
dtype: int64
In [5]:
df.isnull().sum()
Out[5]:
spending 0
advance_payments 0
probability_of_full_payment 0
current_balance 0
credit_limit 0
min_payment_amt 0
max_spent_in_single_shopping 0
dtype: int64
Observations:
Data consists of only numerical values .

No categorical data present.
7 numerical variables and 210 records.
No missing values based on intial analysis.
In [6]:
df.shape
Out[6]:
(210, 7)
In [7]:
print('The number of rows of the dataframe is',df.shape[0],'.')

print('The number of columns of the dataframe is',df.shape[1],'.')
The number of rows of the dataframe is 210 .
The number of columns of the dataframe is 7 .

In [8]:
# Head of Data
df.head()
Out[8]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_pa
0 19.94 16.92 0.8752 6.675 3.763
1 15.99 14.89 0.9064 5.363 3.582
2 18.95 16.42 0.8829 6.248 3.755
3 10.83 12.96 0.8099 5.278 2.641
4 17.99 15.86 0.8992 5.890 3.694
In [9]:
df.tail()
Out[9]:
spending advance_payments probability_of_full_payment current_balance credit_limit min_
205 13.89 14.02 0.8880 5.439 3.199
206 16.77 15.62 0.8638 5.927 3.438
207 14.03 14.16 0.8796 5.438 3.201
208 16.12 15.00 0.9000 5.709 3.485
209 15.57 15.15 0.8527 5.920 3.231
Data Dictionary for Market Segmentation:
1. spending: Amount spent by the customer per month (in 1000s)

2. advance_payments: Amount paid by the customer in advance by cash (in 100s)
3. probability_of_full_payment: Probability of payment done in full by the customer to the bank
4. current_balance: Balance amount left in the account to make purchases (in 1000s)
5. credit_limit: Limit of the amount in credit card (10000s)
6. min_payment_amt : minimum paid by the customer while making payments for purchases made monthly
(in 100s)
7. max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)

In [10]:
# Descriptive Statistics
round(df.describe().T,2)
Out[10]:
count mean std min 25% 50% 75% max
spending 210.0 14.85 2.91 10.59 12.27 14.36 17.30 21.18
advance_payments 210.0 14.56 1.31 12.41 13.45 14.32 15.72 17.25
probability_of_full_payment 210.0 0.87 0.02 0.81 0.86 0.87 0.89 0.92
current_balance 210.0 5.63 0.44 4.90 5.26 5.52 5.98 6.68
credit_limit 210.0 3.26 0.38 2.63 2.94 3.24 3.56 4.03
min_payment_amt 210.0 3.70 1.50 0.77 2.56 3.60 4.77 8.46
max_spent_in_single_shopping 210.0 5.41 0.49 4.52 5.04 5.22 5.88 6.55
Inference:
Based on summary descriptive, the data looks good.For most of the variable, mean/medium are nearly equal.
1. Minimum Spending of a customer per month (in 1000s) is 10.59 and a maximum spending per month (in
1000s) is 21.18 .
2. On an average mimimum spending of a customer per month (in 1000s) is 14.85 .
3. Minimum amount paid by the customer in advance by cash (in 100s) is 12.41 and maximum is 17.25.
4. On an average customers are paying 14.56 icash in advance (in 100s) .
5. On an average approximately 87% of customers make full payments to the bank.
6. Minimum and maximum balance amount left in the account to make purchases (in 1000s) is 4.89 and 6.68
respectively.
7. On an average balance amount left in the account to make purchases (in 1000s) is 5.63 .
8. On an average limit of the amount in credit card (10000s) is 3.26 with a minimum limit of 2.63 and a
maximum limit of 4.03.
9. On an average minimum amount paid by the customer while making payments for purchases made
monthly 3.70 (in 100s).
10. On an average maximum amount spent in one purchase (in 1000s) 5.41.
11. Std Deviation is high for spending variable in comparison to other variables.

In [11]:
df.nunique()
Out[11]:
spending 193
advance_payments 170
probability_of_full_payment 186
current_balance 188
credit_limit 184
min_payment_amt 207
max_spent_in_single_shopping 148
dtype: int64
In [12]:
# Check for Duplicate Values
dups= df.duplicated().sum()
print("Number of duplicate rows = %d" % (dups.sum()))
Number of duplicate rows = 0

In [13]:
fig,axs = plt.subplots(nrows=2,ncols=1, figsize=(20,10))

sns.histplot(data=df, kde=True, ax=axs[0])
sns.boxplot(data=df, ax=axs[1])
plt.xticks(rotation=90)
plt.xlabel("Variables-Bank Marketing Data")
plt.title("Figure 1: Hist plot and Box plot of Bank Marketing Data")
Out[13]:
Text(0.5, 1.0, 'Figure 1: Hist plot and Box plot of Bank Marketing Dat
a')
We know that, Univariate and multivariate represent two approaches to statistical analysis. Univariate involves
the analysis of a single variable while multivariate analysis examines two or more variables. Although univariate
and multivariate differ in function and complexity, Univariate analysis acts as a precursor to multivariate
analysis and knowledge of the former is necessary for understanding the latter.
Univariate analysis is the simplest form of analysing data. It is descriptive and doesn’t deal with causes or
relationships. It takes data, summarizes that data and finds patterns in the data.
Multivariate analysis techniques are used to understand how the set of outcome variables as a combined
whole are influenced by other factors, how the outcome variables relate to each other, or what underlying
factors produce the results observed in the dependent variables.

In [14]:
def univariateAnalysis_numeric(column,nbins):
print("Description of " + column)
print("-------------------------------------------------------------------------
print(df[column].describe(),end=' ')
plt.figure()
print("Distribution of " + column)
print("-------------------------------------------------------------------------
sns.distplot(df[column], kde=False, color='g');
plt.show()
plt.figure()
print("BoxPlot of " + column)
print("-------------------------------------------------------------------------
ax = sns.boxplot(x=df[column])
plt.show()
Since, our data consists of only numerical data , we will perform a univariate analysis of numerical columns of
the data.
In [15]:
Numerical_column_list = list(df.columns.values)
Numerical_length=len(Numerical_column_list)
print("Length of Numerical columns is :",Numerical_length)
Length of Numerical columns is : 7
In [16]:
for x in Numerical_column_list:
univariateAnalysis_numeric(x,20)
BoxPlot of max_spent_in_single_shopping
----------------------------------------------------------------------
------

Insights of Univariate Analysis:
1. Minimum Spending of a customer per month (in 1000s) is 10.59 and a maximum spending per month (in
1000s) is 21.18 .
2. On an average mimimum spending of a customer per month (in 1000s) is 14.85 .
3. Minimum amount paid by the customer in advance by cash (in 100s) is 12.41 and maximum is 17.25.
4. On an average customers are paying 14.56 cash in advance (in 100s) .
5. On an average approximately 87% of customers make full payments to the bank.
6. Maximum of 91.8 % customers make full payments to the bank and a minimum of 80.8 % customers
make full payments.
7. Minimum and maximum balance amount left in the account to make purchases (in 1000s) is 4.89 and 6.68
respectively.
8. On an average balance amount left in the account to make purchases (in 1000s) is 5.63 .
9. On an average limit of the amount in credit card (10000s) is 3.26 with a minimum limit of 2.63 and a
maximum limit of 4.03.
10. On an average minimum amount paid by the customer while making payments for purchases made
monthly 3.70 (in 100s).
11. On an average maximum amount spent in one purchase (in 1000s) 5.41.
12. There may be 2 or more outliers when customers are making minimum payments for monthly purchases
but other than that there aren't any outliers to be seen in the data.
In [17]:
import seaborn as sns; sns.set(style="ticks", color_codes=True)

iris = sns.load_dataset("iris")

In [18]:
#Scatterplots of all possible variable pairs
sns.pairplot(df);

In [19]:
corr = df.corr(method='pearson')
mask = np.triu(np.ones_like(corr, dtype=np.bool))
fig = plt.subplots(figsize=(35, 15))
sns.heatmap(df.corr(), annot=True,fmt='.2f',mask=mask)
plt.title("Figure 4: Heatmap of Variables")

plt.show()
Insigjts from Bivariate and Multivariate Analysis:
Overall the categories in the data looks very well correlated.Listing down a few points below:
1. advance_payments: There is a very strong correlation between Amount spent by the customer per month
with amount paid by the customer in advance by cash,balance amount left in the account to make
purchases,maximum amount spent in one purchase and limit of the amount in credit card.
2. current_balance: We can also see strong correlation between balance amount left in the account to make
purchases with limit of the amount in credit card and maximum amount spent in one purchase.
3. credit_limit:: There is a strong correlation between limit of the amount in credit card with amount paid by
the customer in advance by cash and probability of payment done in full by the customer to the bank.
Butcorrelation is slightly stronger for probabilty of full payment done to the bank , though there is not much
of a difference.
4. max_spent_in_single_shopping: There is a strong correlation of maximum amount spent in one purchase
with amount spent by the customer per month.
5. probability_of_full_payment: There is a moderate correlation between probability of payment done in full by
the customer to the bank with the Amount spent by the customer per month and amount paid by the
customer in advance by cash.
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-opti… 10/42

6. min_payment_amt: We can see that there is a very strong negative correlation,when customers are making
minimum payments for monthly purchases with amount spent by the customer per month, amount paid by
the customer in advance by cash, probability of payment done in full by the customer to the bank, balance
amount left in the account to make purchases and limit of the amount in credit card.
Negative Correlation is an indication that mentioned variables move in the opposite direction whenever
customers are making monthly payments for the minimum amount spent. In general, -0.30 is considered as a
weak correlation and the correlation values are lesser than -0.3 indicating very weak correlation.This
observation by itself demonstrate a cause and effect relationship between the variables.
Summary:
Strong positive correlation between
- spending & advance_payments,
- advance_payments & current_balance,
- credit_limit & spending
- spending & current_balance
- credit_limit & advance_payments
- max_spent_in_single_shopping current_balance
Outliers:
From univariate analysis in Figure(), we can confirm the presence of outliers in variable "min_payment_amt" ,
i.e., when customers are making minimum payments for monthly purchases and for variable
"probability_of_full_payment", i.e., when there is a probability that customer will make a full payment.
To confirm our analysis , we will further detect outliers and decide how these outliers should be treated.
The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the difference between
the 75th and 25th percentiles. It is represented by the formula IQR = Q3 − Q1.
In [20]:
def detect_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

In [21]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
spending 5.035000
advance_payments 2.265000
probability_of_full_payment 0.030875
current_balance 0.717500
credit_limit 0.617750
min_payment_amt 2.207250
max_spent_in_single_shopping 0.832000
dtype: float64
For Variable "probability_of_full_payment" :
In [22]:
lr,ur=detect_outlier(df['probability_of_full_payment'])
print("Lower range in probability_of_full_payment is",lr)
print("Upper range in probability_of_full_payment is", ur)
Lower range in probability_of_full_payment is 0.8105875
Upper range in probability_of_full_payment is 0.9340875
In [23]:
rint('Number of outliers in probability_of_full_payment upper : ', df[df['probability

rint('Number of outliers in probability_of_full_payment lower : ', df[df['probability
rint('% of Outlier in probability_of_full_payment upper: ',round(df[df['probability_o
rint('% of Outlier in probability_of_full_payment lower: ',round(df[df['probability_o
Number of outliers in probability_of_full_payment upper : 0
Number of outliers in probability_of_full_payment lower : 3
% of Outlier in probability_of_full_payment upper: 0 %
% of Outlier in probability_of_full_payment lower: 1 %

In [24]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='probability_of_full_payment',data=df,orient='v',ax=ax1,color='teal')
ax1.set_ylabel('probability_of_full_payment', fontsize=15)
ax1.set_title('Figure 5: Distribution of probability_of_full_payment', fontsize=15)
ax1.tick_params(labelsize=15)
#distplot
sns.distplot(df['probability_of_full_payment'],ax=ax2,color='teal')
ax2.set_xlabel('probability_of_full_payment', fontsize=15)
#histogram
ax3.hist(df['probability_of_full_payment'],color='teal')
ax3.set_ylabel('Density', fontsize=15)
ax3.set_xlabel('probability_of_full_payment', fontsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_core.py:1319: User
Warning: Vertical orientation ignored with only `x` specified.
warnings.warn(single_var_warning.format("Vertical", "x"))
/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:25
57: FutureWarning: `distplot` is a deprecated function and will be rem
oved in a future version. Please adapt your code to use either `displo
t` (a figure-level function with similar flexibility) or `histplot` (a
n axes-level function for histograms).
warnings.warn(msg, FutureWarning)
For Variable "min_payment_amt":
In [25]:
lr,ur=detect_outlier(df['min_payment_amt'])
print("Lower range in min_payment_amt is",lr)
print("Upper range in min_payment_amt is", ur)
Lower range in min_payment_amt is -0.7493750000000006
Upper range in min_payment_amt is 8.079625

In [26]:
print('Number of outliers in min_payment_amt upper : ', df[df['min_payment_amt']>8.0

print('Number of outliers in min_payment_amt lower : ', df[df['min_payment_amt']<-0.
print('% of Outlier in min_payment_amt upper: ',round(df[df['min_payment_amt']>8.079
print('% of Outlier in min_payment_amt lower: ',round(df[df['min_payment_amt']<-0.74
Number of outliers in min_payment_amt upper : 2
Number of outliers in min_payment_amt lower : 0
% of Outlier in min_payment_amt upper: 1 %
% of Outlier in min_payment_amt lower: 0 %
In [27]:
fig, (ax1,ax2,ax3)=plt.subplots(1,3,figsize=(13,5))
#boxplot
sns.boxplot(x='min_payment_amt',data=df,orient='v',ax=ax1,color="green")
ax1.set_ylabel('min_payment_amt', fontsize=15)
ax1.set_title('Figure 6: Distribution of min_payment_amt', fontsize=15)
#distplot
sns.distplot(df['min_payment_amt'],ax=ax2,color="green")
ax2.set_xlabel('min_payment_amt', fontsize=15)
#histogram
ax3.hist(df['min_payment_amt'],color="green")
ax3.set_xlabel('min_payment_amt', fontsize=15)
ax3.set_ylabel('Density', fontsize=15)
plt.subplots_adjust(wspace=0.5)
plt.tight_layout()
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_core.py:1319: User
Warning: Vertical orientation ignored with only `x` specified.
warnings.warn(single_var_warning.format("Vertical", "x"))
/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:25
57: FutureWarning: `distplot` is a deprecated function and will be rem
oved in a future version. Please adapt your code to use either `displo
t` (a figure-level function with similar flexibility) or `histplot` (a
n axes-level function for histograms).
warnings.warn(msg, FutureWarning)

We choose to remove attribute outlier values
In [28]:
clean_dataset=df.copy()
In [29]:
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [30]:
lr,ur=remove_outlier(clean_dataset)
print("lower range",lr, "and upper range", ur)
lower range -10.634375 and upper range 25.980625
In [31]:
clean_dataset=np.where(clean_dataset>ur,ur,clean_dataset)
clean_dataset=np.where(clean_dataset<lr,lr,clean_dataset)

In [32]:
plt.figure(figsize=(18,10))
sns.boxplot(data=df)
plt.xlabel("Variables")
plt.ylabel("Density")
plt.title('Figure 7:Boxplot of variables after Outlier Treatment')
Out[32]:
Text(0.5, 1.0, 'Figure 7:Boxplot of variables after Outlier Treatmen

t')

In [33]:
plt.title('Figure 8: probability_of_full_payment')
sns.boxplot(df['probability_of_full_payment'],orient='horizondal')
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36:
FutureWarning: Pass the following variable as a keyword arg: x. From v
ersion 0.12, the only valid positional argument will be `data`, and pa
ssing other arguments without an explicit keyword will result in an er
ror or misinterpretation.
warnings.warn(
Out[33]:
<AxesSubplot:title={'center':'Figure 8: probability_of_full_payment'},
xlabel='probability_of_full_payment'>
Most of the outliers are treated, we still see one as per the boxplot, it is okay, as it is not extreme and is on the
lower band.

In [34]:
# checking distributions of Independent Attributes
df.hist(figsize=(15,16),layout=(4,2), color="blue");
plt.title("Figure 9:Distributionss of Independent Attributes")
plt.xlabel("Variables")
plt.show()

In [35]:
# Skewness of Data
df.skew(axis = 0, skipna = True).sort_values(ascending=False)
Out[35]:
max_spent_in_single_shopping 0.561897
current_balance 0.525482
min_payment_amt 0.401667
spending 0.399889
advance_payments 0.386573
credit_limit 0.134378
probability_of_full_payment -0.537954
dtype: float64
Insights on Skewness and Distribution using Histograms:
Skewness is the measure of how much the probability distribution of a random variable deviates from the
normal distribution, it explains the extent to which the data is normally distributed. The normal distribution is
the probability distribution without any skewness.
Ideally, the skewness value should be between -1 and +1, and any major deviation from this range indicates
the presence of extreme values.We also know that, the probability distribution with its tail on the right side is a
positively skewed distribution and the one with its tail on the left side is a negatively skewed distribution.
The rule of thumb is:
If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed.
If the skewness is less than -1 or greater than 1, the data are highly skewed.
Now,looking at the above values we can state the following:
Negative skew refers to a longer or fatter tail on the left side of the distribution, while positive skew refers
to a longer or fatter tail on the right. The mean of positively skewed data will be greater than the median.
Distrubtion is skewed to right tail for all the variable execpt probability_of_full_payment variable, which has
left tail.
Also, since the skewness is ranging between -0.5 and 0.5 we can say that data is moderately skewed.
Type Markdown and LaTeX: 𝛼2

1.2 Do you think scaling is necessary for clustering in this case?
Justify The learner is expected to check and comment about the difference in scale of different features on the
bases of appropriate measure for example std dev, variance, etc. Should justify whether there is a necessity for
scaling and which method is he/she using to do the scaling. Can also comment on how that method works.
Scaling
Scaling is necessary for clustering in this case , since ,there is a vast difference in the range of data,few
variables our ranging in ten thousands,thousands and few ranging in the hundreds, and if we do not perform
scaling, data will make the underlying assumption that higher ranging numbers have superiority of some sort.

So these more significant number starts playing a more decisive role while training the model.
Thus feature scaling is needed to bring every feature in the same footing without any upfront
importance.Feature scaling is essential for machine learning algorithms that calculate distances between data.
Below are the few ways we can do feature scaling.
1) Min Max Scaler
2) Standard Scaler
3) Normalization
Min-Max Scaler & Normalization :
This methid transform features by scaling each feature to a given range. This estimator scales and translates
each feature individually such that it is in the given range on the training set, e.g., between zero and one. This
Scaler shrinks the data within the range of -1 to 1 if there are negative values.
It responds well if the standard deviation is small and when a distribution is not Gaussian and is sensitive to
outliers.
Normalization scales each input variable separately to the range 0-1, which is the range for floating-point
values where we have the most precision.
Normalization is a special case of min-max scaling.To normalize the data, the min-max scaling can be applied
to one or more feature column.Normalization is useful when the data is needed in the bounded intervals.
Standard Scaler:
Standardization scales each input variable separately by subtracting the mean (called centering) and dividing
by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.
The Standard Scaler assumes data is normally distributed within each feature and scales them such that the
distribution centered around 0, with a standard deviation of 1.Centering and scaling happen independently on
each feature by computing the relevant statistics on the samples in the training set.
Unlike Normalization, standardization maintains useful information about outliers and makes the algorithm less
sensitive to them in contrast to min-max scaling, which scales the data to a limited range of values.
Therefore, we can summarize scaling with the below mentioned points:
1. Feature scaling is about transforming the value of features in the similar range like others for machine
learning algorithms to behave better resulting in optimal models.
2. Standardization and normalization are two most common techniques for feature scaling.
3. Normalization is about transforming the feature values to fall within the bounded intervals (min and max).
4. Standardization is about transforming the feature values to fall around mean as 0 with standard deviation
as 1.
5. Standardization maintains useful information about outliers and makes the algorithm less sensitive to them
in contrast to min-max scaling.
Also have shown below the plot of the data prior and after scaling.
Scaling will have all the values in the relative
same range.
We will use zscore to standarised the data to perform scaling (relative same scale -3 to +3).

In [36]:
# prior to scaling
plt.plot(clean_dataset)
plt.xlabel("Variable range")
plt.title("Figure 10: Plot of Variables Before Scaling")
plt.show()
1.3 Apply hierarchical clustering to scaled data (3 pts). Identify the number of optimum clusters using
Dendrogram and briefly describe them (4).
Students are expected to apply hierarchical clustering. It can be obtained via Fclusters or Agglomerative
Clustering. Report should talk about the used criterion, affinity and linkage. Report must contain a Dendrogram
and a logical reason behind choosing the optimum number of clusters and Inferences on the dendrogram.
Customer segmentation can be visualized using limited features or whole data but it should be clear, correct
and logical. Use appropriate plots to visualize the clusters.
In [37]:
from scipy.stats import zscore

df_Scaled=df.apply(zscore)
round(df_Scaled.head(),3)
Out[37]:
0 1.754 1.812 0.178 2.368 1.339
1 0.394 0.254 1.502 -0.601 0.858
2 1.413 1.428 0.505 1.401 1.317
3 -1.384 -1.228 -2.592 -0.793 -1.639
4 1.083 0.998 1.196 0.592 1.155

In [136]:
plt.plot(df_Scaled)
plt.xlabel("Variable range")
plt.title("Figure 11: Plot of Variables Before Scaling")
plt.show()
In [39]:
link_method = linkage(df_Scaled, method = 'average')

In [40]:
dend = dendrogram(link_method)
plt.xlabel("Indices of Data")
plt.ylabel("Distance")
plt.title("Figure 13: Customer Segmentation Dendogram- Average Linkage")
Out[40]:
Text(0.5, 1.0, 'Figure 13: Customer Segmentation Dendogram- Average Li

nkage')
In [41]:
dend = dendrogram(link_method,
truncate_mode='lastp',
p = 10)
plt.title("Figure 14: Customer Segmentation Truncated Dendogram with last p 10")
Out[41]:
Text(0.5, 1.0, 'Figure 14: Customer Segmentation Truncated Dendogram w

ith last p 10')

In [42]:
dend = dendrogram(link_method,
p = 25)
plt.title("Figure 14: Customer Segmentation Truncated Dendogram-Average Linkage(last
Out[42]:
Text(0.5, 1.0, 'Figure 14: Customer Segmentation Truncated Dendogram-A

verage Linkage(last p 25)')
In [43]:
from scipy.cluster.hierarchy import fcluster

In [44]:
# Set criterion as maxclust,then create 3 clusters, and store the result in another
clusters_3 = fcluster(link_method, 3, criterion='maxclust')

clusters_3
Out[44]:
array([1, 3, 1, 2, 1, 3, 2, 2, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2,
2,
1, 2, 3, 1, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1,
1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 1, 3,
1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 1, 1,
1,
1, 3, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 3, 1, 3, 1, 3, 1, 1, 2, 3,
1,
1, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2,
3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 3,
1,
3, 3, 2, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 2, 3, 2, 3, 1, 1,
1,
3, 2, 3, 2, 3, 2, 3, 3, 1, 1, 3, 1, 3, 2, 3, 3, 2, 1, 3, 1, 1,
2,
1, 2, 3, 3, 3, 2, 1, 3, 1, 3, 3, 1], dtype=int32)
In [45]:
cluster3_dataset=df.copy()
In [46]:
cluster3_dataset['clusters_3'] = clusters_3
In [47]:
cluster3_dataset.head()
Out[47]:
0 19.94 16.92 0.8752 6.675 3.763
1 15.99 14.89 0.9064 5.363 3.582
2 18.95 16.42 0.8829 6.248 3.755
3 10.83 12.96 0.8099 5.278 2.641
4 17.99 15.86 0.8992 5.890 3.694

In [48]:
# Cluster Frequency:
cluster3_dataset['clusters_3'].value_counts().sort_index()
Out[48]:
1 75
2 70
3 65
Name: clusters_3, dtype: int64
In [49]:
# Cluster Profiling
aggdata=cluster3_dataset.groupby('clusters_3').mean()
aggdata['Freq']=cluster3_dataset['clusters_3'].value_counts().sort_index()
aggdata
Out[49]:
spending advance_payments probability_of_full_payment current_balance credit_limi
clusters_3
1 18.129200 16.058000 0.881595 6.135747 3.648120
2 11.916857 13.291000 0.846766 5.258300 2.846000
3 14.217077 14.195846 0.884869 5.442000 3.253508
In [50]:
# Clustering Using -Ward Linkage Method
wardlink = linkage(df_Scaled, method = 'ward')

In [51]:
dend_wardlink = dendrogram(wardlink)
plt.title("Figure 15: Customer Segmentation-Dendrogram using Ward Linkage")
Out[51]:
Text(0.5, 1.0, 'Figure 15: Customer Segmentation-Dendrogram using Ward

Linkage')
In [52]:
dend_wardlink = dendrogram(wardlink,
p = 25,
)
plt.title("Figure 16: Customer Segmentation-Truncated Dendrogram ( Ward Linkage) wit
Out[52]:
Text(0.5, 1.0, 'Figure 16: Customer Segmentation-Truncated Dendrogram

( Ward Linkage) with last p 25')

In [53]:
clusters_ward_3 = fcluster(wardlink, 3, criterion='maxclust')

clusters_ward_3
Out[53]:
array([1, 3, 1, 2, 1, 2, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2, 3, 2, 3, 2, 2,
2,
1, 2, 3, 1, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 1, 1, 3, 1,
1,
2, 2, 3, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 3, 2, 2, 3, 3,
1,
1, 3, 1, 2, 3, 2, 1, 1, 2, 1, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 3,
1,
1, 2, 3, 1, 3, 2, 2, 1, 1, 1, 2, 1, 2, 1, 3, 1, 3, 1, 1, 2, 2,
1,
3, 3, 1, 2, 2, 1, 3, 3, 2, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 3, 2,
3,
3, 1, 2, 1, 1, 2, 1, 3, 3, 3, 2, 2, 3, 2, 1, 2, 3, 2, 3, 2, 3,
3,
3, 3, 3, 2, 3, 1, 1, 2, 1, 1, 1, 2, 1, 3, 3, 3, 3, 2, 3, 1, 1,
1,
3, 3, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 3, 2, 1, 3, 1, 1,
2,
1, 2, 3, 1, 3, 2, 1, 3, 1, 3, 1, 3], dtype=int32)
In [54]:
cluster_ward_3_dataset=df.copy()
In [55]:
cluster_ward_3_dataset['clusters-3'] = clusters_ward_3
cluster_ward_3_dataset.head()
Out[55]:
0 19.94 16.92 0.8752 6.675 3.763
1 15.99 14.89 0.9064 5.363 3.582
2 18.95 16.42 0.8829 6.248 3.755
3 10.83 12.96 0.8099 5.278 2.641
4 17.99 15.86 0.8992 5.890 3.694

In [56]:
# Cluster Frequency - Ward Linkage

cluster_ward_3_dataset['clusters-3'].value_counts().sort_index()
Out[56]:
1 70
2 67
3 73
Name: clusters-3, dtype: int64
In [57]:
# Cluster Profiling - Ward Linkage

aggdata_ward=cluster_ward_3_dataset.groupby('clusters-3').mean()
aggdata_ward['Freq']=cluster_ward_3_dataset['clusters-3'].value_counts().sort_index(
round(aggdata_ward,3)
Out[57]:
spending advance_payments probability_of_full_payment current_balance credit_limit
clusters-
3
1 18.371 16.145 0.884 6.158 3.685
2 11.872 13.257 0.848 5.239 2.849
3 14.199 14.234 0.879 5.478 3.226
Observation:
Both the method are almost similer means with minor variations.
For cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and based on the
dataset had gone for 3 group cluster solution based on the hierarchical clustering
Also in real time, there colud
have been more variables value captured - tenure, BALANCE_FREQUENCY, balance, purchase, installment of
purchase, others.
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment(payment made).
A dendrogram is a visual representation of cluster-making. On the x-axis are the item names or item numbers.
On the y-axis is the distance or height.The height of that horizontal line tells us about the distance at which this
label was merged into another label or cluster. The higher the level of combining, the distant the individual
items or clusters are.You can find that other cluster by following the other vertical line down again.We know
that by definition, hierarchical clustering combines the items to make one cluster.
Summarizing dendrogram:
1. horizontal lines are cluster merges.

2. vertical lines tells which clusters/labels were part of merge forming that new cluster.
3. heights of the horizontal lines tells about the distance that needed to be "bridged" to form the new cluster.
Generally, from distances > 25 up there's a huge jump of the distance to the final merge at a distance of
approx. 180.

There are no statistical techniques to decide the number of clusters in hierarchical clustering, unlike a K Means
algorithm that uses an elbow plot to determine the number of clusters. However, one common approach is to
analyze the dendrogram and look for groups that combine at a higher dendrogram distance.
Looking at the dendrogram ( Figure ) , the highest vertical distance that doesn’t intersect with any clusters is
the middle yellow one. Given that 3 vertical lines cross the threshold, the optimal number of clusters is 3.
We know that, if the number of clusters is large, the cluster size is small and the clusters are homogeneous and
if the number of clusters is small, each contains more items and hence clusters are more heterogeneous.
Now,After considering the dendrogram above, we have determined the optimum number of clusters as three.
Now , to truncate the dedrogram there are 2 modes:
1.lastp : Plot p leafs at the bottom of the plot
2.level : No more than p levels of the dendrogram tree are displayed
We will use the lastp method
For agglomerative clustering we need the below mentioned:
1. n_clusters : The number of clusters , which we have decided looking at the dendrogram.
2. Linkage : we will use the "ward linkage" method and to measure the distance between the points.This
method as described above will consider the analysis of variance method to determine the distance
between clusters.
3. Affinity: Euclidean Distance method, it calculates the distance between two real-valued vectors or we can
say it will calculate the proximity of clusters.
Cluster Profiling:
1.Cluster 2 has highest average customer spending per month and they are also able to make the timely
payments either in advance by cash or a full amount utilized due to which they have the highest credit limits
sanctioned.
We can also say that since customers are able to make timely payments these are the customers with
economic stability and high spending capacity.
2.On the contrary , cluster 1 has customers with high spendings as they are group of customers utilising
"minimum amount to be paid" facility the most, implying that they could be cutsomers with with low average
balance may be but high spending requiurements and may be not a stable economical background. Bank can
consider offering loans to such customers.
3.However, Cluster 3 have customers lowest monthly spendings and on an average making the payments on
time.However , there spending requirements are also on the lower side.
From business point of view , bank may want to target customers with Cluster 2 and cluster 1 followed by
cluster 3.
Customers in CLuster2 can be provided with promotional offers such as :

Welcome and renewal reward points etc can be offered to customer in Cluster 2 to attract
higher usage of the card
Complimentary Lounge access on high variant cards
Preferential
Foreign Currency mark up Can be considered for opening priority savings account customers
of the bank Waiver of annual credit card fee charges on opening savings account with the bank
with minimum average balance requirements etc
Annual Bonus if customers meet minimum
spend requirements on each anniversary years
Insurance on the credit card

Zero balance savings account to accquire new customers to the
bank
Interest Rate on Revolving Credit, Cash Advances and Overdue Amount
Can be offered
Personal Loan , Auto Loan etc.
Add on credit card facility for family memebers

Welcome and renewal reward points etc can be
offered to attract higher usage of the card
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and
silhouette score. Explain the results properly. Interpret and write inferences on the finalized
clusters.
K-Means CLustering
The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their
respective cluster centroid.
Steps for K Means Clustering :
Step 1: Choose the number of clusters k
Step 2: Select k random points from the data as centroids
Step 3: Assign all the points to the closest cluster centroid
Step 4: Recompute the centroids of newly formed clusters
Step 5: Repeat steps 3 and 4
Stopping Criteria for K-Means Clustering:
There are essentially three stopping criteria that can be adopted to stop the K-means algorithm when:
1.Centroids of newly formed clusters do not change.Even after multiple iterations, if we are getting the same
centroids for all the clusters, we can say that the algorithm is not learning any new pattern and it is a sign to
stop the training.
2.Points remain in the same cluster even after training the algorithm for multiple iterations.
3.Maximum number of iterations are reached.Suppose if we have set the number of iterations as 100. The
process will repeat for 100 iterations before stopping.

For determination of an optimal number of clusters or k there is no closer formed solution. The choice is
somewhat subjective and graphical methods are often employed.
Objective of K Means clustering is to separate out the observations or units so that the ‘most’ similar items are
put together.
Two metrics that may give us some intuition about k:
Elbow method
Silhouette analysis
Elbow Method:
Elbow method gives us an idea on what a good k number of clusters would be based on the total within-
cluster sum of squares (WSS) between data points and their assigned clusters’ centroids. We pick k at the spot
where WSS starts to flatten out, forming an elbow.
That value of k is chosen to be optimum, where addition of one more cluster does not lower the value of total
WCSS appreciably.The Elbow method looks at the total WCSS as a function of the number of clusters.
In [58]:
from sklearn.cluster import KMeans
In [59]:
k_means = KMeans(n_clusters = 1)
k_means.fit(df_Scaled)
k_means.inertia_
Out[59]:
1469.9999999999998
In [60]:
k_means.inertia_
Out[60]:
659.171754487041
In [61]:
k_means.inertia_
Out[61]:
430.6589731513006

In [62]:
k_means.inertia_
Out[62]:
371.30172127754196
In [63]:
wss =[]
In [64]:
for i in range(1,11):
KM = KMeans(n_clusters=i)
KM.fit(df_Scaled)
wss.append(KM.inertia_)
In [105]:
wss
Out[105]:
[1469.9999999999998,
659.171754487041,
430.6589731513006,
371.1846125351018,
327.96082400790306,
289.3058777621541,
262.0598138222025,
239.0437899054871,
221.20567700702614,
207.76507400096355]

In [106]:
plt.plot(range(1,11), wss)
plt.xlabel("Clusters")
plt.ylabel("Inertia in the cluster")
plt.title("Figure 17:WSS plot")
plt.show()
In [109]:
k_means_3 = KMeans(n_clusters = 3)
k_means_3.fit(df_Scaled)
labels_3 = k_means_3.labels_
In [110]:
kmeans3_dataset=df.copy()
In [111]:
kmeans3_dataset["Clus_kmeans"] = labels_3
kmeans3_dataset.head(5)
Out[111]:
0 19.94 16.92 0.8752 6.675 3.763
1 15.99 14.89 0.9064 5.363 3.582
2 18.95 16.42 0.8829 6.248 3.755
3 10.83 12.96 0.8099 5.278 2.641
4 17.99 15.86 0.8992 5.890 3.694
Figure 17 indicates clear break in the elbow after k=3. Hence one option for optimum number of clusters is 3
and thereafter dip is visible for k=4 or 5.
Recollecting that hierarchical clustering of the same data suggested 3 clusters. There may be wide discrepancy
in the number of clusters depending on the procedure applied.
Silhouette Method
Silhouette Coefficient or silhouette score is a metric used to calculate the goodness of a clustering technique.
Its value ranges from -1 to 1.
1: Means clusters are well apart from each other and clearly distinguished.
0: Means clusters are indifferent, or we can say that the distance between clusters is not
significant.
-1: Means clusters are assigned in the wrong way.
Silhouette Score = (b-a)/max(a,b)
where,
a= average intra-cluster distance i.e the average distance between each point within a cluster.
b= average
inter-cluster distance i.e the average distance between all clusters.
This method measures how tightly the observations are clustered and the average distance between clusters.
For each observation a silhouette score is constructed which is a function of the average distance between the
point and all other points in the cluster to which it belongs, and the distance between the point and all other
points in all other clusters, that it does not belong to. The maximum value of the statistic indicates the optimum
value of k.
In [73]:
from sklearn.metrics import silhouette_samples, silhouette_score
In [112]:
silhouette_score(df_Scaled,labels_3)
Out[112]:
0.4007270552751299
In [76]:
from sklearn import metrics

In [113]:
scores = []
k_range = range(2, 11)
for k in k_range:
km = KMeans(n_clusters=k, random_state=1)
km.fit(df_Scaled)
scores.append(metrics.silhouette_score(df_Scaled, km.labels_))
scores
Out[113]:
[0.46577247686580914,
0.4007270552751299,
0.3276547677266193,
0.28273352373803834,
0.28859801403258994,
0.28190587466075073,
0.26644334449887014,
0.2583120167794957,
0.25230419288400546]
In [119]:
#plotting the sc scores

plt.plot(k_range,scores)
plt.xlabel("Number of clusters")
plt.ylabel("Silhouette Coefficient")
plt.title("Figure 18:Silhouette Score plot ")
plt.show()
Insights:
From SC Score, the number of optimal clusters could be 3 or 4

In [123]:
pip install yellowbrick
Collecting yellowbrick
Downloading yellowbrick-1.3.post1-py3-none-any.whl (271 kB)
|████████████████████████████████| 271 kB 571 kB/s eta 0:00:01
Requirement already satisfied: numpy<1.20,>=1.16.0 in /opt/anaconda3/l

ib/python3.8/site-packages (from yellowbrick) (1.19.2)
Requirement already satisfied: cycler>=0.10.0 in /opt/anaconda3/lib/py

thon3.8/site-packages (from yellowbrick) (0.10.0)
Requirement already satisfied: matplotlib!=3.0.0,>=2.0.2 in /opt/anaco

nda3/lib/python3.8/site-packages (from yellowbrick) (3.3.4)
Requirement already satisfied: scikit-learn>=0.20 in /opt/anaconda3/li

b/python3.8/site-packages (from yellowbrick) (0.23.2)
Requirement already satisfied: scipy>=1.0.0 in /opt/anaconda3/lib/pyth

on3.8/site-packages (from yellowbrick) (1.5.2)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/sit

e-packages (from cycler>=0.10.0->yellowbrick) (1.15.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.

3 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib!=3.0.
0,>=2.0.2->yellowbrick) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/li

b/python3.8/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbric
k) (1.3.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/pyt

hon3.8/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbrick) (8.
0.1)
Requirement already satisfied: python-dateutil>=2.1 in /opt/anaconda3/

lib/python3.8/site-packages (from matplotlib!=3.0.0,>=2.0.2->yellowbri
ck) (2.8.1)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/pyth
on3.8/site-packages (from scikit-learn>=0.20->yellowbrick) (0.17.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/

lib/python3.8/site-packages (from scikit-learn>=0.20->yellowbrick) (2.
1.0)
Installing collected packages: yellowbrick
Successfully installed yellowbrick-1.3.post1
Note: you may need to restart the kernel to use updated packages.
In [124]:
from yellowbrick.cluster import SilhouetteVisualizer

In [132]:
fig, ax = plt.subplots(2, 2, figsize=(15,8))
for i in [2, 3, 4, 5]:

'''
Create KMeans instance for different number of clusters
'''
km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_stat
q, mod = divmod(i, 2)
'''
Create SilhouetteVisualizer instance with KMeans instance
Fit the visualizer
'''
visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod])
visualizer.fit(df_Scaled)
In [133]:
sil_width = silhouette_samples(df_Scaled,labels_3)

In [134]:
kmeans3_dataset["sil_width"] = sil_width
kmeans3_dataset.head(5)
Out[134]:
0 19.94 16.92 0.8752 6.675 3.763
1 15.99 14.89 0.9064 5.363 3.582
2 18.95 16.42 0.8829 6.248 3.755
3 10.83 12.96 0.8099 5.278 2.641
4 17.99 15.86 0.8992 5.890 3.694
In [118]:
silhouette_samples(df_Scaled,labels_3).min()
Out[118]:
0.002713089347678533
3 Cluster
In [83]:
km_3 = KMeans(n_clusters=3,random_state=1)
In [84]:
#fitting the Kmeans

km_3.fit(df_Scaled)
km_3.labels_
Out[84]:
array([2, 0, 2, 1, 2, 1, 1, 0, 2, 1, 2, 0, 1, 2, 0, 1, 0, 1, 1, 1, 1,
1,
2, 1, 0, 2, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 2, 2, 0, 2,
2,
1, 1, 0, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 0, 1, 1, 0, 0,
2,
2, 0, 2, 1, 0, 1, 2, 2, 1, 2, 0, 1, 2, 0, 0, 0, 0, 2, 1, 0, 2,
0,
2, 1, 0, 2, 0, 1, 1, 2, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2, 1, 1,
2,
0, 0, 2, 1, 1, 2, 0, 0, 1, 2, 0, 1, 1, 1, 0, 0, 2, 1, 0, 0, 1,
0,
0, 2, 1, 2, 2, 1, 2, 0, 0, 0, 1, 1, 0, 1, 2, 1, 0, 1, 0, 1, 0,
0,
1, 0, 0, 1, 0, 2, 2, 1, 2, 2, 2, 1, 0, 0, 0, 1, 0, 1, 0, 2, 2,
2,
0, 1, 0, 1, 0, 0, 0, 0, 2, 2, 1, 0, 0, 1, 1, 0, 1, 2, 0, 2, 2,
1,
2, 1, 0, 2, 0, 1, 2, 0, 2, 0, 0, 0], dtype=int32)

In [85]:
#proportion of labels classified
pd.Series(km_3.labels_).value_counts()
Out[85]:
1 72
0 71
2 67
dtype: int64
K-Means Clustering & Cluster Information
In [87]:
kmeans1_dataset=df.copy()
In [88]:
# Fitting K-Means to the dataset
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 1)

y_kmeans = kmeans.fit_predict(df_Scaled)
#beginning of the cluster numbering with 1 instead of 0
y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1
# New Dataframe called cluster
cluster = pd.DataFrame(y_kmeans1)
# Adding cluster to the Dataset1
kmeans1_dataset['cluster'] = cluster
#Mean of clusters
kmeans_mean_cluster = pd.DataFrame(round(kmeans1_dataset.groupby('cluster').mean(),1
kmeans_mean_cluster
Out[88]:
spending advance_payments probability_of_full_payment current_balance credit_limit m
cluster
1 14.4 14.3 0.9 5.5 3.3
2 11.9 13.2 0.8 5.2 2.8
3 18.5 16.2 0.9 6.2 3.7

In [89]:
def ClusterPercentage(datafr,name):
"""Common utility function to calculate the percentage and size of cluster"""
size = pd.Series(datafr[name].value_counts().sort_index())
percent = pd.Series(round(datafr[name].value_counts()/datafr.shape[0] * 100,2)).
size_df = pd.concat([size, percent],axis=1)

size_df.columns = ["Cluster_Size","Cluster_Percentage"]
return(size_df)
In [90]:
ClusterPercentage(kmeans1_dataset,"cluster")
Out[90]:
Cluster_Size Cluster_Percentage
1 71 33.81
2 72 34.29
3 67 31.90
In [91]:
#transposing the cluster

cluster_3_T = kmeans_mean_cluster.T
In [92]:
cluster_3_T
Out[92]:
cluster 1 2 3
spending 14.4 11.9 18.5
advance_payments 14.3 13.2 16.2
probability_of_full_payment 0.9 0.8 0.9
current_balance 5.5 5.2 6.2
credit_limit 3.3 2.8 3.7
min_payment_amt 2.7 4.7 3.6
max_spent_in_single_shopping 5.1 5.1 6.0
It is clear from Figure that the maximum value of average silhouette score is achieved for k = 3, which,
therefore, is considered to be the optimum number of clusters for this data.
However, there are a number of merits for using a smaller number of clusters. The objective of this particular
clustering effort is to devise a suitable recommendation system. It may not be practical to manage a very large
number of tailor made recommendations. Hence, the final decision regarding an appropriate number of clusters
must be taken after considering the within sum of squares and between sum of squares. Recall that within
cluster sum of squares is the squared average Euclidean distance of all the points within a cluster from the
cluster centroid and between cluster sum of squares is the average squared Euclidean distance between all
cluster centroids.
Let us now proceed with 3 clusters.
In [ ]:
In [ ]:

25/07/2021 Project-CART-RF-ANN - Jupyter Notebook
Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate,
and multivariate analysis)
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
# Import stats from scipy
from scipy import stats
In [2]:
df=pd.read_csv("insurance_part2_data.csv")
In [3]:
df.head()
Out[3]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales De
Name
Customised
0 48 C2B Airlines No 0.70 Online 7 2.51
Plan
Travel Customised
1 36 EPX No 0.00 Online 34 20.00
Agency Plan
Travel Customised
2 39 CWT No 5.94 Online 3 9.90
Agency Plan
Travel Cancellation
3 36 EPX No 0.00 Online 4 26.00
Agency Plan
4 33 JZI Airlines No 6.30 Online 53 18.00 Bronze Plan
localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project-CART-RF-ANN.ipynb#Building-ANN-Model 1/41

In [4]:
df.tail()
Out[4]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
Travel
2995 28 CWT Yes 166.53 Online 364 256.20 Gold Plan
Agency
2996 35 C2B Airlines No 13.50 Online 5 54.00 Gold Plan
Travel Customised
2997 36 EPX No 0.00 Online 54 28.00
Agency Plan
2998 34 C2B Airlines Yes 7.64 Online 39 30.55 Bronze Plan
2999 47 JZI Airlines No 11.55 Online 15 33.00 Bronze Plan
Attribute Information:
1. Target: Claim Status (Claimed)

2. Agency_Code: Code of tour firm
3. Type: Type of tour insurance firms
4. Channel: Distribution channel of tour insurance agencies
5. Product: Name of the tour insurance products
6. Duration: Duration of the tour
7. Destination: Destination of the tour
8. Sales: Amount of sales of tour insurance policies
9. Commission: The commission received for tour insurance firm
10. Age: Age of insured
In [5]:
df.info()
--- ------ -------------- -----
0 Age 3000 non-null int64
1 Agency_Code 3000 non-null object
2 Type 3000 non-null object
3 Claimed 3000 non-null object
4 Commision 3000 non-null float64
5 Channel 3000 non-null object
6 Duration 3000 non-null int64
7 Sales 3000 non-null float64
8 Product Name 3000 non-null object
9 Destination 3000 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 234.5+ KB

In [6]:
df.dtypes.value_counts()
Out[6]:
object 6
float64 2
int64 2
dtype: int64
Data consists of both categorical and numerical values .
There are total of 3000 rows and 10 columns in the dataset.Out of 10, 6 columns are of object type, 2 columns
are of integer type and remaining two are of float type data.
10 variables
Age, Commision, Duration, Sales are numeric variable
rest are categorial variables
3000 records, no missing one
9 independant variable and one target variable - Clamied
In [7]:
df.isnull().sum()
Out[7]:
Age 0
Agency_Code 0
Type 0
Claimed 0
Commision 0
Channel 0
Duration 0
Sales 0
Product Name 0
Destination 0
dtype: int64
Data does not contain any missing values
In [8]:
round(df.describe().T,3)
Out[8]:
count mean std min 25% 50% 75% max
Age 3000.0 38.091 10.464 8.0 32.0 36.00 42.000 84.00
Commision 3000.0 14.529 25.481 0.0 0.0 4.63 17.235 210.21
Duration 3000.0 70.001 134.053 -1.0 11.0 26.50 63.000 4580.00
Sales 3000.0 60.250 70.734 0.0 20.0 33.00 69.000 539.00

Inference:
duration has negative value, it is not possible. Wrong entry.

Commision & Sales- mean and median varies signficantly
Minimum age of insured is 8 years and maximum age of insured is 84 years.Average group for insured
people is around 38.
Minimum comission an agent can earn is zero and a maximum commission is aprroximately 210.On an
average comiision earned is approximately 14.6.
Minimum amount of sales of tour insurance policies is zero and a maximum amount is 539.
On an average approximately 60.29 is amount of sales of tour insurance policies
Average duration of the tour is 70 and maximum is 4580.
In [9]:
df.shape
print('The number of rows of the dataframe is',df.shape[0],'.')
print('The number of columns of the dataframe is',df.shape[1],'.')
The number of rows of the dataframe is 3000 .
The number of columns of the dataframe is 10 .
Checking for unique Values

In [10]:
for column in df[['Agency_Code', 'Type', 'Claimed', 'Channel',

'Product Name', 'Destination']]:
print(column.upper(),': ',df[column].nunique())
print(df[column].value_counts().sort_values())
print('\n')
AGENCY_CODE : 4
JZI 239
CWT 472
C2B 924
EPX 1365
Name: Agency_Code, dtype: int64
TYPE : 2
Airlines 1163
Travel Agency 1837
Name: Type, dtype: int64
CLAIMED : 2
Yes 924
No 2076
Name: Claimed, dtype: int64
CHANNEL : 2
Offline 46
Online 2954
Name: Channel, dtype: int64
PRODUCT NAME : 5
Gold Plan 109
Silver Plan 427
Bronze Plan 650
Cancellation Plan 678
Customised Plan 1136
Name: Product Name, dtype: int64
DESTINATION : 3
EUROPE 215
Americas 320
ASIA 2465
Name: Destination, dtype: int64
Checking for Duplicate Values

In [11]:
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df[dups]
Number of duplicate rows = 139
Out[11]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales
Name
63 30 C2B Airlines Yes 15.0 Online 27 60.0 Bronze Plan
Travel Customised
329 36 EPX No 0.0 Online 5 20.0
Agency Plan
Travel Cancellation
407 36 EPX No 0.0 Online 11 19.0
Agency Plan
Travel Customised
411 35 EPX No 0.0 Online 2 20.0
Agency Plan
Travel Customised
422 36 EPX No 0.0 Online 5 20.0
Agency Plan
... ... ... ... ... ... ... ... ... ...
Travel Cancellation
2940 36 EPX No 0.0 Online 8 10.0
Agency Plan
Travel Customised
2947 36 EPX No 0.0 Online 10 28.0
Agency Plan
Travel Cancellation
2952 36 EPX No 0.0 Online 2 10.0
Agency Plan
Travel Customised
2962 36 EPX No 0.0 Online 4 20.0
Agency Plan
Travel Customised
2984 36 EPX No 0.0 Online 1 20.0
Agency Plan
139 rows × 10 columns
Though it shows there are 139 records, but it can be of different customers, there is no customer ID or any
unique identifier, hence,we will not drop them off.
Univariate Analysis

In [12]:
def univariateAnalysis_numeric(column,nbins):
print("Description of " + column)
print("-------------------------------------------------------------------------
print(df[column].describe(),end=' ')
plt.figure()
print("Distribution of " + column)
print("-------------------------------------------------------------------------
sns.distplot(df[column], kde=False, color='g');
plt.show()
plt.figure()
print("BoxPlot of " + column)
print("-------------------------------------------------------------------------
ax = sns.boxplot(x=df[column])
plt.show()
In [13]:
df_num = df.select_dtypes(include = ['float64', 'int64'])

df_cat=df.select_dtypes(["object"])
Categorical_column_list=list(df_cat.columns.values)
Numerical_column_list = list(df_num.columns.values)
Numerical_length=len(Numerical_column_list)
Categorical_length=len(Categorical_column_list)
print("Length of Numerical columns is :",Numerical_length)
print("Length of Categorical columns is :",Categorical_length)
Length of Numerical columns is : 4
Length of Categorical columns is : 6
In [14]:
df_cat.head()
Out[14]:
Agency_Code Type Claimed Channel Product Name Destination
0 C2B Airlines No Online Customised Plan ASIA
1 EPX Travel Agency No Online Customised Plan ASIA
2 CWT Travel Agency No Online Customised Plan Americas
3 EPX Travel Agency No Online Cancellation Plan ASIA
4 JZI Airlines No Online Bronze Plan ASIA

In [15]:
df_num.head()
Out[15]:
Age Commision Duration Sales
0 48 0.70 7 2.51
1 36 0.00 34 20.00
2 39 5.94 3 9.90
3 36 0.00 4 26.00
4 33 6.30 53 18.00
In [16]:
for x in Numerical_column_list:
univariateAnalysis_numeric(x,20)
BoxPlot of Commision
----------------------------------------------------------------------
------
Insights of Univariate Analysis of Numerical Variables:
For Age variable, Minimum age of insured is 8 years and maximum age of insured is 84 years.Average age
for insured people is around 38.
For Commision Variable, minimum commission earned is zero and a maximum commission that can be
earned is approximately 210.21, with an average earning of approximately 14.53 .
For Duration Variable, minimum duaration is a negtive value , which cannot be true , hence we now there is
atleast one wrong entry. Maximum duration of tour is 4580 and an average duration of tour is
approximately 70 .
For Sales Variable,Minimum and maximum amounts of sales of tour insurance policies are 0 and 539
respectively. On an average amount of sales is approximately 60.25 .

In [17]:
def univariateAnalysis_category(cat_column):
print("Details of " + cat_column)
print("----------------------------------------------------------------")
print(df_cat[cat_column].value_counts())
plt.figure()
df_cat[cat_column].value_counts().plot.bar(title="Frequency Distribution of " +
plt.show()
print(" ")
In [18]:
df_cat = df.select_dtypes(include = ['object'])

Categorical_column_list = list(df_cat.columns.values)
Categorical_column_list
Out[18]:
['Agency_Code', 'Type', 'Claimed', 'Channel', 'Product Name', 'Destina

tion']
Pairwise Distribution of Continuous variables

In [19]:
sns.pairplot(df[['Age', 'Commision',
'Duration', 'Sales']])
Out[19]:
<seaborn.axisgrid.PairGrid at 0x7ff4b63a7e20>

Heatmap of continuous variables
In [20]:
plt.title("Figure 3: Heatmap of Variables ")
sns.set(font_scale=1.2)
sns.heatmap(df[['Age', 'Commision',
'Duration', 'Sales']].corr(), annot=True)
Out[20]:
<AxesSubplot:title={'center':'Figure 3: Heatmap of Variables '}>
Insights:
There is strong positive correlation between Commission and Sales.

Sales and Duration are moderately correlated.
Commission and Duration have low correlation.

In [21]:
clean_dataset=df.copy()
In [22]:
def check_outliers(data):
vData_num = data.loc[:,data.columns != 'class']
Q1 = vData_num.quantile(0.25)
Q3 = vData_num.quantile(0.75)
IQR = Q3 - Q1
count = 0
# checking for outliers, True represents outlier
vData_num_mod = ((vData_num < (Q1 - 1.5 * IQR)) |(vData_num > (Q3 + 1.5 * IQR)))
#iterating over columns to check for no.of outliers in each of the numerical att
for col in vData_num_mod:
if(1 in vData_num_mod[col].value_counts().index):
print("No. of outliers in %s: %d" %( col, vData_num_mod[col].value_count
count += 1
print("\n\nNo of attributes with outliers are :", count)
check_outliers(df)
No. of outliers in Age: 204
No. of outliers in Commision: 362
No. of outliers in Duration: 382
No. of outliers in Sales: 353
No of attributes with outliers are : 4
There are outliers in all the variables, but the sales and commision can be a geneuine business value. Random
Forest and CART can handle the outliers. Hence, Outliers are not treated for now, we will keep the data as it is.
We will treat the outliers for the ANN model to compare the same after the all the steps just for comparsion.

In [23]:
df.hist(figsize=(15,16),layout=(4,2), color="blue");
plt.title("Figure 4:Distribution plot for Continuous Variables")
plt.show()
In [24]:
# Skewness of Data
df.skew(axis = 0, skipna = True).sort_values(ascending=False)
Out[24]:
Duration 13.784681
Commision 3.148858
Sales 2.381148
Age 1.149713
dtype: float64
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest,
Artificial Neural Network
Object data should be converted into categorical/numerical data to fit in the models. (pd.categorical().codes(),
pd.get_dummies(drop_first=True)) Data split, ratio defined for the split, train-test split should be discussed. Any
reasonable split is acceptable. Use of random state is mandatory. Successful implementation of each model.
Logical reason behind the selection of different values for the parameters involved in each model. Apply grid
search for each model and make models on best_params. Feature importance for each model.
Converting object data type to numerical

In [25]:
for feature in df.columns:

if df[feature].dtype == 'object':
print('\n')
print('feature:',feature)
print(pd.Categorical(df[feature].unique()))
print(pd.Categorical(df[feature].unique()).codes)
df[feature] = pd.Categorical(df[feature]).codes
feature: Agency_Code
['C2B', 'EPX', 'CWT', 'JZI']
Categories (4, object): ['C2B', 'CWT', 'EPX', 'JZI']
[0 2 1 3]
feature: Type
['Airlines', 'Travel Agency']
Categories (2, object): ['Airlines', 'Travel Agency']
[0 1]
feature: Claimed
['No', 'Yes']
Categories (2, object): ['No', 'Yes']
[0 1]
feature: Channel
['Online', 'Offline']
Categories (2, object): ['Offline', 'Online']
[1 0]
feature: Product Name
['Customised Plan', 'Cancellation Plan', 'Bronze Plan', 'Silver Plan',

'Gold Plan']
Categories (5, object): ['Bronze Plan', 'Cancellation Plan', 'Customis

ed Plan', 'Gold Plan', 'Silver Plan']
[2 1 0 4 3]
feature: Destination
['ASIA', 'Americas', 'EUROPE']
Categories (3, object): ['ASIA', 'Americas', 'EUROPE']
[0 1 2]

In [26]:
df.info()
--- ------ -------------- -----
0 Age 3000 non-null int64
1 Agency_Code 3000 non-null int8
2 Type 3000 non-null int8
3 Claimed 3000 non-null int8
4 Commision 3000 non-null float64
5 Channel 3000 non-null int8
6 Duration 3000 non-null int64
7 Sales 3000 non-null float64
8 Product Name 3000 non-null int8
9 Destination 3000 non-null int8
dtypes: float64(2), int64(2), int8(6)
memory usage: 111.5 KB
In [27]:
df.head()
Out[27]:
Product
Age Agency_Code Type Claimed Commision Channel Duration Sales Destinat
Name
0 48 0 0 0 0.70 1 7 2.51 2
1 36 2 1 0 0.00 1 34 20.00 2
2 39 1 1 0 5.94 1 3 9.90 2
3 36 2 1 0 0.00 1 4 26.00 1
4 33 3 0 0 6.30 1 53 18.00 0
Proportion of Target Variable
In [28]:
df.Claimed.value_counts(normalize=True)
Out[28]:
0 0.692
1 0.308
Name: Claimed, dtype: float64

In [29]:
# Check Counts in Target Variable
sns.countplot(df["Claimed"])
plt.title("Figure 5: Countplot of Target Variable-CLaimed")
plt.show()
/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py:36:
FutureWarning: Pass the following variable as a keyword arg: x. From v
ersion 0.12, the only valid positional argument will be `data`, and pa
ssing other arguments without an explicit keyword will result in an er
ror or misinterpretation.
warnings.warn(
In [30]:
# Check % of counts in Tgt Var
print("Percentage of 0's",round(df["Claimed"].value_counts().values[0]/df["Claimed"]
print("Percentage of 1's",round(df["Claimed"].value_counts().values[1]/df["Claimed"]
Percentage of 0's 69.2 %
Percentage of 1's 30.8 %

In [31]:
df["Claimed"].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',shadow=False
plt.title('Figure 6:Pi Chart of Target Variable-Claimed')
plt.show()
Extracting the target column into train and test data
In [32]:
X = df.drop("Claimed", axis=1)
y = df.pop("Claimed")
X.head()
Out[32]:
Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
0 48 0 0 0.70 1 7 2.51 2 0
1 36 2 1 0.00 1 34 20.00 2 0
2 39 1 1 5.94 1 3 9.90 2 1
3 36 2 1 0.00 1 4 26.00 1 0
4 33 3 0 6.30 1 53 18.00 0 0

In [33]:
plt.plot(X)
plt.title("Figure:Independent Variable Plot Before Scaling")
plt.show()
In [34]:
y.head()
Out[34]:
0 0
1 0
2 0
3 0
4 0
Name: Claimed, dtype: int8
Feature Scaling
In [35]:
# Scaling the attributes.
from scipy.stats import zscore

X_scaled=X.apply(zscore)
round(X_scaled.head(),3)
Out[35]:
Product
Age Agency_Code Type Commision Channel Duration Sales Destination
Name
0 0.947 -1.314 -1.257 -0.543 0.125 -0.470 -0.816 0.269 -0.435
1 -0.200 0.698 0.796 -0.570 0.125 -0.269 -0.569 0.269 -0.435
2 0.087 -0.308 0.796 -0.337 0.125 -0.500 -0.712 0.269 1.304
3 -0.200 0.698 0.796 -0.570 0.125 -0.492 -0.484 -0.526 -0.435
4 -0.487 1.704 -1.257 -0.323 0.125 -0.127 -0.597 -1.320 -0.435

In [36]:
plt.plot(X_scaled)
plt.title("Figure:Independent Variable Plot Prior Scaling")
plt.show()
Train and Test Split
In [37]:
X_train, X_test, train_labels, test_labels = train_test_split(X_scaled, y, test_size
Checking Dimensions of Train and Test Data
In [38]:
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('train_labels',train_labels.shape)
print('test_labels',test_labels.shape)
X_train (2100, 9)
X_test (900, 9)
train_labels (2100,)
test_labels (900,)
Building Decision tree Classifier

In [39]:
param_grid_dtcl = {
'criterion': ['gini'],
'max_depth': [10,20,30,50],
'min_samples_leaf': [50,100,150],
'min_samples_split': [150,300,450],
}
dtcl = DecisionTreeClassifier(random_state=5)
grid_search_dtcl = GridSearchCV(estimator = dtcl, param_grid = param_grid_dtcl, cv =
In [ ]:
In [40]:
grid_search_dtcl.fit(X_train, train_labels)
print(grid_search_dtcl.best_params_)
best_grid_dtcl = grid_search_dtcl.best_estimator_
best_grid_dtcl
{'criterion': 'gini', 'max_depth': 10, 'min_samples_leaf': 50, 'min_sa

mples_split': 450}
Out[40]:
DecisionTreeClassifier(max_depth=10, min_samples_leaf=50, min_samples_

split=450,
random_state=5)
Generating Decision tree
In [41]:
from sklearn import tree

from sklearn.tree import DecisionTreeClassifier
In [42]:
train_char_label = ['no', 'yes']

tree_regularized = open('tree_regularized.dot','w')
dot_data = tree.export_graphviz(best_grid_dtcl, out_file= tree_regularized ,
feature_names = list(X_train),
class_names = list(train_char_label))
tree_regularized.close()
dot_data
http://webgraphviz.com/ (http://webgraphviz.com/)
Variable Importance - DTCL

In [43]:
print (pd.DataFrame(best_grid_dtcl.feature_importances_, columns = ["Imp"],

index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.674494
Sales 0.222345
Product Name 0.092149
Commision 0.008008
Duration 0.003005
Age 0.000000
Type 0.000000
Channel 0.000000
Destination 0.000000
Predicting Train and Test model
In [44]:
ytrain_predict_dtcl = best_grid_dtcl.predict(X_train)
ytest_predict_dtcl = best_grid_dtcl.predict(X_test)
Getting Probabilities of predicted data
In [45]:
ytest_predict_dtcl
ytest_predict_prob_dtcl=best_grid_dtcl.predict_proba(X_test)
ytest_predict_prob_dtcl
pd.DataFrame(ytest_predict_prob_dtcl).head()
Out[45]:
0 1
0 0.656751 0.343249
1 0.979452 0.020548
2 0.921171 0.078829
3 0.656751 0.343249
4 0.921171 0.078829
Building a Random Forest Classifier

In [46]:
param_grid_rfcl = {
'max_depth': [4,5,6],#20,30,40
'max_features': [2,3,4,5],## 7,8,9
'min_samples_leaf': [8,9,11,15],## 50,100
'min_samples_split': [46,50,55], ## 60,70
'n_estimators': [290,350,400] ## 100,200
}
rfcl = RandomForestClassifier(random_state=5)
grid_search_rfcl = GridSearchCV(estimator = rfcl, param_grid = param_grid_rfcl, cv =
In [47]:
grid_search_rfcl.fit(X_train, train_labels)
Out[47]:
GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=5),
param_grid={'max_depth': [4, 5, 6], 'max_features': [2,

3, 4, 5],
'min_samples_leaf': [8, 9, 11, 15],
'min_samples_split': [46, 50, 55],
'n_estimators': [290, 350, 400]})
In [48]:
grid_search_rfcl.best_params_
Out[48]:
{'max_depth': 6,
'max_features': 3,
'min_samples_leaf': 9,
'min_samples_split': 50,
'n_estimators': 290}
In [49]:
best_grid_rfcl = grid_search_rfcl.best_estimator_
In [50]:
best_grid_rfcl
Out[50]:
RandomForestClassifier(max_depth=6, max_features=3, min_samples_leaf=

9,
min_samples_split=50, n_estimators=290, random_

state=5)
Using Best Parameters to predict Train & Test Data

In [51]:
ytrain_predict_rfcl = best_grid_rfcl.predict(X_train)
ytest_predict_rfcl = best_grid_rfcl.predict(X_test)
Getting probabilities of predicted data
In [52]:
ytest_predict_rfcl
ytest_predict_prob_rfcl=best_grid_rfcl.predict_proba(X_test)
ytest_predict_prob_rfcl
pd.DataFrame(ytest_predict_prob_rfcl).head()
Out[52]:
0 1
0 0.786094 0.213906
1 0.971485 0.028515
2 0.906544 0.093456
3 0.657028 0.342972
4 0.875002 0.124998
Variable Importance via Random Forest
In [53]:
# Variable Importance
print (pd.DataFrame(best_grid_rfcl.feature_importances_,
columns = ["Imp"],
index = X_train.columns).sort_values('Imp',ascending=False))
Imp
Agency_Code 0.279196
Product Name 0.235375
Sales 0.150871
Commision 0.146070
Duration 0.078847
Type 0.057515
Age 0.040628
Destination 0.008741
Channel 0.002758
Building ANN Model

In [54]:
param_grid_nncl = {
'hidden_layer_sizes': [50,100,200],
'max_iter': [2500,3000,4000],
'solver': ['adam'],
'tol': [0.01],
}
nncl = MLPClassifier(random_state=5)
grid_search_nncl = GridSearchCV(estimator = nncl, param_grid = param_grid_nncl, cv =
In [55]:
grid_search_nncl.fit(X_train, train_labels)
grid_search_nncl.best_params_
best_grid_nncl = grid_search_nncl.best_estimator_
best_grid_nncl
Out[55]:
MLPClassifier(hidden_layer_sizes=100, max_iter=2500, random_state=5, t

ol=0.01)
Using Best Parameters to predict Train & Test Data
In [56]:
ytrain_predict_nncl = best_grid_nncl.predict(X_train)
ytest_predict_nncl = best_grid_nncl.predict(X_test)
Getting probabilities of predicted data
In [57]:
ytest_predict_nncl
ytest_predict_prob_nncl=best_grid_nncl.predict_proba(X_test)
ytest_predict_prob_nncl
pd.DataFrame(ytest_predict_prob_nncl).head()
Out[57]:
0 1
0 0.838865 0.161135
1 0.926699 0.073301
2 0.914996 0.085004
3 0.657225 0.342775
4 0.909727 0.090273
2.3 Performance Metrics: Comment and Check the performance of Predictions

on Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score, classification reports for each model.

Comment on the validness of models (overfitting or underfitting) Build confusion matrix for each model.
Comment on the positive class in hand. Must clearly show obs/pred in row/col Plot roc_curve for each model.
Calculate roc_auc_score for each model. Comment on the above calculated scores and plots. Build
classification reports for each model. Comment on f1 score, precision and recall, which one is important here.
CART : AUC & ROC for Train Data
In [58]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_train)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_train_auc = roc_auc_score(train_labels, probs_cart)
print('AUC: %.3f' % cart_train_auc)
# calculate roc curve
cart_train_fpr, cart_train_tpr, cart_train_thresholds = roc_curve(train_labels, prob
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (FPR)")
plt.title("Figure 13: CART AUC-ROC for Train Data ")
# plot the roc curve for the model
plt.plot(cart_train_fpr, cart_train_tpr)
AUC: 0.812
Out[58]:
[<matplotlib.lines.Line2D at 0x7ff4ad794dc0>]
CART : AUC & ROC for Test Data

In [59]:
# predict probabilities
probs_cart = best_grid_dtcl.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs_cart = probs_cart[:, 1]
# calculate AUC
cart_test_auc = roc_auc_score(test_labels, probs_cart)
print('AUC: %.3f' % cart_test_auc)
# calculate roc curve
cart_test_fpr, cart_test_tpr, cart_testthresholds = roc_curve(test_labels, probs_car
plt.title("Figure 14: CART AUC-ROC for Test Data ")
# plot the roc curve for the model
plt.plot(cart_test_fpr, cart_test_tpr)
AUC: 0.800
Out[59]:
[<matplotlib.lines.Line2D at 0x7ff4b7e163d0>]
CART Confusion Matrix and Classification Report for the training data
In [60]:
confusion_matrix(train_labels, ytrain_predict_dtcl)
Out[60]:
array([[1258, 195],
[ 268, 379]])

In [61]:
ax=sns.heatmap(confusion_matrix(train_labels, ytrain_predict_dtcl),annot=True, fmt='

plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Figure 15: CART Confusion Matrix of Train Data')
plt.show()
In [62]:
#Train Data Accuracy

cart_train_acc=best_grid_dtcl.score(X_train,train_labels)
cart_train_acc
Out[62]:
0.7795238095238095
In [63]:
print(classification_report(train_labels, ytrain_predict_dtcl))
precision recall f1-score support
0 0.82 0.87 0.84 1453
1 0.66 0.59 0.62 647
accuracy 0.78 2100
macro avg 0.74 0.73 0.73 2100
weighted avg 0.77 0.78 0.78 2100

In [64]:
cart_metrics=classification_report(train_labels, ytrain_predict_dtcl,output_dict=Tru
df=pd.DataFrame(cart_metrics).transpose()
cart_train_f1=round(df.loc["1"][2],2)
cart_train_recall=round(df.loc["1"][1],2)
cart_train_precision=round(df.loc["1"][0],2)
print ('cart_train_precision ',cart_train_precision)
print ('cart_train_recall ',cart_train_recall)
print ('cart_train_f1 ',cart_train_f1)
cart_train_precision 0.66
cart_train_recall 0.59
cart_train_f1 0.62
CART Confusion Matrix and Classification Report for the testing data
In [65]:
confusion_matrix(test_labels, ytest_predict_dtcl)
Out[65]:
array([[536, 87],
[113, 164]])
In [66]:
ax=sns.heatmap(confusion_matrix(test_labels, ytest_predict_dtcl),annot=True, fmt='d'

plt.title('Figure 16: CART Confusion Matrix of Test Data')
plt.show()

In [67]:
#Test Data Accuracy

cart_test_acc=best_grid_dtcl.score(X_test,test_labels)
cart_test_acc
Out[67]:
0.7777777777777778
In [68]:
print(classification_report(test_labels, ytest_predict_dtcl))
0 0.83 0.86 0.84 623
1 0.65 0.59 0.62 277
accuracy 0.78 900
macro avg 0.74 0.73 0.73 900
weighted avg 0.77 0.78 0.77 900
In [69]:
cart_metrics=classification_report(test_labels, ytest_predict_dtcl,output_dict=True)
df=pd.DataFrame(cart_metrics).transpose()
cart_test_precision=round(df.loc["1"][0],2)
cart_test_recall=round(df.loc["1"][1],2)
cart_test_f1=round(df.loc["1"][2],2)
print ('cart_test_precision ',cart_test_precision)
print ('cart_test_recall ',cart_test_recall)
print ('cart_test_f1 ',cart_test_f1)
cart_test_precision 0.65
cart_test_recall 0.59
cart_test_f1 0.62
CART Conclusion:
Train Data:
AUC: 82%
Accuracy: 79%
Precision: 70%
f1-Score: 60%
Test Data:
AUC: 80%
Accuracy: 77%
Precision: 80%
f1-Score: 84%
Training and Test set results are almost similar, and with the overall measures high, the model is a good model.
Change is the most important variable for predicting diabetes

RF Model Performance Evaluation on Training data
In [70]:
confusion_matrix(train_labels,ytrain_predict_rfcl)
Out[70]:
array([[1296, 157],
[ 249, 398]])
In [71]:
ax=sns.heatmap(confusion_matrix(train_labels,ytrain_predict_rfcl),annot=True, fmt='d
plt.title('Figure 19: RF Confusion Matrix of Train Data')
plt.show()
In [72]:
rf_train_acc=best_grid_rfcl.score(X_train,train_labels)
rf_train_acc
Out[72]:
0.8066666666666666
In [73]:
print(classification_report(train_labels,ytrain_predict_rfcl))
0 0.84 0.89 0.86 1453
1 0.72 0.62 0.66 647
accuracy 0.81 2100
macro avg 0.78 0.75 0.76 2100
weighted avg 0.80 0.81 0.80 2100

In [74]:
rf_metrics=classification_report(train_labels, ytrain_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_train_precision=round(df.loc["1"][0],2)
rf_train_recall=round(df.loc["1"][1],2)
rf_train_f1=round(df.loc["1"][2],2)
print ('rf_train_precision ',rf_train_precision)
print ('rf_train_recall ',rf_train_recall)
print ('rf_train_f1 ',rf_train_f1)
rf_train_precision 0.72
rf_train_recall 0.62
rf_train_f1 0.66
In [75]:
rf_train_fpr, rf_train_tpr,_=roc_curve(train_labels,best_grid_rfcl.predict_proba(X_t
plt.plot(rf_train_fpr,rf_train_tpr,color='green')
plt.title("Figure 17: RF AUC-ROC for Train Data ")
rf_train_auc=roc_auc_score(train_labels,best_grid_rfcl.predict_proba(X_train)[:,1])
print('Area under Curve is', rf_train_auc)
Area under Curve is 0.854377395379809
RF Model Performance Evaluation on Test data
In [76]:
confusion_matrix(test_labels,ytest_predict_rfcl)
Out[76]:
array([[546, 77],
[120, 157]])

In [77]:
ax=sns.heatmap(confusion_matrix(test_labels,ytest_predict_rfcl),annot=True, fmt='d')
plt.title('Figure 20: RF Confusion Matrix of Test Data')
plt.show()
In [78]:
rf_test_acc=best_grid_rfcl.score(X_test,test_labels)
rf_test_acc
Out[78]:
0.7811111111111111
In [79]:
print(classification_report(test_labels,ytest_predict_rfcl))
0 0.82 0.88 0.85 623
1 0.67 0.57 0.61 277
accuracy 0.78 900
macro avg 0.75 0.72 0.73 900
weighted avg 0.77 0.78 0.78 900

In [80]:
rf_metrics=classification_report(test_labels, ytest_predict_rfcl,output_dict=True)
df=pd.DataFrame(rf_metrics).transpose()
rf_test_precision=round(df.loc["1"][0],2)
rf_test_recall=round(df.loc["1"][1],2)
rf_test_f1=round(df.loc["1"][2],2)
print ('rf_test_precision ',rf_test_precision)
print ('rf_test_recall ',rf_test_recall)
print ('rf_test_f1 ',rf_test_f1)
rf_test_precision 0.67
rf_test_recall 0.57
rf_test_f1 0.61
In [81]:
rf_test_fpr, rf_test_tpr,_=roc_curve(test_labels,best_grid_rfcl.predict_proba(X_test
plt.plot(rf_test_fpr,rf_test_tpr,color='green')
plt.title("Figure 18: RF AUC-ROC for Test Data ")
rf_test_auc=roc_auc_score(test_labels,best_grid_rfcl.predict_proba(X_test)[:,1])
print('Area under Curve is', rf_test_auc)
Random Forest Conclusion:
Train Data:
AUC: 86%
Accuracy: 80%
Precision: 72%
f1-Score: 66%
Test Data:
AUC: 82%
Accuracy: 78%
Precision: 68%
f1-Score: 62
Change is again the most important variable for predicting diabetes
NN Model Performance Evaluation on Training data
In [82]:
confusion_matrix(train_labels,ytrain_predict_nncl)
Out[82]:
array([[1292, 161],
[ 319, 328]])
In [83]:
ax=sns.heatmap(confusion_matrix(train_labels,ytrain_predict_nncl),annot=True, fmt='d
plt.title('Figure 23: ANN Confusion Matrix of Train Data')
plt.show()
In [84]:
nn_train_acc=best_grid_nncl.score(X_train,train_labels)
nn_train_acc
Out[84]:
0.7714285714285715

In [85]:
print(classification_report(train_labels,ytrain_predict_nncl))
0 0.80 0.89 0.84 1453
1 0.67 0.51 0.58 647
accuracy 0.77 2100
macro avg 0.74 0.70 0.71 2100
weighted avg 0.76 0.77 0.76 2100
In [86]:
nn_metrics=classification_report(train_labels, ytrain_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_train_precision=round(df.loc["1"][0],2)
nn_train_recall=round(df.loc["1"][1],2)
nn_train_f1=round(df.loc["1"][2],2)
print ('nn_train_precision ',nn_train_precision)
print ('nn_train_recall ',nn_train_recall)
print ('nn_train_f1 ',nn_train_f1)
nn_train_precision 0.67
nn_train_recall 0.51
nn_train_f1 0.58
In [87]:
nn_train_fpr, nn_train_tpr,_=roc_curve(train_labels,best_grid_nncl.predict_proba(X_t
plt.plot(nn_train_fpr,nn_train_tpr,color='black')
plt.title("Figure 21: ANN AUC-ROC for Train Data ")
nn_train_auc=roc_auc_score(train_labels,best_grid_nncl.predict_proba(X_train)[:,1])
print('Area under Curve is', nn_train_auc)

NN Model Performance Evaluation on Test data
In [88]:
confusion_matrix(test_labels,ytest_predict_nncl)
Out[88]:
array([[550, 73],
[140, 137]])
In [89]:
ax=sns.heatmap(confusion_matrix(test_labels,ytest_predict_nncl),annot=True, fmt='d',
plt.title('Figure 24: ANN Confusion Matrix of Test Data')
plt.show()
In [90]:
nn_test_acc=best_grid_nncl.score(X_test,test_labels)
nn_test_acc
Out[90]:
0.7633333333333333
In [91]:
print(classification_report(test_labels,ytest_predict_nncl))
0 0.80 0.88 0.84 623
1 0.65 0.49 0.56 277
accuracy 0.76 900
macro avg 0.72 0.69 0.70 900
weighted avg 0.75 0.76 0.75 900

In [92]:
nn_metrics=classification_report(test_labels, ytest_predict_nncl,output_dict=True)
df=pd.DataFrame(nn_metrics).transpose()
nn_test_precision=round(df.loc["1"][0],2)
nn_test_recall=round(df.loc["1"][1],2)
nn_test_f1=round(df.loc["1"][2],2)
print ('nn_test_precision ',nn_test_precision)
print ('nn_test_recall ',nn_test_recall)
print ('nn_test_f1 ',nn_test_f1)
nn_test_precision 0.65
nn_test_recall 0.49
nn_test_f1 0.56
In [93]:
nn_test_fpr, nn_test_tpr,_=roc_curve(test_labels,best_grid_nncl.predict_proba(X_test
plt.plot(nn_test_fpr,nn_test_tpr,color='black')
plt.title("Figure 22: ANN AUC-ROC for Test Data ")
nn_test_auc=roc_auc_score(test_labels,best_grid_nncl.predict_proba(X_test)[:,1])
print('Area under Curve is', nn_test_auc)
Neural Network Conclusion:
Train Data:
AUC: 82%
Accuracy: 78%
Precision: 68%
f1-Score: 59
Test Data:
AUC: 80%
Accuracy: 77%
Precision: 67%
f1-Score: 57%
2.4 Final Model - Compare all models on the basis of the performance metrics in
a structured tabular manner (2.5 pts).
Describe on which model is best/optimized (1.5 pts ). A table containing all the values of accuracies, precision,
recall, auc_roc_score, f1 score. Comparison between the different models(final) on the basis of above table
values. After comparison which model suits the best for the problem in hand on the basis of different
measures. Comment on the final model.
Comparison of the performance metrics from the 3 models
In [94]:
index=['Accuracy', 'AUC', 'Recall','Precision','F1 Score']

data = pd.DataFrame({'CART Train':[cart_train_acc,cart_train_auc,cart_train_recall,ca
'CART Test':[cart_test_acc,cart_test_auc,cart_test_recall,cart_test_precision
'Random Forest Train':[rf_train_acc,rf_train_auc,rf_train_recall,rf_train_prec
'Random Forest Test':[rf_test_acc,rf_test_auc,rf_test_recall,rf_test_precisio
'Neural Network Train':[nn_train_acc,nn_train_auc,nn_train_recall,nn_train_pre
'Neural Network Test':[nn_test_acc,nn_test_auc,nn_test_recall,nn_test_precisi
round(data,2)
Out[94]:
CART CART Random Forest Random Forest Neural Network Neural Network
Train Test Train Test Train Test
Accuracy 0.78 0.78 0.81 0.78 0.77 0.76
AUC 0.81 0.80 0.85 0.82 0.81 0.80
Recall 0.59 0.59 0.62 0.57 0.51 0.49
Precision 0.66 0.65 0.72 0.67 0.67 0.65
F1 Score 0.62 0.62 0.66 0.61 0.58 0.56
ROC Curve for the 3 models on the Training data

In [98]:
plt.plot(cart_train_fpr, cart_train_tpr,color='red',label="CART")
plt.plot(rf_train_fpr,rf_train_tpr,color='green',label="RF")
plt.plot(nn_train_fpr,nn_train_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Figure 25:ROC for 3 Models in Training Data')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')
Out[98]:
<matplotlib.legend.Legend at 0x7ff4b17210d0>

ROC Curve for the 3 models on the Test data
In [99]:
plt.plot(cart_test_fpr, cart_test_tpr,color='red',label="CART")
plt.plot(rf_test_fpr,rf_test_tpr,color='green',label="RF")
plt.plot(nn_test_fpr,nn_test_tpr,color='black',label="NN")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Figure 26:ROC for 3 Models in Test Data')
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc='lower right')
Out[99]:
<matplotlib.legend.Legend at 0x7ff4b88992e0>
RF model should be selected, as it has better accuracy, precsion, recall, f1 score better than other two CART &
NN.
2.5 Based on your analysis and working on the business problem, detail out appropriate insights and
recommendations to help the management solve the business objective.
p g j
There should be at least 3-4 Recommendations and insights in total. Recommendations should be easily
understandable and business specific, students should not give any technical suggestions. Full marks should
only be allotted if the recommendations are correct and business specific.
In [ ]:
In [ ]:
In [ ]:

Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Jupyter Notebook Project DM Nikita Chaturvedi 25.07.2021

Uploaded by

Copyright:

Available Formats

25/07/2021 Project- Clustering - Jupyter Notebook

RangeIndex: 210 entries, 0 to 209

Data columns (total 7 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 spending 210 non-null float64

1 advance_payments 210 non-null float64

2 probability_of_full_payment 210 non-null float64

3 current_balance 210 non-null float64

4 credit_limit 210 non-null float64

5 min_payment_amt 210 non-null float64

6 max_spent_in_single_shopping 210 non-null float64

memory usage: 11.6 KB

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 1/42

# Value Counts of Data

Data consists of only numerical values .

print('The number of rows of the dataframe is',df.shape[0],'.')

The number of rows of the dataframe is 210 .

The number of columns of the dataframe is 7 .

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 2/42

spending advance_payments probability_of_full_payment current_balance credit_limit min_pa

0 19.94 16.92 0.8752 6.675 3.763

1 15.99 14.89 0.9064 5.363 3.582

2 18.95 16.42 0.8829 6.248 3.755

3 10.83 12.96 0.8099 5.278 2.641

4 17.99 15.86 0.8992 5.890 3.694

spending advance_payments probability_of_full_payment current_balance credit_limit min_

205 13.89 14.02 0.8880 5.439 3.199

206 16.77 15.62 0.8638 5.927 3.438

207 14.03 14.16 0.8796 5.438 3.201

208 16.12 15.00 0.9000 5.709 3.485

209 15.57 15.15 0.8527 5.920 3.231

Data Dictionary for Market Segmentation:

1. spending: Amount spent by the customer per month (in 1000s)

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 3/42

count mean std min 25% 50% 75% max

spending 210.0 14.85 2.91 10.59 12.27 14.36 17.30 21.18

advance_payments 210.0 14.56 1.31 12.41 13.45 14.32 15.72 17.25

probability_of_full_payment 210.0 0.87 0.02 0.81 0.86 0.87 0.89 0.92

current_balance 210.0 5.63 0.44 4.90 5.26 5.52 5.98 6.68

credit_limit 210.0 3.26 0.38 2.63 2.94 3.24 3.56 4.03

min_payment_amt 210.0 3.70 1.50 0.77 2.56 3.60 4.77 8.46

max_spent_in_single_shopping 210.0 5.41 0.49 4.52 5.04 5.22 5.88 6.55

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 4/42

# Check for Duplicate Values

Number of duplicate rows = 0

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 5/42

fig,axs = plt.subplots(nrows=2,ncols=1, figsize=(20,10))

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 6/42

print("Length of Numerical columns is :",Numerical_length)

Length of Numerical columns is : 7

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 7/42

Insights of Univariate Analysis:

import seaborn as sns; sns.set(style="ticks", color_codes=True)

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 8/42

#Scatterplots of all possible variable pairs

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-optim… 9/42

plt.title("Figure 4: Heatmap of Variables")

Insigjts from Bivariate and Multivariate Analysis:

localhost:8888/notebooks/Downloads/Data Mining/Project- DM/Project- Clustering.ipynb#1.4-Apply-K-Means-clustering-on-scaled-data-and-determine-opti… 10/42

Strong positive correlation between

- spending & advance_payments,

- advance_payments & current_balance,