Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

PREDICTING THE SUCCESS OF BANK

TELEMARKETING USING MACHINE LEARNING

IDS Project Report

Team Members:

 Tathya Sethi (20UCS210)


 Avish Vijay (20UCS044)
 Viral Bambori (20UCS230)
 Chiroshree Tiwari (20UCS058)

Submitted to:

 Dr. Sakthi Balan Muthiah

1
1. Objective:
This project will allow the bank to create a target customer profile
for upcoming marketing strategies, anticipate how clients will
react to its telemarketing campaign, and get a more detailed
insight of its customer base.
The bank will be able to estimate customer's saving behaviors
and determine which types of customers are more likely to make
term deposits by examining customer features like demographics
and transaction history. Following that, the bank may focus its
marketing efforts on such clients. As a result, the bank will be able
to protect deposits more effectively and improve customer
satisfaction by omitting ads that are unsuitable for individual
customers.

2. Problem Statement:
The data is related with direct marketing campaigns (phone calls)
of a Portuguese banking institution. The classification goal is to
predict if the client will subscribe a term deposit (variable y) or not.

3. Data Set Information:


The dataset that we used is publicly available in the UCI Machine
Learning Repository as Bank Marketing Data Set (bank-

2
additional-full.csv). This dataset contains 20 input variables and 1
output variable. It consists of 41188 instances. The information
about the attributes is as follows:
Input Variables
# bank client data:
1 - age (numeric)
2 - job: type of job (categorical: 'admin.', 'blue-collar',
'entrepreneur', 'housemaid', 'management', 'retired', 'self-
employed', 'services', 'student', 'technician', 'unemployed',
'unknown')
3 - marital: marital status (categorical: 'divorced', 'married',
'single', 'unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y', 'basic.6y', 'basic.9y',
'high.school', 'illiterate', 'professional.course', 'university.degree',
'unknown')
5 - default: has credit in default? (categorical: 'no', 'yes',
'unknown')
6 - housing: has housing loan? (categorical: 'no', 'yes',
'unknown')
7 - loan: has personal loan? (categorical: 'no', 'yes', 'unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular',
'telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb',
'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical:
'mon', 'tue', 'wed', 'thu', 'fri')
11 - duration: last contact duration, in seconds (numeric).
Important note: this attribute highly affects the output target (e.g.,

3
if duration=0 then y='no'). Yet, the duration is not known before a
call is performed. Also, after the end of the call y is obviously
known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a
realistic predictive model.

# other attributes:
12 - campaign: number of contacts performed during this
campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was
last contacted from a previous campaign (numeric; 999 means
client was not previously contacted)
14 - previous: number of contacts performed before this
campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign
(categorical: 'failure', 'nonexistent', 'success')

# social and economic context attributes:


16 - emp.var.rate: employment variation rate - quarterly indicator
(numeric)
17 - cons.price.idx: consumer price index - monthly indicator
(numeric)
18 - cons.conf.idx: consumer confidence index - monthly
indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator
(numeric)
Output Variables

4
21 - y: has the client subscribed a term deposit? (binary: 'yes',
'no')

4. Exploratory Data Analysis:


Importing the libraries
Pandas: Pandas is a software library written for the Python
programming language for data manipulation and analysis.
NumPy: NumPy is a library for the Python programming
language, adding support for large, multi-dimensional arrays and
matrices, along with a large collection of high-level mathematical
functions to operate on these arrays.
Matplotlib: Matplotlib is a plotting library for the Python
programming language and its numerical mathematics extension
NumPy.
Seaborn: Seaborn is a library that uses Matplotlib underneath to
plot graphs.
scikit-learn: scikit-learn is a free software machine learning
library for the Python programming language. It features various
classification, regression and clustering algorithms

5
Importing the dataset
We use the read_csv() function of the pandas library for importing
our dataset.

Output

Count of instances and features

6
Output

There are 41188 instances and 21 features in the dataset.

Checking for null values

Output

There are no null values in the dataset.

Checking for duplicate values

Output

7
There are 12 duplicate values in our dataset. We need to drop
these duplicates.

Output

The 12 duplicates were dropped.

Dealing with missing values


There are several missing values in some categorical attributes,
all coded with the 'unknown' label. We need to remove these
values from the dataset.
Firstly we replace 'unknown' with null value.

8
Output

Now we drop these null values present in our dataset

Output

So we are left with 30478 instances in our dataset.

Checking for Class Imbalance


We check the distribution of target variable to see if the dataset is
balanced or not.

Output

9
Output

We can see from the above plot that the dataset is imbalanced.
We will use random sampling to deal with this problem.

Looking at the datatypes of all the attributes present.

Output

10
5. Data Visualisation
We do the analysis of different type of attributes separately.

Analysis of categorical attributes

11
Inferences

 There are no customers which have credit in default.


 Most of the customers have personal loan.
 For most of the customers the outcome of the previous marketing campaign does
not exist. This means that most of the customers are new.
 Most of the customers are married.
 Customers having level of education as high school and university show the
highest rate of subscribing to a term deposit.
 Customers having their job as admin show the highest rate of subscribing to a
term deposit.
 Customers with cellular contact show more likelihood in subscribing to a term
deposit.
 All days have nearly the same distribution so it acts as an irrelevant attribute in
the classification process.

Analysis of numerical attributes

12
Inferences

 For the age attribute the box plots of both the classes overlap a lot. Thus it
cannot be considered a good attribute for classification.
 The features emp.var.rate, euribor3m and nr.employed show a clear difference in
median for the box plots of both the classes. So they can be considered as a
good attribute for classification.
 Duration of a customer can be a useful attribute in predicting the target variable.
 For pdays attribute most values are 999 meaning that most of the customers
have never been contacted previously.

13
Correlation Matrix

We can see that the attribute emp.var.rate has very high correlation with attribute
euribor3m.

6. Data Preprocessing:
Splitting the Dataset
We split our dataset into testing and training samples.

14
Output

We can see that our training sample contains 24382 instances


and testing sample contains 6096 instances.

Random Sampling
The training dataset is checked for the problem of imbalance.

We can clearly see that the training dataset is imbalanced.


We used the upsampling technique for handling this dataset.

15
Hence we solved the imbalance problem. We shuffle the data as
well.

Encoding of categorical attributes


Categorical data is converted into integer format so that it can be
provided to different models. Both the train and test set is
encoded using one hot encoding. We used the get_dummies()
function of the pandas library.

The values after encoding are as follows:

16
We dropped the redundant target variable.

Splitting into features and target


We split the training and test set to separate features(X) and the
target attribute(y).

7. Prediction and Classification:


We use different machine learning models for classification and
compare them through various evaluation metrics.

Applying logistic regression


We use this algorithm because it is suitable for the cases in which
the target variable has only two possible classes. It works well
when we have to perform categorical prediction.
17
We get an accuracy of approx. 86% from the logistic regression
model.

Confusion Matrix

Applying decision tree


It is known that decision trees are popular for classification. They
don’t require feature scaling as they are not based on distance.
So we decided to apply this algorithm.

18
We get an accuracy of approx. 82% by using decision tree
classifier based on gini index.

Applying random forest


We chose this model because it is an ensemble method which
uses a combination of decision tree classifies. It thus increases
the accuracy of classification. Also it is robust to outliers.

We get an accuracy of 90.6% which is the highest so far.

Confusion Matrix

19
Applying K fold Cross-Validation
We decide to use this technique to validate the performance of
the random forest classifier model that we used above.

We took the means of the scores obtained in cross validation as


our accuracy measure and were able to obtain an accuracy of
96.3%. Thus we got an increment of 6%(approx.) on our model.
This technique also helped with reducing the biasedness of our
model.

20
8. Conclusion:
Firstly we did some EDA and visualization of numerical and
categorical attributes in a separate manner to get an better insight
into the data we are dealing with and drawing some useful
inferences. As expected naturally from most of the marketing
campaigns our dataset was highly imbalanced. To deal with this
problem random upsampling was used.
To design our model for predicting whether the customer will
subscribe to the term deposit or not after the telemarketing
campaign of a bank we applied various machine learning
classification algorithms. Firstly we trained our model using
logistic regression which is useful in binary classification
problems. We were able to obtain an accuracy of 86%.
Since the dataset had a lot of categorical attributes we trained our
model using decision tree classifiers since they are highly suitable
in handling these kinds of attributes. We obtained an accuracy of
82% through decision trees based on gini index. To further
improve this accuracy we used an ensemble of decision trees
known as random forest classifier. This improved the accuracy of
our model to 90%.
Finally we applied cross-validation. It helped us with using our
data in a better manner. We were also able to increase the
accuracy of our model by approximately 6%. So finally we were
able to achieve an accuracy of 96.3% through our model.

21
9. References:
 Scikit-learn tutorials
 Pandas documentation
 NumPy documentation
 UCI Machine Learning Repository

The complete code is available here.

22

You might also like