Professional Documents
Culture Documents
Quadexp IDS Project
Quadexp IDS Project
Team Members:
Submitted to:
1
1. Objective:
This project will allow the bank to create a target customer profile
for upcoming marketing strategies, anticipate how clients will
react to its telemarketing campaign, and get a more detailed
insight of its customer base.
The bank will be able to estimate customer's saving behaviors
and determine which types of customers are more likely to make
term deposits by examining customer features like demographics
and transaction history. Following that, the bank may focus its
marketing efforts on such clients. As a result, the bank will be able
to protect deposits more effectively and improve customer
satisfaction by omitting ads that are unsuitable for individual
customers.
2. Problem Statement:
The data is related with direct marketing campaigns (phone calls)
of a Portuguese banking institution. The classification goal is to
predict if the client will subscribe a term deposit (variable y) or not.
2
additional-full.csv). This dataset contains 20 input variables and 1
output variable. It consists of 41188 instances. The information
about the attributes is as follows:
Input Variables
# bank client data:
1 - age (numeric)
2 - job: type of job (categorical: 'admin.', 'blue-collar',
'entrepreneur', 'housemaid', 'management', 'retired', 'self-
employed', 'services', 'student', 'technician', 'unemployed',
'unknown')
3 - marital: marital status (categorical: 'divorced', 'married',
'single', 'unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y', 'basic.6y', 'basic.9y',
'high.school', 'illiterate', 'professional.course', 'university.degree',
'unknown')
5 - default: has credit in default? (categorical: 'no', 'yes',
'unknown')
6 - housing: has housing loan? (categorical: 'no', 'yes',
'unknown')
7 - loan: has personal loan? (categorical: 'no', 'yes', 'unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular',
'telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb',
'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical:
'mon', 'tue', 'wed', 'thu', 'fri')
11 - duration: last contact duration, in seconds (numeric).
Important note: this attribute highly affects the output target (e.g.,
3
if duration=0 then y='no'). Yet, the duration is not known before a
call is performed. Also, after the end of the call y is obviously
known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a
realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this
campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was
last contacted from a previous campaign (numeric; 999 means
client was not previously contacted)
14 - previous: number of contacts performed before this
campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign
(categorical: 'failure', 'nonexistent', 'success')
4
21 - y: has the client subscribed a term deposit? (binary: 'yes',
'no')
5
Importing the dataset
We use the read_csv() function of the pandas library for importing
our dataset.
Output
6
Output
Output
Output
7
There are 12 duplicate values in our dataset. We need to drop
these duplicates.
Output
8
Output
Output
Output
9
Output
We can see from the above plot that the dataset is imbalanced.
We will use random sampling to deal with this problem.
Output
10
5. Data Visualisation
We do the analysis of different type of attributes separately.
11
Inferences
12
Inferences
For the age attribute the box plots of both the classes overlap a lot. Thus it
cannot be considered a good attribute for classification.
The features emp.var.rate, euribor3m and nr.employed show a clear difference in
median for the box plots of both the classes. So they can be considered as a
good attribute for classification.
Duration of a customer can be a useful attribute in predicting the target variable.
For pdays attribute most values are 999 meaning that most of the customers
have never been contacted previously.
13
Correlation Matrix
We can see that the attribute emp.var.rate has very high correlation with attribute
euribor3m.
6. Data Preprocessing:
Splitting the Dataset
We split our dataset into testing and training samples.
14
Output
Random Sampling
The training dataset is checked for the problem of imbalance.
15
Hence we solved the imbalance problem. We shuffle the data as
well.
16
We dropped the redundant target variable.
Confusion Matrix
18
We get an accuracy of approx. 82% by using decision tree
classifier based on gini index.
Confusion Matrix
19
Applying K fold Cross-Validation
We decide to use this technique to validate the performance of
the random forest classifier model that we used above.
20
8. Conclusion:
Firstly we did some EDA and visualization of numerical and
categorical attributes in a separate manner to get an better insight
into the data we are dealing with and drawing some useful
inferences. As expected naturally from most of the marketing
campaigns our dataset was highly imbalanced. To deal with this
problem random upsampling was used.
To design our model for predicting whether the customer will
subscribe to the term deposit or not after the telemarketing
campaign of a bank we applied various machine learning
classification algorithms. Firstly we trained our model using
logistic regression which is useful in binary classification
problems. We were able to obtain an accuracy of 86%.
Since the dataset had a lot of categorical attributes we trained our
model using decision tree classifiers since they are highly suitable
in handling these kinds of attributes. We obtained an accuracy of
82% through decision trees based on gini index. To further
improve this accuracy we used an ensemble of decision trees
known as random forest classifier. This improved the accuracy of
our model to 90%.
Finally we applied cross-validation. It helped us with using our
data in a better manner. We were also able to increase the
accuracy of our model by approximately 6%. So finally we were
able to achieve an accuracy of 96.3% through our model.
21
9. References:
Scikit-learn tutorials
Pandas documentation
NumPy documentation
UCI Machine Learning Repository
22