Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

UNIT:
ASSIGNMENT:
UNIT CORDINATOR:
STUDENT NAME: DIKSHA

ID:
EMAIL: {YOUR.NAME}@STUDY.BEDS.AC.UK
1
ADVANCED DATA MININGTECHNIQUES
FOR DIRECT MARKETING CAMPAIGNS
2
1 . INTRODUCTION
This task is describe the basic and advanced data mining techniques with bank marketing dataset
i i i i i i i i i i i i i i
of kaggle. The banking sector is increasing day by day in terms of innovation and evolving. We
i i i i i i i i i i i i i i i i i
choose this dataset for two reasons, one is this dataset is used in kaggle competition. And the second
i i i i i i i i i i i i i i i i i i
is Bank store a very large amount of data including customer’s personal info, and have previous
i i i i i i i i i i i i i i i i
history of all time customers. This way they market their products and offers with the help of
customers history. For targeting the customers’ demands bank use one to one meeting and media
i i i i i i i i i i i i i i i
is called direct marketing.As in this assignment we have to analyze different techniques of Data
mining approaches.
i i
Business intelligence with data mining is very common now a days, there are a lot of techniques and
i solution for the business improvement are developed now. Particularly, in the data science field a
i modern world is using for their decisions. Previous data is first task for solving the existing problem
i while the prediction about upcoming data is very useful. As we know, a lot of different techniques
and advanced techniques are developing but in this task we use some of them. Below is the list of
i i i i i i i i i i i i i i i i i i i
Data mining methods and algorithms, Data cleaning like dealing with missing values and
i i i i i i i i i i i i i
removing outliers, Data visualization with multiple libraries with python, Track pattern in data
with two types of analysis one is uni variant analysis with one column wise and second is bi variant
i analysis with multiple column analysis. Visualization graph is used with the help of python library
for showing uni variate and bi variate analysis. After that, classification is the process of dealing
with classes and multiple column classify with the simple format. There are many classification
techniques but we used linear model with the provided dataset. And the final one we used decision
tree classifier based on target column and as we know decision tree classifier is worked with only
one true value of column. In this Paragraph it should be decide the objective of this task, Earlier I
said business decision can be made with data mining and the best method is decision tree classifier
is best one for all the others techniques of those dataset which have data of target column. Other
machine learning algorithm, like XGBoost and SVM is the best algorithm for analyzing the
dataset. But in this assignment we are only used decision tree classifier. The objective of this
dataset is to effect those customers and improve the techniques of direct marketing, calculate the
features where direct marketing efficiently used.

i i i i i i
3
2 . DESIGNING A SOLUTION I I
First need to analyzed the dataset, how many feature this dataset have. The Question is which are
i usefull for us for meeting the objective and which are just outliers and how many columns are
i effecting the dataset in a different directions. As I describe earlier that many data mining
techniques are exist but we use some techniques in which we find a solution.
DATASET:
Dataset is described with the help of python. This dataset include more then 41k rows with 20
i columns. Later we extract the most important features from them. Below picture is explaining the
i basic structure the Dataset, First column is age, describing the age of customer, and later the job is
i explaining his/her profession. And the others features like education, compaign duration are also
i i i i i i i i i i i i
i have some importance points in them. Let’s dive into column values.
i i i i i i i i i i
Fig 1: Describing the Dataset

i i i i
4
Below picture is raises another important point in which target column count is described and this
resulted two type of information.

i i i i i
A- Dataset have so much no column values with respect to yes values.

B- This column value acts as different either we use half dataset. Hence the distribution is
i skewed towards max rows of columns values.

i i i i i i
fig 2: explains the target column

i i i i i
5
Above picture explains complete dataset, now considering this column we analyse in what age have
I i i i i i i i i i i i i i
yes deposit account or in what age less deposit account. As we have many no in the target column.
Fig 3: describing the count of target column with respect age column.
i i i i i i i i i i i
Hence above picture shows that almost age have many record count between 30 to 50 But the
i distribution of yes or no is different. We saw almost 90% records have no column then after targeting
i age column with respect to distribution of either they have account or not is not skewed so much. This
Results we don’t have so much skewedness in the distribution of age feature.

Now we are moving towards Uni variate analysis. This techniques is used for analyzing single
column values means the distribution of columns. The question is how the column is skewed
towards values. How the distribution of data is included in the dataset.

6
Fig 4: Explains the Education column values counts
i i i i i i i
Fig 5: Explains the Job column.

i i i i i
7
Now let’s move to the bi-varaite analysis with column to column dependencies and how the one
i column is effecting the other column value. How are they effect with multiple values of
distribution?
i
BIVARIATE ANALYSIS: I
First analysis is between column values of marital status with age and target column in this dataset.
i There are many graph explains this type of analysis but we use boxplot of this analysis. Boxplot have
many benefits because it shows the one column values with different colors.
This analysis results that divorced and married have yes target column with respect to other marital
i status. So, Target column explains that single peoples have less yes target column and age wise they
are younger then other martial status. Below picture describing the this analysis.
Fig 6: BiVariate Analysis of age and martial with target column.

i i i i i i i i i i
8
Now lets move another Bi variate analysis, this time we target our column is education which is
very skewed towards target column. As we know target column is very skewed towards no values,
we are finding how this column effect another columns. Below picture explains the education
main values with respect to age and target the last feature. We are analyzing the question is that
how many educated and uneducated peoples have deposit account with respect to age matters.
Fig 7: Describing education on target column yes or no.

i i i i i i i i i
Above picture resulted that, those who have basic 4y education with age 60+ have more
deposit accounts then others, second results is that whose education is unknown have deposit
account. Hence with the count wise, basic4y education is high records of yes.
3 . EXPERIMENTS
Now we are calculating the advanced data mining techniques like classifications, Regression and
the decision tree classifier. First we need to extract features and split dataset into two data streams like
9
training dataset and testing dataset. Training dataset is used for train our model like decision tree
in this case and we predict the next values with the help of testing dataset. Split with datasets are
very important for accuracy, and also we have to calculate our main features which are needed for
the evaluation procedure. First of all we need to extract a matrix of correlation with the column to
column. This way we can make a decision of calculating main features.

i I i i i i i i i i i i
We are using sklearn library for preprocessing the dataset. The function name is LableEncoder
which works to transform the datasets columns into one-hot encoding of numeric columns.
Because for training models it need to be completed that all the numeric column should results in
one boundary. Means all the columns features train to maximum and minimum values of their
distributions. I will explain this way, once the column values have different distribution values then
its very difficult to train a model. So, all the column features assigned as same distributions values.
Standard scaler from sklearn is used to transform the values to some distribution. After
transforming dataset look like this,

i i i i i
Fig 8: After transforming Dataset shape: i i i i i
After transforming the next step is to split the dataset into train test, we use sklearn.train_test_split
i into two different datasets the shape of after split is in this format, X_train.shape have 32950 rows
and 19 column one column is removed from training because it is used y training for evaluating,
since y column also have rows equal to 32950. Then for testing X_test.shape have 8238 rows with
19 columns and Y_test.shape also have 8238 rows respectively.

i i i i i i ii i i
First Experiment is Logistic regression which results accuracy of 90%. On the test set so this model
i needs some improvements. Now we are moving to next experimental technique is. Before moving to
classification report we need to create confusion matrix for approximate results.

10
Following parameters are to be calculated with the classification techniques is the result of confusion
matrix and prediction using classification problem.

i i i i i i
Below picture is taken from code which is explaining the confusion matrix, accuracy score and f1-
I, i i i i i i i i i i i i i i
i score and all the parameters of predictions using the y target class. This means either we use logistic
regression or classification techniques the difference of results are calculated below.

Fig 9: Summary of classifications Results

i i i i i
Above pictures shows that our predictions with confusion matrix are more correct then wrong
i predictions. Actual no with predicted no is 7191 and actual no with predicted yes are 103 means
i correction predictions have greater number then actual yes is predicted. With the precision of 91%
means corrected predictions have higher weight.

i i i i i i
11
Fig 10: ROC Curve of FP Rate
i i i i i i
12
Roc curve means the ratio of false positive with respect to true positive. Our algorithm predict false
i positive then true positive. This explains the predicted calculation are either towards positive or
i negative. The ration converts the accuracy of the False Positive which are calculated false but their
actual values is true and true positive which are calculates true and their actual value is yes, means our
algorithm differentiate the two different ratio. This is the curve gives us the algorithm testing. Above
graph is explaining the Roc curve of logistic regression.

i i i i i i i i i
Now the Decision tree classifier is the path finder of the results. Complete dataset divided into path of
reaching the exact result.

i i i i
Fig 11: Decision tree classifier of Bank Dataset

i i i i i i i
13
The tree is resulted in this way,
i i i i i i
1- When entropy is greater than 0.9 it always yes for the predicted class.
2- When column value of nr.employed <= -1.099 then entropy is 0.5 resulted yes with
i the predicted class. i i
3- With column value of nr.employed if we check cons.conf.idx value <= -1.328 then
i it is always yes, either which column value is added.

i i i i i i i i i
4- When checking with three column values nr.employed with month, and days of week
i then this value is no with entropy of 0.9

i i i i i i i i
5- If we consider poutcome <= -.1.5 with days of week comparing all entropy value
i of nr.employed will lead to no in decision.

i i i i i i i
6- If we consider poutcome <= -.1.5 with cons.price.idx comparing all entropy

i i i i i i i i i i
i value of nr.employed will lead to no in decision. i i i i i i i i
7- The best case is nr.employed with month addition to poutcome value of <= 1.5 lead to
i the yes in predicted class.

i i i i
4 . CONCLUSIONS
The bank marketing strategy effects with multiple data patterns. In this results made from Decision
i tree are those which lead to positive marketing strategy. This pattern results conclude the resulted
i best marketing campaigns. The resulted path is either the value of no and yes, no scheme is
i promoting to shift to yes and the yes pattern need another intentions for their business. Managers
and other stakeholders made their choices according to situations. As many patterns seems likely to
old styles in bi varaite analysis. Moreover the classifier takes too much in making the business
higher.
i
14
5 . REFERENCES
1. Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.
2. Bramer, M. (2016). Principles of Data Mining. Springer.
3. Chen, M., Hao, Y., & Zhang, Y. (Eds.). (2018). Data Mining: Theories, Algorithms, and
Examples. CRC Press.
4. Han, J., Pei, J., Kamber, M., & Dong, G. (2011). Data Mining: Concepts and
Techniques (3rd ed.). Morgan Kaufmann.
5. Tan, P. N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining. Pearson.
6. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Yu, P. S.
(2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-
37.
7. Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer Science & Business
Media.
8. Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief survey of text mining. LDV
Forum, 20(1), 19-62.
9. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI magazine, 17(3), 37-54.
10. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
15

Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

Uploaded by

Copyright:

You might also like

Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

Uploaded by

Copyright:

UNIT:

STUDENT NAME: DIKSHA

features where direct marketing efficiently used.

Fig 1: Describing the Dataset

resulted two type of information.

A- Dataset have so much no column values with respect to yes values.

i skewed towards max rows of columns values.

fig 2: explains the target column

Results we don’t have so much skewedness in the distribution of age feature.

towards values. How the distribution of data is included in the dataset.

Fig 5: Explains the Job column.

Fig 6: BiVariate Analysis of age and martial with target column.

Fig 7: Describing education on target column yes or no.

column. This way we can make a decision of calculating main features.

transforming dataset look like this,

Fig 8: After transforming Dataset shape: i i i i i

19 columns and Y_test.shape also have 8238 rows respectively.

classification report we need to create confusion matrix for approximate results.

matrix and prediction using classification problem.

regression or classification techniques the difference of results are calculated below.

Fig 9: Summary of classifications Results

means corrected predictions have higher weight.

graph is explaining the Roc curve of logistic regression.

reaching the exact result.

Fig 11: Decision tree classifier of Bank Dataset

i the predicted class. i i

i it is always yes, either which column value is added.

i then this value is no with entropy of 0.9

i of nr.employed will lead to no in decision.

6- If we consider poutcome <= -.1.5 with cons.price.idx comparing all entropy

i value of nr.employed will lead to no in decision. i i i i i i i i

i the yes in predicted class.

1. Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.

2. Bramer, M. (2016). Principles of Data Mining. Springer.

10. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

You might also like