Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining

You might also like

You are on page 1of 15

UNIT:

ASSIGNMENT:
UNIT CORDINATOR:

STUDENT NAME: DIKSHA


ID:
EMAIL: {YOUR.NAME}@STUDY.BEDS.AC.UK

1
ADVANCED DATA MININGTECHNIQUES
FOR DIRECT MARKETING CAMPAIGNS

2
1 . INTRODUCTION

This task is describe the basic and advanced data mining techniques with bank marketing dataset
i i i i i i i i i i i i i i

of kaggle. The banking sector is increasing day by day in terms of innovation and evolving. We
i i i i i i i i i i i i i i i i i

choose this dataset for two reasons, one is this dataset is used in kaggle competition. And the second
i i i i i i i i i i i i i i i i i i

is Bank store a very large amount of data including customer’s personal info, and have previous
i i i i i i i i i i i i i i i i

history of all time customers. This way they market their products and offers with the help of
i i i i i i i i i i i i i i i i i

customers history. For targeting the customers’ demands bank use one to one meeting and media
i i i i i i i i i i i i i i i

is called direct marketing.As in this assignment we have to analyze different techniques of Data
i i i i i i i i i i i i i i i

mining approaches.
i i

Business intelligence with data mining is very common now a days, there are a lot of techniques and
i i i i i i i i i i i i i i i i i

i solution for the business improvement are developed now. Particularly, in the data science field a
i i i i i i i i i i i i i i

i modern world is using for their decisions. Previous data is first task for solving the existing problem
i i i i i i i i i i i i i i i i

i while the prediction about upcoming data is very useful. As we know, a lot of different techniques
i i i i i i i i i i i i i i i i

and advanced techniques are developing but in this task we use some of them. Below is the list of
i i i i i i i i i i i i i i i i i i i

Data mining methods and algorithms, Data cleaning like dealing with missing values and
i i i i i i i i i i i i i

removing outliers, Data visualization with multiple libraries with python, Track pattern in data
i i i i i i i i i i i i i

with two types of analysis one is uni variant analysis with one column wise and second is bi variant
i i i i i i i i i i i i i i i i i i i

i analysis with multiple column analysis. Visualization graph is used with the help of python library
i i i i i i i i i i i i i i

for showing uni variate and bi variate analysis. After that, classification is the process of dealing
i i i i i i i i i i i i i i i i

with classes and multiple column classify with the simple format. There are many classification
i i i i i i i i i i i i i i

techniques but we used linear model with the provided dataset. And the final one we used decision
i i i i i i i i i i i i i i i i i

tree classifier based on target column and as we know decision tree classifier is worked with only
i i i i i i i i i i i i i i i i i

one true value of column. In this Paragraph it should be decide the objective of this task, Earlier I
i i i i i i i i i i i i i i i i i i i

said business decision can be made with data mining and the best method is decision tree classifier
i i i i i i i i i i i i i i i i i

is best one for all the others techniques of those dataset which have data of target column. Other
i i i i i i i i i i i i i i i i i i

machine learning algorithm, like XGBoost and SVM is the best algorithm for analyzing the
i i i i i i i i i i i i i i

dataset. But in this assignment we are only used decision tree classifier. The objective of this
i i i i i i i i i i i i i i i

dataset is to effect those customers and improve the techniques of direct marketing, calculate the
i i i i i i i i i i i i i i i

features where direct marketing efficiently used.


i i i i i i

3
2 . DESIGNING A SOLUTION I I

First need to analyzed the dataset, how many feature this dataset have. The Question is which are
i i i i i i i i i i i i i i i i

i usefull for us for meeting the objective and which are just outliers and how many columns are
i i i i i i i i i i i i i i i i

i effecting the dataset in a different directions. As I describe earlier that many data mining
i i i i i i i i i i i i i i

techniques are exist but we use some techniques in which we find a solution.
i i i i i i i i i i i i i i

DATASET:
Dataset is described with the help of python. This dataset include more then 41k rows with 20
i i i i i i i i i i i i i i i i

i columns. Later we extract the most important features from them. Below picture is explaining the
i i i i i i i i i i i i i i

i basic structure the Dataset, First column is age, describing the age of customer, and later the job is
i i i i i i i i i i i i i i i i i

i explaining his/her profession. And the others features like education, compaign duration are also
i i i i i i i i i i i i

i have some importance points in them. Let’s dive into column values.
i i i i i i i i i i

Fig 1: Describing the Dataset


i i i i

4
Below picture is raises another important point in which target column count is described and this
i i i i i i i i i i i i i i i

resulted two type of information.


i i i i i

A- Dataset have so much no column values with respect to yes values.


i i i i i i i i i i i i

B- This column value acts as different either we use half dataset. Hence the distribution is
i i i i i i i i i i i i i i i

i skewed towards max rows of columns values.


i i i i i i

fig 2: explains the target column


i i i i i

5
Above picture explains complete dataset, now considering this column we analyse in what age have
I i i i i i i i i i i i i i

yes deposit account or in what age less deposit account. As we have many no in the target column.
i i i i i i i i i i i i i i i i i i i

Fig 3: describing the count of target column with respect age column.
i i i i i i i i i i i

Hence above picture shows that almost age have many record count between 30 to 50 But the
i i i i i i i i i i i i i i i i

i distribution of yes or no is different. We saw almost 90% records have no column then after targeting
i i i i i i i i i i i i i i i i i

i age column with respect to distribution of either they have account or not is not skewed so much. This
i i i i i i i i i i i i i i i i i i

Results we don’t have so much skewedness in the distribution of age feature.


i i i i i i i i i i i i i

Now we are moving towards Uni variate analysis. This techniques is used for analyzing single
i i i i i i i i i i i i i i

column values means the distribution of columns. The question is how the column is skewed
i i i i i i i i i i i i i i i

towards values. How the distribution of data is included in the dataset.


i i i i i i i i i i i i

6
Fig 4: Explains the Education column values counts
i i i i i i i

Fig 5: Explains the Job column.


i i i i i

7
Now let’s move to the bi-varaite analysis with column to column dependencies and how the one
i i i i i i i i i i i i i i i

i column is effecting the other column value. How are they effect with multiple values of
i i i i i i i i i i i i i i

distribution?
i

BIVARIATE ANALYSIS: I

First analysis is between column values of marital status with age and target column in this dataset.
i i i i i i i i i i i i i i i i

i There are many graph explains this type of analysis but we use boxplot of this analysis. Boxplot have
i i i i i i i i i i i i i i i i i

many benefits because it shows the one column values with different colors.
i i i i i i i i i i i i

This analysis results that divorced and married have yes target column with respect to other marital
i i i i i i i i i i i i i i i

i status. So, Target column explains that single peoples have less yes target column and age wise they
i i i i i i i i i i i i i i i i

are younger then other martial status. Below picture describing the this analysis.
i i i i i i i i i i i i

Fig 6: BiVariate Analysis of age and martial with target column.


i i i i i i i i i i

8
Now lets move another Bi variate analysis, this time we target our column is education which is
i i i i i i i i i i i i i i i i

very skewed towards target column. As we know target column is very skewed towards no values,
i i i i i i i i i i i i i i i i

we are finding how this column effect another columns. Below picture explains the education
i i i i i i i i i i i i i i

main values with respect to age and target the last feature. We are analyzing the question is that
i i i i i i i i i i i i i i i i i i

how many educated and uneducated peoples have deposit account with respect to age matters.
i i i i i i i i i i i i i i

Fig 7: Describing education on target column yes or no.


i i i i i i i i i

Above picture resulted that, those who have basic 4y education with age 60+ have more
i i i i i i i i i i i i i i

deposit accounts then others, second results is that whose education is unknown have deposit
i i i i i i i i i i i i i i

account. Hence with the count wise, basic4y education is high records of yes.
i i i i i i i i i i i i i

3 . EXPERIMENTS

Now we are calculating the advanced data mining techniques like classifications, Regression and
i i i i i i i i i i i i

the decision tree classifier. First we need to extract features and split dataset into two data streams like
i i i i i i i i i i i i i i i i i i

9
training dataset and testing dataset. Training dataset is used for train our model like decision tree
i i i i i i i i i i i i i i i

in this case and we predict the next values with the help of testing dataset. Split with datasets are
i i i i i i i i i i i i i i i i i i i

very important for accuracy, and also we have to calculate our main features which are needed for
i i i i i i i i i i i i i i i i i

the evaluation procedure. First of all we need to extract a matrix of correlation with the column to
i i i i i i i i i i i i i i i i i

column. This way we can make a decision of calculating main features.


i I i i i i i i i i i i

We are using sklearn library for preprocessing the dataset. The function name is LableEncoder
i i i i i i i i i i i i i

which works to transform the datasets columns into one-hot encoding of numeric columns.
i i i i i i i i i i i i i

Because for training models it need to be completed that all the numeric column should results in
i i i i i i i i i i i i i i i i i

one boundary. Means all the columns features train to maximum and minimum values of their
i i i i i i i i i i i i i i i

distributions. I will explain this way, once the column values have different distribution values then
i i i i i i i i i i i i i i i

its very difficult to train a model. So, all the column features assigned as same distributions values.
i i i i i i i i i i i i i i i i i

Standard scaler from sklearn is used to transform the values to some distribution. After
i i i i i i i i i i i i i

transforming dataset look like this,


i i i i i

Fig 8: After transforming Dataset shape: i i i i i

After transforming the next step is to split the dataset into train test, we use sklearn.train_test_split
i i i i i i i i i i i i i i i

i into two different datasets the shape of after split is in this format, X_train.shape have 32950 rows
i i i i i i i i i i i i i i i i

and 19 column one column is removed from training because it is used y training for evaluating,
i i i i i i i i i i i i i i i i

since y column also have rows equal to 32950. Then for testing X_test.shape have 8238 rows with
i i i i i i i i i i i i i i i i i

19 columns and Y_test.shape also have 8238 rows respectively.


i i i i i i ii i i

First Experiment is Logistic regression which results accuracy of 90%. On the test set so this model
i i i i i i i i i i i i i i i i

i needs some improvements. Now we are moving to next experimental technique is. Before moving to
i i i i i i i i i i i i i i

classification report we need to create confusion matrix for approximate results.


i i i i i i i i i i i

10
Following parameters are to be calculated with the classification techniques is the result of confusion
i i i i i i i i i i i i i i

matrix and prediction using classification problem.


i i i i i i

Below picture is taken from code which is explaining the confusion matrix, accuracy score and f1-
I, i i i i i i i i i i i i i i

i score and all the parameters of predictions using the y target class. This means either we use logistic
i i i i i i i i i i i i i i i i i

regression or classification techniques the difference of results are calculated below.


i i i i i i i i i i i

Fig 9: Summary of classifications Results


i i i i i

Above pictures shows that our predictions with confusion matrix are more correct then wrong
i i i i i i i i i i i i i

i predictions. Actual no with predicted no is 7191 and actual no with predicted yes are 103 means
i i i i i i i i i i i i i i i i

i correction predictions have greater number then actual yes is predicted. With the precision of 91%
i i i i i i i i i i i i i i

means corrected predictions have higher weight.


i i i i i i

11
Fig 10: ROC Curve of FP Rate
i i i i i i

12
Roc curve means the ratio of false positive with respect to true positive. Our algorithm predict false
i i i i i i i i i i i i i i i i

i positive then true positive. This explains the predicted calculation are either towards positive or
i i i i i i i i i i i i i

i negative. The ration converts the accuracy of the False Positive which are calculated false but their
i i i i i i i i i i i i i i i

actual values is true and true positive which are calculates true and their actual value is yes, means our
i i i i i i i i i i i i i i i i i i

algorithm differentiate the two different ratio. This is the curve gives us the algorithm testing. Above
i i i i i i i i i i i i i i i i

graph is explaining the Roc curve of logistic regression.


i i i i i i i i i

Now the Decision tree classifier is the path finder of the results. Complete dataset divided into path of
i i i i i i i i i i i i i i i i

reaching the exact result.


i i i i

Fig 11: Decision tree classifier of Bank Dataset


i i i i i i i

13
The tree is resulted in this way,
i i i i i i

1- When entropy is greater than 0.9 it always yes for the predicted class.
i i i i i i i i i i i i

2- When column value of nr.employed <= -1.099 then entropy is 0.5 resulted yes with
i i i i i i i i i i i i i

i the predicted class. i i

3- With column value of nr.employed if we check cons.conf.idx value <= -1.328 then
i i i i i i i i i i i i

i it is always yes, either which column value is added.


i i i i i i i i i

4- When checking with three column values nr.employed with month, and days of week
i i i i i i i i i i i i

i then this value is no with entropy of 0.9


i i i i i i i i

5- If we consider poutcome <= -.1.5 with days of week comparing all entropy value
i i i i i i i i i i i i i

i of nr.employed will lead to no in decision.


i i i i i i i

6- If we consider poutcome <= -.1.5 with cons.price.idx comparing all entropy


i i i i i i i i i i

i value of nr.employed will lead to no in decision. i i i i i i i i

7- The best case is nr.employed with month addition to poutcome value of <= 1.5 lead to
i i i i i i i i i i i i i i i

i the yes in predicted class.


i i i i

4 . CONCLUSIONS

The bank marketing strategy effects with multiple data patterns. In this results made from Decision
i i i i i i i i i i i i i i

i tree are those which lead to positive marketing strategy. This pattern results conclude the resulted
i i i i i i i i i i i i i i

i best marketing campaigns. The resulted path is either the value of no and yes, no scheme is
i i i i i i i i i i i i i i i i

i promoting to shift to yes and the yes pattern need another intentions for their business. Managers
i i i i i i i i i i i i i i i

and other stakeholders made their choices according to situations. As many patterns seems likely to
i i i i i i i i i i i i i i i

old styles in bi varaite analysis. Moreover the classifier takes too much in making the business
i i i i i i i i i i i i i i i i

higher.
i

14
5 . REFERENCES

1. Aggarwal, C. C. (2015). Data Mining: The Textbook. Springer.

2. Bramer, M. (2016). Principles of Data Mining. Springer.

3. Chen, M., Hao, Y., & Zhang, Y. (Eds.). (2018). Data Mining: Theories, Algorithms, and
Examples. CRC Press.

4. Han, J., Pei, J., Kamber, M., & Dong, G. (2011). Data Mining: Concepts and
Techniques (3rd ed.). Morgan Kaufmann.

5. Tan, P. N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining. Pearson.

6. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Yu, P. S.
(2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-
37.

7. Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer Science & Business
Media.

8. Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief survey of text mining. LDV
Forum, 20(1), 19-62.

9. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI magazine, 17(3), 37-54.

10. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

15

You might also like