Professional Documents
Culture Documents
Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining
Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining
Cis111 - 6 Assignment 2 Advanced Data Techqniuqe For Data Mining
ASSIGNMENT:
UNIT CORDINATOR:
1
ADVANCED DATA MININGTECHNIQUES
FOR DIRECT MARKETING CAMPAIGNS
2
1 . INTRODUCTION
This task is describe the basic and advanced data mining techniques with bank marketing dataset
i i i i i i i i i i i i i i
of kaggle. The banking sector is increasing day by day in terms of innovation and evolving. We
i i i i i i i i i i i i i i i i i
choose this dataset for two reasons, one is this dataset is used in kaggle competition. And the second
i i i i i i i i i i i i i i i i i i
is Bank store a very large amount of data including customer’s personal info, and have previous
i i i i i i i i i i i i i i i i
history of all time customers. This way they market their products and offers with the help of
i i i i i i i i i i i i i i i i i
customers history. For targeting the customers’ demands bank use one to one meeting and media
i i i i i i i i i i i i i i i
is called direct marketing.As in this assignment we have to analyze different techniques of Data
i i i i i i i i i i i i i i i
mining approaches.
i i
Business intelligence with data mining is very common now a days, there are a lot of techniques and
i i i i i i i i i i i i i i i i i
i solution for the business improvement are developed now. Particularly, in the data science field a
i i i i i i i i i i i i i i
i modern world is using for their decisions. Previous data is first task for solving the existing problem
i i i i i i i i i i i i i i i i
i while the prediction about upcoming data is very useful. As we know, a lot of different techniques
i i i i i i i i i i i i i i i i
and advanced techniques are developing but in this task we use some of them. Below is the list of
i i i i i i i i i i i i i i i i i i i
Data mining methods and algorithms, Data cleaning like dealing with missing values and
i i i i i i i i i i i i i
removing outliers, Data visualization with multiple libraries with python, Track pattern in data
i i i i i i i i i i i i i
with two types of analysis one is uni variant analysis with one column wise and second is bi variant
i i i i i i i i i i i i i i i i i i i
i analysis with multiple column analysis. Visualization graph is used with the help of python library
i i i i i i i i i i i i i i
for showing uni variate and bi variate analysis. After that, classification is the process of dealing
i i i i i i i i i i i i i i i i
with classes and multiple column classify with the simple format. There are many classification
i i i i i i i i i i i i i i
techniques but we used linear model with the provided dataset. And the final one we used decision
i i i i i i i i i i i i i i i i i
tree classifier based on target column and as we know decision tree classifier is worked with only
i i i i i i i i i i i i i i i i i
one true value of column. In this Paragraph it should be decide the objective of this task, Earlier I
i i i i i i i i i i i i i i i i i i i
said business decision can be made with data mining and the best method is decision tree classifier
i i i i i i i i i i i i i i i i i
is best one for all the others techniques of those dataset which have data of target column. Other
i i i i i i i i i i i i i i i i i i
machine learning algorithm, like XGBoost and SVM is the best algorithm for analyzing the
i i i i i i i i i i i i i i
dataset. But in this assignment we are only used decision tree classifier. The objective of this
i i i i i i i i i i i i i i i
dataset is to effect those customers and improve the techniques of direct marketing, calculate the
i i i i i i i i i i i i i i i
3
2 . DESIGNING A SOLUTION I I
First need to analyzed the dataset, how many feature this dataset have. The Question is which are
i i i i i i i i i i i i i i i i
i usefull for us for meeting the objective and which are just outliers and how many columns are
i i i i i i i i i i i i i i i i
i effecting the dataset in a different directions. As I describe earlier that many data mining
i i i i i i i i i i i i i i
techniques are exist but we use some techniques in which we find a solution.
i i i i i i i i i i i i i i
DATASET:
Dataset is described with the help of python. This dataset include more then 41k rows with 20
i i i i i i i i i i i i i i i i
i columns. Later we extract the most important features from them. Below picture is explaining the
i i i i i i i i i i i i i i
i basic structure the Dataset, First column is age, describing the age of customer, and later the job is
i i i i i i i i i i i i i i i i i
i explaining his/her profession. And the others features like education, compaign duration are also
i i i i i i i i i i i i
i have some importance points in them. Let’s dive into column values.
i i i i i i i i i i
4
Below picture is raises another important point in which target column count is described and this
i i i i i i i i i i i i i i i
B- This column value acts as different either we use half dataset. Hence the distribution is
i i i i i i i i i i i i i i i
5
Above picture explains complete dataset, now considering this column we analyse in what age have
I i i i i i i i i i i i i i
yes deposit account or in what age less deposit account. As we have many no in the target column.
i i i i i i i i i i i i i i i i i i i
Fig 3: describing the count of target column with respect age column.
i i i i i i i i i i i
Hence above picture shows that almost age have many record count between 30 to 50 But the
i i i i i i i i i i i i i i i i
i distribution of yes or no is different. We saw almost 90% records have no column then after targeting
i i i i i i i i i i i i i i i i i
i age column with respect to distribution of either they have account or not is not skewed so much. This
i i i i i i i i i i i i i i i i i i
Now we are moving towards Uni variate analysis. This techniques is used for analyzing single
i i i i i i i i i i i i i i
column values means the distribution of columns. The question is how the column is skewed
i i i i i i i i i i i i i i i
6
Fig 4: Explains the Education column values counts
i i i i i i i
7
Now let’s move to the bi-varaite analysis with column to column dependencies and how the one
i i i i i i i i i i i i i i i
i column is effecting the other column value. How are they effect with multiple values of
i i i i i i i i i i i i i i
distribution?
i
BIVARIATE ANALYSIS: I
First analysis is between column values of marital status with age and target column in this dataset.
i i i i i i i i i i i i i i i i
i There are many graph explains this type of analysis but we use boxplot of this analysis. Boxplot have
i i i i i i i i i i i i i i i i i
many benefits because it shows the one column values with different colors.
i i i i i i i i i i i i
This analysis results that divorced and married have yes target column with respect to other marital
i i i i i i i i i i i i i i i
i status. So, Target column explains that single peoples have less yes target column and age wise they
i i i i i i i i i i i i i i i i
are younger then other martial status. Below picture describing the this analysis.
i i i i i i i i i i i i
8
Now lets move another Bi variate analysis, this time we target our column is education which is
i i i i i i i i i i i i i i i i
very skewed towards target column. As we know target column is very skewed towards no values,
i i i i i i i i i i i i i i i i
we are finding how this column effect another columns. Below picture explains the education
i i i i i i i i i i i i i i
main values with respect to age and target the last feature. We are analyzing the question is that
i i i i i i i i i i i i i i i i i i
how many educated and uneducated peoples have deposit account with respect to age matters.
i i i i i i i i i i i i i i
Above picture resulted that, those who have basic 4y education with age 60+ have more
i i i i i i i i i i i i i i
deposit accounts then others, second results is that whose education is unknown have deposit
i i i i i i i i i i i i i i
account. Hence with the count wise, basic4y education is high records of yes.
i i i i i i i i i i i i i
3 . EXPERIMENTS
Now we are calculating the advanced data mining techniques like classifications, Regression and
i i i i i i i i i i i i
the decision tree classifier. First we need to extract features and split dataset into two data streams like
i i i i i i i i i i i i i i i i i i
9
training dataset and testing dataset. Training dataset is used for train our model like decision tree
i i i i i i i i i i i i i i i
in this case and we predict the next values with the help of testing dataset. Split with datasets are
i i i i i i i i i i i i i i i i i i i
very important for accuracy, and also we have to calculate our main features which are needed for
i i i i i i i i i i i i i i i i i
the evaluation procedure. First of all we need to extract a matrix of correlation with the column to
i i i i i i i i i i i i i i i i i
We are using sklearn library for preprocessing the dataset. The function name is LableEncoder
i i i i i i i i i i i i i
which works to transform the datasets columns into one-hot encoding of numeric columns.
i i i i i i i i i i i i i
Because for training models it need to be completed that all the numeric column should results in
i i i i i i i i i i i i i i i i i
one boundary. Means all the columns features train to maximum and minimum values of their
i i i i i i i i i i i i i i i
distributions. I will explain this way, once the column values have different distribution values then
i i i i i i i i i i i i i i i
its very difficult to train a model. So, all the column features assigned as same distributions values.
i i i i i i i i i i i i i i i i i
Standard scaler from sklearn is used to transform the values to some distribution. After
i i i i i i i i i i i i i
After transforming the next step is to split the dataset into train test, we use sklearn.train_test_split
i i i i i i i i i i i i i i i
i into two different datasets the shape of after split is in this format, X_train.shape have 32950 rows
i i i i i i i i i i i i i i i i
and 19 column one column is removed from training because it is used y training for evaluating,
i i i i i i i i i i i i i i i i
since y column also have rows equal to 32950. Then for testing X_test.shape have 8238 rows with
i i i i i i i i i i i i i i i i i
First Experiment is Logistic regression which results accuracy of 90%. On the test set so this model
i i i i i i i i i i i i i i i i
i needs some improvements. Now we are moving to next experimental technique is. Before moving to
i i i i i i i i i i i i i i
10
Following parameters are to be calculated with the classification techniques is the result of confusion
i i i i i i i i i i i i i i
Below picture is taken from code which is explaining the confusion matrix, accuracy score and f1-
I, i i i i i i i i i i i i i i
i score and all the parameters of predictions using the y target class. This means either we use logistic
i i i i i i i i i i i i i i i i i
Above pictures shows that our predictions with confusion matrix are more correct then wrong
i i i i i i i i i i i i i
i predictions. Actual no with predicted no is 7191 and actual no with predicted yes are 103 means
i i i i i i i i i i i i i i i i
i correction predictions have greater number then actual yes is predicted. With the precision of 91%
i i i i i i i i i i i i i i
11
Fig 10: ROC Curve of FP Rate
i i i i i i
12
Roc curve means the ratio of false positive with respect to true positive. Our algorithm predict false
i i i i i i i i i i i i i i i i
i positive then true positive. This explains the predicted calculation are either towards positive or
i i i i i i i i i i i i i
i negative. The ration converts the accuracy of the False Positive which are calculated false but their
i i i i i i i i i i i i i i i
actual values is true and true positive which are calculates true and their actual value is yes, means our
i i i i i i i i i i i i i i i i i i
algorithm differentiate the two different ratio. This is the curve gives us the algorithm testing. Above
i i i i i i i i i i i i i i i i
Now the Decision tree classifier is the path finder of the results. Complete dataset divided into path of
i i i i i i i i i i i i i i i i
13
The tree is resulted in this way,
i i i i i i
1- When entropy is greater than 0.9 it always yes for the predicted class.
i i i i i i i i i i i i
2- When column value of nr.employed <= -1.099 then entropy is 0.5 resulted yes with
i i i i i i i i i i i i i
3- With column value of nr.employed if we check cons.conf.idx value <= -1.328 then
i i i i i i i i i i i i
4- When checking with three column values nr.employed with month, and days of week
i i i i i i i i i i i i
5- If we consider poutcome <= -.1.5 with days of week comparing all entropy value
i i i i i i i i i i i i i
7- The best case is nr.employed with month addition to poutcome value of <= 1.5 lead to
i i i i i i i i i i i i i i i
4 . CONCLUSIONS
The bank marketing strategy effects with multiple data patterns. In this results made from Decision
i i i i i i i i i i i i i i
i tree are those which lead to positive marketing strategy. This pattern results conclude the resulted
i i i i i i i i i i i i i i
i best marketing campaigns. The resulted path is either the value of no and yes, no scheme is
i i i i i i i i i i i i i i i i
i promoting to shift to yes and the yes pattern need another intentions for their business. Managers
i i i i i i i i i i i i i i i
and other stakeholders made their choices according to situations. As many patterns seems likely to
i i i i i i i i i i i i i i i
old styles in bi varaite analysis. Moreover the classifier takes too much in making the business
i i i i i i i i i i i i i i i i
higher.
i
14
5 . REFERENCES
3. Chen, M., Hao, Y., & Zhang, Y. (Eds.). (2018). Data Mining: Theories, Algorithms, and
Examples. CRC Press.
4. Han, J., Pei, J., Kamber, M., & Dong, G. (2011). Data Mining: Concepts and
Techniques (3rd ed.). Morgan Kaufmann.
5. Tan, P. N., Steinbach, M., & Kumar, V. (2013). Introduction to Data Mining. Pearson.
6. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., ... & Yu, P. S.
(2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-
37.
7. Aggarwal, C. C., & Zhai, C. (2012). Mining text data. Springer Science & Business
Media.
8. Hotho, A., Nürnberger, A., & Paaß, G. (2005). A brief survey of text mining. LDV
Forum, 20(1), 19-62.
9. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge
discovery in databases. AI magazine, 17(3), 37-54.
15