Data Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Introduction to Decision Tree

In general, Decision tree analysis is a predictive modelling tool that can be applied across
many areas. Decision trees can be constructed by an algorithmic approach that can split
the dataset in different ways based on different conditions. Decisions trees are the most
powerful algorithms that falls under the category of supervised algorithms.
They can be used for both classification and regression tasks. The two main entities of
a tree are decision nodes, where the data is split and leaves, where we got outcome.
The example of a binary tree for predicting whether a person is fit or unfit providing
various information like age, eating habits and exercise habits, is given below −

In the above decision tree, the question are decision nodes and final outcomes are
leaves. We have the following two types of decision trees.
• Classification decision trees − In this kind of decision trees, the decision variable is
categorical. The above decision tree is an example of classification decision tree.
• Regression decision trees − In this kind of decision trees, the decision variable is
continuous.

Implementing Decision Tree Algorithm


Gini Index
It is the name of the cost function that is used to evaluate the binary splits in the dataset
and works with the categorial target variable “Success” or “Failure”.
Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0
and worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the
help of following steps −
• First, calculate Gini index for sub-nodes by using the formula p^2+q^2, which is the sum of
the square of probability for success and failure.
• Next, calculate Gini index for split using weighted Gini score of each node of that split.
Classification and Regression Tree (CART) algorithm uses Gini method to generate
binary splits.
Split Creation
A split is basically including an attribute in the dataset and a value. We can create a split
in dataset with the help of following three parts −
• Part 1: Calculating Gini Score − We have just discussed this part in the previous section.
• Part 2: Splitting a dataset − It may be defined as separating a dataset into two lists of rows
having index of an attribute and a split value of that attribute. After getting the two groups -
right and left, from the dataset, we can calculate the value of split by using Gini score
calculated in first part. Split value will decide in which group the attribute will reside.
• Part 3: Evaluating all splits − Next part after finding Gini score and splitting dataset is the
evaluation of all splits. For this purpose, first, we must check every value associated with
each attribute as a candidate split. Then we need to find the best possible split by evaluating
the cost of the split. The best split will be used as a node in the decision tree.

Building a Tree
As we know that a tree has root node and terminal nodes. After creating the root node,
we can build the tree by following two parts −
Part 1: Terminal node creation
While creating terminal nodes of decision tree, one important point is to decide when to
stop growing tree or creating further terminal nodes. It can be done by using two criteria
namely maximum tree depth and minimum node records as follows −
• Maximum Tree Depth − As name suggests, this is the maximum number of the nodes in a
tree after root node. We must stop adding terminal nodes once a tree reached at maximum
depth i.e. once a tree got maximum number of terminal nodes.
• Minimum Node Records − It may be defined as the minimum number of training patterns
that a given node is responsible for. We must stop adding terminal nodes once tree reached
at these minimum node records or below this minimum.
Terminal node is used to make a final prediction.
Part 2: Recursive Splitting
As we understood about when to create terminal nodes, now we can start building our
tree. Recursive splitting is a method to build the tree. In this method, once a node is
created, we can create the child nodes (nodes added to an existing node) recursively on
each group of data, generated by splitting the dataset, by calling the same function again
and again.
Prediction
After building a decision tree, we need to make a prediction about it. Basically, prediction
involves navigating the decision tree with the specifically provided row of data.
We can make a prediction with the help of recursive function, as did above. The same
prediction routine is called again with the left or the child right nodes.
Assumptions
The following are some of the assumptions we make while creating decision tree −
• While preparing decision trees, the training set is as root node.
• Decision tree classifier prefers the features values to be categorical. In case if you want to
use continuous values then they must be done discretized prior to model building.
• Based on the attribute’s values, the records are recursively distributed.
• Statistical approach will be used to place attributes at any node position i.e.as root node or
internal node.

Implementation in Python
Example
In the following example, we are going to implement Decision Tree classifier on Pima
Indian Diabetes −
First, start with importing necessary python packages −
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Next, download the iris dataset from its weblink as follows −
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi',
'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:\pima-indians-diabetes.csv", header = None,
names = col_names)
pima.head()
Pregnant Glucose BP Skin Insulin Bmi Pedigree Age Label

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 33.6 0.627 50 1
2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

Now, split the dataset into features and target variable as follows −
feature_cols = ['pregnant', 'insulin', 'bmi',
'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
Next, we will divide the data into train and test split. The following code will split the
dataset into 70% training data and 30% of testing data −
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
= 0.3, random_state = 1)
Next, train the model with the help of DecisionTreeClassifier class of sklearn as follows

clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = clf.predict(X_test)
Next, we can get the accuracy score, confusion matrix and classification report as follows

from sklearn.metrics import classification_report,
confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[116 30]
[ 46 39]]
Classification Report:
precision recall f1-score support
0 0.72 0.79 0.75 146
1 0.57 0.46 0.51 85
micro avg 0.67 0.67 0.67 231
macro avg 0.64 0.63 0.63 231
weighted avg 0.66 0.67 0.66 231

Accuracy: 0.670995670995671
Visualizing Decision Tree
The above decision tree can be visualized with the help of following code −
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,
special_characters=True,feature_names =
feature_cols,class_names=['0','1'])

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Image(graph.create_png())

What is Clustering?
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster

Clustering is the process of making a group of abstract objects into classes of similar
objects.
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based on data
similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to changes and
helps single out useful features that distinguish different groups.

Applications of Cluster Analysis


• Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer base. And they
can characterize their customer groups based on the purchasing patterns.
• In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
• Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house
type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information discovery.
• Clustering is also used in outlier detection applications such as detection of credit card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the distribution
of data to observe characteristics of each cluster.

Requirements of Clustering in Data Mining


The following points throw light on why clustering is required in data mining −
• Scalability − We need highly scalable clustering algorithms to deal with large databases.
• Ability to deal with different kinds of attributes − Algorithms should be capable to be
applied on any kind of data such as interval-based (numerical) data, categorical, and binary
data.
• Discovery of clusters with attribute shape − The clustering algorithm should be capable of
detecting clusters of arbitrary shape. They should not be bounded to only distance measures
that tend to find spherical cluster of small sizes.
• High dimensionality − The clustering algorithm should not only be able to handle low-
dimensional data but also the high dimensional space.
• Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may lead to poor quality clusters.
• Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods
Clustering methods can be classified into the following categories −

• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning method constructs
‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Points to remember −
• For a given number of partitions (say k), the partitioning method will create an initial
partitioning.
• Then it uses the iterative relocation technique to improve the partitioning by moving objects
from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of data objects. We
can classify hierarchical methods on the basis of how the hierarchical decomposition is
formed. There are two approaches here −

• Agglomerative Approach
• Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we start with each object
forming a separate group. It keeps on merging the objects or groups that are close to
one another. It keep on doing so until all of the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we start with all of the
objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is down until each object in one cluster or the termination condition holds.
This method is rigid, i.e., once a merging or splitting is done, it can never be undone.
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the quality of hierarchical
clustering −
• Perform careful analysis of object linkages at each hierarchical partitioning.
• Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to
group objects into micro-clusters, and then performing macro-clustering on the micro-
clusters.
Density-based Method
This method is based on the notion of density. The basic idea is to continue growing the
given cluster as long as the density in the neighborhood exceeds some threshold, i.e.,
for each data point within a given cluster, the radius of a given cluster has to contain at
least a minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number
of cells that form a grid structure.
Advantages
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the best fit of data for a
given model. This method locates the clusters by clustering the density function. It
reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number of clusters
based on standard statistics, taking outlier or noise into account. It therefore yields robust
clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of user or application-
oriented constraints. A constraint refers to the user expectation or the properties of
desired clustering results. Constraints provide us with an interactive way of
communication with the clustering process. Constraints can be specified by the user or
the application requirement.

What is Artificial Neural Network?


Artificial Neural Network ANNANN is an efficient computing system whose central theme
is borrowed from the analogy of biological neural networks. ANNs are also named as
“artificial neural systems,” or “parallel distributed processing systems,” or “connectionist
systems.” ANN acquires a large collection of units that are interconnected in some
pattern to allow communication between the units. These units, also referred to as nodes
or neurons, are simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection
link is associated with a weight that has information about the input signal. This is the
most useful information for neurons to solve a particular problem because the weight
usually excites or inhibits the signal that is being communicated. Each neuron has an
internal state, which is called an activation signal. Output signals, which are produced
after combining the input signals and activation rule, may be sent to other units.

A Brief History of ANN


The history of ANN can be divided into the following three eras −
ANN during 1940s to 1960s
Some key developments of this era are as follows −
• 1943 − It has been assumed that the concept of neural network started with the work of
physiologist, Warren McCulloch, and mathematician, Walter Pitts, when in 1943 they
modeled a simple neural network using electrical circuits in order to describe how neurons in
the brain might work.
• 1949 − Donald Hebb’s book, The Organization of Behavior, put forth the fact that repeated
activation of one neuron by another increases its strength each time they are used.
• 1956 − An associative memory network was introduced by Taylor.
• 1958 − A learning method for McCulloch and Pitts neuron model named Perceptron was
invented by Rosenblatt.
• 1960 − Bernard Widrow and Marcian Hoff developed models called "ADALINE" and
“MADALINE.”
ANN during 1960s to 1980s
Some key developments of this era are as follows −
• 1961 − Rosenblatt made an unsuccessful attempt but proposed the “backpropagation”
scheme for multilayer networks.
• 1964 − Taylor constructed a winner-take-all circuit with inhibitions among output units.

• 1969 − Multilayer perceptron MLPMLP was invented by Minsky and Papert.


• 1971 − Kohonen developed Associative memories.
• 1976 − Stephen Grossberg and Gail Carpenter developed Adaptive resonance theory.
ANN from 1980s till Present
Some key developments of this era are as follows −
• 1982 − The major development was Hopfield’s Energy approach.
• 1985 − Boltzmann machine was developed by Ackley, Hinton, and Sejnowski.
• 1986 − Rumelhart, Hinton, and Williams introduced Generalised Delta Rule.

• 1988 − Kosko developed Binary Associative Memory BAMBAM and also gave the concept
of Fuzzy Logic in ANN.
The historical review shows that significant progress has been made in this field. Neural
network based chips are emerging and applications to complex problems are being
developed. Surely, today is a period of transition for neural network technology.

Biological Neuron
A nerve cell neuronneuron is a special biological cell that processes information.
According to an estimation, there are huge number of neurons, approximately 10 11 with
numerous interconnections, approximately 1015.
Schematic Diagram

Working of a Biological Neuron


As shown in the above diagram, a typical neuron consists of the following four parts with
the help of which we can explain its working −
• Dendrites − They are tree-like branches, responsible for receiving the information from other
neurons it is connected to. In other sense, we can say that they are like the ears of neuron.
• Soma − It is the cell body of the neuron and is responsible for processing of information, they
have received from dendrites.
• Axon − It is just like a cable through which neurons send the information.
• Synapses − It is the connection between the axon and other neuron dendrites.
ANN versus BNN
Before taking a look at the differences between Artificial Neural Network ANNANN and
Biological Neural Network BNNBNN, let us take a look at the similarities based on the
terminology between these two.
Biological Neural Network BNNBNN Artificial Neural Network ANNANN

Soma Node

Dendrites Input
Synapse Weights or Interconnections

Axon Output

The following table shows the comparison between ANN and BNN based on some
criteria mentioned.

Criteria BNN ANN

Processing Massively parallel, Massively parallel, fast but inferior than BNN
slow but superior
than ANN

Size 1011 neurons and 102 to


1015 interconnections 104 nodes mainlydependsonthetypeofapplicationandnetworkdesigne

Learning They can tolerate Very precise, structured and formatted data is required to tolerate ambiguity
ambiguity

Fault Performance It is capable of robust performance, hence has the potential to be fault toleran
tolerance degrades with even
partial damage

Storage Stores the Stores the information in continuous memory locations


capacity information in the
synapse
Model of Artificial Neural Network
The following diagram represents the general model of ANN followed by its processing.

For the above general model of artificial neural network, the net input can be calculated
as follows −
yin=x1.w1+x2.w2+x3.w3…xm.wmyin=x1.w1+x2.w2+x3.w3…xm.wm
i.e., Net input yin=∑mixi.wiyin=∑imxi.wi
The output can be calculated by applying the activation function over the net input.
Y=F(yin)Y=F(yin)
Output = function netinputcalculated
“Association rule in data mining”
Definition:
Association rules are if-then statements that help to show the probability of relationships
between data items within large data sets in various types of databases. Association rule
mining has a number of applications and is widely used to help discover sales correlations
in transactional data or in medical data sets.

History:
While the concepts behind association rules can be traced back earlier, association rule
mining was defined in the 1990s, when computer scientists Rakesh Agrawal, Tomasz
Imieliński and Arun Swami developed an algorithm-based way to find relationships
between items using point-of-sale (POS) systems. Applying the algorithms to
supermarkets, the scientists were able to discover links between different items
purchased, called association rules, and ultimately use that information to predict the
likelihood of different products being purchased together.

For retailers, association rule mining offered a way to better understand customer
purchase behaviors. Because of its retail origins, association rule mining is often referred
to as market basket analysis.

How association rules work:

Association rule mining, at a basic level, involves the use of machine learning models to
analyze data for patterns, or co-occurrence, in a database. It identifies frequent if-then
associations, which are called association rules. An association rule has two parts: an
antecedent (if) and a consequent (then). An antecedent is an item found within the data.
A consequent is an item found in combination with the antecedent.

Association rules are created by searching data for frequent if-then patterns and using
the criteria support and confidence to identify the most important relationships. Support is
an indication of how frequently the items appear in the data. Confidence indicates the
number of times the if-then statements are found true. A third metric, called lift, can be
used to compare confidence with expected confidence. Association rules are calculated
from item sets, which are made up of two or more items. If rules are built from analyzing
all the possible item sets, there could be so many rules that the rules hold little meaning.
With that, association rules are typically created from rules well-represented in data.

Association rule algorithms:

Popular algorithms that use association rules include AIS, SETM, Apriori and variations of
the latter. With the AIS algorithm, item sets are generated and counted as it scans the
data. In transaction data, the AIS algorithm determines which large item sets contained a
transaction, and new candidate item sets are created by extending the large item sets
with other items in the transaction data.

The SETM algorithm also generates candidate item sets as it scans a database, but this
algorithm accounts for the item sets at the end of its scan. New candidate item sets are
generated the same way as with the AIS algorithm, but the transaction ID of the
generating transaction is saved with the candidate item set in a sequential structure. At
the end of the pass, the support count of candidate item sets is created by aggregating
the sequential structure. The downside of both the AIS and SETM algorithms is that each
one can generate and count many small candidate item sets, according to published
materials from Dr. Saed Sayad, author of Real Time Data Mining. With the Apriori
algorithm, candidate item sets are generated using only the large item sets of the previous
pass. The large item set of the previous pass is joined with itself to generate all item sets
with a size that's larger by one. Each generated itemset with a subset that is not large is
then deleted. The remaining item sets are the candidates. The Apriori algorithm considers
any subset of a frequent item set to also be a frequent item set. With this approach, the
algorithm reduces the number of candidates being considered by only exploring the item
sets whose support count is greater than the minimum support count, according to Sayad.
Uses of association rules in data mining:

In data mining, association rules are useful for analyzing and predicting customer
behavior. They play an important part in customer analytics, market basket analysis,
product clustering, catalog design and store layout. Programmers use association rules
to build programs capable of machine learning. Machine learning is a type of artificial
intelligence (AI) that seeks to build programs with the ability to become more efficient
without being explicitly programmed.

Examples of association rules in data mining:

A classic example of association rule mining refers to a relationship between diapers and
beers. The example, which seems to be fictional, claims that men who go to a store to
buy diapers are also likely to buy beer. Data that would point to that might look like this:

A supermarket has 200,000 customer transactions. About 4,000 transactions, or about


2% of total transactions, include the purchase of diapers. About 5,500 transactions
(2.75%) include the purchase of beer. Of those, about 3,500 transactions, 1.75%, include
both the purchase of diapers and beer. Based on the percentages, that number should
be much lower. However, the fact that about 87.5% of diaper purchases include the
purchase of beer indicates a link between diapers and beer.

You might also like