Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 45

IV Distance and Rule Based Models

4.1 Distance Based Models

Distance based learning models learns input data by measuring similarity.

A distance measure summarizes the relative difference between two objects in a problem
domain. A similarity measure is a metric which determines how two objects are alike.
Similarity is having a value between 0-1(Range)

If the distance between two data points is small then more similarity where large distance
then low similarity. To test new sample, distance based models takes advantage of
similarity measures of training data.

Different distance measures are available but out of this we will learn most popular
distance measures used in machine learning.
a) Manhattan Distance
b) Euclidean Distance
c) Minkowski Distance
d) Hamming Distance
e) Cosine Similarity

a) Manhattan Distance

Book Figure 4.1.1 :- Manhattan Distance

This name is given from a city in US called Manhattan.

Manhattan distance is defined as the sum of absolute difference of their Cartesian coordinates in
both x and y directions.

So, the Manhattan distance in a 2-dimensional space is given as:


The generalized formula for an n-dimensional space is given as:

Where,

 n = number of dimensions
 pi, qi = data points

Mahattan distance is also known as City Block distance.

In machine learning Mahattan distance is advisable to use with high dimensional data and when
you want to give emphasis on outlier due to its linear nature.

b) Euclidian distance

It is just a distance between a pair of data points A and B or it is a length of path connecting two
points in  n-dimensional feature space:

Book Figure Figure 4.1.2:- Euclidian distance

As shown in figure a straight line connects two data points A and B in 2D feature space.

Euclidian distance formula in 2D space is given by

Generalized formula in n dimensional space is given by:


Where,

 n = number of dimensions
 pi, qi = data points

c) Minkowski Distance

Vector space is defined as a collection of objects that can be added together.

A norm vector space assigns positive length to each vector having real or complex numbers.

In N-dimensional real space, the similarity between two data points is a Minkowski distance. It is
given by following formula:

If p=1 then

Minkowski Distance=Manhattan distance

If p=2 then

Minkowski Distance=Euclidian Distance

d) Hamming Distance

If “SPPU” and “BATU” are two strings with same length then hamming distance represents
number of positions at which corresponding characters are different.
As length of these strings is same, we can find difference in characters by matching characters of
both strings. First character “S” and “B” of both strings are different. Similarly we will go
matching strings character by character and find out similar and dissimilar character value. So in
this example 3 characters are different and 1 character is same. So here, Hamming distance will
be 3.

If hamming distance between two strings is Large then they are more dissimilar and vice versa.

e) Cosine Similarity

Cosine similarity finds normalization of dot product of two attributes.

It basically finds the angle of separation.

It judges the orientation of two input vectors rather than their magnitude.

Figure 4.1.3:- Cosine Similarity

If two input vectors are in same direction then cos(0)=1

4.1.2 Neighbors and Exemplars,


Distance based models are based on two main concepts:
i. Identification of object as an exemplar.
ii. The rules to determine matching objects group membership
Machine learning models uses exemplar from the labeled training dataset to group or classify
other objects.
Exemplar can be
1)Arithmetic Mean
Arithmetic mean can be obtained by adding up all numbers in series and divide it by the total
numbers available.
5+10+7=22
22/3=11
Arithmetic mean=11

Question:- The arithmetic mean minimizes squared Euclidean distance

The arithmetic mean μ of a set of data points D in a Euclidean space is the unique point that
minimises the sum of squared Euclidean distances to those data points.
Proof. We will show that

where ||·|| denotes the 2-norm.We find this minimum by taking the gradient (the vector of partial
derivatives with respect to yi ) of the sum and setting it to the zero vector:

from which we derive

minimizing the sum of squared Euclidean distances of a given set of points is the same as
minimizing the average squared Euclidean distance.
2) Geometric Mean:

The geometric mean is a mean or average, which indicates the central tendency or typical value
of a set of numbers by using the product of their values.
Geometric mean is a member of all input instances.
Univariate data means one variable or one type of data. So in univariate data geometric mean
corresponds to mean or median value. However, for multivariate data there is no closed-form
expression for the geometric median, which needs to be calculated by successive approximation.
This computational advantage is the main reason why distance-based methods tend to use
squared Euclidean distance.
To calculate median arrange data in ascending order and then find the middle element as a
median value.
Let’s consider
Z= {1,3 3,6,7,8,9}

Median=6

3) Medoids:
In certain situations it may need to restrict an exemplar to be one of the data points.
Finding a medoid requires us to calculate, for each data point, the total distance to all other data
points, in order to choose the point that minimises it.

Regardless of the distance metric used, this is an O(n2) operation for n points, so for medoids
there is no compuational reason to prefer one distance metric over another.

4) Centroids:-

The centroid can be thought of as the multi-dimensional average of the cluster.

According to chosen distance metric centroids find the centre of mass.

Centroid can not be member of given input instances.

Centroids, Medioids, Mean, geometric mean can be used as an exemplar in Distance based
models.

Learn or find exemplar for each class and use nearest-exemplar decision rule to classify new data

Examples of Distance based learning models are:


1. KNN Classification
2. K-means clustering
3. Hierarchical clustering

4.1.2 Nearest neighbor classification


Distance-based classifier uses each training instance as an exemplar. Consequently, ‘training’
this classifier requires nothing more than memorising the training data. This extremely simple
classifier is known as the nearest neighbor classifier.
Properties of the nearest-neighbour classifier:-
a) Choose right exemplar to find decision boundry.
b) nearest-neighbour classifier needs to memorise all training instances.
c) The nearest-neighbour classifier has low bias, but also high variance and there is risk of
overfitting if the training data is limited, noisy or unrepresentative.
d) To store n exemplars nearest neighbor classifier is taking only o(n) time.(Fast
algorithm)The downside is that classifying a single instance also takes O(n) time, as the
instance will need to be compared with every exemplar to determine which one is the
nearest.
e) In case of high dimensional instance space every data point is far away from other data
point so pair-wise distance calculation may tend to be uninformative .
f) Solution is to perform dimension reduction using algorithms like Principal Component
Analysis ,feature selection.
g) Good idea before applying nearest-neighbour classification is to plot histogram from
which one can understand pairwise distances of a sample to see if they are sufficiently
varied.
h) Nearest-neighbor method can easily be applied to regression problems with a real-valued
target variable.

4.1.2.1 KNN Classifier Algorithm:-

K-nearest neighbors (KNN) algorithm is a Supervised classification algorithm. As it is nearest


neighbor classification algorithm we can use Euclidian distance measure to find nearest object
or neighbor. . Two properties of KNN are as follows:
 Lazy learning algorithm – KNN is called as lazy learning algorithm because it does not
use explicit training phase and makes use of all the data for training while doing
classification.
 Non-parametric learning algorithm – KNN is non-parametric learning algorithm
because it doesn’t make any assumptions about underlying data distribution.

So KNN can be good choice of classification algorithm when there is little or no prior
knowledge about the distribution data.

KNN Algorithm:-
Step 1 – Input:-Load labeled training dataset and test dataset
Let’s consider input instances used for learning of model is (x1,y1),(x2,y2),(x3,y3)
……………..(xn,yn)
Where,
Xi= d-dimensional input instance
Yi= class to which x belongs
Assume test instance=xt

Step 2 – Choose the value of k as any integer greater than the number of classes in given
problem. Generally value of k is odd number.
Step 3 − For each point in the test data do the following −
 3.1 – Calculate the Euclidean distance between Xt ,test data and Xi where i=1 to n
Different distance measures can be used like Manhattan, Hamming distance but the
most commonly used distance measure is Euclidean.
 3.2 – Arrange all calculated distances in an ascending order.
 3.3 – Now, select the top K values from the sorted distances calculated in step no.3.2.
If k=3 then select first 3 values from sorted distances.
 3.4 – Test instance will be assigned to most frequent class of these rows.
Step 4 – Stop.
Choosing the value of k:
 If k is too small, sensitive to noise points
 If k is too large, neighborhood may include points from other classes

Pros and Cons of KNN

Pro
 It is very simple to understand and easy to interpret.
 As there is no assumption about data in this algorithm so useful for nonlinear data.
 It can be used for classification as well as regression.
 It gives high accuracy as size of dataset increases.
Cons
 It is computationally expensive as it has to calculate distance of test instance with each
data point in a dataset.
 High memory storage required as compared to other supervised learning algorithms.
 KNN classifier produces output which may change according to change in value of k.
 It is very sensitive to the scale of data as well as irrelevant features.
 Prediction stage might be slow (with big N)

Applications of KNN:
1. Speech Recognition
2. Handwriting Detection
3. Image Recognition
4. Video Recognition.
5. Banking system
6. Calculating credit ratings

Solved Examples:--

Let on a scale of 1 to 10(where 1 is lowest and 10 is highest), a student is evaluated by internal


examiner and external examiner, and accordingly student result can be pass or fail. A sample data is
collected for 4 students. If a new student is rated by internal and external examiner as 3 and 7
respectively (test instance), decide new students result

Student No. (Xi1) Rating by (Xi2) (Y) Result


internal Rating by
examiner external
examiner)
S1 7 7 Pass
S2 7 4 Pass
S3 3 4 Fail
S4 1 4 Fail
Snew 3 7 ?

Solution:-
Handwritten Notes

4.1.3 Distance based clustering algorithms

In a distance-based context, unsupervised learning is usually taken to refer to clustering and we


will now review a number of distance-based clustering methods.

There are two types of clustering methods:


1. Distance based clustering by using exemplars
For example:-K-Means clustering uses centroid as exemplar
2. Distance based clustering without using exemplars
For example:-Hierarchical clustering

The goal of Distance based clustering is to find clusters that are compact with respect to distance
metric.
This requires a notion of cluster compactness that can serve as our optimisation criterion.

Scatter Matrix:- Given a data matrix X, the scatter matrix is the matrix

Where μ is a row vector containing all column means of X. The scatter of X is defined

which is equal to the trace of the scatter matrix.

Lets partition D into K subsets D1. . .DK =D, and let μj denote the mean of Dj. Let S be the scatter
matrix of D, and Sj be the scatter matrices of Dj . These scatter matrices then have the following
relationship:

Here, B is the scatter matrix that results by replacing each point in D with the corresponding μj

Sj  a within-cluster scatter matrix and describes the compactness of the jth cluster.

B  is the between-cluster scatter matrix and describes the spread of the cluster centroids.

It follows that the traces of these matrices can be decomposed similarly, which gives

minimising the total scatter over all clusters is equivalent to maximising the (weighted) scatter of
the centroids

The K-means problem is to find a partition that minimises the total within-cluster scatter.

Reducing scatter by partitioning data:-

Consider the following five points: (0,3), (3,3), (3,0), (−2,−4) and (−4,−2). These points are,
conveniently, centred around (0,0). The scatter matrix is
Book Figure
with trace Scat(D) = 76. If we cluster the first two points together in one cluster and the
remaining three in another, then we obtain cluster means μ1 = (1.5,3) and μ2 = (−1,−2) and
within-cluster scatter matrices

Book Figure
with traces Scat(D1) = 4.5 and Scat(D2) = 34. Two copies of μ1 and three copies of μ2 have, by
definition, the same centre of gravity as the complete data set: (0,0) in this case. We thus
calculate the between-cluster scatter matrix as

Book Figure
with trace 37.5.

Alternatively, if we treat the first three points as a cluster and put the other two
in a second cluster, then we obtain cluster means μ1= (2,2) and μ2= (−3,−3),and within-cluster
scatter matrices
Book Figure
with traces Scat(D1) = 12 and Scat(D2) = 4. The between-cluster scatter matrix is

Book Figure
with trace 60. Clearly, the second clustering produces tighter clusters whose centroids are further
apart.

4.1.3.1 K-Means Clustering-

K-means clustering is distance based clustering algorithm which uses exemplar as Centroid.

K-means clustering is unsupervised learning algorithm which works on iterative clustering


techniques.

It is also called as Lloyd’s algorithm but popular by the name K-Means algorithm.

There is no efficient solution to find the global minimum for The K-means problem so it is
referred as NP-complete problem.

Using the nearest-centroid decision rule the algorithm iterates between partitioning the data and
recalculating centroids from a partition.

K-Means(D,K) – K-means clustering using Euclidean distance(Dis2)

K-Means Algorithm:-
Book Figure please convert
We can summarize K-Means algorithm as follows:
3. Define number of clusters as K. 
4. Randomly choose any of the K data points given in an instance space as k-
Centroids as C1,C2,C3…..Ck select any K instances as k-Centroids of k-
clusters as C1,C2,C3…..Ck
5. Initially, there is only one randomly selected data point in each cluster.
6. Calculate the distance of each data point with center of each cluster.
7. In k-Means algorithm distance can be calculated by using Euclidian Distance
measure.
8. Each data point from instance space will be assigned to the cluster whose centre is
nearest to the data point.
9. Form new clusters and Re-calculate the center of each new cluster by taking mean
of all data points belonging to that respective cluster.
10. Repeat the procedure from step 4 to step 7 until any of the stopping criteria is met.
 Maximum number of iterations are done
 Same data points in same cluster.
 Center of new cluster is not changing

Elbow Method to select best value of K :-


Elbow method is available to select best of value of K in K-Means clustering algorithm.
Find the sum of squared distances between data points and their Centroids
Pick value of K at the point where SSE starts Flatting out leading to the formation of an elbow as
shown in figure.

Figure 4.1.4:- Elbow Method


Above graph represents that value of K=2 is good choice of defining number of clusters based on
concept of Elbow method but in case of some datasets graph of SSE may not form an elbow.
Advantages of K-Means Clustering:-
1. If value of k is small then K-Means is computationally faster than hierarchical
clustering.
2. It is simple to implement.
3. Assurance of Convergence.
4. It generalizes to elliptical clusters of different shapes and size.
Disadvantages of K-Means Clustering:-
1. Difficult to select value of K manually.
2. Change in initial value of Centroid leads to different final clusters.
3. We have to remove outliers before applying clustering algorithm or K-Means are
sensitive to outliers as it leads to formation of one more cluster.
4. If number of dimension in dataset is increasing or more then similarity measures
based on distance measures converges to constant value between any given
examples.

Applications of K-Means Clustering:-


1. Customers of bank, e-commerce, sports, sales, Insurance are grouped in the form of
clusters where helping owner to find regular customers to offer discounts.
2. Clustering helps search engine to group the search results based on query fired by the
user.
3. In image segmentation to group similar pixels clustering can be used.
4. Clustering is used in Recommendation system like song recommendation system to find
similar song that is searched by user.Accordingly form clusters and suggest similar song.

Handwritten problems pdf

4.1.3.2 K-medoids Clustering Algorithm:

K-Medoid clustering uses medoid as exemplar.

To calculate distance any distance metric is used in k-Medoid based clustering whereas
Euclidian distance metric is used in K-Means Clustering.

K-means algorithm is sensitive to outliers! Since an object with an extremely large value may
substantially distort the distribution of data.

Instead of taking mean value of object in cluster as a reference point medoid can be used ,which
is most centrally located object.

Calculating the medoids of a cluster requires examining all pairs of points – whereas calculating
the mean requires just a single pass through the points– which can be prohibitive for large data
sets.

An instance can be selected as medoid whose sum of distances from all other instances is
minimum.

K-medoid clustering is also called as Partitioning Around Medoids(PAM) algorithm.

An important limitation of the clustering methods discussed in this section is that they represent
clusters only by means of exemplars.
Time complexity of K-Medoid algorithm is O(k*(n-k)2)

Advantages of K-Medoid algorithm:-


1. It is easy to understand and easy to implement.
2. It is fast algorithm and fixed number of steps required for convergion.
3. It is not sensitive to outliers.
Disadvantages of K-Medoid algorithm:-
1. It is not suitable for arbitrary shaped group of objects.
2. Different results for different runs on same dataset.

4.1.3.3 Hierarchical Clustering:-

 Hierarchical clustering is very common, when scientists look at the animal kingdom they
divide them hierarchically like shown in figure .
 Animals are vertebrates and invertebrates. Within vertebrate we can find
fish,repltiles,mammals and so on whereas in Invertebrate we have worm,insects etc. so it
produces a nested sequence of clusters.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

Figure 4.1.5 :- Hierarchical Clustering

 In order to get hierarchical cluster we can recursively use a partition algorithm.


 The clustering methods like K-Means and K-Medoids use exemplars like Centroids and
Medoids to represent predictive clustering.
 Hierarchical clustering is represented using trees called as dendograms where
dendrograms are nothing but distance measures.
 Dendrograms partition the given data rather than the entire instance space representing
descriptive clustering.
 So examples of Predictive distance based learning model are K-Means and K-Medoids
and Descriptive distance based learning model is Hierarchical clustering.

Types of Hierarchical Clustering:-

• Agglomerative (bottom up) clustering: Agglomerative clustering uses bottom up


approach to build clusters.
• We assume that every object is cluster of its own and then find distance between each
pair of clusters and those pair which is closest to each other is merged.
• So, It builds the dendrogram (tree) from the bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single cluster (i.e., the root
cluster).

• Divisive (top down) clustering: It starts with all data points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster is recursively divided
further
– stops when only singleton clusters of individual data points remain, i.e., each
cluster with only a single point

Agglomerative (bottom up) clustering:


6 5
0.2
4
3 4
0.15 2
5

0.1 2

0.05
1
3 1

0
1 3 2 5 4 6

Figure 4.1.6:- A Sample Dendrogram

 Consider an input set is S where, S contains different data points as shown in figure.
 Each node in above figure represents subset of S and is defined as union of their
children’s representing cluster.
 The root of dendrogram represents the whole input set S
 Leaves are individual elements of S

Each level of the tree represents a partition of the input data into several (nested) clusters or
groups.

May be cut at any level: Each connected component forms a cluster.

Agglomerative clustering Algorithm:-

1. Initially each data point forms a cluster.


2. Compute the distance matrix between the clusters.
3. Repeat
a. Merge the two closest clusters
b. Update the distance matrix
4. Until only a single cluster remains.

Different definitions of the distance leads to different algorithms.

1. Each individual point is taken as a cluster.


Figure 4.1.7:- Individual data points

2. Construct distance/proximity matrix

p5
p4
p3
p2
p1
.
.
.

p1
p2
p3
p4
p5
...

Figure 4.1.8 :- Distance/proximity matrix

3. Intermediate State
After some merging steps, we have some clusters

Distance/Proximity Matrix
Figure 4.1.9 :-Intermediate Step

4. Merge the two closest clusters (C2 and C5) and update the distance matrix as
shown below:

Distance/Proximity Matrix
Figure 4.1.10:- Merge and update distance matrix
5. After merging clusters update distance matrix

Figure 4.1.11:- Cluster Merging

To measure distance of two clusters some popular distance measures are available as
follows:
Single

• Single-link
– Similarity of the most similar (single-link)
– Single-link distance between clusters Ci and Cj is the minimum distance between
any object in Ci and any object in Cj
– sim(C , C )= max sim( x , y )
i j
x ∈C i , y ∈C j

Single Link Example shown in figure

Figure 4.1.12:-Single Link Example

It Can result in “straggly” (long and thin) clusters due to chaining effect.

• Determined by one pair of points, i.e., by one link in the proximity graph.
1 2 3 4 5
Figure 4.1.12- Single Link with sample data points

• Complete-link

– Similarity of the least similar points
– The distance between two clusters is the distance of two furthest data points in the
two clusters.

sim (c i , c j )= min sim( x , y )
• x ∈ci , y ∈c j


– Makes “tighter,” spherical clusters that are typically preferable.
– It is sensitive to outliers because they are far away.
– Distance between clusters is determined by the two most distant points in the
different clusters

1 2 3 4 5

Figure 4.1.13 :- Complete-link clustering: example

• Average-link
– Returns average of cosine between pairs of elements
– Similarity of two clusters = average similarity between any object in Ci and any
object in Cj
1
|Ci||C j| ⃗x∑ ∑
– sim (c , c )= sim( ⃗x , ⃗y )
i j
∈C y ∈C j
i⃗

• Compromise between single and complete link. Less susceptible to noise and outliers.
• Two options:
– Averaged across all ordered pairs in the merged cluster
– Averaged over all pairs between the two original clusters

Time Complexity:-
• All the algorithms are at least O(n2). n is the number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in O(n2logn).
• Due the complexity, hard to use for large data sets.

Table 4.1.1 Clustering Methods Comparison

Hierarchical K-means

Running time naively, O(N3 ) fastest (each iteration is linear)

Assumptions requires a similarity / distance measure strong assumptions

Input parameters None K (number of clusters)

Clusters subjective (only a tree is returned) exactly K clusters

Problem on Single linkage Clustering:-

We have five objects and we put each object into one cluster.Now call each object as a
cluster. In the begining we have 5 clusters now group those clusters such that at the end of
iterations we have only single cluster having all original 5 objects.

Consider the table given below indicates pair wise distances calculated for cluster.

In each iteration we find closest pair clusters and merge.

1. Initially, closest distance is in between cluster 1 and cluster 2, so merge them


as shown below.
Distance between cluster(1,2) with remaining clusters 3,4 and 5 is calculated as
given below and distance is updated in distance matrix.

Figure 4.1.14 Distance matrix and Dendrogram after 1st iteration

2. Now, distance between cluster(1,2) is closest with cluster 3 as per updated


distance matrix so merge cluster(1,2) with cluster 3 and update distance
matrix by calculating minimum distance as shown below:
Figure 4.1.15 Distance matrix and Dendrogram after 2nd iteration

3. Now only cluster 4 and cluster 5 is remaining so calculate distance and


merge.

Figure 4.1.16:- Final dendrogram showing all clusters

Now we get all objects under one cluster so computation is finished and result of computation is
summarized as follows:

1. In the beginning we have 5 objects.


2. We merge cluster(1) and cluster(2) into cluster (1,2) at distance of two
3. We merge cluster(1,2) with cluster(3) at distance of 3
4.then we merge cluster(4) with cluster(5) at distance of 4
5.Finally we merge cluster(1,2,3) with cluster(4,5) at distance of 5
6.The last cluster contains all 5 objects so conclude the computation.

4.2 Rule Based Models:-

Rule based models accept set of instances and constructs rules to achieve goal of machine
learning task.

It search best rule from the available set of rules.

Rule based models can be descriptive or predictive. We will learn descriptive rule based models.

Descriptive Rule mining:-It can be applied for supervised and unsupervised learning.

1)Supervised:-Adopt rules to form subgroup discovery

2)Unsupervised:-Learning about frequent items set and association rule mining

When rules will be converted into decision trees then decision tree will be exponential in size as
rules are highly overlapping.

Therefore, Rule based models are preferred over Tree based models.

4.2.1 Rule Learning for Subgroup Discovery:-

The concept of subgroup discovery was initially introduced by Kloesgen and Wrobel which is
defined as:

“In subgroup discovery, we assume we are given a so-called population of individuals (objects,
customer, ...) and a property of those individuals we are interested in. The task of subgroup
discovery is then to discover the subgroups of the population that are statistically “most
interesting”, i.e. are as large as possible and have the most unusual statistical (distributional)
characteristics with respect to the property of interest.”

Subgroup discovery identifies a relationship between variables with respect to target variable and
these relations are defined in the form of rules.

Rules containing subgroup description is defined as follow:

R : Cond → Target value

Targetvalue =It represents a variable of interest or target variable for subgroup discovery.

Cond= Defines statistical distribution or conjunction of features with respect to target variable.
Subgroup discovery can be done using SD algorithms which extracts rules from subset of data.

Subgroup discovery has mostly been applied to Medical Domain.

Two types of inductions:

1. Predictive Induction:-
Its objective is to discover knowledge for classification of prediction.
Predictive induction features includes Classification and Regression
2. Descriptive induction:-
It’s objective is to extract knowledge from the data. Descriptive induction
features includes Classification and Regression.

Subgroup discovery lies between predictive and descriptive induction.

Figure shows rule for subgroup discovery for target variables having values as Targetvalue =
x and Targetvalue = o).

Figure 4.2.1 :- Subgroup Discovery for Targetvalue = x

From figure target value of variable(Targetvalue = x ) is observed where, high number of


objects(x) with a single function(function name=’Circle’) is covered. It is observed that
subgroups cant covers all examples where Targetvalue = x but this form of function is easy to
interpret and simple. So we can say that this model is simple and gives better True positive rate,
more than 75%.

A classifier tries to predict from examples whereas subgroup discovery tries to describe
knowledge for data.

Different Measures of Subgroup Discovery:


 Coverage:-Formula to measure number of examples covered is given by
Cov(R) = n(Cond)
ns
where,

ns= is the number of total examples and n(Cond) is the number of examples which satisfy
the conditions determined by the antecedent part of the rule.

 Support: It measures It measures the frequency of correctly classified examples covered


by the rule

This can be computed as:

Sup(R) = n(Targetvalue ・ Cond)/ns

where n(Targetvalue ・ Cond) = T P is the number of examples which satisfy the conditions and
also belong to the value for the target variable in the rule.

 Confidence: It measures the relative frequency of examples satisfying the complete rule
among those satisfying only the antecedent. This can be computed with different
expressions,
e.g.
Cnf (R) = n(Targetvalue ・ Cond) / n(Cond)

This quality measure can also be found as accuracy.

Subgroup discovery identifies a relationship between variables with respect to target variable and
these relations are defined in the form of rules.

Rules containing subgroup description is defined as follow:

R : Cond → Target value

Targetvalue =It represents a variable of interest or target variable for subgroup discovery.

Cond= Defines statistical distribution or conjunction of features with respect to target variable.

Example of Subgroup Discovery

Let’s consider,

D=Dataset with three variables,

Age={<25,25-60,>60}

Sex={M,F,}
Country={India, Finland, Italy, Bangladesh}

and Target variable is=Money={Rich, Normal, Poor}

Sample Rules generated are :

Rule1 : (Age = 25-60 AND Country = India) → Money = Rich

Rule2 : (Age = > 60 AND Sex = F) → Money = Normal

Rule1 can discover subgroup of Indian people for age 25-60 having probability of rich is high as
compared to the rest of the population.

Rule2 represents that women with more than 60 years old are more likely to have a normal
economy than the rest of the population.

4.2.2 Association rule mining:-

Association rule mining is used in unsupervised learning and it is one of the prominent method in
Data mining applications.

Mostly machine learning model works on numerical data but association rule mining model
works on categorical and non-numeric data.

Association rules are nothing but simple If/Then statements to discover relationship between
independent item sets.

Main focus of association rule mining is to:


i) find frequent patterns in a dataset
j) find co-relation between items
k) Associations of items in transaction.

Association rule mining is also called as Apriori algorithm and affinity analysis. Set of
items in transaction is called as Market basket.

Association rule mining has two key concepts:

i) Antecedent(if):- It is a data/item which is available in dataset


ii) consequent (then):-It is data/item found in combination with antecedent

For example: - “If a customer buys Laptop, he’s 70% likely of buying Headset.”

In this association rule antecedent is Laptop and consequent is Headset. However ,this rule
can be interpreted that if person buys laptop then that person may also purchases Headset.
This association rule is generated using thorough analysis of some set of transactions to
improve customer service and company revenue.

4.2.2.1 Confidence and Support parameters

Association rule can be generalized as:

“If A then B”

with respect to this lets learn two important parameters:

a) Support(S):- It represents frequency of the if/then relationship appears in the transaction


or measures frequency of association.

Percentage of transactions (T) that contains both ‘A’ and ‘B’

S=P(A U B)

b) Confidence(C):-It represents number of times relationship found to be true or strength of


association.

In a transaction set (T),C represents percentage of times B present in all the transaction
containing A.

C=P(B/A)= P(A ∩ B )/P(A)

4.2.2.2 Apriori Algorithm to find frequent Item-sets


Algorithm to find Association Rule

Solved Problem on Apriori Algorithm:-


For the following given transaction Data set generate rules using Apriori algorithm.
Consider the values as support=50% and Confidence=75%

Transaction id Items Purchased


1 Bread,Cheese,Egg,Juice
2 Bread,Cheese,Juice
3 Bread,Milk ,Yougurt
4 Bread,Juice,Milk
5 Cheese,Juice,Milk

Step 1:-Frequent Item Set

Items Frequency Support


Bread 4 4/5=80%
Cheese 3 3/5=60%
Egg 1 1/5=20%
Juice 4 4/5=80%
Milk 3 3/5=60%
Yogurt 1 1/5=20%

Minimum support given in problem is 50% so remove items having support less than
50%

Therefore, remove Egg and Yogurt from above table.

Step 2:- Make 2-items Candidate Set and write their frequency

Remaining items from step 1 are Bread,Cheese,Juice,Milk

Item-Pairs Frequency Support


Bread,Cheese 2 2/5=40%
Bread,Juice 3 3/5=60%
Bread,Milk 2 2/5=40%
Cheese,Juice 3 3/5=60%
Cheese,Milk 1 1/5=20%
Juice,Milk 2 2/5=40%

From above table support>50% item-pairs have to be considered for further calculation.
Step 3:- So for association rules let’s consider only (Bread,Juice) & (Cheese,Juice) item-pairs
as their support is greater than 50%.

a) So for first Item-pair (Bread, Juice) rule can be


“If Bread then Juice” or “If Juice then Bread”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

i) (BreadJuice)=Support(BUJ)/Support(B)

=3 * 5/5 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 75%
ii) Similarly calculate confidence for (JuiceBread)

=3 * 5/5 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 75%

b) For second pair of item (Cheese,Juice) rule can be

“If Cheese then Juice” or “If Juice then Cheese”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

i) (CheeseJuice)=Support(CUJ)/Support(C)

=3 * 5/5 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 75%

ii) Similarly calculate confidence for (JuiceBread)

=3 * 5/5 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 75%

Step 4:-Summary of final Rules generated after applying Apriori algorithm are

Rule 1:- If Juice then Bread with confidence value 75%


Rule 2:- If Bread then Juice with confidence value 75%
Rule 3:- If Cheese then Juice with confidence value 100%
Rule 4:- If Juice then Cheese with confidence value 75%
Problem 2:- Find all association rules in the following database with minimum support = 2
and minimum confidence = 65%. [10] (Nov-Dec 2018)
Transaction Data Items
s
T1 Milk, Bread, Cornflakes
T2 Bread, Jam
T3 Milk, Bread, cornflakes, Jam
T4 Milk, cornflakes
T5 Bread, Butter, Jam
T6 Bread, Butter
T7 Milk, Bread, Butter
Solution:-
Step 1:-Frequent Item Set

Items Frequency Support


Milk 4 4/7=57%
Bread 6 6/7=85.7%
Cornflakes 3 3/7=42.8%
Jam 3 3/7=42.8%
Butter 2 2/7=28.5%

Minimum support given in problem is 20% so remove items having support less than
20%. But,from table no item is removed as their support value is greater than equals to 2.

Step 2:- Make 2-items Candidate Set and write their frequency

Remaining items from step 1 are Milk,Bread,Cornflakes,Jam,Butter

Item-Pairs Frequency Support


Milk Bread 3 3/7=42.8%
Milk Cornflakes 3 3/7=42.8%
Milk Jam 1 1/7=14.2%
Milk,Butter 1 1/7=14.2%
Bread,Cornflakes 2 2/7=28.5%
Bread Jam 3 3/7=57%
Bread,Butter 3 3/7=85.7%
Cornflakes, Jam 1 1/7=14.2%
Jam,Butter 1 1/7=14.2%

From above table support>20% , item-pairs have to be considered for further calculation.
Step 3:- So for association rules let’s consider only (Milk,Bread),(Milk,Cornflakes),
(Bread,cornflakes),(Bread,Jam) and (Bread,Butter) item-pairs as their support is greater than
20%.

a) So for first Item-pair (Milk,Bread) rule can be


“If Milk then Bread” or “If Bread then Milk”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

i) (MilkBread)=Support(MUB)/Support(M)

=3 * 7/7 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 65%
ii) Similarly calculate confidence for (BreadMilk)

=3 * 7/7 * 6
=3/6
=50% (Not Valid Rule because Confidence given in Problem is
65%

b) For second pair of item (MilkCornflakes) rule can be

“If Milk then Cornflakes” or “If Cornflakes then Milk”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

iii) (MilkCornflakes)=Support(MUC)/Support(M)

=3 * 7/7 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 65%

iv) Similarly calculate confidence for (CornflakesMilk)


= Support(CUM)/Support(M)
=3 * 7/7 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 65%

c) For third pair of item (BreadCornflakes) rule can be

“If Bread then Cornflakes” or “If Cornflakes then Bread”


So as per formula of Confidence(AB)=Support(AUB)/Support(A)

v) (BreadCornflakes)=Support(BUC)/Support(B)

=2 * 7/7 * 6
=2/6
=33% (Not Valid Rule because Confidence given in Problem is
65%

vi) Similarly calculate confidence for (CornflakesBread)


=Support(CUB)/Support(C)
=2* 7/7 * 3
=2/3
=66.6% (Valid Rule because Confidence given in Problem is 65%

d) For fourth pair of item (BreadJam) rule can be

“If Bread then Jam” or “If Jam then Bread”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

vii) (BreadJam)=Support(BUJ)/Support(B)

=3 * 7/7 * 6
=3/6
=50% (Not Valid Rule because Confidence given in Problem is
65%

viii) Similarly calculate confidence for (JamBread)


=Support(JUB)/Support(J)
=3 * 7/7 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 65%

e) For last pair of item (BreadButter) rule can be

“If Bread then Butter” or “If Butter then Bread”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

ix) (BreadButter)=Support(BUB)/Support(B)
=3 * 7/7 * 6
=3/6
=50% (Not Valid Rule because Confidence given in Problem is
65%

x) Similarly calculate confidence for (ButterBread)


=Support(BUB)/Support(B)
=3 * 7/7 * 2
=3/2
=150% (Valid Rule because Confidence given in Problem is 65%

Step 4:-Summary of final Rules generated after applying Apriori algorithm are

Rule 1:- If Milk then Bread with confidence value 75%


Rule 2:- If Milk then Cornflakes with confidence value 100%
Rule 3:- If Cornflakes then Milk with confidence value 66.6%
Rule 4:- If Jam then Bread with confidence value 100%
Rule 5:- If Butter then Bread with confidence value 100%

Problem 3:--Apply Apriori algorithm for following set of transactions and find all the
association rules with min support = 1 and min confidence = 60%. [10] (May-June 2019)

Tr.ID Transactions
1 1,3,4
2 2,3,5
3 1,2,3,5
4 2,5

Solution:-
Step 1:-Frequent Item Set

Items Frequency Support


1 2 2/4=50%
2 3 3/4=75%
3 3 3/4=75%
4 1 1/4=25%
5 3 3/4=75%
Minimum support given in problem is 60% so remove items having support less than
60%. So item 1 and item 4 is removed from table.

Step 2:- Make 2-items Candidate Set and write their frequency

Remaining items from step 1 are 2,3,5

Item-Pairs Frequency Support


2,3 1 1/4=25%
2,5 3 3/4=75%
3,5 2 2/4=50%

From above table support>60% , item-pairs have to be considered for further calculation.

Step 3:- So for association rules let’s consider only (2,5) item-pairs as their support is greater
than 60%.

a) So for first Item-pair (2,5) rule can be


“If 2 then 5” or “If 5 then 2”

So as per formula of Confidence(AB)=Support(AUB)/Support(A)

i) (25)=Support(2U5)/Support(2)

=3 * 4/4 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 60%
ii) Similarly calculate confidence for (52)
= Support(5U2)/Support(5)
=3 * 4/4 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 60%

Step 4:-Summary of final Rules generated after applying Apriori algorithm are

Rule 1:- If item 2 then item 5 with confidence value 100%


Rule 2:- If item 5 then item 2 with confidence value 100%

4.3 Tree Based Models:-


 Tree based models are more popular because they are easy to understand, Expressive and
follows divide and conquer approach to solve machine learning task.

4.3.1 Decision Tree:-


 Decision tree is a classifier in the form of a tree and tree has two types of nodes
o Decision nodes:- It represents process of partitioning of feature space. We test
something and that test may have more than one result. This test is performed on
value or attribute of a feature of instance. Then decide the Leaf node based on
condition value specified on branches of decision tree.

o Leaf Node:-Represents classification of an example or predicted value of an


example.

Decision tree can be used both for classification and regression. However,it is more popularly
used for classification.

Following figure shows a decision tree for loan approve or disapprove process.

As shown in figure ? in first decision node we test whether applicant is employed or not?.If
applicant is not employed then we have another test that does applicant have high credit score or
not?.If credit score is high then approve loan else reject the loan. In another case, if applicant is
employed then test the income if income is high then approve loan else reject loan.

Low
Figure 4.3.1:- Decision tree for Loan approve or reject

How to select best decision tree?

 Consider that some training examples are given and we have to generate a decision tree.
 It is possible that we may generate many decision trees which fit the training examples.
 Which decision tree is best?
 Solution is preferring smaller decision tree that fits the training data. Smaller trees are
nothing but trees with lower depth or trees with smaller number of nodes.
 But finding smallest decision tree that fits the training data is computationally hard
problem.
 So we have to look for Greedy algorithm to search for smallest decision tree.
A2=? [29+,35-]

True False

[18+, 33-] [11+, 2-]


Figure 4.3.2:- Which attribute is best for split

We have total 64 training examples out of which 29 are positive(29+) and 35 examples are
negative(35-).

If we split tree on A1 which is binary attribute having True and false, then if A1 is equal to true
then we get 21+ and 5- examples and if A1 is false then we get 8+ and 38- Examples.

Similarly, If we consider split on attribute A2 then if A2 is equal to true then we get 18+ and 33-
examples and if A2 is false then we get 11+and 2- Examples.

To decide which attribute to split on, we have method called Entropy and Information gain.

4.3.2 Impurity Measures:-

4.3.2.1 Entropy:-

Entropy is a measure of uncertainty, purity or information content.

If at a particular node all examples have positive class or all examples have negative class it
means if all examples belongs to same class then it is a homogeneous set S and for such
homogeneous set entropy is zero.

• Let S is a sample of training examples where,


– p+ is the proportion of positive examples in S
– p- is the proportion of negative examples in S

• Entropy of S: average optimal number of bits to encode information about


certainty/uncertainty about S
Entropy(S) = p+(-log2p+) + p-(-log2p-)

= -p+log2p+ - p-log2p-

Figure 4.3.3 :- Entropy

• S is a sample of training examples


• p+ is the proportion of positive examples
• p- is the proportion of negative examples
• Entropy measures the impurity of S

Entropy(S) = -p+log2p+- p-log2 p-


Impurity should not change if we swap the positive and negative class

Figure shows entropy curve from this we can say,


• The entropy is 0 if the outcome is ``certain”.
• The entropy is maximum if we have no knowledge of the system (or any outcome is
equally possible).

4.3.2.2 Information Gain:-

 Information Gain measures how well a given attribute separates the training examples
according to their target classification.

 This measure is used to select among the candidate attributes at each step while growing
the tree.

 Gain is measure of how much we can reduce uncertainty (Value lies between 0,1)
Gain(S,A): expected reduction in entropy due to partitioning S on attribute A
 

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64


= 0.99
A2=? [29+,35-]

True False

[18+, 33-] [11+, 2-]


Figure 4.3.4:- Split on A1 and A2 attribute
Information Gain is calculated for A1 attribute:
Entropy([21+,5-]) = 0.71
Entropy([8+,30-]) = 0.74
Gain(S,A1)=Entropy(S)
-26/64*Entropy([21+,5-])
-38/64*Entropy([8+,30-])
=0.27

Information Gain is calculated for A2 attribute:


Entropy([18+,33-]) = 0.94
Entropy([8+,30-]) = 0.62
Gain(S,A2)=Entropy(S)
-51/64*Entropy([18+,33-])
-13/64*Entropy([11+,2-])
=0.12

Information gain for split on attribute A1 is more than attribute A2 so in this example we
consider A1 as splitting attribute for decision tree or feature tree.

4.3.2.3 Best Split Algorithm:-

Let D= Set of instances (Homogeneous)


Label(D)=Class of instance

D is split into D1 and D2

Assume two classes as D+ is positive class and D- is negative class

Best situation to asses efficacy of feature in terms of splitting examples into positive class and
negative class is where D1+= D+ and D1- =¢ or where D1+=¢ and D1- = D-. This kind of split is
called as Pure Split.

n+ Represents set of training examples belonging to positive class

n- Represents set of training examples belonging to negative class

Impurity with respect to positive class p˙ = n+ /( n+ + n- )

Impurity with respect to negative class p˙ = n- /( n+ + n- )

Best Split Algorithm:-

Measures of impurity are given by:

1. Minority Class measures proportion of misclassified examples

Error Rate= 1/2−(| p˙−1/2 |)


2. Gini Index this is the expected error if we label examples in the leaf
randomly:positive with probability p˙ and negative with probability 1−p˙.
Gini Index=2p˙(1−p˙)

3. Entropy= p˙ log2 p˙−(1−p˙) log2 (1− p˙)

impurity of a single leaf Dj as Imp(Dj ), the impurity of a set of mutually exclusive leaves
{D1, . . . ,Dl } defined as a weighted average

where D = D1∪. . .∪ Dl

You might also like