Professional Documents
Culture Documents
IV Distance and Rule Based Models 4.1 Distance Based Models
IV Distance and Rule Based Models 4.1 Distance Based Models
A distance measure summarizes the relative difference between two objects in a problem
domain. A similarity measure is a metric which determines how two objects are alike.
Similarity is having a value between 0-1(Range)
If the distance between two data points is small then more similarity where large distance
then low similarity. To test new sample, distance based models takes advantage of
similarity measures of training data.
Different distance measures are available but out of this we will learn most popular
distance measures used in machine learning.
a) Manhattan Distance
b) Euclidean Distance
c) Minkowski Distance
d) Hamming Distance
e) Cosine Similarity
a) Manhattan Distance
Manhattan distance is defined as the sum of absolute difference of their Cartesian coordinates in
both x and y directions.
Where,
n = number of dimensions
pi, qi = data points
In machine learning Mahattan distance is advisable to use with high dimensional data and when
you want to give emphasis on outlier due to its linear nature.
b) Euclidian distance
It is just a distance between a pair of data points A and B or it is a length of path connecting two
points in n-dimensional feature space:
As shown in figure a straight line connects two data points A and B in 2D feature space.
n = number of dimensions
pi, qi = data points
c) Minkowski Distance
A norm vector space assigns positive length to each vector having real or complex numbers.
In N-dimensional real space, the similarity between two data points is a Minkowski distance. It is
given by following formula:
If p=1 then
If p=2 then
d) Hamming Distance
If “SPPU” and “BATU” are two strings with same length then hamming distance represents
number of positions at which corresponding characters are different.
As length of these strings is same, we can find difference in characters by matching characters of
both strings. First character “S” and “B” of both strings are different. Similarly we will go
matching strings character by character and find out similar and dissimilar character value. So in
this example 3 characters are different and 1 character is same. So here, Hamming distance will
be 3.
If hamming distance between two strings is Large then they are more dissimilar and vice versa.
e) Cosine Similarity
It judges the orientation of two input vectors rather than their magnitude.
The arithmetic mean μ of a set of data points D in a Euclidean space is the unique point that
minimises the sum of squared Euclidean distances to those data points.
Proof. We will show that
where ||·|| denotes the 2-norm.We find this minimum by taking the gradient (the vector of partial
derivatives with respect to yi ) of the sum and setting it to the zero vector:
minimizing the sum of squared Euclidean distances of a given set of points is the same as
minimizing the average squared Euclidean distance.
2) Geometric Mean:
The geometric mean is a mean or average, which indicates the central tendency or typical value
of a set of numbers by using the product of their values.
Geometric mean is a member of all input instances.
Univariate data means one variable or one type of data. So in univariate data geometric mean
corresponds to mean or median value. However, for multivariate data there is no closed-form
expression for the geometric median, which needs to be calculated by successive approximation.
This computational advantage is the main reason why distance-based methods tend to use
squared Euclidean distance.
To calculate median arrange data in ascending order and then find the middle element as a
median value.
Let’s consider
Z= {1,3 3,6,7,8,9}
Median=6
3) Medoids:
In certain situations it may need to restrict an exemplar to be one of the data points.
Finding a medoid requires us to calculate, for each data point, the total distance to all other data
points, in order to choose the point that minimises it.
Regardless of the distance metric used, this is an O(n2) operation for n points, so for medoids
there is no compuational reason to prefer one distance metric over another.
4) Centroids:-
Centroids, Medioids, Mean, geometric mean can be used as an exemplar in Distance based
models.
Learn or find exemplar for each class and use nearest-exemplar decision rule to classify new data
So KNN can be good choice of classification algorithm when there is little or no prior
knowledge about the distribution data.
KNN Algorithm:-
Step 1 – Input:-Load labeled training dataset and test dataset
Let’s consider input instances used for learning of model is (x1,y1),(x2,y2),(x3,y3)
……………..(xn,yn)
Where,
Xi= d-dimensional input instance
Yi= class to which x belongs
Assume test instance=xt
Step 2 – Choose the value of k as any integer greater than the number of classes in given
problem. Generally value of k is odd number.
Step 3 − For each point in the test data do the following −
3.1 – Calculate the Euclidean distance between Xt ,test data and Xi where i=1 to n
Different distance measures can be used like Manhattan, Hamming distance but the
most commonly used distance measure is Euclidean.
3.2 – Arrange all calculated distances in an ascending order.
3.3 – Now, select the top K values from the sorted distances calculated in step no.3.2.
If k=3 then select first 3 values from sorted distances.
3.4 – Test instance will be assigned to most frequent class of these rows.
Step 4 – Stop.
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
Pro
It is very simple to understand and easy to interpret.
As there is no assumption about data in this algorithm so useful for nonlinear data.
It can be used for classification as well as regression.
It gives high accuracy as size of dataset increases.
Cons
It is computationally expensive as it has to calculate distance of test instance with each
data point in a dataset.
High memory storage required as compared to other supervised learning algorithms.
KNN classifier produces output which may change according to change in value of k.
It is very sensitive to the scale of data as well as irrelevant features.
Prediction stage might be slow (with big N)
Applications of KNN:
1. Speech Recognition
2. Handwriting Detection
3. Image Recognition
4. Video Recognition.
5. Banking system
6. Calculating credit ratings
Solved Examples:--
Solution:-
Handwritten Notes
The goal of Distance based clustering is to find clusters that are compact with respect to distance
metric.
This requires a notion of cluster compactness that can serve as our optimisation criterion.
Scatter Matrix:- Given a data matrix X, the scatter matrix is the matrix
Where μ is a row vector containing all column means of X. The scatter of X is defined
Lets partition D into K subsets D1. . .DK =D, and let μj denote the mean of Dj. Let S be the scatter
matrix of D, and Sj be the scatter matrices of Dj . These scatter matrices then have the following
relationship:
Here, B is the scatter matrix that results by replacing each point in D with the corresponding μj
Sj a within-cluster scatter matrix and describes the compactness of the jth cluster.
B is the between-cluster scatter matrix and describes the spread of the cluster centroids.
It follows that the traces of these matrices can be decomposed similarly, which gives
minimising the total scatter over all clusters is equivalent to maximising the (weighted) scatter of
the centroids
The K-means problem is to find a partition that minimises the total within-cluster scatter.
Consider the following five points: (0,3), (3,3), (3,0), (−2,−4) and (−4,−2). These points are,
conveniently, centred around (0,0). The scatter matrix is
Book Figure
with trace Scat(D) = 76. If we cluster the first two points together in one cluster and the
remaining three in another, then we obtain cluster means μ1 = (1.5,3) and μ2 = (−1,−2) and
within-cluster scatter matrices
Book Figure
with traces Scat(D1) = 4.5 and Scat(D2) = 34. Two copies of μ1 and three copies of μ2 have, by
definition, the same centre of gravity as the complete data set: (0,0) in this case. We thus
calculate the between-cluster scatter matrix as
Book Figure
with trace 37.5.
Alternatively, if we treat the first three points as a cluster and put the other two
in a second cluster, then we obtain cluster means μ1= (2,2) and μ2= (−3,−3),and within-cluster
scatter matrices
Book Figure
with traces Scat(D1) = 12 and Scat(D2) = 4. The between-cluster scatter matrix is
Book Figure
with trace 60. Clearly, the second clustering produces tighter clusters whose centroids are further
apart.
K-means clustering is distance based clustering algorithm which uses exemplar as Centroid.
It is also called as Lloyd’s algorithm but popular by the name K-Means algorithm.
There is no efficient solution to find the global minimum for The K-means problem so it is
referred as NP-complete problem.
Using the nearest-centroid decision rule the algorithm iterates between partitioning the data and
recalculating centroids from a partition.
K-Means Algorithm:-
Book Figure please convert
We can summarize K-Means algorithm as follows:
3. Define number of clusters as K.
4. Randomly choose any of the K data points given in an instance space as k-
Centroids as C1,C2,C3…..Ck select any K instances as k-Centroids of k-
clusters as C1,C2,C3…..Ck
5. Initially, there is only one randomly selected data point in each cluster.
6. Calculate the distance of each data point with center of each cluster.
7. In k-Means algorithm distance can be calculated by using Euclidian Distance
measure.
8. Each data point from instance space will be assigned to the cluster whose centre is
nearest to the data point.
9. Form new clusters and Re-calculate the center of each new cluster by taking mean
of all data points belonging to that respective cluster.
10. Repeat the procedure from step 4 to step 7 until any of the stopping criteria is met.
Maximum number of iterations are done
Same data points in same cluster.
Center of new cluster is not changing
To calculate distance any distance metric is used in k-Medoid based clustering whereas
Euclidian distance metric is used in K-Means Clustering.
K-means algorithm is sensitive to outliers! Since an object with an extremely large value may
substantially distort the distribution of data.
Instead of taking mean value of object in cluster as a reference point medoid can be used ,which
is most centrally located object.
Calculating the medoids of a cluster requires examining all pairs of points – whereas calculating
the mean requires just a single pass through the points– which can be prohibitive for large data
sets.
An instance can be selected as medoid whose sum of distances from all other instances is
minimum.
An important limitation of the clustering methods discussed in this section is that they represent
clusters only by means of exemplars.
Time complexity of K-Medoid algorithm is O(k*(n-k)2)
Hierarchical clustering is very common, when scientists look at the animal kingdom they
divide them hierarchically like shown in figure .
Animals are vertebrates and invertebrates. Within vertebrate we can find
fish,repltiles,mammals and so on whereas in Invertebrate we have worm,insects etc. so it
produces a nested sequence of clusters.
animal
vertebrate invertebrate
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
Consider an input set is S where, S contains different data points as shown in figure.
Each node in above figure represents subset of S and is defined as union of their
children’s representing cluster.
The root of dendrogram represents the whole input set S
Leaves are individual elements of S
Each level of the tree represents a partition of the input data into several (nested) clusters or
groups.
p5
p4
p3
p2
p1
.
.
.
p1
p2
p3
p4
p5
...
3. Intermediate State
After some merging steps, we have some clusters
Distance/Proximity Matrix
Figure 4.1.9 :-Intermediate Step
4. Merge the two closest clusters (C2 and C5) and update the distance matrix as
shown below:
Distance/Proximity Matrix
Figure 4.1.10:- Merge and update distance matrix
5. After merging clusters update distance matrix
To measure distance of two clusters some popular distance measures are available as
follows:
Single
• Single-link
– Similarity of the most similar (single-link)
– Single-link distance between clusters Ci and Cj is the minimum distance between
any object in Ci and any object in Cj
– sim(C , C )= max sim( x , y )
i j
x ∈C i , y ∈C j
It Can result in “straggly” (long and thin) clusters due to chaining effect.
• Determined by one pair of points, i.e., by one link in the proximity graph.
1 2 3 4 5
Figure 4.1.12- Single Link with sample data points
• Complete-link
•
– Similarity of the least similar points
– The distance between two clusters is the distance of two furthest data points in the
two clusters.
–
sim (c i , c j )= min sim( x , y )
• x ∈ci , y ∈c j
•
– Makes “tighter,” spherical clusters that are typically preferable.
– It is sensitive to outliers because they are far away.
– Distance between clusters is determined by the two most distant points in the
different clusters
–
1 2 3 4 5
• Average-link
– Returns average of cosine between pairs of elements
– Similarity of two clusters = average similarity between any object in Ci and any
object in Cj
1
|Ci||C j| ⃗x∑ ∑
– sim (c , c )= sim( ⃗x , ⃗y )
i j
∈C y ∈C j
i⃗
• Compromise between single and complete link. Less susceptible to noise and outliers.
• Two options:
– Averaged across all ordered pairs in the merged cluster
– Averaged over all pairs between the two original clusters
Time Complexity:-
• All the algorithms are at least O(n2). n is the number of data points.
• Single link can be done in O(n2).
• Complete and average links can be done in O(n2logn).
• Due the complexity, hard to use for large data sets.
Hierarchical K-means
We have five objects and we put each object into one cluster.Now call each object as a
cluster. In the begining we have 5 clusters now group those clusters such that at the end of
iterations we have only single cluster having all original 5 objects.
Consider the table given below indicates pair wise distances calculated for cluster.
Now we get all objects under one cluster so computation is finished and result of computation is
summarized as follows:
Rule based models accept set of instances and constructs rules to achieve goal of machine
learning task.
Rule based models can be descriptive or predictive. We will learn descriptive rule based models.
Descriptive Rule mining:-It can be applied for supervised and unsupervised learning.
When rules will be converted into decision trees then decision tree will be exponential in size as
rules are highly overlapping.
Therefore, Rule based models are preferred over Tree based models.
The concept of subgroup discovery was initially introduced by Kloesgen and Wrobel which is
defined as:
“In subgroup discovery, we assume we are given a so-called population of individuals (objects,
customer, ...) and a property of those individuals we are interested in. The task of subgroup
discovery is then to discover the subgroups of the population that are statistically “most
interesting”, i.e. are as large as possible and have the most unusual statistical (distributional)
characteristics with respect to the property of interest.”
Subgroup discovery identifies a relationship between variables with respect to target variable and
these relations are defined in the form of rules.
Targetvalue =It represents a variable of interest or target variable for subgroup discovery.
Cond= Defines statistical distribution or conjunction of features with respect to target variable.
Subgroup discovery can be done using SD algorithms which extracts rules from subset of data.
1. Predictive Induction:-
Its objective is to discover knowledge for classification of prediction.
Predictive induction features includes Classification and Regression
2. Descriptive induction:-
It’s objective is to extract knowledge from the data. Descriptive induction
features includes Classification and Regression.
Figure shows rule for subgroup discovery for target variables having values as Targetvalue =
x and Targetvalue = o).
A classifier tries to predict from examples whereas subgroup discovery tries to describe
knowledge for data.
ns= is the number of total examples and n(Cond) is the number of examples which satisfy
the conditions determined by the antecedent part of the rule.
where n(Targetvalue ・ Cond) = T P is the number of examples which satisfy the conditions and
also belong to the value for the target variable in the rule.
Confidence: It measures the relative frequency of examples satisfying the complete rule
among those satisfying only the antecedent. This can be computed with different
expressions,
e.g.
Cnf (R) = n(Targetvalue ・ Cond) / n(Cond)
Subgroup discovery identifies a relationship between variables with respect to target variable and
these relations are defined in the form of rules.
Targetvalue =It represents a variable of interest or target variable for subgroup discovery.
Cond= Defines statistical distribution or conjunction of features with respect to target variable.
Let’s consider,
Age={<25,25-60,>60}
Sex={M,F,}
Country={India, Finland, Italy, Bangladesh}
Rule1 can discover subgroup of Indian people for age 25-60 having probability of rich is high as
compared to the rest of the population.
Rule2 represents that women with more than 60 years old are more likely to have a normal
economy than the rest of the population.
Association rule mining is used in unsupervised learning and it is one of the prominent method in
Data mining applications.
Mostly machine learning model works on numerical data but association rule mining model
works on categorical and non-numeric data.
Association rules are nothing but simple If/Then statements to discover relationship between
independent item sets.
Association rule mining is also called as Apriori algorithm and affinity analysis. Set of
items in transaction is called as Market basket.
For example: - “If a customer buys Laptop, he’s 70% likely of buying Headset.”
In this association rule antecedent is Laptop and consequent is Headset. However ,this rule
can be interpreted that if person buys laptop then that person may also purchases Headset.
This association rule is generated using thorough analysis of some set of transactions to
improve customer service and company revenue.
“If A then B”
S=P(A U B)
In a transaction set (T),C represents percentage of times B present in all the transaction
containing A.
Minimum support given in problem is 50% so remove items having support less than
50%
Step 2:- Make 2-items Candidate Set and write their frequency
From above table support>50% item-pairs have to be considered for further calculation.
Step 3:- So for association rules let’s consider only (Bread,Juice) & (Cheese,Juice) item-pairs
as their support is greater than 50%.
i) (BreadJuice)=Support(BUJ)/Support(B)
=3 * 5/5 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 75%
ii) Similarly calculate confidence for (JuiceBread)
=3 * 5/5 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 75%
i) (CheeseJuice)=Support(CUJ)/Support(C)
=3 * 5/5 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 75%
=3 * 5/5 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 75%
Step 4:-Summary of final Rules generated after applying Apriori algorithm are
Minimum support given in problem is 20% so remove items having support less than
20%. But,from table no item is removed as their support value is greater than equals to 2.
Step 2:- Make 2-items Candidate Set and write their frequency
From above table support>20% , item-pairs have to be considered for further calculation.
Step 3:- So for association rules let’s consider only (Milk,Bread),(Milk,Cornflakes),
(Bread,cornflakes),(Bread,Jam) and (Bread,Butter) item-pairs as their support is greater than
20%.
i) (MilkBread)=Support(MUB)/Support(M)
=3 * 7/7 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 65%
ii) Similarly calculate confidence for (BreadMilk)
=3 * 7/7 * 6
=3/6
=50% (Not Valid Rule because Confidence given in Problem is
65%
iii) (MilkCornflakes)=Support(MUC)/Support(M)
=3 * 7/7 * 4
=3/4
=75% (Valid Rule because Confidence given in Problem is 65%
v) (BreadCornflakes)=Support(BUC)/Support(B)
=2 * 7/7 * 6
=2/6
=33% (Not Valid Rule because Confidence given in Problem is
65%
vii) (BreadJam)=Support(BUJ)/Support(B)
=3 * 7/7 * 6
=3/6
=50% (Not Valid Rule because Confidence given in Problem is
65%
ix) (BreadButter)=Support(BUB)/Support(B)
=3 * 7/7 * 6
=3/6
=50% (Not Valid Rule because Confidence given in Problem is
65%
Step 4:-Summary of final Rules generated after applying Apriori algorithm are
Problem 3:--Apply Apriori algorithm for following set of transactions and find all the
association rules with min support = 1 and min confidence = 60%. [10] (May-June 2019)
Tr.ID Transactions
1 1,3,4
2 2,3,5
3 1,2,3,5
4 2,5
Solution:-
Step 1:-Frequent Item Set
Step 2:- Make 2-items Candidate Set and write their frequency
From above table support>60% , item-pairs have to be considered for further calculation.
Step 3:- So for association rules let’s consider only (2,5) item-pairs as their support is greater
than 60%.
i) (25)=Support(2U5)/Support(2)
=3 * 4/4 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 60%
ii) Similarly calculate confidence for (52)
= Support(5U2)/Support(5)
=3 * 4/4 * 3
=3/3
=100% (Valid Rule because Confidence given in Problem is 60%
Step 4:-Summary of final Rules generated after applying Apriori algorithm are
Decision tree can be used both for classification and regression. However,it is more popularly
used for classification.
Following figure shows a decision tree for loan approve or disapprove process.
As shown in figure ? in first decision node we test whether applicant is employed or not?.If
applicant is not employed then we have another test that does applicant have high credit score or
not?.If credit score is high then approve loan else reject the loan. In another case, if applicant is
employed then test the income if income is high then approve loan else reject loan.
Low
Figure 4.3.1:- Decision tree for Loan approve or reject
Consider that some training examples are given and we have to generate a decision tree.
It is possible that we may generate many decision trees which fit the training examples.
Which decision tree is best?
Solution is preferring smaller decision tree that fits the training data. Smaller trees are
nothing but trees with lower depth or trees with smaller number of nodes.
But finding smallest decision tree that fits the training data is computationally hard
problem.
So we have to look for Greedy algorithm to search for smallest decision tree.
A2=? [29+,35-]
True False
We have total 64 training examples out of which 29 are positive(29+) and 35 examples are
negative(35-).
If we split tree on A1 which is binary attribute having True and false, then if A1 is equal to true
then we get 21+ and 5- examples and if A1 is false then we get 8+ and 38- Examples.
Similarly, If we consider split on attribute A2 then if A2 is equal to true then we get 18+ and 33-
examples and if A2 is false then we get 11+and 2- Examples.
To decide which attribute to split on, we have method called Entropy and Information gain.
4.3.2.1 Entropy:-
If at a particular node all examples have positive class or all examples have negative class it
means if all examples belongs to same class then it is a homogeneous set S and for such
homogeneous set entropy is zero.
= -p+log2p+ - p-log2p-
Information Gain measures how well a given attribute separates the training examples
according to their target classification.
This measure is used to select among the candidate attributes at each step while growing
the tree.
Gain is measure of how much we can reduce uncertainty (Value lies between 0,1)
Gain(S,A): expected reduction in entropy due to partitioning S on attribute A
True False
Information gain for split on attribute A1 is more than attribute A2 so in this example we
consider A1 as splitting attribute for decision tree or feature tree.
Best situation to asses efficacy of feature in terms of splitting examples into positive class and
negative class is where D1+= D+ and D1- =¢ or where D1+=¢ and D1- = D-. This kind of split is
called as Pure Split.
impurity of a single leaf Dj as Imp(Dj ), the impurity of a set of mutually exclusive leaves
{D1, . . . ,Dl } defined as a weighted average
where D = D1∪. . .∪ Dl