Similarity

LECTURE(3)
Similarity and Dissimilarity Measures

• Similarity measure
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity measure
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
The following table shows the similarity and dissimilarity
between two objects, x and y, with respect to a single, simple
attribute.
Euclidean Distance
• Euclidean Distance
where n is the number of dimensions (attributes) and xk and yk are,

respectively, the kth attributes (components) or data objects x and y.
• x = (3, 6, 0, 3, 6)
• y = (1, 2, 0, 1,2)
 Standardization is necessary, if scales differ.

Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Minkowski Distance
•Minkowski Distance is a generalization of Euclidean Distance Where r is a
parameter, n is the number of dimensions (attributes) and xk and yk are, respectively,
the kth attributes (components) or data objects x and y.
•
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• A common example of this for binary vectors is the Hamming
distance, which is just the number of bits that are different between
two binary vectors
• r = 2. Euclidean distance
• r  . “supremum” (Lmax norm, L norm) distance.

• This is the maximum difference between any component of the
vectors
Hamming Distance
• Hamming distance is the number of positions in
which bit-vectors differ.
• Example: p1 = 10101 p2 =
10011.
• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
positions.
• The L1 norm for the binary vectors
8
‫ملخص القوانين‬
Distances for real vectors
• Vectors and
• Lp norms or Minkowski distance:
• L2 norm: Euclidean distance:
• L1 norm: Manhattan distance:
• L norm:
• The limit of Lp as p goes to infinity.

Example of Distances
y = (9,8)
L2-norm:
5 3
L1-norm:
4
x = (5,5)
L-norm:
10
Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities

f01 = the number of attributes where x was 0 and y was 1
• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
Common Properties of a Distance
• Distances, such as the Euclidean distance, have

some well known properties.
1. d(x, y)  0 for all x and y and d(x, y) = 0 if and only if
x = y.
2. d(x, y) = d(y, x) for all x and y. (Symmetry)
3. d(x, z)  d(x, y) + d(y, z) for all points x, y, and z.
(Triangle Inequality)
where d(x, y) is the distance (dissimilarity) between points
(data objects), x and y.
• A distance that satisfies these properties is a metric
Common Properties of a Similarity
• Similarities, also have some well known

properties.
1. s(x, y) = 1 (or maximum similarity) only if x = y.
2. s(x, y) = s(y, x) for all x and y. (Symmetry)
where s(x, y) is the similarity between points (data

objects), x and y.
quiz1
• Calculate Hamming distance to
• p1 = 11001, p2 = 10001.
• Hamming distance = 1
• Calculate Euclidean distance, Manhattan distance, Lmax norm point x y
• Euclidean distance =
p1 2 0
• Manhattan distance =
p2 3 1
• , Lmax norm =
• d(x, y) = d(y, x) for all x and y is asymmetry property ( true –false)

• If Similarity is high then objects are more alike( true –false)
• Mention data mining tasks..?
• Mention steps of KDD?
Data cleaning
Data integration from multiple sources
Data selection for data mining
Data transformation
Data mining
Patterns evaluation
Presentation of the mining results
Mention properties of pattern?
Define discretization, aggregation?

Discretization ; is the process of converting a continuous attribute into an categorical attribute
Aggregation ; Combining two or more attributes (or objects) into a single attribute (or object)
Similarity Between Binary Vectors
• Common situation is that objects, x and y, have only binary
attributes
• Compute similarities using the following quantities

• Simple Matching and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)
J = number of 11 matches / number of non-zero attributes
= (f11) / (f01 + f10 + f11)
SMC versus Jaccard: Example
x= 1000000000
y= 0000001001
f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

= (0+7) / (2+1+0+7) = 0.7
J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Jaccard: Example
Cosine Similarity
• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| ,
where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and || d
|| is the length of vector d.
X  x 2
i X Y  (x  y ) i i
i cos( X , Y )   i
X  y x   y i
2
i
i
2
i
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
<d1, d2> ( d1  d2 ) =3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
|| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449
cos(d1, d2 ) = 0.3150
Similarities into distances
• Jaccard distance:
• Jaccard Distance is a metric
• Cosine distance:

Similarity

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Similarity

Uploaded by

Copyright:

Available Formats

LECTURE(3)

Similarity and Dissimilarity Measures

where n is the number of dimensions (attributes) and xk and yk are,

 Standardization is necessary, if scales differ.

parameter, n is the number of dimensions (attributes) and xk and yk are, respectively,

the kth attributes (components) or data objects x and y.

• r = 1. City block (Manhattan, taxicab, L1 norm) distance.

• r  . “supremum” (Lmax norm, L norm) distance.

• L2 norm: Euclidean distance:

• L1 norm: Manhattan distance:

• The limit of Lp as p goes to infinity.

• Compute similarities using the following quantities

• Simple Matching and Jaccard Coefficients

• Distances, such as the Euclidean distance, have

• Similarities, also have some well known

where s(x, y) is the similarity between points (data

• d(x, y) = d(y, x) for all x and y is asymmetry property ( true –false)

Define discretization, aggregation?

• Compute similarities using the following quantities

• Simple Matching and Jaccard Coefficients

f01 = 2 (the number of attributes where x was 0 and y was 1)

SMC = (f11 + f00) / (f01 + f10 + f11 + f00)

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

• If d1 and d2 are two document vectors, then

• Jaccard Distance is a metric

You might also like