Professional Documents
Culture Documents
Lecture 4: Text Similarity & Distance Between Strings: Understand Able To Understand Applications Able To Understand
Lecture 4: Text Similarity & Distance Between Strings: Understand Able To Understand Applications Able To Understand
Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi
William Cohen CM
Computing Levenshtein distance
(MED)
D(i-1,j-1) + d(si,tj) //subst/copy
D(i,j)= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 2 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
= D(s,t)
N 5 4 3 3 3
Sec. 6.3
• .
Cosine for length-normalized
vectors
• For length-normalized vectors, cosine
similarity is simply the dot product (or scalar
product):
V
cos(q, d ) q d qi di
i1
for q, d length-normalized.
14
Cosine similarity
A generic View
• text1 = "This is a foo bar sentence ."
• text2 = "This sentence is similar to a foo
bar sentence ."
• vector1 = text_to_vector(text1)
• vector2 = text_to_vector(text2)
• cosine = get_cosine(vector1, vector2)
Examples will be covered after
studying vector forms or word
embeddings of Text
Summary
• Term based distance metrics are used to
find similarity between strings
• Fuzzy string matching uses Levenshtein or
edit distance
• Cosine similarity is another method of
finding similarity between individual terms
or strings
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
• Stackoverflow.com
18