Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Lecture 4: Text Similarity &

Distance between strings


Lecture Objectives:

•Student will be able to understand NLP tasks


•Students Will be able to understand Edit distance
Algorithm applications
•Students Will be able to understand Distance Metrics
for Text
CSC-441: Natural Language Processing
What is NLP??
NLP is the
branch of
computer science
focused on
developing
systems that
allow computers
to communicate
with people using
everyday
language
How similar are two strings?

• Spell correction • Computational Biology


• Align two sequences of nucleotides
– The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“Karachi” TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Which is closest?
• Kirachi • Resulting alignment:
• Karachu
• Kerrach -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Kararachi

• Also used for Machine Translation, Information Extraction, Speech


Recognition
String distance metrics: term-
based
• Advantages:
– Exploits frequency information
– Efficiency: Finding { t : sim(t,s)>k } is sublinear!
– Alternative word orderings ignored (William Cohen
vs Cohen, William)
• Disadvantages:
– Sensitive to spelling errors (Willliam Cohon)
– Sensitive to abbreviations (Univ. vs University)
– Alternative word orderings ignored (James Joyce
vs Joyce James, City National Bank vs National City
Bank)
Defining Min Edit Distance
• For two strings
– X of length n
– Y of length m
• We define D(i,j)
– the edit distance between X[1..i] and Y[1..j]
• i.e., the first i characters of X and the first j characters
of Y
– The edit distance between X and Y is thus D(n,m)
Edit Distance
• The minimum edit distance between two
strings
• Is the minimum number of editing
operations
– Insertion
– Deletion
– Substitution
• Needed to transform one into the other
Computing Levenshtein
distance - 1
D(i,j) = score of best alignment from s1..si to s2…sj

D(i-1,j-1), if s1[i]=s2[j] //copy


D(i-1,j-1)+1, if s1[i]!=s2[j] //substitute
= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
Minimum Edit Distance
• Two strings and their alignment:
Examples
• Finding distance between Brief and drive
Jaccard Distance
S William Cohen CM Univ Pgh

T Dr. William Cohen CM University

Dr. William Cohen CM Univ University Pgh

William Cohen CM
Computing Levenshtein distance
(MED)
D(i-1,j-1) + d(si,tj) //subst/copy
D(i,j)= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete

C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 2 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
= D(s,t)
N 5 4 3 3 3
Sec. 6.3

Why distance is a bad idea


The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and
the distribution of
terms in the
document d2 are
very similar.
Sec. 6.3

Use angle instead of distance


• Thought experiment: take a document d and
append it to itself. Call this document d′.
• “Semantically” d and d′ have the same content
• The Euclidean distance between the two
documents can be quite large
• The angle between the two documents is 0,
corresponding to maximal similarity.

• .
Cosine for length-normalized
vectors
• For length-normalized vectors, cosine
similarity is simply the dot product (or scalar
product):
   V
cos(q, d )  q  d   qi di
 i1

for q, d length-normalized.

14
Cosine similarity
A generic View
• text1 = "This is a foo bar sentence ."
• text2 = "This sentence is similar to a foo
bar sentence ."
• vector1 = text_to_vector(text1)
• vector2 = text_to_vector(text2)
• cosine = get_cosine(vector1, vector2)
Examples will be covered after
studying vector forms or word
embeddings of Text
Summary
• Term based distance metrics are used to
find similarity between strings
• Fuzzy string matching uses Levenshtein or
edit distance
• Cosine similarity is another method of
finding similarity between individual terms
or strings
References
• Wikipedia.com
• Prof. Jason Eisner (Natural Language
Processing)John Hopkins University.
• Web.Standford.edu
• Stackoverflow.com

18

You might also like