A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science

12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
Get started Open in app
Follow 608K Followers
A Complete Beginners Guide to Document

Similarity Algorithms
Learn the code and math behind Euclidean Distance, Cosine Similarity and Pearson
Correlation to power recommendation engines
GreekDataGuy May 1, 2020 · 6 min read
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 1/11
Photo by Tyler Nix on Unsplash

I’ve deployed a few recommender systems into production. Not surprisingly, the simplest
turned out to be the most effective.
At the core of most recommender systems lies collaborative filtering. And at the core of
collaborative filtering is document similarity.
We’ll walk through 3 algorithms for calculating document similarity.
1) Euclidean Distance
2) Cosine Similarity
3) Pearsons Correlation Coefficient
Even a general intuition for how they work will help you pick the right tool for the job
and build a more intelligent engine.
What these algorithms have in common

Keep these caveats to keep in the back of your mind while we discuss the algorithms.
1. Calculations are performed on vector representations of objects. Each object must

first be converted to a numeric vector.
2. Similarity/distance is calculated between a single pair of vectors at a time.
3. Regardless of the algorithm, feature selection will have a huge impact on your
results.
Euclidean Distance
In a Nutshell
Euclidean distance is the distance between 2 points in a multidimensional space.
Closer points are more similar to each other. Further points are more different from each
other. So above, Mario and Carlos are more similar than Carlos and Jenny.
I’ve intentionally chosen 2 dimensions (aka. features: [wealth, friends]) because it’s
easy to plot. We can still calculate distance beyond 2 dimension but a formula is
required.
Intuitively this method makes sense as a distance measure. You plot your documents as
points and can literally measure the distance between them with a ruler.
Comparing Cities with Euclidean Distance

Let’s compare 3 cities: New York, Toronto and Paris.
Toronto = [3,7]
New York = [7,8]
Paris = [2,10]
The feature vector contains 2 features: [population, temperature] . Population is in
millions. Temperature is in celsius.
Now because we’ve again framed the problem as 2 dimensional, we could measure the
distance between points with a ruler, but we’re going to use the formula here instead.
The formula works whether there are 2 or 1000 dimensions.
Implement Euclidean Distance in Python

Nobody hates math notation more than me but below is the formula for Euclidean
distance.
Excuse my freehand
Let’s write a function that implements it and calculates the distance between 2 points.
from math import sqrt
def euclidean_dist(doc1, doc2):

'''
For every (x,y) pair, square the difference
Then take the square root of the sum
'''
pre_square_sum = 0
for idx,_ in enumerate(doc1):
pre_square_sum += (doc1[idx] - doc2[idx]) ** 2

return sqrt(pre_square_sum)
toronto = [3,7]
new_york = [7,8]
euclidean_dist(toronto, new_york)
#=> 4.123105625617661
Alright. The distance between Toronto and New York is 4.12 .
Euclidean Distance with Sklearn

The function we wrote above is a little inefficient. Sklearn implements a faster version
using Numpy. In production we’d just use this.
toronto = [3,7]
new_york = [7,8]
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
t = np.array(toronto).reshape(1,-1)
n = np.array(new_york).reshape(1,-1)
euclidean_distances(t, n)[0][0]
#=> 4.123105625617661
Note that it requires arrays instead of lists as inputs, but we get the same result. Booya!
Cosine Similarity
In a Nutshell
Cosine similarity is the cosine of the angle between 2 points in a multidimensional
space. Points with smaller angles are more similar. Points with larger angles are more
different.
While harder to wrap your head around, cosine similarity solves some problems with
Euclidean distance. Namely, magnitude.
Number of times an article mentions the words “cooking” and “restaurant”
In the above drawing, we compare 3 documents based on how many times they contain
the words “cooking” and “restaurant”.
Euclidean distance tells us the blog and magazine are more similar than the blog and
newspaper. But I think that’s misleading.
The blog and newspaper could have similar content but are distant in a Euclidean sense
because the newspaper is longer and contains more words.
In reality, they both mention “restaurant” more than “cooking” and are probably more
similar to each other than not. Cosine similarity doesn’t fall into this trap.
Compare Books and Articles with Cosine Similarity

Let’s work through our above example. We’ll compare documents based on the count of
specific words
magazine_article = [7,1]
blog_post = [2,10]
newspaper_article = [2,20]
Rather than taking the distance between each, we’ll now take the cosine of the angle
between them from the point of origin. Now even just eyeballing it, the blog and the
newspaper look more similar.
Implementing Cosine Similarity in Python

Note that cosine similarity is not the angle itself, but the cosine of the angle. So a smaller
angle (sub 90 degrees) returns a larger similarity.
https://en.wikipedia.org/wiki/Cosine_similarity
Let’s implement a function to calculate this ourselves.
import numpy as np
from math import sqrt
def my_cosine_similarity(A, B):

numerator
Get started Open = np.dot(A,B)
in app
denominator = sqrt(A.dot(A)) * sqrt(B.dot(B))
return numerator / denominator
magazine_article = [7,1]
blog_post = [2,10]
newspaper_article = [2,20]
m = np.array(magazine_article)
b = np.array(blog_post)
n = np.array(newspaper_article)
print( my_cosine_similarity(m,b) ) #=> 0.3328201177351375

print( my_cosine_similarity(b,n) ) #=> 0.9952285251199801
print( my_cosine_similarity(n,m) ) #=> 0.2392231652082992
Now we see that the blog and newspaper are indeed more similar to each other.
Cosine Similarity with Sklearn

In production, we’re better off just importing Sklearn’s more efficient implementation.
from sklearn.metrics.pairwise import cosine_similarity
m = np.array(magazine_article).reshape(1,-1)
b = np.array(blog_post).reshape(1,-1)
n = np.array(newspaper_article).reshape(1,-1)
print( cosine_similarity(m,b)[0,0] ) #=> 0.3328201177351375

print( cosine_similarity(b,n)[0,0] ) #=> 0.9952285251199801
print( cosine_similarity(n,m)[0,0] ) #=> 0.2392231652082992
Same values. Great!
Pearson Correlation
This typically quantifies the relationship between 2 variables. Like for example,
education and income.
Entirely fictional data
But we can also use it to measure the similarity between 2 documents where we treat the
1st document’s vector as x and the 2nd document’s vector as y.
Because of the Pearson correlation coefficient, r , returns a value between 1 and -1,
Pearson distance can then be calculated as 1 — r to return a value between 0 and 2.
Implementing Pearson Correlation Coefficient in Python

Let’s implement the formula ourselves to develop an understanding of how it works.
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
We’re going to generate some fake data representing a few people. We’ll compare how
similar they are based on a 3-feature vector.
emily = [1,2,5]
kartik = [1,3,5]
todd = [5,3,5]
Our implementation.
import numpy as np
def pearsons_correlation_coef(x, y):

x = np.array(x)
y = np.array(y)
x_mean = x.mean()
y_mean = y.mean()
x_less_mean = x - x_mean
y_less_mean = y - y_mean
numerator = np.sum(xm * ym)

denominator = np.sqrt(
np.sum(xm**2) * np.sum(ym**2)
)
return r_num / r_den
pearsons_correlation_coef(emily,kartik)
#=> 0.9607689228305226
Cool. Emily and Kartik appear pretty similar. We’ll compare all 3 with Scipy in a second.
Pearson Correlation Coefficient with Scipy

Scipy implements a more efficient and robust calculation.
emily = [1,2,5]
kartik = [1,3,5]
todd = [5,3,5]
pearsonr2(emily,kartik)
print( pearsonr2(emily, kartik) )

#=> (0.9607689228305226, 0.1789123750220673)
print( pearsonr2(kartik, todd) )
#=> (0.0, 1.0)
print( pearsonr2(todd, emily) )
#=> (0.27735009811261446, 0.8210876249779328)
Albeit I’ve chosen random numbers as data points, we can see that Emily and Kartik are
more similar than Emily and Todd.
Conclusion
While we’ve covered one piece of the puzzle, the road to a fully fledged recommender is
not complete.
In the context of an e-commerce engine, we’d next build a matrix of similarity scores
between every pair of users. We could then use that to recommend products that similar
users purchased.
That said, a good recommender might also incorporate domain-based rules and users
preferences.
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Get this newsletter
Data Science Python Programming Mathematics Statistics
About Write Help Legal
Get the Medium app

A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science

Uploaded by

Copyright:

Available Formats

You might also like

A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science

Uploaded by

Copyright:

Available Formats

12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Get started Open in app

Follow 608K Followers