A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Get started Open in app

Follow 608K Followers

A Complete Beginners Guide to Document


Similarity Algorithms
Learn the code and math behind Euclidean Distance, Cosine Similarity and Pearson
Correlation to power recommendation engines

GreekDataGuy May 1, 2020 · 6 min read

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 1/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Photo by Tyler Nix on Unsplash


Get started Open in app

I’ve deployed a few recommender systems into production. Not surprisingly, the simplest
turned out to be the most effective.

At the core of most recommender systems lies collaborative filtering. And at the core of
collaborative filtering is document similarity.

We’ll walk through 3 algorithms for calculating document similarity.

1) Euclidean Distance
2) Cosine Similarity
3) Pearsons Correlation Coefficient

Even a general intuition for how they work will help you pick the right tool for the job
and build a more intelligent engine.

What these algorithms have in common


Keep these caveats to keep in the back of your mind while we discuss the algorithms.

1. Calculations are performed on vector representations of objects. Each object must


first be converted to a numeric vector.

2. Similarity/distance is calculated between a single pair of vectors at a time.

3. Regardless of the algorithm, feature selection will have a huge impact on your
results.

Euclidean Distance
In a Nutshell
Euclidean distance is the distance between 2 points in a multidimensional space.

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 2/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Get started Open in app

Closer points are more similar to each other. Further points are more different from each
other. So above, Mario and Carlos are more similar than Carlos and Jenny.

I’ve intentionally chosen 2 dimensions (aka. features: [wealth, friends]) because it’s
easy to plot. We can still calculate distance beyond 2 dimension but a formula is
required.

Intuitively this method makes sense as a distance measure. You plot your documents as
points and can literally measure the distance between them with a ruler.

Comparing Cities with Euclidean Distance


Let’s compare 3 cities: New York, Toronto and Paris.

Toronto = [3,7]

New York = [7,8]

Paris = [2,10]

The feature vector contains 2 features: [population, temperature] . Population is in

millions. Temperature is in celsius.

Now because we’ve again framed the problem as 2 dimensional, we could measure the
distance between points with a ruler, but we’re going to use the formula here instead.

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 3/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Get started Open in app

The formula works whether there are 2 or 1000 dimensions.

Implement Euclidean Distance in Python


Nobody hates math notation more than me but below is the formula for Euclidean
distance.

Excuse my freehand

Let’s write a function that implements it and calculates the distance between 2 points.

from math import sqrt

def euclidean_dist(doc1, doc2):


'''
For every (x,y) pair, square the difference
Then take the square root of the sum
'''
pre_square_sum = 0
for idx,_ in enumerate(doc1):

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 4/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

pre_square_sum += (doc1[idx] - doc2[idx]) ** 2


Get started Open in app
return sqrt(pre_square_sum)

toronto = [3,7]
new_york = [7,8]

euclidean_dist(toronto, new_york)
#=> 4.123105625617661

Alright. The distance between Toronto and New York is 4.12 .

Euclidean Distance with Sklearn


The function we wrote above is a little inefficient. Sklearn implements a faster version
using Numpy. In production we’d just use this.

toronto = [3,7]
new_york = [7,8]

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

t = np.array(toronto).reshape(1,-1)
n = np.array(new_york).reshape(1,-1)

euclidean_distances(t, n)[0][0]
#=> 4.123105625617661

Note that it requires arrays instead of lists as inputs, but we get the same result. Booya!

Cosine Similarity
In a Nutshell
Cosine similarity is the cosine of the angle between 2 points in a multidimensional
space. Points with smaller angles are more similar. Points with larger angles are more

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 5/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

different.
Get started Open in app

While harder to wrap your head around, cosine similarity solves some problems with
Euclidean distance. Namely, magnitude.

Number of times an article mentions the words “cooking” and “restaurant”

In the above drawing, we compare 3 documents based on how many times they contain
the words “cooking” and “restaurant”.

Euclidean distance tells us the blog and magazine are more similar than the blog and
newspaper. But I think that’s misleading.

The blog and newspaper could have similar content but are distant in a Euclidean sense
because the newspaper is longer and contains more words.

In reality, they both mention “restaurant” more than “cooking” and are probably more
similar to each other than not. Cosine similarity doesn’t fall into this trap.

Compare Books and Articles with Cosine Similarity


Let’s work through our above example. We’ll compare documents based on the count of
specific words

magazine_article = [7,1]
blog_post = [2,10]
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 6/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

newspaper_article = [2,20]
Get started Open in app

Rather than taking the distance between each, we’ll now take the cosine of the angle
between them from the point of origin. Now even just eyeballing it, the blog and the
newspaper look more similar.

Implementing Cosine Similarity in Python


Note that cosine similarity is not the angle itself, but the cosine of the angle. So a smaller
angle (sub 90 degrees) returns a larger similarity.

https://en.wikipedia.org/wiki/Cosine_similarity

Let’s implement a function to calculate this ourselves.

import numpy as np
from math import sqrt
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 7/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

def my_cosine_similarity(A, B):


numerator
Get started Open = np.dot(A,B)
in app
denominator = sqrt(A.dot(A)) * sqrt(B.dot(B))
return numerator / denominator

magazine_article = [7,1]
blog_post = [2,10]
newspaper_article = [2,20]

m = np.array(magazine_article)
b = np.array(blog_post)
n = np.array(newspaper_article)

print( my_cosine_similarity(m,b) ) #=> 0.3328201177351375


print( my_cosine_similarity(b,n) ) #=> 0.9952285251199801
print( my_cosine_similarity(n,m) ) #=> 0.2392231652082992

Now we see that the blog and newspaper are indeed more similar to each other.

Cosine Similarity with Sklearn


In production, we’re better off just importing Sklearn’s more efficient implementation.

from sklearn.metrics.pairwise import cosine_similarity

m = np.array(magazine_article).reshape(1,-1)
b = np.array(blog_post).reshape(1,-1)
n = np.array(newspaper_article).reshape(1,-1)

print( cosine_similarity(m,b)[0,0] ) #=> 0.3328201177351375


print( cosine_similarity(b,n)[0,0] ) #=> 0.9952285251199801
print( cosine_similarity(n,m)[0,0] ) #=> 0.2392231652082992

Same values. Great!

Pearson Correlation
This typically quantifies the relationship between 2 variables. Like for example,
education and income.

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 8/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Get started Open in app

Entirely fictional data

But we can also use it to measure the similarity between 2 documents where we treat the
1st document’s vector as x and the 2nd document’s vector as y.

Because of the Pearson correlation coefficient, r , returns a value between 1 and -1,

Pearson distance can then be calculated as 1 — r to return a value between 0 and 2.

Implementing Pearson Correlation Coefficient in Python


Let’s implement the formula ourselves to develop an understanding of how it works.

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

We’re going to generate some fake data representing a few people. We’ll compare how
similar they are based on a 3-feature vector.

emily = [1,2,5]
kartik = [1,3,5]
todd = [5,3,5]

Our implementation.

import numpy as np
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 9/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

def pearsons_correlation_coef(x, y):


x = np.array(x)
Get started Open in app
y = np.array(y)

x_mean = x.mean()
y_mean = y.mean()

x_less_mean = x - x_mean
y_less_mean = y - y_mean

numerator = np.sum(xm * ym)


denominator = np.sqrt(
np.sum(xm**2) * np.sum(ym**2)
)

return r_num / r_den

pearsons_correlation_coef(emily,kartik)
#=> 0.9607689228305226

Cool. Emily and Kartik appear pretty similar. We’ll compare all 3 with Scipy in a second.

Pearson Correlation Coefficient with Scipy


Scipy implements a more efficient and robust calculation.

emily = [1,2,5]
kartik = [1,3,5]
todd = [5,3,5]

pearsonr2(emily,kartik)

print( pearsonr2(emily, kartik) )


#=> (0.9607689228305226, 0.1789123750220673)
print( pearsonr2(kartik, todd) )
#=> (0.0, 1.0)
print( pearsonr2(todd, emily) )
#=> (0.27735009811261446, 0.8210876249779328)

Albeit I’ve chosen random numbers as data points, we can see that Emily and Kartik are
more similar than Emily and Todd.

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 10/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science

Conclusion
Get started Open in app
While we’ve covered one piece of the puzzle, the road to a fully fledged recommender is
not complete.

In the context of an e-commerce engine, we’d next build a matrix of similarity scores
between every pair of users. We could then use that to recommend products that similar
users purchased.

That said, a good recommender might also incorporate domain-based rules and users
preferences.

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Get this newsletter

Data Science Python Programming Mathematics Statistics

About Write Help Legal

Get the Medium app

https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 11/11

You might also like