Professional Documents
Culture Documents
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
A Complete Beginners Guide To Document Similarity Algorithms - by GreekDataGuy - Towards Data Science
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 1/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
I’ve deployed a few recommender systems into production. Not surprisingly, the simplest
turned out to be the most effective.
At the core of most recommender systems lies collaborative filtering. And at the core of
collaborative filtering is document similarity.
1) Euclidean Distance
2) Cosine Similarity
3) Pearsons Correlation Coefficient
Even a general intuition for how they work will help you pick the right tool for the job
and build a more intelligent engine.
3. Regardless of the algorithm, feature selection will have a huge impact on your
results.
Euclidean Distance
In a Nutshell
Euclidean distance is the distance between 2 points in a multidimensional space.
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 2/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
Closer points are more similar to each other. Further points are more different from each
other. So above, Mario and Carlos are more similar than Carlos and Jenny.
I’ve intentionally chosen 2 dimensions (aka. features: [wealth, friends]) because it’s
easy to plot. We can still calculate distance beyond 2 dimension but a formula is
required.
Intuitively this method makes sense as a distance measure. You plot your documents as
points and can literally measure the distance between them with a ruler.
Toronto = [3,7]
Paris = [2,10]
Now because we’ve again framed the problem as 2 dimensional, we could measure the
distance between points with a ruler, but we’re going to use the formula here instead.
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 3/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
Excuse my freehand
Let’s write a function that implements it and calculates the distance between 2 points.
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 4/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
toronto = [3,7]
new_york = [7,8]
euclidean_dist(toronto, new_york)
#=> 4.123105625617661
toronto = [3,7]
new_york = [7,8]
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
t = np.array(toronto).reshape(1,-1)
n = np.array(new_york).reshape(1,-1)
euclidean_distances(t, n)[0][0]
#=> 4.123105625617661
Note that it requires arrays instead of lists as inputs, but we get the same result. Booya!
Cosine Similarity
In a Nutshell
Cosine similarity is the cosine of the angle between 2 points in a multidimensional
space. Points with smaller angles are more similar. Points with larger angles are more
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 5/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
different.
Get started Open in app
While harder to wrap your head around, cosine similarity solves some problems with
Euclidean distance. Namely, magnitude.
In the above drawing, we compare 3 documents based on how many times they contain
the words “cooking” and “restaurant”.
Euclidean distance tells us the blog and magazine are more similar than the blog and
newspaper. But I think that’s misleading.
The blog and newspaper could have similar content but are distant in a Euclidean sense
because the newspaper is longer and contains more words.
In reality, they both mention “restaurant” more than “cooking” and are probably more
similar to each other than not. Cosine similarity doesn’t fall into this trap.
magazine_article = [7,1]
blog_post = [2,10]
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 6/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
newspaper_article = [2,20]
Get started Open in app
Rather than taking the distance between each, we’ll now take the cosine of the angle
between them from the point of origin. Now even just eyeballing it, the blog and the
newspaper look more similar.
https://en.wikipedia.org/wiki/Cosine_similarity
import numpy as np
from math import sqrt
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 7/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
magazine_article = [7,1]
blog_post = [2,10]
newspaper_article = [2,20]
m = np.array(magazine_article)
b = np.array(blog_post)
n = np.array(newspaper_article)
Now we see that the blog and newspaper are indeed more similar to each other.
m = np.array(magazine_article).reshape(1,-1)
b = np.array(blog_post).reshape(1,-1)
n = np.array(newspaper_article).reshape(1,-1)
Pearson Correlation
This typically quantifies the relationship between 2 variables. Like for example,
education and income.
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 8/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
But we can also use it to measure the similarity between 2 documents where we treat the
1st document’s vector as x and the 2nd document’s vector as y.
Because of the Pearson correlation coefficient, r , returns a value between 1 and -1,
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
We’re going to generate some fake data representing a few people. We’ll compare how
similar they are based on a 3-feature vector.
emily = [1,2,5]
kartik = [1,3,5]
todd = [5,3,5]
Our implementation.
import numpy as np
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 9/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
x_mean = x.mean()
y_mean = y.mean()
x_less_mean = x - x_mean
y_less_mean = y - y_mean
pearsons_correlation_coef(emily,kartik)
#=> 0.9607689228305226
Cool. Emily and Kartik appear pretty similar. We’ll compare all 3 with Scipy in a second.
emily = [1,2,5]
kartik = [1,3,5]
todd = [5,3,5]
pearsonr2(emily,kartik)
Albeit I’ve chosen random numbers as data points, we can see that Emily and Kartik are
more similar than Emily and Todd.
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 10/11
12/26/21, 3:33 PM A Complete Beginners Guide to Document Similarity Algorithms | by GreekDataGuy | Towards Data Science
Conclusion
Get started Open in app
While we’ve covered one piece of the puzzle, the road to a fully fledged recommender is
not complete.
In the context of an e-commerce engine, we’d next build a matrix of similarity scores
between every pair of users. We could then use that to recommend products that similar
users purchased.
That said, a good recommender might also incorporate domain-based rules and users
preferences.
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/a-complete-beginners-guide-to-document-similarity-algorithms-75c44035df90 11/11