A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/339466606
A Survey of Similarity Measures for Collaborative Filtering-Based

Recommender System
Chapter · January 2020

DOI: 10.1007/978-981-15-0751-9_32
CITATIONS READS
14 2,016
3 authors:
Gourav Jain Tripti Mahara

Indian Institute of Technology Roorkee Indian Institute of Technology Roorkee
7 PUBLICATIONS 24 CITATIONS 43 PUBLICATIONS 288 CITATIONS
SEE PROFILE SEE PROFILE
Kuldeep Narayan Tripathi

Indian Institute of Technology Roorkee
15 PUBLICATIONS 53 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Cluster based secure and stable routing in Internet of Vehicles (IOV) View project
Blockchain Based Security Model for VANET View project
All content following this page was uploaded by Gourav Jain on 03 March 2020.
The user has requested enhancement of the downloaded file.

A Survey of Similarity Measures
for Collaborative Filtering-Based
Recommender System
Gourav Jain, Tripti Mahara and Kuldeep Narayan Tripathi
Abstract Recommendation system is one of the most valuable approaches to pro-

vide personalized services for users. It helps in finding the relevant information as per
user’s interest from enormous amount of data. To achieve this, a similarity measure is
used that computes the similarity between two users or items. There are multifarious
methods to compute the similarity between users/items, but each method has some
limitations. In this paper, we calculate the similarity between users through Pearson
Correlation Coefficient (PCC), Cosine Correlation, Constrained Pearson Correla-
tion Coefficient (CPCC), Sigmoid function-based Pearson Correlation Coefficient
(SPCC), Jaccard similarity, and Minkowski distance measures (Euclidean distance
and City block distance). The results show that Minkowski metric gives better result
than other similarity calculation measures as it is less affected by size and sparsity
of data set.
Keywords Pearson correlation · Cosine correlation · Jaccard · Euclidean

distance · City block distance · Constrained Pearson correlation · Sigmoid
function-based Pearson correlation
1 Introduction
There is an enormous increase in the amount of information available on Web. This

tremendous increase in the information that cannot be processed easily is termed
as information overload. With this overload, it becomes very difficult for the users
to find out meaningful and relevant information. One of the possible solutions for
the information overload problem is recommendation system [1] that helps the user
to find relevant information according to their interest. Collaborative filtering (CF)
is one of the most well-known and commonly used techniques for providing rec-
ommendations based on the idea that people who shared common preferences in
past will also share the same in the future. These common preferences are found
G. Jain (B) · T. Mahara · K. N. Tripathi

Indian Institute of Technology Roorkee, Roorkee, Uttarakhand 247667, India
e-mail: jaingourav3010@gmail.com
© Springer Nature Singapore Pte Ltd. 2020 343

M. Pant et al. (eds.), Soft Computing: Theories and Applications,
Advances in Intelligent Systems and Computing 1053,
https://doi.org/10.1007/978-981-15-0751-9_32
344 G. Jain et al.
Table 1 Nomenclature used

Notation Description
in similarity measures
ru a,i The rating given by user ua on item i
ru b,i The rating given by user ub on item i
ru a The average rating value of user ua
ru b The average rating value of user ub
n Total number of items
i Total number of co-rated items
rmed Median value in rating scale

Iu The cardinality of items rated by user ua
a
Iu The cardinality of items rated by user ub
b
using various similarity techniques like Pearson Correlation Coefficient, Constrained

Pearson Correlation, Cosine, Jaccard, etc.
2 Objective and Analysis
The objective of this paper is to analyze efficiency of similarity measures between the
users for various sparsity levels using Pearson Correlation Coefficient, Cosine Cor-
relation Coefficient, Jaccard, Constrained Pearson Correlation, Sigmoid function-
based Pearson Correlation, Euclidean distance, and City block distance measures
(Table 1).
3 Related Work
Collaborative filtering is widely used by e-commerce companies to recommend vari-

ous products to the consumer by making use of similarity measures. In this paper, we
are analyzing performance of Pearson correlation, Cosine correlation, Jaccard, Con-
strained PCC, Sigmoid PCC, Euclidean distance, and City block distance measures
on various sparsity levels. The description of them is as follows:
3.1 Pearson Correlation Coefficient
Pearson correlation coefficient (PCC) [2, 3] is one of the most popular and prominent
traditional similarity measures [4]. It shows the linear correlation between users/items
and is represented by the ratio of covariance of two users to the standard deviation
between them, considering only co-rated items. Similarity using PCC is calculated
A Survey of Similarity Measures for Collaborative Filtering … 345
[5] by:
i
i=1 r u a,i − r u a r u b,i − r u b
Sim(u a , u b ) = 2 i 2 (1)
i
r
i=1 u a,i − r ua · i=1 r u b,i − r u b
PCC has some limitations [6, 7], and one of them includes showing low (high)
similarity regardless of similar (difference) in the ratings.
3.2 Cosine Correlation Coefficient
Cosine similarity measure [4] calculates similarity by measuring the cosine angles,
and these angles are formed between the two rating vectors given by users. Higher
similarity is represented by smaller value of angles and vice versa. This measure
does not allow centering the data and adjusting the preference values; hence, it is
called the uncentered cosine similarity [8]. Similarity using Cosine is calculated as
follows [5]:
i
ru a,i ru b,i
Sim(u a , u b ) = i=1 n (2)
n
r
i=1 u a,i i=1 r u b,i
Many limitations of cosine similarity are addressed by many researchers [3, 7]. The
main limitation being its output has high similarity in spite of significant difference
in ratings [6]. The main reason for this is that in place of focusing on length, cosine
mainly focuses on direction of rating vectors.
3.3 Jaccard Similarity Coefficient
Jaccard similarity coefficient is one of the popular methods for calculating the similar-
ity between users/items and is also called as Tanimato coefficient similarity measure
[8]. In similarity calculation, it only considers the number of common ratings between
the two users. The benefit of using this method is maximum when the number of
common ratings is more. Similarity is calculated in Jaccard [5] by:
|Iu |∩|Iu b |
Sim(u a , u b ) = a (3)
Iu |∪|Iu
a b
The main drawback [7] of this method is that it does not consider the absolute
ratings into account and is extremely sensitive to small data size with missing obser-
vations.
346 G. Jain et al.
3.4 Constrained Pearson Correlation and Sigmoid

Function-Based Pearson Correlation
Constrained Pearson Correlation Coefficient (CPCC) is a variant of PCC which is

more suitable when negative ratings are present in the data set [3]. In CPCC, the user’s
rating average value is replaced with the median value in the rating scale which is
3 for rating scale 1–5. Similarity is calculated in Constrained Pearson Correlation
Coefficient (CPCC) by:
i
i=1 r u a,i − r med r u b,i − r med
Sim(u a , u b ) = 2 i 2 (4)
i
i=1 r u a,i
− r med · i=1 r u b,i − r med
The main drawback of this technique is its poor performance for sparse data
set. Another Pearson correlation-based method is Sigmoid function-based Pearson
Correlation Coefficient (SPCC). This similarity measure works best when there is a
large number of co-rated items between users. Similarity in SPCC is calculated as
follows:
1
Sim(u a , u b ) = Sim(u a , u b )PCC · (5)
1 + exp − |i2 |
3.5 Euclidean Distance Measure and City Block Distance

Measure
Euclidean distance and City block measures are distance-based similarity measures
and are special cases of Minkowski distance metric [9]. Minkowski metric is a p-
metric distance between n-dimensional points and can be defined as:

i

Dis(u a , u b ) =
p ru − ru p
a,i b,i
(6)
i=1
Dis (ua , ub ) shows the distance between two users, ua and ub . In this metric, value
of p = 1 represents the Euclidean distance, and p = 2 represents the City block
distance. After calculating the distance, similarity is calculated as follows:
1
Sim(u a , u b ) = (7)
1 + Dis(u a , u b )
Euclidean distance [8] measure is basically based upon distance, and for calcu-
lating the similarity, first, distance is calculated (by eq. 8) between users, and then,
similarity is computed (by eq. 7). Similar to cosine similarity measure, this measure
[10] never gives negative similarity values, and as the value of distance decreases,
similarity between users increases. Distance is calculated as follows:

i

Dis(u a , u b ) = ru − ru 2 (8)
a,i b,i
i=1
This technique is less accurate in sparse environment [8]. Another variant of

Minkowski metric is City Block (CB)/Manhattan distance measure [9]. The distance
between two users is determined (by eq. 9), and then, similarity is calculated (by
eq. 7). The similarity value in City block is always greater than or equal to zero.

i
Dis(u a , u b ) = |ru a,i − ru b,i | (9)
i=1
Advantage of using City block distance measure for similarity computation is

that it works well for sparse data set, and thus, it overcomes the main limitation of
traditional similarity measures.
4 Results and Discussion
To compare the performance of the similarity measures, experiments are conducted

on data set consisting the user item rating matrix. The data set consists of 4 users and
5 items with ratings provided on a scale of 1–5. In order to evaluate the performance,
similarity is calculated on the data set at varying sparsity levels 0–80% (taking
interval of 20%). In Tables 2, 3, 4, 5, and 6, part (a) present the data set for each
sparsity level and part (b) represent the similarity calculated by different techniques
for dataset given in part (a).
From the experiments, it is concluded that:
1. Pearson correlation coefficient gives incorrect results in some cases. For example,
PCC computes zero similarity between user pairs 23 and 34 in Table 2, but from
the data set, it is clear that there are some similarities exist between these user
pairs. In Table 5, it renders zero similarity for all user pairs, which is wrong.
2. Similar to PCC, cosine correlation gives false result in some cases. For instance,
it shows low similarity for the user pair 14 in Table 4, while other similarity
measures except Jaccard show maximum similarity for the same pair. As evident
from Table 3, that for user pair 13, cosine similarity measure computed maximum
similarity, but it is obvious from the data set that similarity should be maximum
between user pair 12.
348 G. Jain et al.
Table 2 a Sparsity 0% b Similarity

(a)
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 4 4 1 4 3
User 2 2 1 4 2 5
User 3 3 1 3 2 1
User 4 5 4 2 3 3
(b)
User PCC Cosine Jaccard CPCC SPCC Euclidean CB
12 −0.68 0.72 1.00 −0.68 −0.63 0.15 0.08
13 −0.38 0.80 1.00 −0.38 −0.35 0.18 0.09
14 0.77 0.98 1.00 0.77 0.72 0.37 0.25
23 0.00 0.81 1.00 0.10 0.00 0.19 0.14
24 −0.61 0.77 1.00 −0.62 −0.57 0.16 0.08
34 0.00 0.87 1.00 −0.27 0.00 0.19 0.10

(a)
User 1 4 0 1 4 3
User 2 2 1 4 0 5
User 3 3 1 0 2 1
User 4 0 4 2 3 3
(b)
12 −0.55 0.61 0.60 −0.55 −0.45 0.20 0.13
13 0.72 0.92 0.60 −0.32 0.59 0.25 0.17
14 0.89 0.58 0.60 0.89 0.73 0.41 0.33
23 −0.25 0.46 0.60 0.00 −0.21 0.20 0.17
24 −0.71 0.65 0.60 −0.71 −0.58 0.20 0.13
34 −0.69 0.54 0.60 −0.67 −0.56 0.21 0.14

(a)
User 1 4 0 1 4 0
User 2 2 1 4 0 5
User 3 0 1 0 2 1
User 4 0 4 2 0 0
(b)
12 −0.95 0.31 0.40 −0.95 −0.69 0.22 0.17
13 1.00 0.57 0.20 −1.00 0.62 0.33 0.33
14 1.00 0.08 0.25 1.00 0.62 0.50 0.50
23 0.00 0.36 0.40 0.00 0.00 0.20 0.20
24 −0.95 0.40 0.50 −0.95 −0.69 0.22 0.17
34 −1.00 0.37 0.25 −1.00 −0.62 0.25 0.25

(a)
User 1 4 0 0 4 0
User 2 2 0 4 0 0
User 3 0 1 0 2 1
User 4 0 0 2 0 0
(b)
12 0.00 0.32 0.33 −1.00 0.00 0.33 0.33
13 0.00 0.58 0.25 −1.00 0.00 0.33 0.33
14 0.00 0.00 0.00 0.00 0.00 0.00 0.00
23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
24 0.00 0.89 0.50 −1.00 0.00 0.33 0.33
34 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3. Jaccard similarity only counts common ratings. This method fails when a more
number of users have same common ratings. All user pairs show same similarity
as evident in Tables 2 and 3, which is erroneous.
4. Similar to PCC, CPCC also computes wrong results in many cases as observed
in Table 4. CPCC computed similarity 0 for user pair 23, and it is incorrect as
users 2 and 3 gave some ratings to items 2 and 4. As we know SPCC is based
350 G. Jain et al.

(a)
User 1 4 0 0 0 0
User 2 2 0 0 0 0
User 3 0 0 0 2 0
User 4 0 0 2 0 0
(b)
12 0.00 1.00 1.00 −1.00 0.00 0.33 0.33
13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
14 0. 00 0.00 0.00 0.00 0.00 0.00 0.00
23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
34 0.00 0.00 0.00 0.00 0.00 0.00 0.00
upon PCC, wrong result is computed for user 25 in Table 6. Both gave ratings
for item 3, but calculated similarity between them is zero.
5. Euclidean and City block are special cases of Minkowaski distance measure. Both
similarity measures give more or less similarity results for high sparsity data
set, but at low sparsity data set, City block gives better result than Euclidean.
For example, in Table 5, Euclidean distance gives more similarity for user pair
13(0.190) in comparison with user pairs 24(0.333) and 12(0.5), while rating
difference between 12 is 1, 24 is 2, and 13 is 6.
Based on above experiments, it is concluded that City block distance measure is
more efficient than other similarity measures even as the sparsity of the rating matrix
increases as depicted in Fig. 1a–e.
5 Conclusion
Similarity measures help a system which is developed for suggesting the items on the
basis of their interest. This paper compares existing traditional similarity measures
such as Pearson correlation, Cosine correlation, Jaccard, Constrained PCC, Sigmoid
PCC, Euclidean distance, and City block similarity measures under varying sparsity
levels to check their accuracies. Experimental results show that City block distance
measure outperform than other similarity measures in high sparse data set.
(a) 1.50
1.00
0.50
0.00
12 13 14 23 24 34
-0.50
-1.00
PCC Cosine Jaccard CPCC SPCC Euclidian City block
(b) 1.00
0.50
0.00
12 13 14 23 24 34
-0.50
-1.00
(c) 1.5
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5
(d) 1
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5
Fig. 1 a Sparsity 0% b 20% c 40% d 60% e 80%

352 G. Jain et al.
(e) 1.5
1
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5
Fig. 1 (continued)
References
1. Jain, G., Mishra, N., Sharma, S.: CRLRM: category based recommendation using linear regres-
sion model. In: 2013 Third International Conference on Advances in Computing and Commu-
nications, pp. 17–20, IEEE (2013)
2. Dharaneeshwaran, Nithya, S., Srinivasan, A., Senthilkumar, M.: Calculating the user-item
similarity using Pearsons and cosine correlation. In: 2017 International Conference on Trends
in Electronics and Informatics (ICEI), pp. 1000–1004. IEEE (2017)
3. Liu, H., Hu, Z., Mian, A., Tian, H., Zhu, X.: A new user similarity model to improve the
accuracy of collaborative filtering. Knowl.-Based Syst. 56, 156–166 (2014)
4. Wang, Y., Deng, J., Gao, J., Zhang, P.: A hybrid user similarity model for collaborative filtering.
Inf. Sci. 418–419, 102–118 (2017)
5. https://www.youtube.com/watch?v=h9gpufJFF-0. 08 Oct 2018
6. Tan, Z., He, L.: An efficient similarity measure for user-based collaborative filtering recom-
mender systems inspired by the physical resonance principle. IEEE Access. 5, 27211–27228
(2017)
7. Patra, B.K., Launonen, R., Ollikainen, V., Nandi, S.: A new similarity measure using Bhat-
tacharyya coefficient for collaborative filtering in sparse data. Knowl.-Based Syst. 82, 163–177
(2015)
8. Arsan, T., Koksal, E., Bozkus, Z.: Comparison of collaborative filtering algorithms with various
similarity measures for movie recommendation. Int. J. Comput. Sci. Eng. Appl. 6, 1–20 (2016)
9. Distance Measures—umass.edu. https://www.umass.edu/landeco/teaching/multivariate/
readings/McCune.and.Grace.2002.chapter6.pdf. 08 Oct 2018
10. Shimodaira, H.: Similarity and Recommender Systems. http://www.inf.ed.ac.uk/teaching/
courses/inf2b-learn-note02-2up.pdf. 08 Oct 2018
View publication stats

A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Survey of Similarity Measures for Collaborative Filtering-Based

Chapter · January 2020

Gourav Jain Tripti Mahara

SEE PROFILE SEE PROFILE

Kuldeep Narayan Tripathi

Blockchain Based Security Model for VANET View project

The user has requested enhancement of the downloaded file.

Gourav Jain, Tripti Mahara and Kuldeep Narayan Tripathi

Abstract Recommendation system is one of the most valuable approaches to pro-

Keywords Pearson correlation · Cosine correlation · Jaccard · Euclidean

There is an enormous increase in the amount of information available on Web. This

G. Jain (B) · T. Mahara · K. N. Tripathi

© Springer Nature Singapore Pte Ltd. 2020 343

Table 1 Nomenclature used

using various similarity techniques like Pearson Correlation Coefficient, Constrained

2 Objective and Analysis

Collaborative filtering is widely used by e-commerce companies to recommend vari-

3.1 Pearson Correlation Coefficient

3.2 Cosine Correlation Coefficient

3.3 Jaccard Similarity Coefficient

3.4 Constrained Pearson Correlation and Sigmoid

Constrained Pearson Correlation Coefficient (CPCC) is a variant of PCC which is

3.5 Euclidean Distance Measure and City Block Distance

This technique is less accurate in sparse environment [8]. Another variant of

Advantage of using City block distance measure for similarity computation is

4 Results and Discussion

To compare the performance of the similarity measures, experiments are conducted

Table 2 a Sparsity 0% b Similarity

Table 3 a Sparsity 20% b Similarity

Table 4 a Sparsity 40% b Similarity

Table 5 a Sparsity 60% b Similarity

Table 6 a Sparsity 80% b Similarity

PCC Cosine Jaccard CPCC SPCC Euclidian City block

PCC Cosine Jaccard CPCC SPCC Euclidian City block

PCC Cosine Jaccard CPCC SPCC Euclidian City block

Fig. 1 a Sparsity 0% b 20% c 40% d 60% e 80%

PCC Cosine Jaccard CPCC SPCC Euclidian City block

View publication stats

You might also like