Professional Documents
Culture Documents
A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System
A Survey of Similarity Measures For Collaborative Filtering-Based Recommender System
net/publication/339466606
CITATIONS READS
14 2,016
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Cluster based secure and stable routing in Internet of Vehicles (IOV) View project
All content following this page was uploaded by Gourav Jain on 03 March 2020.
1 Introduction
The objective of this paper is to analyze efficiency of similarity measures between the
users for various sparsity levels using Pearson Correlation Coefficient, Cosine Cor-
relation Coefficient, Jaccard, Constrained Pearson Correlation, Sigmoid function-
based Pearson Correlation, Euclidean distance, and City block distance measures
(Table 1).
3 Related Work
Pearson correlation coefficient (PCC) [2, 3] is one of the most popular and prominent
traditional similarity measures [4]. It shows the linear correlation between users/items
and is represented by the ratio of covariance of two users to the standard deviation
between them, considering only co-rated items. Similarity using PCC is calculated
A Survey of Similarity Measures for Collaborative Filtering … 345
[5] by:
i
i=1 r u a,i − r u a r u b,i − r u b
Sim(u a , u b ) = 2 i 2 (1)
i
r
i=1 u a,i − r ua · i=1 r u b,i − r u b
PCC has some limitations [6, 7], and one of them includes showing low (high)
similarity regardless of similar (difference) in the ratings.
Cosine similarity measure [4] calculates similarity by measuring the cosine angles,
and these angles are formed between the two rating vectors given by users. Higher
similarity is represented by smaller value of angles and vice versa. This measure
does not allow centering the data and adjusting the preference values; hence, it is
called the uncentered cosine similarity [8]. Similarity using Cosine is calculated as
follows [5]:
i
ru a,i ru b,i
Sim(u a , u b ) = i=1 n (2)
n
r
i=1 u a,i i=1 r u b,i
Many limitations of cosine similarity are addressed by many researchers [3, 7]. The
main limitation being its output has high similarity in spite of significant difference
in ratings [6]. The main reason for this is that in place of focusing on length, cosine
mainly focuses on direction of rating vectors.
Jaccard similarity coefficient is one of the popular methods for calculating the similar-
ity between users/items and is also called as Tanimato coefficient similarity measure
[8]. In similarity calculation, it only considers the number of common ratings between
the two users. The benefit of using this method is maximum when the number of
common ratings is more. Similarity is calculated in Jaccard [5] by:
|Iu |∩|Iu b |
Sim(u a , u b ) = a (3)
Iu |∪|Iu
a b
The main drawback [7] of this method is that it does not consider the absolute
ratings into account and is extremely sensitive to small data size with missing obser-
vations.
346 G. Jain et al.
The main drawback of this technique is its poor performance for sparse data
set. Another Pearson correlation-based method is Sigmoid function-based Pearson
Correlation Coefficient (SPCC). This similarity measure works best when there is a
large number of co-rated items between users. Similarity in SPCC is calculated as
follows:
1
Sim(u a , u b ) = Sim(u a , u b )PCC · (5)
1 + exp − |i2 |
Euclidean distance and City block measures are distance-based similarity measures
and are special cases of Minkowski distance metric [9]. Minkowski metric is a p-
metric distance between n-dimensional points and can be defined as:
i
Dis(u a , u b ) =
p ru − ru p
a,i b,i
(6)
i=1
Dis (ua , ub ) shows the distance between two users, ua and ub . In this metric, value
of p = 1 represents the Euclidean distance, and p = 2 represents the City block
distance. After calculating the distance, similarity is calculated as follows:
1
Sim(u a , u b ) = (7)
1 + Dis(u a , u b )
A Survey of Similarity Measures for Collaborative Filtering … 347
Euclidean distance [8] measure is basically based upon distance, and for calcu-
lating the similarity, first, distance is calculated (by eq. 8) between users, and then,
similarity is computed (by eq. 7). Similar to cosine similarity measure, this measure
[10] never gives negative similarity values, and as the value of distance decreases,
similarity between users increases. Distance is calculated as follows:
i
Dis(u a , u b ) = ru − ru 2 (8)
a,i b,i
i=1
3. Jaccard similarity only counts common ratings. This method fails when a more
number of users have same common ratings. All user pairs show same similarity
as evident in Tables 2 and 3, which is erroneous.
4. Similar to PCC, CPCC also computes wrong results in many cases as observed
in Table 4. CPCC computed similarity 0 for user pair 23, and it is incorrect as
users 2 and 3 gave some ratings to items 2 and 4. As we know SPCC is based
350 G. Jain et al.
upon PCC, wrong result is computed for user 25 in Table 6. Both gave ratings
for item 3, but calculated similarity between them is zero.
5. Euclidean and City block are special cases of Minkowaski distance measure. Both
similarity measures give more or less similarity results for high sparsity data
set, but at low sparsity data set, City block gives better result than Euclidean.
For example, in Table 5, Euclidean distance gives more similarity for user pair
13(0.190) in comparison with user pairs 24(0.333) and 12(0.5), while rating
difference between 12 is 1, 24 is 2, and 13 is 6.
Based on above experiments, it is concluded that City block distance measure is
more efficient than other similarity measures even as the sparsity of the rating matrix
increases as depicted in Fig. 1a–e.
5 Conclusion
Similarity measures help a system which is developed for suggesting the items on the
basis of their interest. This paper compares existing traditional similarity measures
such as Pearson correlation, Cosine correlation, Jaccard, Constrained PCC, Sigmoid
PCC, Euclidean distance, and City block similarity measures under varying sparsity
levels to check their accuracies. Experimental results show that City block distance
measure outperform than other similarity measures in high sparse data set.
A Survey of Similarity Measures for Collaborative Filtering … 351
(a) 1.50
1.00
0.50
0.00
12 13 14 23 24 34
-0.50
-1.00
(b) 1.00
0.50
0.00
12 13 14 23 24 34
-0.50
-1.00
(c) 1.5
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5
(d) 1
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5
PCC Cosine Jaccard CPCC SPCC Euclidian City block
(e) 1.5
1
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5
Fig. 1 (continued)
References
1. Jain, G., Mishra, N., Sharma, S.: CRLRM: category based recommendation using linear regres-
sion model. In: 2013 Third International Conference on Advances in Computing and Commu-
nications, pp. 17–20, IEEE (2013)
2. Dharaneeshwaran, Nithya, S., Srinivasan, A., Senthilkumar, M.: Calculating the user-item
similarity using Pearsons and cosine correlation. In: 2017 International Conference on Trends
in Electronics and Informatics (ICEI), pp. 1000–1004. IEEE (2017)
3. Liu, H., Hu, Z., Mian, A., Tian, H., Zhu, X.: A new user similarity model to improve the
accuracy of collaborative filtering. Knowl.-Based Syst. 56, 156–166 (2014)
4. Wang, Y., Deng, J., Gao, J., Zhang, P.: A hybrid user similarity model for collaborative filtering.
Inf. Sci. 418–419, 102–118 (2017)
5. https://www.youtube.com/watch?v=h9gpufJFF-0. 08 Oct 2018
6. Tan, Z., He, L.: An efficient similarity measure for user-based collaborative filtering recom-
mender systems inspired by the physical resonance principle. IEEE Access. 5, 27211–27228
(2017)
7. Patra, B.K., Launonen, R., Ollikainen, V., Nandi, S.: A new similarity measure using Bhat-
tacharyya coefficient for collaborative filtering in sparse data. Knowl.-Based Syst. 82, 163–177
(2015)
8. Arsan, T., Koksal, E., Bozkus, Z.: Comparison of collaborative filtering algorithms with various
similarity measures for movie recommendation. Int. J. Comput. Sci. Eng. Appl. 6, 1–20 (2016)
9. Distance Measures—umass.edu. https://www.umass.edu/landeco/teaching/multivariate/
readings/McCune.and.Grace.2002.chapter6.pdf. 08 Oct 2018
10. Shimodaira, H.: Similarity and Recommender Systems. http://www.inf.ed.ac.uk/teaching/
courses/inf2b-learn-note02-2up.pdf. 08 Oct 2018