Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/339466606

A Survey of Similarity Measures for Collaborative Filtering-Based


Recommender System

Chapter · January 2020


DOI: 10.1007/978-981-15-0751-9_32

CITATIONS READS

14 2,016

3 authors:

Gourav Jain Tripti Mahara


Indian Institute of Technology Roorkee Indian Institute of Technology Roorkee
7 PUBLICATIONS   24 CITATIONS    43 PUBLICATIONS   288 CITATIONS   

SEE PROFILE SEE PROFILE

Kuldeep Narayan Tripathi


Indian Institute of Technology Roorkee
15 PUBLICATIONS   53 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Cluster based secure and stable routing in Internet of Vehicles (IOV) View project

Blockchain Based Security Model for VANET View project

All content following this page was uploaded by Gourav Jain on 03 March 2020.

The user has requested enhancement of the downloaded file.


A Survey of Similarity Measures
for Collaborative Filtering-Based
Recommender System

Gourav Jain, Tripti Mahara and Kuldeep Narayan Tripathi

Abstract Recommendation system is one of the most valuable approaches to pro-


vide personalized services for users. It helps in finding the relevant information as per
user’s interest from enormous amount of data. To achieve this, a similarity measure is
used that computes the similarity between two users or items. There are multifarious
methods to compute the similarity between users/items, but each method has some
limitations. In this paper, we calculate the similarity between users through Pearson
Correlation Coefficient (PCC), Cosine Correlation, Constrained Pearson Correla-
tion Coefficient (CPCC), Sigmoid function-based Pearson Correlation Coefficient
(SPCC), Jaccard similarity, and Minkowski distance measures (Euclidean distance
and City block distance). The results show that Minkowski metric gives better result
than other similarity calculation measures as it is less affected by size and sparsity
of data set.

Keywords Pearson correlation · Cosine correlation · Jaccard · Euclidean


distance · City block distance · Constrained Pearson correlation · Sigmoid
function-based Pearson correlation

1 Introduction

There is an enormous increase in the amount of information available on Web. This


tremendous increase in the information that cannot be processed easily is termed
as information overload. With this overload, it becomes very difficult for the users
to find out meaningful and relevant information. One of the possible solutions for
the information overload problem is recommendation system [1] that helps the user
to find relevant information according to their interest. Collaborative filtering (CF)
is one of the most well-known and commonly used techniques for providing rec-
ommendations based on the idea that people who shared common preferences in
past will also share the same in the future. These common preferences are found

G. Jain (B) · T. Mahara · K. N. Tripathi


Indian Institute of Technology Roorkee, Roorkee, Uttarakhand 247667, India
e-mail: jaingourav3010@gmail.com

© Springer Nature Singapore Pte Ltd. 2020 343


M. Pant et al. (eds.), Soft Computing: Theories and Applications,
Advances in Intelligent Systems and Computing 1053,
https://doi.org/10.1007/978-981-15-0751-9_32
344 G. Jain et al.

Table 1 Nomenclature used


Notation Description
in similarity measures
ru a,i The rating given by user ua on item i
ru b,i The rating given by user ub on item i
ru a The average rating value of user ua
ru b The average rating value of user ub
n Total number of items
i Total number of co-rated items
rmed Median value in rating scale
 
 Iu  The cardinality of items rated by user ua
 a
 Iu  The cardinality of items rated by user ub
b

using various similarity techniques like Pearson Correlation Coefficient, Constrained


Pearson Correlation, Cosine, Jaccard, etc.

2 Objective and Analysis

The objective of this paper is to analyze efficiency of similarity measures between the
users for various sparsity levels using Pearson Correlation Coefficient, Cosine Cor-
relation Coefficient, Jaccard, Constrained Pearson Correlation, Sigmoid function-
based Pearson Correlation, Euclidean distance, and City block distance measures
(Table 1).

3 Related Work

Collaborative filtering is widely used by e-commerce companies to recommend vari-


ous products to the consumer by making use of similarity measures. In this paper, we
are analyzing performance of Pearson correlation, Cosine correlation, Jaccard, Con-
strained PCC, Sigmoid PCC, Euclidean distance, and City block distance measures
on various sparsity levels. The description of them is as follows:

3.1 Pearson Correlation Coefficient

Pearson correlation coefficient (PCC) [2, 3] is one of the most popular and prominent
traditional similarity measures [4]. It shows the linear correlation between users/items
and is represented by the ratio of covariance of two users to the standard deviation
between them, considering only co-rated items. Similarity using PCC is calculated
A Survey of Similarity Measures for Collaborative Filtering … 345

[5] by:
i    
i=1 r u a,i − r u a r u b,i − r u b
Sim(u a , u b ) =   2 i   2 (1)
i
r
i=1 u a,i − r ua · i=1 r u b,i − r u b

PCC has some limitations [6, 7], and one of them includes showing low (high)
similarity regardless of similar (difference) in the ratings.

3.2 Cosine Correlation Coefficient

Cosine similarity measure [4] calculates similarity by measuring the cosine angles,
and these angles are formed between the two rating vectors given by users. Higher
similarity is represented by smaller value of angles and vice versa. This measure
does not allow centering the data and adjusting the preference values; hence, it is
called the uncentered cosine similarity [8]. Similarity using Cosine is calculated as
follows [5]:
i    
ru a,i ru b,i
Sim(u a , u b ) =  i=1 n   (2)
n 
r
i=1 u a,i i=1 r u b,i

Many limitations of cosine similarity are addressed by many researchers [3, 7]. The
main limitation being its output has high similarity in spite of significant difference
in ratings [6]. The main reason for this is that in place of focusing on length, cosine
mainly focuses on direction of rating vectors.

3.3 Jaccard Similarity Coefficient

Jaccard similarity coefficient is one of the popular methods for calculating the similar-
ity between users/items and is also called as Tanimato coefficient similarity measure
[8]. In similarity calculation, it only considers the number of common ratings between
the two users. The benefit of using this method is maximum when the number of
common ratings is more. Similarity is calculated in Jaccard [5] by:

|Iu |∩|Iu b |
Sim(u a , u b ) =  a  (3)
 Iu |∪|Iu 
a b

The main drawback [7] of this method is that it does not consider the absolute
ratings into account and is extremely sensitive to small data size with missing obser-
vations.
346 G. Jain et al.

3.4 Constrained Pearson Correlation and Sigmoid


Function-Based Pearson Correlation

Constrained Pearson Correlation Coefficient (CPCC) is a variant of PCC which is


more suitable when negative ratings are present in the data set [3]. In CPCC, the user’s
rating average value is replaced with the median value in the rating scale which is
3 for rating scale 1–5. Similarity is calculated in Constrained Pearson Correlation
Coefficient (CPCC) by:
i    
i=1 r u a,i − r med r u b,i − r med
Sim(u a , u b ) =   2 i   2 (4)
i
i=1 r u a,i
− r med · i=1 r u b,i − r med

The main drawback of this technique is its poor performance for sparse data
set. Another Pearson correlation-based method is Sigmoid function-based Pearson
Correlation Coefficient (SPCC). This similarity measure works best when there is a
large number of co-rated items between users. Similarity in SPCC is calculated as
follows:
1
Sim(u a , u b ) = Sim(u a , u b )PCC ·   (5)
1 + exp − |i2 |

3.5 Euclidean Distance Measure and City Block Distance


Measure

Euclidean distance and City block measures are distance-based similarity measures
and are special cases of Minkowski distance metric [9]. Minkowski metric is a p-
metric distance between n-dimensional points and can be defined as:


i

 
Dis(u a , u b ) =
p ru − ru  p
a,i b,i
(6)
i=1

Dis (ua , ub ) shows the distance between two users, ua and ub . In this metric, value
of p = 1 represents the Euclidean distance, and p = 2 represents the City block
distance. After calculating the distance, similarity is calculated as follows:

1
Sim(u a , u b ) = (7)
1 + Dis(u a , u b )
A Survey of Similarity Measures for Collaborative Filtering … 347

Euclidean distance [8] measure is basically based upon distance, and for calcu-
lating the similarity, first, distance is calculated (by eq. 8) between users, and then,
similarity is computed (by eq. 7). Similar to cosine similarity measure, this measure
[10] never gives negative similarity values, and as the value of distance decreases,
similarity between users increases. Distance is calculated as follows:


i
 
Dis(u a , u b ) = ru − ru 2 (8)
a,i b,i
i=1

This technique is less accurate in sparse environment [8]. Another variant of


Minkowski metric is City Block (CB)/Manhattan distance measure [9]. The distance
between two users is determined (by eq. 9), and then, similarity is calculated (by
eq. 7). The similarity value in City block is always greater than or equal to zero.


i
Dis(u a , u b ) = |ru a,i − ru b,i | (9)
i=1

Advantage of using City block distance measure for similarity computation is


that it works well for sparse data set, and thus, it overcomes the main limitation of
traditional similarity measures.

4 Results and Discussion

To compare the performance of the similarity measures, experiments are conducted


on data set consisting the user item rating matrix. The data set consists of 4 users and
5 items with ratings provided on a scale of 1–5. In order to evaluate the performance,
similarity is calculated on the data set at varying sparsity levels 0–80% (taking
interval of 20%). In Tables 2, 3, 4, 5, and 6, part (a) present the data set for each
sparsity level and part (b) represent the similarity calculated by different techniques
for dataset given in part (a).
From the experiments, it is concluded that:
1. Pearson correlation coefficient gives incorrect results in some cases. For example,
PCC computes zero similarity between user pairs 23 and 34 in Table 2, but from
the data set, it is clear that there are some similarities exist between these user
pairs. In Table 5, it renders zero similarity for all user pairs, which is wrong.
2. Similar to PCC, cosine correlation gives false result in some cases. For instance,
it shows low similarity for the user pair 14 in Table 4, while other similarity
measures except Jaccard show maximum similarity for the same pair. As evident
from Table 3, that for user pair 13, cosine similarity measure computed maximum
similarity, but it is obvious from the data set that similarity should be maximum
between user pair 12.
348 G. Jain et al.

Table 2 a Sparsity 0% b Similarity


(a)
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 4 4 1 4 3
User 2 2 1 4 2 5
User 3 3 1 3 2 1
User 4 5 4 2 3 3
(b)
User PCC Cosine Jaccard CPCC SPCC Euclidean CB
12 −0.68 0.72 1.00 −0.68 −0.63 0.15 0.08
13 −0.38 0.80 1.00 −0.38 −0.35 0.18 0.09
14 0.77 0.98 1.00 0.77 0.72 0.37 0.25
23 0.00 0.81 1.00 0.10 0.00 0.19 0.14
24 −0.61 0.77 1.00 −0.62 −0.57 0.16 0.08
34 0.00 0.87 1.00 −0.27 0.00 0.19 0.10

Table 3 a Sparsity 20% b Similarity


(a)
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 4 0 1 4 3
User 2 2 1 4 0 5
User 3 3 1 0 2 1
User 4 0 4 2 3 3
(b)
User PCC Cosine Jaccard CPCC SPCC Euclidean CB
12 −0.55 0.61 0.60 −0.55 −0.45 0.20 0.13
13 0.72 0.92 0.60 −0.32 0.59 0.25 0.17
14 0.89 0.58 0.60 0.89 0.73 0.41 0.33
23 −0.25 0.46 0.60 0.00 −0.21 0.20 0.17
24 −0.71 0.65 0.60 −0.71 −0.58 0.20 0.13
34 −0.69 0.54 0.60 −0.67 −0.56 0.21 0.14
A Survey of Similarity Measures for Collaborative Filtering … 349

Table 4 a Sparsity 40% b Similarity


(a)
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 4 0 1 4 0
User 2 2 1 4 0 5
User 3 0 1 0 2 1
User 4 0 4 2 0 0
(b)
User PCC Cosine Jaccard CPCC SPCC Euclidean CB
12 −0.95 0.31 0.40 −0.95 −0.69 0.22 0.17
13 1.00 0.57 0.20 −1.00 0.62 0.33 0.33
14 1.00 0.08 0.25 1.00 0.62 0.50 0.50
23 0.00 0.36 0.40 0.00 0.00 0.20 0.20
24 −0.95 0.40 0.50 −0.95 −0.69 0.22 0.17
34 −1.00 0.37 0.25 −1.00 −0.62 0.25 0.25

Table 5 a Sparsity 60% b Similarity


(a)
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 4 0 0 4 0
User 2 2 0 4 0 0
User 3 0 1 0 2 1
User 4 0 0 2 0 0
(b)
User PCC Cosine Jaccard CPCC SPCC Euclidean CB
12 0.00 0.32 0.33 −1.00 0.00 0.33 0.33
13 0.00 0.58 0.25 −1.00 0.00 0.33 0.33
14 0.00 0.00 0.00 0.00 0.00 0.00 0.00
23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
24 0.00 0.89 0.50 −1.00 0.00 0.33 0.33
34 0.00 0.00 0.00 0.00 0.00 0.00 0.00

3. Jaccard similarity only counts common ratings. This method fails when a more
number of users have same common ratings. All user pairs show same similarity
as evident in Tables 2 and 3, which is erroneous.
4. Similar to PCC, CPCC also computes wrong results in many cases as observed
in Table 4. CPCC computed similarity 0 for user pair 23, and it is incorrect as
users 2 and 3 gave some ratings to items 2 and 4. As we know SPCC is based
350 G. Jain et al.

Table 6 a Sparsity 80% b Similarity


(a)
Item 1 Item 2 Item 3 Item 4 Item 5
User 1 4 0 0 0 0
User 2 2 0 0 0 0
User 3 0 0 0 2 0
User 4 0 0 2 0 0
(b)
User PCC Cosine Jaccard CPCC SPCC Euclidean CB
12 0.00 1.00 1.00 −1.00 0.00 0.33 0.33
13 0.00 0.00 0.00 0.00 0.00 0.00 0.00
14 0. 00 0.00 0.00 0.00 0.00 0.00 0.00
23 0.00 0.00 0.00 0.00 0.00 0.00 0.00
24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
34 0.00 0.00 0.00 0.00 0.00 0.00 0.00

upon PCC, wrong result is computed for user 25 in Table 6. Both gave ratings
for item 3, but calculated similarity between them is zero.
5. Euclidean and City block are special cases of Minkowaski distance measure. Both
similarity measures give more or less similarity results for high sparsity data
set, but at low sparsity data set, City block gives better result than Euclidean.
For example, in Table 5, Euclidean distance gives more similarity for user pair
13(0.190) in comparison with user pairs 24(0.333) and 12(0.5), while rating
difference between 12 is 1, 24 is 2, and 13 is 6.
Based on above experiments, it is concluded that City block distance measure is
more efficient than other similarity measures even as the sparsity of the rating matrix
increases as depicted in Fig. 1a–e.

5 Conclusion

Similarity measures help a system which is developed for suggesting the items on the
basis of their interest. This paper compares existing traditional similarity measures
such as Pearson correlation, Cosine correlation, Jaccard, Constrained PCC, Sigmoid
PCC, Euclidean distance, and City block similarity measures under varying sparsity
levels to check their accuracies. Experimental results show that City block distance
measure outperform than other similarity measures in high sparse data set.
A Survey of Similarity Measures for Collaborative Filtering … 351

(a) 1.50

1.00

0.50

0.00
12 13 14 23 24 34
-0.50

-1.00

PCC Cosine Jaccard CPCC SPCC Euclidian City block

(b) 1.00

0.50

0.00
12 13 14 23 24 34
-0.50

-1.00

PCC Cosine Jaccard CPCC SPCC Euclidian City block

(c) 1.5

0.5

0
12 13 14 23 24 34
-0.5

-1

-1.5

PCC Cosine Jaccard CPCC SPCC Euclidian City block

(d) 1

0.5

0
12 13 14 23 24 34
-0.5

-1

-1.5
PCC Cosine Jaccard CPCC SPCC Euclidian City block

Fig. 1 a Sparsity 0% b 20% c 40% d 60% e 80%


352 G. Jain et al.

(e) 1.5
1
0.5
0
12 13 14 23 24 34
-0.5
-1
-1.5

PCC Cosine Jaccard CPCC SPCC Euclidian City block

Fig. 1 (continued)

References

1. Jain, G., Mishra, N., Sharma, S.: CRLRM: category based recommendation using linear regres-
sion model. In: 2013 Third International Conference on Advances in Computing and Commu-
nications, pp. 17–20, IEEE (2013)
2. Dharaneeshwaran, Nithya, S., Srinivasan, A., Senthilkumar, M.: Calculating the user-item
similarity using Pearsons and cosine correlation. In: 2017 International Conference on Trends
in Electronics and Informatics (ICEI), pp. 1000–1004. IEEE (2017)
3. Liu, H., Hu, Z., Mian, A., Tian, H., Zhu, X.: A new user similarity model to improve the
accuracy of collaborative filtering. Knowl.-Based Syst. 56, 156–166 (2014)
4. Wang, Y., Deng, J., Gao, J., Zhang, P.: A hybrid user similarity model for collaborative filtering.
Inf. Sci. 418–419, 102–118 (2017)
5. https://www.youtube.com/watch?v=h9gpufJFF-0. 08 Oct 2018
6. Tan, Z., He, L.: An efficient similarity measure for user-based collaborative filtering recom-
mender systems inspired by the physical resonance principle. IEEE Access. 5, 27211–27228
(2017)
7. Patra, B.K., Launonen, R., Ollikainen, V., Nandi, S.: A new similarity measure using Bhat-
tacharyya coefficient for collaborative filtering in sparse data. Knowl.-Based Syst. 82, 163–177
(2015)
8. Arsan, T., Koksal, E., Bozkus, Z.: Comparison of collaborative filtering algorithms with various
similarity measures for movie recommendation. Int. J. Comput. Sci. Eng. Appl. 6, 1–20 (2016)
9. Distance Measures—umass.edu. https://www.umass.edu/landeco/teaching/multivariate/
readings/McCune.and.Grace.2002.chapter6.pdf. 08 Oct 2018
10. Shimodaira, H.: Similarity and Recommender Systems. http://www.inf.ed.ac.uk/teaching/
courses/inf2b-learn-note02-2up.pdf. 08 Oct 2018

View publication stats

You might also like