Professional Documents
Culture Documents
Deduplication of Data in The Cloud-Based Server For Chat/File Transfer Application
Deduplication of Data in The Cloud-Based Server For Chat/File Transfer Application
LITERATURE SURVEY
In the recent times, recommender system
has been extensively studied and improved.
There are numerous techniques and
methods to implement the system. In this
section we will review some of the existing
work related to the proposed method. We
will limit our scope to reviewing
advancement made in collaborative
filtering techniques.
Collaborative Recommendation systems give
recommendation to a user if similar users
liked an item. Examples of this technique
include nearest neighbour modelling [1],
Matrix Completion [2], Restricted
Boltzmann machine [3], Bayesian matrix
factorization [4] etc.
Types of collaborative filtering based on
the implementation are:
extent to which two variables Decomposition (SVD) [19], which represents
linearly relate with each other the data by a set of vectors, one for each
[10]. item and user, such that the dot product of
Other correlation-based the user vector and the movie vector is
similarities include: Constrained the best approximation for the training set.
Pearson correlation, a variation The model building process is
of Pearson correlation that computationally expensive and memory
uses midpoint instead of intensive. After model construction,
mean rate; Spearman rank predictions are done very fast with small
correlation, similar to Pearson memory requirement. It achieve less
correlation, except that the accurate prediction than memory-based
ratings are ranks; and methods on dense data sets where a
Kendall’s τ correlation, large fraction of user-item values are
similar to the Spearman rank available in the training set, but perform
correlation, but instead of better on sparse data sets.
using ranks themselves, only -Simple Bayesian CF Algorithm:
the relative ranks are used to The simple Bayesian CF algorithm uses a
calculate the correlation naive Bayes (NB) strategy to make
[11,12]. predictions for CF tasks. Assuming the
-Model based CF: features are independent given in the class,
Unlike memory-based CF, the probability of a certain class given all
model-based approach does of the features can be computed, and then
not use the whole data set to the class with the highest probability will
compute a prediction. Instead, it be classified as the predicted class [20].
builds a model of the data -Other Bayesian CF Algorithms:
based on a training set and Bayesian belief nets with decision trees at
uses that model to predict each node. This model has a decision tree
future ratings. For example, at each node of the BNs, where a node
clustering based CF method corresponds to each item in the domain
builds a model of the data set as and the states of each node correspond to
clusters of users, and then uses the possible ratings for each item [21].
the ratings of users within the Results show that this model has similar
cluster to predict. A very prediction performance to Pearson
successful model- based correlation-based CF
method is the Singular Value
methods, and has better performance the model prediction. Furthermore, requirement is α +
than Bayesian-clustering and vector β = 1 where 0 ≤ (α, β) ≤ 1. The values of α and β are
cosine memory-based CF algorithms. determined numerically. This simple approach
Baseline Bayesian model uses a Bayesian intuitively results in a program running time of the
belief net with no arcs (baseline model) for both algorithms combined. Hybrid Recommenders
collaborative filtering and recommends Incorporating CF and Content Based Features. The
items on their overall popularity. [22] content-boosted CF algorithm uses naive Bayes as
-Hybrid Recommender: the content classifier, it then fills in the missing
Most of the hybrid recommenders are built values of the rating matrix with the predictions of the
by combining a CF algorithm with a content predictor to form a pseudo rating matrix, in
content based algorithm [13]. However, which observed ratings are kept untouched and
with two CF algorithms, hybrid algorithm missing ratings are replaced by the predictions of a
can be implemented. One way to do it, as content predictor. Then predictions over the resulting
[14] suggests (and the approach this pseudo ratings matrix using a weighted Pearson
project has chosen), is to weight every correlation- based CF algorithm, which gives a
prediction from the two algorithms as the higher weight for the item that more users rated,
final prediction according to and gives a higher weight for the active user. [18]
rˆui = αrˆui1 + βrˆui2 where rˆui1 is the
predicted rating from the memory based Other recommender systems include demographic-
algorithm, rˆui2 is the predicted rating based recommender systems, which use user profile
from the model based algorithm and rˆui information such as gender, postcode, occupation,
is the hybrid prediction. α is the weight and so forth [15]; utility-based recommender systems
of the and knowledge-based recommender systems, both
memory prediction and β is the weight of of which require knowledge about how a particular
object satisfies the user needs [16, 17]. Further collaborative filtering can be divided
We will not discuss these systems in into to two subcategories: User based and
detail in this work. Item Based
In user based, the focusing on the
“nearest neighbour” approach for
recommendations, which looks at the
rating patterns of other users and finds the
“nearest neighbours”, i.e. users having
ratings closer to yours. The algorithm then
gives you recommendations based on the
ratings of these neighbours [5]. Amazon
developed Item based collaborative filtering
in 1998[6]. Unlike user based
collaborative filtering, item-based filtering
looks at the similarity between items, and
perform this task by noting down how
many users that bought item A also
bought item B. If the correlation is high
enough, a similarity can be considered to
exist between the two items, and they can
be seen as similar to one another. Item B
will from there on be recommended to
users who bought item A and vice versa
[6]. Similarity can be computed in a number
of ways Using the user ratings, using
some product description, Using co-
occurrence of items in a bag or in the set
of a user past purchased products.
Item-based methodology will be used in
the proposed model thus we will look
more into the work performed using
them.
Bayesian networks create a model based
on a training set with a decision tree at
each node and edges representing user
information. The model can be built off-line
over a matter of hours or days. The
resulting model is very small, very fast,
and essentially as accurate as nearest
neighbour methods [4].
Clustering techniques work by identifying
groups of users who appear to have
similar preferences. Once the clusters are
created, predictions for an individual can
be made by averaging the opinions of the
other users in that cluster. Some
clustering techniques represent each user
with partial participation in several
clusters. The prediction is then an average
across the clusters, weighted by degree of
participation. Clustering techniques usually
produce less-personal recommendations
than other methods, and in some cases, the
clusters have worse accuracy than nearest
neighbour algorithm [7].
Horting is a graph-based technique in
which nodes are users, and edges
between nodes indicate degree of
similarity between two users. Predictions
are produced by walking the combining the opinions of the nearby
graph to nearby nodes and users. Horting differs
from nearest neighbour as the graph may Nevertheless, the more users there are in the
be walked through other users who have system, the
not rated the item in question, thus
exploring transitive relationships that
nearest neighbour algorithms do not
consider. In one study using synthetic data,
Horting produced better predictions than a
nearest neighbour algorithm [8].
The item-based approach performs poorly
for datasets with browsing or
entertainment related items such as
MovieLens, where the recommendations
it gives out seem very obvious to the target
users. Such datasets see better results
with matrix factorization techniques. Matrix
factorization can be seen as breaking
down a large matrix into a product of
smaller ones. This is similar to the
factorization of integers, where 12 can be
written as 6 x 2 or 4 x 3. In the case of
matrices, a matrix A with dimensions m x
n can be reduced to a product of two
matrices X and Y with dimensions m x p
and p x n respectively.
There are number of ways to implement
matrix factorization which improves the
performance of item-based
recommender system, one of the
popular algorithms to factorize a matrix
is the singular value decomposition (SVD)
algorithm. SVD came into the limelight
when matrix factorization was seen
performing well in the Netflix prize
competition. Other algorithms
include PCA and its variations, NMF, and so
on. Auto encoders can also be used for
dimensionality reduction in case you want to
use Neural Networks.
We aim to use the output of matrix
factorization as the input to the deep
neural network for recommendations.
PROBLEM FORMULATION
User based CF: User-Based Collaborative
Filtering is a method used for predicting
the items to a user on the basis of ratings
given by the other users to that item who
have similar taste with that of the target
user.
Many websites use collaborative filtering for
building their recommendation system.
Drawbacks:
Sparsity: The percentage of user rating
items is very low.
Scalability: The more K neighbors we
consider (under a certain threshold), the
better my classification should be.
greater the cost of finding movies then SVD would look into by
the nearest K neighbors generating factors like action vs comedy,
will be. Hollywood vs Bollywood, or Marvel vs
Cold-start: New users don’t Disney. Mainly, we will focus on the latent
have much information about factor model for the Singular Value
their taste to be compared with Decomposition (SVD) approach. The main
other users. drawback of SVD is that there is not much
New item: New items when explanation why an item is recommended to
added will lack ratings to a user. This turn out to be a huge problem
create a solid ranking. when wants to know why a particular item
is recommended to them.
Item based CF: the similarities The advantages of using deep neural
between different items in the networks to assist representation learning
dataset are calculated by using are in two-folds: (1) it reduces the efforts
one of a number of similarity on hand-craft feature design , (2) it aids
measures, and then these recommendation model system to take in
similarity values are used to wide variety of content such as text,
predict ratings for user-item images, audio, and even video
pairs not present in the dataset.
Some of the problem faced in EXPERIMENT
this method arises when there Dataset Description:
was only one common user MovieLens is the recommender system
between movies. These that recommends movies on the basis of
problems arise due to the preference of the user using collaborative
sparsity of the dataset itself. filtering approach. MovieLens was created
in 1997 by GroupLens Research, a research lab
SVD: Singular value in the Department of Computer Science
decomposition (SVD) is a and Engineering at the University of
collaborative filtering method for Minnesota, in order to gather research data
movie recommendation. (SVD) on personalized recommendations.
generates correlation and We evaluated the proposed neural
features from the user-item network model on MovieLens dataset 1M.
matrix. For example, if items are As shown in Table 1, the MovieLens 1M
dataset contains
1,000,000 ratings for 3952 movies by Table 3 shows the RMSE values of some basic
6040 users. Each rating is in integer form models and
between 1-5 (worst-best). To measure the neural-network models. We referred to the results of
performance of the online prediction experiment in [26] for other conventional CF
method, we have shuffled the rating algorithms.
matrix. We are using about 80% of the Our proposed neural-network model achieves much
data as training and validation set, and the better performance than traditional methods such as
remainder as testing set. In evaluating, we the user-based CF, the item- based CF, and the SVD
are using root mean square error (RMSE), method. The results of our method are superior to the
which emphasizes more on prediction with auto encoder- based model and the restricted
larger error. Boltzmann machine, which are based on deep
learning.
Architecture:
Movielens Dataset -> user based embedded layer (u)
-> item based embedded layer(m) -> dot product of m
and n -> 6 fully connected layer dense neural
For MovieLens 1M dataset, we are using layers -> output layer
network with a (6040+3952) rating vector
as the input, and 6 fully-connected hidden
layers with activations each. Fig. (a) shows
the RMSE value for test data versus the
number of training epochs. We further
examine our algorithm with the existing
one using the MovieLens 1M datasets.
DETAILS layers and dropout layers with the
We’re using embedding to probability of 0.4 is used for the purpose
represent each user and each of regularization after each neural layer.
movie in the data. These Adam’s optimiser is used for optimizing the
embedding are vectors (of size algorithm with learning rate of 0.001.
n=50 factors) that start as
random numbers but are the
best fit by the model to capture
the essential qualities of each
user/movie. We will accomplish
CONCLUSION
this by computing the dot
We conclude that the proposed method
product between a user vector
has out-performed the traditional
and a movie vector to get
methods. With more research on fining
predicted rating. Achieved dot
tuning the hyper- parameters and building
product is feed into the deep
the deeper network architecture we can
neural network.
achieve even better results for validation
Deep neural network consists
error.
of six Fully connected dense