Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Deduplication of Data in the Cloud-based

Server for Chat/File Transfer Application


Ankit Rai , Arpit Diwan , Harsh Tripathi

ABSTRACT encrypt the data prior to uploading to the cloud


With the enormous use and collection of storage. But uploading an encrypted copy of the
data, cloud storage is gaining the data to the cloud storage does not support
popularity among the computer users. But eliminating redundant data in the cloud, as the
most of the data in the cloud storage is cloud storage provider is not able to identify
redundant. Cloud storage providers are whether two different cipher text files correspond
using deduplication as the mechanism to to the same file or not [9]. The solution by which
keep only one copy of the file, by which this problem can be solved is convergent
cloud storage providers are minimizing the Encryption. In which the encryption key is obtained
storage and management overheads for by applying the file to the hash function. The
data. However, deduplication is raising so resulting hash code will be used as the key for
many security concerns and challenges. encrypting the file. As the properties of hash
This paper addresses the techniques to
functions ensure that, no two files will produce the
provide secure deduplication of data, by
same hash code, if two users encrypt the files with
taking the popularity of data items into the
the resulting convergent key, it will produce the
count, and assuming that data items
require different levels of security based same cipher text. This will help the cloud service
on popularity. We are proposing a provider to eliminate redundant copies of the files.
mechanism to ensure secure data Convergent encryption is introduced to make data
deduplication leveraging the advantages of deduplication and data confidentiality compatible.
dynamic perfect hash techniques Convergent key encryption is vulnerable to well-
known problems like brute force attack and
Keywords dictionary attacks [17][11].
Cloud Storage, Convergent key This paper proposes a mechanism to provide data
Encryption, Dynamic Hashing, data deduplication considering, how popular the data
deduplication item is. In general, a data item owned by several
users, like a popular audio track requires less
INTRODUCTION security. A data item which does not belong to
With the increasing demand of data many users like a research proposal requires
rates throughout the world, cloud storage better security. Deduplication will be applied only
systems are becoming a necessity for to the popular file. If the file is not popular, it will
computer users. All the users are not be deduplicated, will be encrypted using the
uploading their data to the cloud storage conventional symmetric key encryption. The
and are accessing the remotely stored popularity of a file will be considered a factor for
data upon need [18]. But 70% of the data deduplication.
data stored in cloud storage is redundant
[10], this is not permitting us to utilize the 2. Data deduplication techniques: various data
storage space efficiently. Data deduplication techniques have been developed by
deduplication is a process of eliminating several researchers, helping the users and cloud
redundant items from the cloud storage. storage service providers, in utilizing the storage
Data Deduplication can take place in space effectively. Most of the works did not
several ways: client side deduplication consider security as a concern. But data
i.e. Prior to uploading data to the cloud deduplication without security is of no use.
storage, the user can verify the Recently Harnik et al [1] identified several attacks
redundancy of the data. In Server side that lead to the leakage of data in client side
deduplication, after uploading data to the deduplication systems. To overcome these
cloud storage server, the server will attacks, proof of ownership concept is introduced
check for redundant data items, removes [3] [10][5]. But this mechanism will not provide
them. Data Deduplication can happen at confidentiality to the data stored when the cloud
several granularities, i.e. file level, and service provider is honest-but-curious.
block level. Storing the data in remote As convergent Key encryption is not able to
storage locations i.e. cloud storage provide semantic security, is susceptible to various
raises several security challenges and means of guessing the content. Bellare [8][9] et al
concerns [1][2]. The general solution to developed a mechanism under the name message
ensure confidentiality of the data is to locked encryption and showed that it provides
confidentiality for messages which are -Memory based CF:
not predictable only. Later Bellare et al. Memory-based CF algorithms use the entire or a
Presented DupLESS [4], a server-aided sample of the user-item database to generate a
encryption scheme for ensuring prediction. Grouping of users is done on the basis of
deduplication in cloud storage. But this similar interests. By identifying the so-called
scheme is offering server-side neighbours of a new user (or active user), a
deduplication. But over the year's client prediction of preferences on new items for user can
side deduplication proves to produce be produced. In neighbourhood-based CF algorithm,
better results when compared to the a prevalent memory-based CF algorithm, following
server-side deduplication. Convergent steps take place: calculate the similarity or weight,
encryption is suffering from various w[i,j], which reflects distance, correlation, or weight,
attacks like poison attack, Sybil attack between two users or two items, i and j; produce a
etc [17]. prediction for the active user by taking the weighted
average of all the ratings of the user or item on a
Armknecht et al [7] presented a certain item or user, or using a simple weighted
transparent encryption scheme for average[9].
deduplication storage with built in Proof -Similarity Computation:
of ownership. It attests the users in Similarity computation between items or users is a
utilizing the cloud storage space critical step in memory-based collaborative filtering
efficiently. The user will pay only if algorithms. For item- based CF algorithms, in similarity
his/her contents were stored in cloud computation between item i and item j is first to work
storage. on the users who have rated both of these items and
Jin li et al [6] developed an efficient then to apply a similarity computation to determine
mechanism to manage the convergent the similarity, w[i,j], between the two co-rated items
keys in deduplication storage. Instead of of the users [9].
every user generating the convergent -Correlation-Based Similarity:
keys, this mechanism will reduce the key In this case, similarity w[u,v] between two users u
management overhead a lot in client- and v, or w[i,j] between two items i and j, is
side block level deduplication. When measured by computing the Pearson correlation or
every user has to generate the other correlation-based similarities. Pearson
convergent key and store, the key correlation measures the
management will become an overhead
task for the users.
Pierre meye et al [20] presented a two-
phase data deduplication technique
addressing the weakness of convergent
key encryption and ensuring proof of
ownership, making inter-user
deduplication transparent.

LITERATURE SURVEY
In the recent times, recommender system
has been extensively studied and improved.
There are numerous techniques and
methods to implement the system. In this
section we will review some of the existing
work related to the proposed method. We
will limit our scope to reviewing
advancement made in collaborative
filtering techniques.
Collaborative Recommendation systems give
recommendation to a user if similar users
liked an item. Examples of this technique
include nearest neighbour modelling [1],
Matrix Completion [2], Restricted
Boltzmann machine [3], Bayesian matrix
factorization [4] etc.
Types of collaborative filtering based on
the implementation are:
extent to which two variables Decomposition (SVD) [19], which represents
linearly relate with each other the data by a set of vectors, one for each
[10]. item and user, such that the dot product of
Other correlation-based the user vector and the movie vector is
similarities include: Constrained the best approximation for the training set.
Pearson correlation, a variation The model building process is
of Pearson correlation that computationally expensive and memory
uses midpoint instead of intensive. After model construction,
mean rate; Spearman rank predictions are done very fast with small
correlation, similar to Pearson memory requirement. It achieve less
correlation, except that the accurate prediction than memory-based
ratings are ranks; and methods on dense data sets where a
Kendall’s τ correlation, large fraction of user-item values are
similar to the Spearman rank available in the training set, but perform
correlation, but instead of better on sparse data sets.
using ranks themselves, only -Simple Bayesian CF Algorithm:
the relative ranks are used to The simple Bayesian CF algorithm uses a
calculate the correlation naive Bayes (NB) strategy to make
[11,12]. predictions for CF tasks. Assuming the
-Model based CF: features are independent given in the class,
Unlike memory-based CF, the probability of a certain class given all
model-based approach does of the features can be computed, and then
not use the whole data set to the class with the highest probability will
compute a prediction. Instead, it be classified as the predicted class [20].
builds a model of the data -Other Bayesian CF Algorithms:
based on a training set and Bayesian belief nets with decision trees at
uses that model to predict each node. This model has a decision tree
future ratings. For example, at each node of the BNs, where a node
clustering based CF method corresponds to each item in the domain
builds a model of the data set as and the states of each node correspond to
clusters of users, and then uses the possible ratings for each item [21].
the ratings of users within the Results show that this model has similar
cluster to predict. A very prediction performance to Pearson
successful model- based correlation-based CF
method is the Singular Value
methods, and has better performance the model prediction. Furthermore, requirement is α +
than Bayesian-clustering and vector β = 1 where 0 ≤ (α, β) ≤ 1. The values of α and β are
cosine memory-based CF algorithms. determined numerically. This simple approach
Baseline Bayesian model uses a Bayesian intuitively results in a program running time of the
belief net with no arcs (baseline model) for both algorithms combined. Hybrid Recommenders
collaborative filtering and recommends Incorporating CF and Content Based Features. The
items on their overall popularity. [22] content-boosted CF algorithm uses naive Bayes as
-Hybrid Recommender: the content classifier, it then fills in the missing
Most of the hybrid recommenders are built values of the rating matrix with the predictions of the
by combining a CF algorithm with a content predictor to form a pseudo rating matrix, in
content based algorithm [13]. However, which observed ratings are kept untouched and
with two CF algorithms, hybrid algorithm missing ratings are replaced by the predictions of a
can be implemented. One way to do it, as content predictor. Then predictions over the resulting
[14] suggests (and the approach this pseudo ratings matrix using a weighted Pearson
project has chosen), is to weight every correlation- based CF algorithm, which gives a
prediction from the two algorithms as the higher weight for the item that more users rated,
final prediction according to and gives a higher weight for the active user. [18]
rˆui = αrˆui1 + βrˆui2 where rˆui1 is the
predicted rating from the memory based Other recommender systems include demographic-
algorithm, rˆui2 is the predicted rating based recommender systems, which use user profile
from the model based algorithm and rˆui information such as gender, postcode, occupation,
is the hybrid prediction. α is the weight and so forth [15]; utility-based recommender systems
of the and knowledge-based recommender systems, both
memory prediction and β is the weight of of which require knowledge about how a particular
object satisfies the user needs [16, 17]. Further collaborative filtering can be divided
We will not discuss these systems in into to two subcategories: User based and
detail in this work. Item Based
In user based, the focusing on the
“nearest neighbour” approach for
recommendations, which looks at the
rating patterns of other users and finds the
“nearest neighbours”, i.e. users having
ratings closer to yours. The algorithm then
gives you recommendations based on the
ratings of these neighbours [5]. Amazon
developed Item based collaborative filtering
in 1998[6]. Unlike user based
collaborative filtering, item-based filtering
looks at the similarity between items, and
perform this task by noting down how
many users that bought item A also
bought item B. If the correlation is high
enough, a similarity can be considered to
exist between the two items, and they can
be seen as similar to one another. Item B
will from there on be recommended to
users who bought item A and vice versa
[6]. Similarity can be computed in a number
of ways Using the user ratings, using
some product description, Using co-
occurrence of items in a bag or in the set
of a user past purchased products.
Item-based methodology will be used in
the proposed model thus we will look
more into the work performed using
them.
Bayesian networks create a model based
on a training set with a decision tree at
each node and edges representing user
information. The model can be built off-line
over a matter of hours or days. The
resulting model is very small, very fast,
and essentially as accurate as nearest
neighbour methods [4].
Clustering techniques work by identifying
groups of users who appear to have
similar preferences. Once the clusters are
created, predictions for an individual can
be made by averaging the opinions of the
other users in that cluster. Some
clustering techniques represent each user
with partial participation in several
clusters. The prediction is then an average
across the clusters, weighted by degree of
participation. Clustering techniques usually
produce less-personal recommendations
than other methods, and in some cases, the
clusters have worse accuracy than nearest
neighbour algorithm [7].
Horting is a graph-based technique in
which nodes are users, and edges
between nodes indicate degree of
similarity between two users. Predictions
are produced by walking the combining the opinions of the nearby
graph to nearby nodes and users. Horting differs
from nearest neighbour as the graph may Nevertheless, the more users there are in the
be walked through other users who have system, the
not rated the item in question, thus
exploring transitive relationships that
nearest neighbour algorithms do not
consider. In one study using synthetic data,
Horting produced better predictions than a
nearest neighbour algorithm [8].
The item-based approach performs poorly
for datasets with browsing or
entertainment related items such as
MovieLens, where the recommendations
it gives out seem very obvious to the target
users. Such datasets see better results
with matrix factorization techniques. Matrix
factorization can be seen as breaking
down a large matrix into a product of
smaller ones. This is similar to the
factorization of integers, where 12 can be
written as 6 x 2 or 4 x 3. In the case of
matrices, a matrix A with dimensions m x
n can be reduced to a product of two
matrices X and Y with dimensions m x p
and p x n respectively.
There are number of ways to implement
matrix factorization which improves the
performance of item-based
recommender system, one of the
popular algorithms to factorize a matrix
is the singular value decomposition (SVD)
algorithm. SVD came into the limelight
when matrix factorization was seen
performing well in the Netflix prize
competition. Other algorithms
include PCA and its variations, NMF, and so
on. Auto encoders can also be used for
dimensionality reduction in case you want to
use Neural Networks.
We aim to use the output of matrix
factorization as the input to the deep
neural network for recommendations.

PROBLEM FORMULATION
User based CF: User-Based Collaborative
Filtering is a method used for predicting
the items to a user on the basis of ratings
given by the other users to that item who
have similar taste with that of the target
user.
Many websites use collaborative filtering for
building their recommendation system.
Drawbacks:
Sparsity: The percentage of user rating
items is very low.
Scalability: The more K neighbors we
consider (under a certain threshold), the
better my classification should be.
greater the cost of finding movies then SVD would look into by
the nearest K neighbors generating factors like action vs comedy,
will be. Hollywood vs Bollywood, or Marvel vs
Cold-start: New users don’t Disney. Mainly, we will focus on the latent
have much information about factor model for the Singular Value
their taste to be compared with Decomposition (SVD) approach. The main
other users. drawback of SVD is that there is not much
New item: New items when explanation why an item is recommended to
added will lack ratings to a user. This turn out to be a huge problem
create a solid ranking. when wants to know why a particular item
is recommended to them.
Item based CF: the similarities The advantages of using deep neural
between different items in the networks to assist representation learning
dataset are calculated by using are in two-folds: (1) it reduces the efforts
one of a number of similarity on hand-craft feature design , (2) it aids
measures, and then these recommendation model system to take in
similarity values are used to wide variety of content such as text,
predict ratings for user-item images, audio, and even video
pairs not present in the dataset.
Some of the problem faced in EXPERIMENT
this method arises when there Dataset Description:
was only one common user MovieLens is the recommender system
between movies. These that recommends movies on the basis of
problems arise due to the preference of the user using collaborative
sparsity of the dataset itself. filtering approach. MovieLens was created
in 1997 by GroupLens Research, a research lab
SVD: Singular value in the Department of Computer Science
decomposition (SVD) is a and Engineering at the University of
collaborative filtering method for Minnesota, in order to gather research data
movie recommendation. (SVD) on personalized recommendations.
generates correlation and We evaluated the proposed neural
features from the user-item network model on MovieLens dataset 1M.
matrix. For example, if items are As shown in Table 1, the MovieLens 1M
dataset contains
1,000,000 ratings for 3952 movies by Table 3 shows the RMSE values of some basic
6040 users. Each rating is in integer form models and
between 1-5 (worst-best). To measure the neural-network models. We referred to the results of
performance of the online prediction experiment in [26] for other conventional CF
method, we have shuffled the rating algorithms.
matrix. We are using about 80% of the Our proposed neural-network model achieves much
data as training and validation set, and the better performance than traditional methods such as
remainder as testing set. In evaluating, we the user-based CF, the item- based CF, and the SVD
are using root mean square error (RMSE), method. The results of our method are superior to the
which emphasizes more on prediction with auto encoder- based model and the restricted
larger error. Boltzmann machine, which are based on deep
learning.

Architecture:
Movielens Dataset -> user based embedded layer (u)
-> item based embedded layer(m) -> dot product of m
and n -> 6 fully connected layer dense neural
For MovieLens 1M dataset, we are using layers -> output layer
network with a (6040+3952) rating vector
as the input, and 6 fully-connected hidden
layers with activations each. Fig. (a) shows
the RMSE value for test data versus the
number of training epochs. We further
examine our algorithm with the existing
one using the MovieLens 1M datasets.
DETAILS layers and dropout layers with the
We’re using embedding to probability of 0.4 is used for the purpose
represent each user and each of regularization after each neural layer.
movie in the data. These Adam’s optimiser is used for optimizing the
embedding are vectors (of size algorithm with learning rate of 0.001.
n=50 factors) that start as
random numbers but are the
best fit by the model to capture
the essential qualities of each
user/movie. We will accomplish
CONCLUSION
this by computing the dot
We conclude that the proposed method
product between a user vector
has out-performed the traditional
and a movie vector to get
methods. With more research on fining
predicted rating. Achieved dot
tuning the hyper- parameters and building
product is feed into the deep
the deeper network architecture we can
neural network.
achieve even better results for validation
Deep neural network consists
error.
of six Fully connected dense

REFERENCE ICML’07, pages 791–798.


[1] Robert M Bell and Yehuda Koren. [4] Ruslan Salakhutdinov and Andriy Mnih. Bayesian
Improved neighbourhood-based collaborative probabilistic matrix factorization using markov chain
filtering. In KDD’13 CUP, 2007. monte carlo. In ICML’08, pages 880–887.
[2] Jasson DM Rennie and Nathan Srebro. [5] R. Devooght, H. Bersini (2017) Collaborative Filtering
Fast maximum margin matrix factorization with Recurrent Neural Networks (Accessed 15-02-
for 2017) [6]https://www.google.com/patents/US626664 9
collaborative prediction. In ICML’05, pages (patent of item based collaborative filtering)
713– 719. [7] Breese, J. S., Heckerman, D., and Kadie, C. (1998).
[3] Ruslan Salakhutdinov, Andriy Mnih, and Empirical Analysis of Predictive Algorithms for
Geoffrey Hinton. Restricted boltzmann Collaborative Filtering. In Proceedings of the 14th
machines for collaborative filtering. In Conference on Uncertainty in Artificial Intelligence, pp.
43-52. [14] G. Badaro, H. Hajj, W. El-Hajj, and L.
[8] Aggarwal, C. C., Wolf, J. L., Wu K., and Yu, Nachman, “A hybrid approach with
P. collaborative filtering for recommender
S. (1999). Horting Hatches an Egg: A New systems,” in 2013 9th International Wireless
Graph- theoretic Approach to Collaborative Communications and Mobile Computing
Filtering. In Proceedings of the ACM KDD’99 Conference (IWCMC), July 2013, pp. 349–
Conference. San Diego, CA. pp. 201-212. 354
[9] B. M. Sarwar, G. Karypis, J. A. Konstan, *15+ B. Krulwich, “Lifestyle finder: intelligent
and J. user profiling using large-scale demographic
Riedl, “Itembased collaborative filtering data,” Artificial Intelligence Magazine, vol.
recommendation algorithms,” in Proceedings 18, no. 2, pp. 37–45, 1997.
of the 10th International Conference on *16+ R. Burke, “Hybrid recommender systems:
World survey and experiments,” User Modelling
Wide Web (WWW ’01), pp. 285–295, May and User-Adapted Interaction, vol. 12, no.
2001 4, pp. 331–370, 2002
[10] P. Resnick, N. Iacovou, M. Suchak, P. [17] R. H. Guttman, Merchant
Bergstrom, and J. Riedl, “Grouplens: an open differentiation through integrative
architecture for collaborative filtering of negotiation in agent- mediated electronic
netnews,” in Proceedings of the ACM commerce, M.S. thesis, School of
Conference on Computer Supported Architecture and Planning, MIT, 1998
Cooperative Work, pp. 175–186, New York, [18] P. Melville, R. J. Mooney, and R.
NY, USA, 1994. Nagarajan, “Contentboosted collaborative
[11] K. Goldberg, T. Roeder, D. Gupta, and filtering for
C. improved recommendations,” in Proceedings
Perkins, “Eigentaste: a constant time of the 18th National Conference on Artificial
collaborative filtering algorithm,” Information Intelligence (AAAI ’02), pp. 187–192, Edmonton,
Retrieval, vol. 4, no. 2, pp. 133–151, 2001. Canada, 2002.
[12] J. L. Herlocker, J. A. Konstan, L. G. [19] G. Linden, B. Smith, and J. York,
Terveen, and J. T. Riedl, “Evaluating “Amazon.com recommendations: item-to-
collaborative filtering recommender systems,” item
ACM Transactions on Information Systems, collaborative filtering,” IEEE Internet
vol. 22, no. 1, pp. 5–53, 2004. Computing, vol. 7, no. 1, pp. 76–80, 2003.
[13] R. F. A. Elkhleifi and F. B. Kharrat, [20] K. Miyahara and M. J. Pazzani,
“Improving collaborative filtering algorithms,” in “Improvement of collaborative filtering with
2016 12th International Conference on the simple Bayesian classifier,” Information
Semantics, Knowledge and Grids (SKG), Aug Processing Society of Japan, vol. 43, no.
2016, pp. 109–114. 11, 2002.
[21] J. Breese, D. Heckerman, and C. Kadie,
“Empirical analysis of predictive algorithms
for collaborative filtering,” in Proceedings
of the 14th Conference on Uncertainty in
Artificial
Intelligence (UAI ’98), 1998.
[22] D. Heckerman, D. M. Chickering, C. Meek,
R. Rounthwaite, and C. Kadie, “Dependency
networks for inference, collaborative filtering,
and data visualization,” Journal of Machine
Learning Research, vol. 1, no. 1, pp. 49–75,
2001.
[23] D. Goldberg, D. Nichols, B. M. Oki, and
D. Terry, “Using collaborative filtering to weave
an information tapestry,” Communications of
ACM, vol. 35, no. 12, pp. 61–70, 1992.
*24+ P. Resnick and H. R. Varian,
“Recommender systems,” Communications of
the ACM, vol. 40, no. 3, pp. 56–58, 1997.
[25] Scalable deep learning-based
recommendation Systems, Hyeungill Lee,
Jungwoo Lee, Seoul National University, Seoul,
Republic of Korea
[26]O. Yuanxin, R. Wewnge, X. Zhang,
Autoencoder colaborative filtering, pp. 284–29

You might also like