Professional Documents
Culture Documents
Deduplication Using Hadoop and Hbase
Deduplication Using Hadoop and Hbase
Exam
Under Guidence of
Prof. P. B. Sahane
Introduction
Need of De-Duplication?
Literature Survey
Sr
no
Paper Name
Author
Name
Approach
Advantag
e
Disadvan
tage
Sparse
Indexing: M.
Large Scale, Inline Lillibride,
Deduplication Using K. Eshghi
Sampling
and
Locality
Content
based
segmentation
,sampling,spa
rse indexing.
Excellent
deduplicati
on
throughput
, little Ram
Small
Loss
duplicatio
n,HP
product.
Extreme Binning:
Scalable, Parallel
Deduplication for
Chunk-based File
Backup
Chunk based
(hash)
Parallelize,
file
similarity
Restoratio
n and
storage
requires
more
number
of random
seeks
D.
Bhagwa
t, K.
Eshghi
Literature Survey
Sr
n
o
Paper Name
Autho Approac
r
h
Name
DeDu: Building a
Deduplication
Storage System
over Cloud
Computing
Z.
Sun, J.
Shen
Cloud
High
not occur at
based,spa throughpu the file level,
rse index t
and the
results of
deduplication
are not
accurate
Q. Sean
and D.
Sean
Chunk
based
(hash)
Advanta
ge
Disadvanta
ge
enforces a It is not
writesuitable to
once
deal with
policy to
mass data,
avoid
and the
damage
system is
of data.
not scalable.
Problem Statement
To develop a reliable, efficient client side deduplication system using efficient Hash based
techniques, Hadoop, Hbase. It will help in
offloading the processing power requirements of
the target to the client nodes reducing the
amount of data that is to be sent onto the
network.
Objective
To
Understand
the limitations of client side deduplication technology for big data and to
evaluate the best predict time.
To
To
Existing System
Advantages
It is easy to compute the hash value for given file.
Unique hash for different massage
Larger datasets analysed faster using Hadoop, mapreduce
Disadvantages
The security of MD5 hash function is severally
compromised.
Collision complexity of MD5 is 2^64,due to 128-bit.
Pig,hive runs batch processes on Hadoop they never
databases
Proposed System
Advantages
Hbase = random real-time read/write access to our data, flexible
data model.
Allowing space to be saved on the storage resource as it
compresses redundant data
Fingerprints+Hbase achieve high look up efficiency with high
security.
Implementation :System
Architecture
Md5 Generator
HBase
File
s
Passing
The
Non
Matched
values
Lookup
For
Existing
Hash key
MapReduce
UNIQUE
FILES
CLIENT SIDE
HDFS
SERVER SIDE
Algorithm
1)Data is written onto HDFS FileSystem fs =
FileSystem.get(new Configuration());
2) In map() function Hadoop by default splits the
input file into 64MB blocks.
FileSplit filesplit =(FileSplit)context.getInputSplit();
3) Generating Hash value of File
String hxVal =
MD5.toHexString(MD5.computeMD5(inputVal.getBy
tes()));
4) Instantiating Get class with Hash Value
HTable hTable = new HTable(config, "Hash");
Algorithm
5) Hash values are read from Hbase compare it with current
value
Get g = new Get(Bytes.toBytes(hxVal));
Result result = hTable.get(g);
5.1)If the hash is new then the current hash value is
updated into habse ,input file is transferred to backup
server.
Put p = new Put(Bytes.toBytes(hxVal));
p.add(Bytes.toBytes("File)
FileSystem fs=FileSystem.get(conf);
Path filenamePath =new Path("/user/shilpa/final/);
FSDataOutputStream out =fs.create(filenamePath);
Algorithm
Algorithm
Steps:i.
ii.
Initialization module
ii.
G(X,Y,Z) = XZ v Y not(Z)
Algorithm
iv. Bit extending module
special extending function:
K(X,Y,Z)=(X AND Y) OR (X AND Z) OR (Y AND Z). With 40-bit
input and output.
Data Tables an
discussions
Algorithmic
comparision
HDFS lacks random read and write access. This is where HBase
comes into picture. It's a distributed, scalable, big data store It
stores data as key/value pairs.
Thank You!