Deduplication Using Hadoop and Hbase

Client Side data duplication detector
using Hadoop Framework

Ms. Shilpa D.Kanhurkar
No :9375
Exam
Under Guidence of
Prof. P. B. Sahane
Department of Computer Engineering

P K Technical Campus Chakan
Introduction
What is Big Data?
Need of De-Duplication?
Literature Survey
Sr
no
Paper Name
Author
Name
Approach
Advantag
e
Disadvan
tage
Sparse
Indexing: M.
Large Scale, Inline Lillibride,
Deduplication Using K. Eshghi
Sampling
and
Locality
Content
based
segmentation
,sampling,spa
rse indexing.
Excellent
deduplicati
on
throughput
, little Ram
Small
Loss
duplicatio
n,HP
product.
Extreme Binning:
Scalable, Parallel
Deduplication for
Chunk-based File
Backup
Chunk based
(hash)
Parallelize,
file
similarity
Restoratio
n and
storage
requires
more
number
of random
seeks
D.
Bhagwa
t, K.
Eshghi
Literature Survey
Sr
n
o
Paper Name
Autho Approac
r
h
Name
DeDu: Building a
Deduplication
Storage System
over Cloud
Computing
Z.
Sun, J.
Shen
Cloud
High
not occur at
based,spa throughpu the file level,
rse index t
and the
results of
deduplication
are not
accurate
Venti: A New Approach

to Archival Data Storage
Q. Sean
and D.
Sean
Chunk
based
(hash)
Advanta
ge
Disadvanta
ge
enforces a It is not
writesuitable to
once
deal with
policy to
mass data,
avoid
and the
damage
system is
of data.
not scalable.
Problem Statement
To develop a reliable, efficient client side deduplication system using efficient Hash based
techniques, Hadoop, Hbase. It will help in
offloading the processing power requirements of
the target to the client nodes reducing the
amount of data that is to be sent onto the
network.
Objective
To
Learn the technologies of de-duplication

techniques for big data.
Understand
the limitations of client side deduplication technology for big data and to
evaluate the best predict time.
To
identify the basic requirements of client side

de-duplication also consider security issues.
To
identify the parameters which affect reliable

and efficient on process time of duplication
detection for big data.
Existing System
Hash based Duplication Detection method

MD5 and SHA I algorithms
Data storage and analysis using HDFS with Pig, Hive.
Advantages
It is easy to compute the hash value for given file.
Unique hash for different massage
Larger datasets analysed faster using Hadoop, mapreduce
Disadvantages
The security of MD5 hash function is severally
compromised.
Collision complexity of MD5 is 2^64,due to 128-bit.
Pig,hive runs batch processes on Hadoop they never
databases
Proposed System
Client Side Data duplication detector using Hadoop

Hash based de-duplication technique with hadoop framework .
One improved hash algorithm
Hbase which is No-Sql database to be used on top of hadoop for
storing and fast analyzing of big datasets .
Advantages
Hbase = random real-time read/write access to our data, flexible
data model.
Allowing space to be saved on the storage resource as it
compresses redundant data
Fingerprints+Hbase achieve high look up efficiency with high
security.
Implementation :System
Architecture
Md5 Generator
HBase
File
s
Passing
The
Non
Matched
values
Lookup
For
Existing
Hash key
MapReduce
UNIQUE
FILES
CLIENT SIDE
HDFS
SERVER SIDE
Algorithm
1)Data is written onto HDFS FileSystem fs =
FileSystem.get(new Configuration());
2) In map() function Hadoop by default splits the
input file into 64MB blocks.
FileSplit filesplit =(FileSplit)context.getInputSplit();
3) Generating Hash value of File
String hxVal =
MD5.toHexString(MD5.computeMD5(inputVal.getBy
tes()));
4) Instantiating Get class with Hash Value
HTable hTable = new HTable(config, "Hash");
Algorithm
5) Hash values are read from Hbase compare it with current
value
Get g = new Get(Bytes.toBytes(hxVal));
Result result = hTable.get(g);
5.1)If the hash is new then the current hash value is
updated into habse ,input file is transferred to backup
server.
Put p = new Put(Bytes.toBytes(hxVal));
p.add(Bytes.toBytes("File)
FileSystem fs=FileSystem.get(conf);
Path filenamePath =new Path("/user/shilpa/final/);
FSDataOutputStream out =fs.create(filenamePath);
Algorithm
One Improved Hash

SHA
Algorithm
MD5plus based on MD5 and
Steps:i.
Information filling module
ii.
Initialization module
ii.
Hash value calculation module

F(X,Y,Z) = XY v not(X) Z
G(X,Y,Z) = XZ v Y not(Z)
H(X,Y,Z) = X xor Y xor Z
I(X,Y,Z) = Y xor (X v not(Z))
operational functions :-Each process has 4 rounds and each

round has 16 steps:
FF(a,b,c,d,Mj,s,ti)means: a=b+((a+(F(b,c,d)+Mj+t[i]) <<< s )
GG(a,b,c,d,Mj,s,ti)means: a=b+((a+(G(b,c,d)+Mj+t[i]) <<< s )
HH(a,b,c,d,Mj,s,ti)means: a=b+((a+(H(b,c,d)+Mj+t[i]) <<< s )
II(a,b,c,d,Mj,s,ti)means :a=b+((a+I(b,c,d)+Mj+t[i]) <<< s )
Algorithm
iv. Bit extending module
special extending function:
K(X,Y,Z)=(X AND Y) OR (X AND Z) OR (Y AND Z). With 40-bit
input and output.
we only have to append the results with eight 0 in front, then,

save them to 40-bit registers AA, BB, CC, and DD .
KK(a,b,c,d,Mj,s,ti) means: a=b+((a+(K(b,c,d)+Mj+t[i])<< s )
output: AA, BB, CC, and DD
MD5plus algorithm based on MD5, and absorbed some

excellent functions from SHA1. In hash length, MD5plus has
improved to 160-bit.
Data Tables an
discussions
Algorithmic
comparision
HDFS lacks random read and write access. This is where HBase
comes into picture. It's a distributed, scalable, big data store It
stores data as key/value pairs.
Conclusion and Future Scope
Leveraged Hadoop framework to design and develop a

duplication detection system
that helped us in
identifying multiple copies of the same data at the file
level itself, eliminating duplicate/redundant files in
entirety and that too before transmission i.e. at the
Client (Servers) end. It thus helps in wishful elimination
and thereafter in controlling the number of unnecessary
replicas. Thereafter, these replicas are managed and
controlled as per the requirements. By using hash based
duplication techniques duplication is detected in faster
manner. Use of Hadoop map reduce and Hbase handles
big data files in simple, parallel, scalable manner it does
computation and storage in parallel, distributed way.
Future scope is to improve little compression which

proves to be an efficient technique is enhancing the
performance of our duplication detector
References and Bibilography

[I] Kapil Bakshi, "Considerations for Big Data: Architecture and Approach," in Aerospace
Conference. 2012 1EEE, Big Sky, MT. 3-10 March 2012,pp.I-7.
[2] P. Malik, "Governing Big Data: Principles and Practices", IBM Journal of Research
and Development, vol 57, pp.l:l -I: 13, 2013.
[3] D.Geer, "Reducing the Storage Burden via Data Deduplication", in Computer, the
flagship publication of the IEEE Computer Society, vol. 41, pp.15-17, 2008.
[4]M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble,
"Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," in
7th USENiX Conference on File and Storage Technologies, San Francisco, California.
[5] B. Zhu, K. Li, and H. Patterson, "Avoiding the disk bottleneck in the data domain
deduplication file system, in Proceedings of the 6th USENiX Conference on File and
Storage Technologies, San Jose, California, 2008, pp. 269-282.
References and Bibilography

[6] Q. Sean and D. Sean, "Venti: A New Approach to Archival Data Storage," in
Proceedings of the 1st USENiX Conference on File and Storage Technologies,
ed. Monterey. CA: USENIX Association, 2002,
[7] Z. Sun, J. Shen and J. Yong, "DeDu: Building a DeduplicationStorage System
over Cloud Computing, " in 2011 15th international Conference on Computer
Supported Cooperative Work in Design (CSCWD), Lausanne, 2011, pp. 348-355.
[8] D. Cezary, G. Leszek, H. Lukasz, K. Michal, K. Wojciech, S.Przemyslaw, S.ferzy,
U. Cristian, and W. Michal, "HYDRA stororage a Scalable Secondary Storage,"
in Proceedings of the 7th conference on File and storage technologies, San
Francisco, California, 2009, pp.197-210.
[9] D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge, "Extreme Binning:
Scalable, Parallel Deduplication for Chunk-based File Backup," in 2009 IEEE
international Symposium on Modeling.Analysis & Simulation of Computer and
Telecommunication Systems MASCOTS.
.
Thank You!

Deduplication Using Hadoop and Hbase

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deduplication Using Hadoop and Hbase

Uploaded by

Copyright:

Available Formats

Client Side data duplication detector

using Hadoop Framework

Department of Computer Engineering

What is Big Data?

Venti: A New Approach

Learn the technologies of de-duplication

identify the basic requirements of client side

identify the parameters which affect reliable

Hash based Duplication Detection method

Client Side Data duplication detector using Hadoop

One Improved Hash

MD5plus based on MD5 and

Information filling module

Hash value calculation module

H(X,Y,Z) = X xor Y xor Z

I(X,Y,Z) = Y xor (X v not(Z))

operational functions :-Each process has 4 rounds and each

FF(a,b,c,d,Mj,s,ti)means: a=b+((a+(F(b,c,d)+Mj+t[i]) <<< s )

GG(a,b,c,d,Mj,s,ti)means: a=b+((a+(G(b,c,d)+Mj+t[i]) <<< s )

HH(a,b,c,d,Mj,s,ti)means: a=b+((a+(H(b,c,d)+Mj+t[i]) <<< s )

II(a,b,c,d,Mj,s,ti)means :a=b+((a+I(b,c,d)+Mj+t[i]) <<< s )

we only have to append the results with eight 0 in front, then,

KK(a,b,c,d,Mj,s,ti) means: a=b+((a+(K(b,c,d)+Mj+t[i])<< s )

output: AA, BB, CC, and DD

MD5plus algorithm based on MD5, and absorbed some

Conclusion and Future Scope

Leveraged Hadoop framework to design and develop a

Future scope is to improve little compression which

References and Bibilography

References and Bibilography

You might also like