Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 18

Client Side data duplication detector

using Hadoop Framework


Ms. Shilpa D.Kanhurkar
No :9375

Exam

Under Guidence of
Prof. P. B. Sahane

Department of Computer Engineering


P K Technical Campus Chakan

Introduction

What is Big Data?

Need of De-Duplication?

Literature Survey
Sr
no

Paper Name

Author
Name

Approach

Advantag
e

Disadvan
tage

Sparse
Indexing: M.
Large Scale, Inline Lillibride,
Deduplication Using K. Eshghi
Sampling
and
Locality

Content
based
segmentation
,sampling,spa
rse indexing.

Excellent
deduplicati
on
throughput
, little Ram

Small
Loss
duplicatio
n,HP
product.

Extreme Binning:
Scalable, Parallel
Deduplication for
Chunk-based File
Backup

Chunk based
(hash)

Parallelize,
file
similarity

Restoratio
n and
storage
requires
more
number
of random
seeks

D.
Bhagwa
t, K.
Eshghi

Literature Survey
Sr
n
o

Paper Name

Autho Approac
r
h
Name

DeDu: Building a
Deduplication
Storage System
over Cloud
Computing

Z.
Sun, J.
Shen

Cloud
High
not occur at
based,spa throughpu the file level,
rse index t
and the
results of
deduplication
are not
accurate

Venti: A New Approach


to Archival Data Storage

Q. Sean
and D.
Sean

Chunk
based
(hash)

Advanta
ge

Disadvanta
ge

enforces a It is not
writesuitable to
once
deal with
policy to
mass data,
avoid
and the
damage
system is
of data.
not scalable.

Problem Statement
To develop a reliable, efficient client side deduplication system using efficient Hash based
techniques, Hadoop, Hbase. It will help in
offloading the processing power requirements of
the target to the client nodes reducing the
amount of data that is to be sent onto the
network.

Objective
To

Learn the technologies of de-duplication


techniques for big data.

Understand

the limitations of client side deduplication technology for big data and to
evaluate the best predict time.

To

identify the basic requirements of client side


de-duplication also consider security issues.

To

identify the parameters which affect reliable


and efficient on process time of duplication
detection for big data.

Existing System

Hash based Duplication Detection method


MD5 and SHA I algorithms
Data storage and analysis using HDFS with Pig, Hive.

Advantages
It is easy to compute the hash value for given file.
Unique hash for different massage
Larger datasets analysed faster using Hadoop, mapreduce

Disadvantages
The security of MD5 hash function is severally
compromised.
Collision complexity of MD5 is 2^64,due to 128-bit.
Pig,hive runs batch processes on Hadoop they never
databases

Proposed System

Client Side Data duplication detector using Hadoop


Hash based de-duplication technique with hadoop framework .
One improved hash algorithm
Hbase which is No-Sql database to be used on top of hadoop for
storing and fast analyzing of big datasets .

Advantages
Hbase = random real-time read/write access to our data, flexible
data model.
Allowing space to be saved on the storage resource as it
compresses redundant data
Fingerprints+Hbase achieve high look up efficiency with high
security.

Implementation :System
Architecture
Md5 Generator
HBase

File
s
Passing
The
Non
Matched
values

Lookup
For
Existing
Hash key

MapReduce
UNIQUE
FILES

CLIENT SIDE

HDFS

SERVER SIDE

Algorithm
1)Data is written onto HDFS FileSystem fs =
FileSystem.get(new Configuration());
2) In map() function Hadoop by default splits the
input file into 64MB blocks.
FileSplit filesplit =(FileSplit)context.getInputSplit();
3) Generating Hash value of File
String hxVal =
MD5.toHexString(MD5.computeMD5(inputVal.getBy
tes()));
4) Instantiating Get class with Hash Value
HTable hTable = new HTable(config, "Hash");

Algorithm
5) Hash values are read from Hbase compare it with current
value
Get g = new Get(Bytes.toBytes(hxVal));
Result result = hTable.get(g);
5.1)If the hash is new then the current hash value is
updated into habse ,input file is transferred to backup
server.
Put p = new Put(Bytes.toBytes(hxVal));
p.add(Bytes.toBytes("File)
FileSystem fs=FileSystem.get(conf);
Path filenamePath =new Path("/user/shilpa/final/);
FSDataOutputStream out =fs.create(filenamePath);

Algorithm

One Improved Hash


SHA

Algorithm

MD5plus based on MD5 and

Steps:i.

Information filling module

ii.

Initialization module

ii.

Hash value calculation module


F(X,Y,Z) = XY v not(X) Z

G(X,Y,Z) = XZ v Y not(Z)

H(X,Y,Z) = X xor Y xor Z

I(X,Y,Z) = Y xor (X v not(Z))

operational functions :-Each process has 4 rounds and each


round has 16 steps:

FF(a,b,c,d,Mj,s,ti)means: a=b+((a+(F(b,c,d)+Mj+t[i]) <<< s )

GG(a,b,c,d,Mj,s,ti)means: a=b+((a+(G(b,c,d)+Mj+t[i]) <<< s )

HH(a,b,c,d,Mj,s,ti)means: a=b+((a+(H(b,c,d)+Mj+t[i]) <<< s )

II(a,b,c,d,Mj,s,ti)means :a=b+((a+I(b,c,d)+Mj+t[i]) <<< s )

Algorithm
iv. Bit extending module
special extending function:
K(X,Y,Z)=(X AND Y) OR (X AND Z) OR (Y AND Z). With 40-bit
input and output.

we only have to append the results with eight 0 in front, then,


save them to 40-bit registers AA, BB, CC, and DD .

KK(a,b,c,d,Mj,s,ti) means: a=b+((a+(K(b,c,d)+Mj+t[i])<< s )

output: AA, BB, CC, and DD

MD5plus algorithm based on MD5, and absorbed some


excellent functions from SHA1. In hash length, MD5plus has
improved to 160-bit.

Data Tables an
discussions
Algorithmic

comparision

HDFS lacks random read and write access. This is where HBase
comes into picture. It's a distributed, scalable, big data store It
stores data as key/value pairs.

Conclusion and Future Scope

Leveraged Hadoop framework to design and develop a


duplication detection system
that helped us in
identifying multiple copies of the same data at the file
level itself, eliminating duplicate/redundant files in
entirety and that too before transmission i.e. at the
Client (Servers) end. It thus helps in wishful elimination
and thereafter in controlling the number of unnecessary
replicas. Thereafter, these replicas are managed and
controlled as per the requirements. By using hash based
duplication techniques duplication is detected in faster
manner. Use of Hadoop map reduce and Hbase handles
big data files in simple, parallel, scalable manner it does
computation and storage in parallel, distributed way.

Future scope is to improve little compression which


proves to be an efficient technique is enhancing the
performance of our duplication detector

References and Bibilography


[I] Kapil Bakshi, "Considerations for Big Data: Architecture and Approach," in Aerospace
Conference. 2012 1EEE, Big Sky, MT. 3-10 March 2012,pp.I-7.
[2] P. Malik, "Governing Big Data: Principles and Practices", IBM Journal of Research
and Development, vol 57, pp.l:l -I: 13, 2013.
[3] D.Geer, "Reducing the Storage Burden via Data Deduplication", in Computer, the
flagship publication of the IEEE Computer Society, vol. 41, pp.15-17, 2008.
[4]M. Lillibridge, K. Eshghi, D. Bhagwat, V. Deolalikar, G. Trezise, and P. Camble,
"Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality," in
7th USENiX Conference on File and Storage Technologies, San Francisco, California.
[5] B. Zhu, K. Li, and H. Patterson, "Avoiding the disk bottleneck in the data domain
deduplication file system, in Proceedings of the 6th USENiX Conference on File and
Storage Technologies, San Jose, California, 2008, pp. 269-282.

References and Bibilography


[6] Q. Sean and D. Sean, "Venti: A New Approach to Archival Data Storage," in
Proceedings of the 1st USENiX Conference on File and Storage Technologies,
ed. Monterey. CA: USENIX Association, 2002,
[7] Z. Sun, J. Shen and J. Yong, "DeDu: Building a DeduplicationStorage System
over Cloud Computing, " in 2011 15th international Conference on Computer
Supported Cooperative Work in Design (CSCWD), Lausanne, 2011, pp. 348-355.
[8] D. Cezary, G. Leszek, H. Lukasz, K. Michal, K. Wojciech, S.Przemyslaw, S.ferzy,
U. Cristian, and W. Michal, "HYDRA stororage a Scalable Secondary Storage,"
in Proceedings of the 7th conference on File and storage technologies, San
Francisco, California, 2009, pp.197-210.
[9] D. Bhagwat, K. Eshghi, D. D. E. Long, and M. Lillibridge, "Extreme Binning:
Scalable, Parallel Deduplication for Chunk-based File Backup," in 2009 IEEE
international Symposium on Modeling.Analysis & Simulation of Computer and
Telecommunication Systems MASCOTS.
.

Thank You!

You might also like