Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

‫جمهورية العراق‬

‫جامعة االمام جعفر الصادق (ع)‬

‫الكلية ‪ :‬تكنلوجيا المعلومات‬

‫القسم‪ :‬هندسة تقنيات الحاسوب‬

‫المرحلة ‪:‬الثالثه‬

‫عنوان التقرير ‪:‬‬

‫((‪)) Duplication of Data‬‬

‫اسم الطالب ‪:‬‬

‫((الحسن صالح حسن صحن ))‬

‫استاذ المادة ‪:‬‬

‫(( زينا ستار ))‬


Introduction

In computing, data deduplication is a technique for eliminating


duplicate copies of repeating data. A related and somewhat synonymous
term is single-instance (data) storage. This technique is used to improve
storage utilization and can also be applied to network data transfers to
reduce the number of bytes that must be sent. In the deduplication
process, unique chunks of data, or byte patterns, are identified and
stored during a process of analysis. As the analysis continues, other
chunks are compared to the stored copy and whenever a match occurs,
the redundant chunk is replaced with a small reference that points to
the stored chunk. Given that the same byte pattern may occur dozens,
hundreds, or even thousands of times (the match frequency is
dependent on the chunk size), the amount of data that must be stored
or transferred can be greatly reduced.

Deduplication is different from data compression algorithms, such as


LZ77 and LZ78. Whereas compression algorithms identify redundant
data inside individual files and encodes this redundant data more
efficiently, the intent of deduplication is to inspect large volumes of data
and identify large sections – such as entire files or large sections of files –
that are identical, and replace them with a shared copy. For example, a
typical email system might contain 100 instances of the same 1 MB
(megabyte) file attachment. Each time the email platform is backed up,
all 100 instances of the attachment are saved, requiring 100 MB storage
space. With data deduplication, only one instance of the attachment is
actually stored; the subsequent instances are referenced back to the
saved copy for deduplication ratio of roughly 100 to 1. Deduplication is
often paired with data compression for additional storage saving:
Deduplication is first used to eliminate large chunks of repetitive data,
and compression is then used to efficiently encode each of the stored
chunks.
Post-process versus in-line deduplication

Deduplication may occur "in-line", as data is flowing, or "post-


process" after it has been written.

With post-process deduplication, new data is first stored on the storage


device and then a process at a later time will analyze the data looking for
duplication. The benefit is that there is no need to wait for the hash
calculations and lookup to be completed before storing the data,
thereby ensuring that store performance is not degraded.
Implementations offering policy-based operation can give users the
ability to defer optimization on "active" files, or to process files based on
type and location. One potential drawback is that duplicate data may be
unnecessarily stored for a short time, which can be problematic if the
system is nearing full capacity.

Alternatively, deduplication hash calculations can be done in-line:


synchronized as data enters the target device. If the storage system
identifies a block which it has already stored, only a reference to the
existing block is stored, rather than the whole new block.

The advantage of in-line deduplication over post-process deduplication


is that it requires less storage and network traffic, since duplicate data is
never stored or transferred. On the negative side, hash calculations may
be computationally expensive, thereby reducing the storage throughput.
However, certain vendors with in-line deduplication have demonstrated
equipment which is able to perform in-line deduplication at high rates.

Post-process and in-line deduplication methods are often heavily


debated.
Data formats

SNIA Dictionary identifies two methods:

content-agnostic data deduplication - a data deduplication method that


does not require awareness of specific application data formats.

content-aware data deduplication - a data deduplication method that


leverages knowledge of specific application data formats.

Source versus target deduplication

Another way to classify data deduplication methods is according to


where they occur. Deduplication occurring close to where data is
created, is referred to as "source deduplication". When it occurs near
where the data is stored, it is called "target deduplication".

Source deduplication ensures that data on the data source is


deduplicated. This generally takes place directly within a file system. The
file system will periodically scan new files creating hashes and compare
them to hashes of existing files. When files with same hashes are found
then the file copy is removed and the new file points to the old file.
Unlike hard links however, duplicated files are considered to be separate
entities and if one of the duplicated files is later modified, then using a
system called copy-on-write a copy of that changed file or block is
created. The deduplication process is transparent to the users and
backup applications. Backing up a deduplicated file system will often
cause duplication to occur resulting in the backups being bigger than the
source data.

Source deduplication can be declared explicitly for copying operations,


as no calculation is needed to know that the copied data is in need of
deduplication. This leads to a new form of "linking" on file systems called
the reflink (Linux) or clonefile (MacOS), where one or more inodes (file
information entries) are made to share some or all of their data. It is
named analogously to hard links, which work at the inode level, and
symbolic links that work at the filename level. The individual entries
have a copy-on-write behavior that is non-aliasing, i.e. changing one
copy afterwards will not affect other copies. Microsoft's ReFS also
supports this operation.

Deduplication methods
One of the most common forms of data deduplication implementations
works by comparing chunks of data to detect duplicates. For that to
happen, each chunk of data is assigned an identification, calculated by
the software, typically using cryptographic hash functions. In many
implementations, the assumption is made that if the identification is
identical, the data is identical, even though this cannot be true in all
cases due to the pigeonhole principle; other implementations do not
assume that two blocks of data with the same identifier are identical,
but actually verify that data with the same identification is identical.[11]
If the software either assumes that a given identification already exists in
the deduplication namespace or actually verifies the identity of the two
blocks of data, depending on the implementation, then it will replace
that duplicate chunk with a link.

Once the data has been deduplicated, upon read back of the file,
wherever a link is found, the system simply replaces that link with the
referenced data chunk. The deduplication process is intended to be
transparent to end users and applications.

Commercial deduplication implementations differ by their chunking


methods and architectures.

Chunking. In some systems, chunks are defined by physical layer


constraints (e.g. 4KB block size in WAFL). In some systems only complete
files are compared, which is called single-instance storage or SIS. The
most intelligent (but CPU intensive) method to chunking is generally
considered to be sliding-block. In sliding block, a window is passed along
the file stream to seek out more naturally occurring internal file
boundaries.

You might also like