Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Data Domain, Deduplication and More

DEDUPLICATION
Total Cost of Ownership
With any backup solution, the total cost of ownership of the solution needs to encompass all
elements to enable the solution to function. These elements include the, but are not limited to;

 cost of the backup software licensing;


 number of media servers required, including the;
o number and performance of the CPUs,
o amount of memory,
o performance and capacity of the server storage, and
o throughput and type of the networking interfaces.
 deduplication method used;
o cost and implementation of the hardware based appliance,
o cost and implementation of the software based option, including the required
storage , and
o reliability and integrity of the deduplication method to perform fast backup and
recovery operations.
 flexibility of the solution to adapt to the department’s future direction; and
 effort to administer and maintain the backup software:

Snapshots provide a fast point in time copy of the data, however, it is recommended to roll over
selected snapshot based copies to an external storage appliance, like Data Domain. The risk of
relying solely on snapshot copies for recovery is directly linked to the integrity of the primary
snapshot. If the primary snapshot becomes corrupted, then all subsequent snapshots are likely to
be unavailable for data recovery operations.

When recovery of data is required, the department must be 100% confident that the backup
solution will recover the data. This is why EMC have place great importance on the Data
Invulnerability Architecture (DIA) with Data Domain.

EMC Confidential 1
Data Domain, Deduplication and More

DATA DOMAIN OVERVIEW


Data Domain Appliance
EMC Data Domain is a purpose built deduplication storage appliance that integrates easily with
existing backup software applications and can be used seamlessly with a variety of data movers
across both backup and archive workloads.
Data Domain appliances provide fast and reliable, backup and recovery operations with the
following key features.
 Variable Length Deduplication - Data Domain appliance analyses variable length
segments and performs inline deduplication before writing unique segments to a storage
platform that can scale to meet the required workload and data retention requirements.
The deduplication technology reduces disk capacity requirements and overhead, while
increasing accessibility and reliability, and makes Data Domain appliances a cost-effective
alternative to tape. This flexibility allows Data Domain to be utilized for both backup and
archive workloads.
 Data Invulnerability Architecture (DIA) – enables Data Domain to provide the industry’s
best defence against the data integrity issues that plague physical tape backups. This
capability is exclusive to Data Domain which provides inline write and read verification
detects data integrity issues, and automatically recovers from them during data ingest and
retrieval. This ability to capture and correct I/O errors “in-line” during the backup process,
eliminates the need to repeat backup jobs. Unlike traditional enterprise storage arrays or
file systems, continuous fault detection and self-healing features protect data throughout
its lifecycle on all Data Domain systems. Data Domain deduplication storage appliance
focuses on data integrity and recoverability as the most important goals. The Data Domain
Data Invulnerability Architecture provides continuous fault detection, healing, and write-
verification, which ensures backup and archive data are accurately stored, available and
recoverable. There are four critical areas of focus;
1. End-to-End Verification at Backup Time - data is read as it is written to verify that it
is the correct data and that it is reachable through the file system to disk. Most restores
happen within a day or two of backups. Systems that verify/correct data integrity slowly
over time will be too late for most recoveries;
2. Fault Avoidance and Containment - new data never overwrites good data. Data
Domain systems use fewer complex data structures and Non-volatile RAM (NVRAM)
for fast, safe restart. No partial stripe writes are allowed; and
3. Continuous Fault Detection and Healing - Data Domain RAID-6 provides double
disk failure protection and read error correction, on-the-fly error detection and
correction, and scrubbing to find/repair grown defects on the disk before they can
become a problem; and
4. File system Recoverability - data is written in a self-describing format. If necessary,
the file system can be recreated by scanning the log and rebuilding it from the metadata
stored with the data.
 Stream-Informed Segment Layout (SISL) - allows Data Domain deduplication to be
dependent on CPU performance rather than disk I/O capability. As CPU speeds increase
so does Data Domain’s ability to perform data de-duplication. The SISL technology
identifies 99 percent of duplicate, variable-length data segments in RAM, before storing
unique segments on disk. Because deduplication is performed using CPU and memory
resources, Data Domain systems are able to distribute some of the deduplication segment
processing to backup severs or clients which use a backup software application that

EMC Confidential 2
Data Domain, Deduplication and More

supports integration with Data Domain Boost. The effect of distributed deduplication is a
reduction in backup data being transferred over the IP network infrastructure compared to
traditional CIFS or NFS protocols.
 Easy Integration – due to the extensive compatibility with backup software and archive
applications, Data Domain systems integrate easily into existing backup environments.
Disk based backup systems offer similar performance characteristics with a significant
reliability advantage over traditional tape based backups. Physical Tape libraries suffer
from a single point of failure at the robotic arm and a significant amount of manual effort is
required to manage the tape operations.
 Replication – is supported between sites in a peer relationship, a cascaded relationship,
a one to many relationship or a many to one relationship that would be found when smaller
regional data centres replicate back to a larger central site. A large Data Domain appliance
can support a replication fan-in from up to 270 remote sites. Cross-site deduplication
minimises the required bandwidth between all sites, since only the first instance of data is
transferred across any of the WAN segments. The volume of data transferred is reduced
by up to 99 percent, making replication very efficient.
 Scalability – the Data Domain appliances provide fast inline deduplication with up to
31TB/hour of throughput when using Data Domain Boost, with the largest appliance
providing up to 2PB of usable capacity with Data Domain Extended Retention. This allows
a single Data Domain appliance to store up to 100PB of logical data for long term backup
storage.
Figure 1 provides an overview of the EMC Data Domain 5.5 family.

Figure 1 - EMC Data Domain 5.5 Family Overview

Why Data Domain Deduplication and Boost Matter


EMC Data Domain is a purpose built deduplication storage appliance is designed from the ground
up to perform variable length deduplication, inline and identifies 99% of the deduplication
processing in CPU and memory. It is the world’s first deduplication appliance to support both
backup and archive workloads.
Data Domain Boost is made up of two components, a Data Domain Boost plug-in that runs on the
backup server or client and a Data Domain Boost component that runs on the Data Domain system.
All connectivity between components uses industry standard Ethernet or Fibre Channel. Data
Domain Boost software enables tight integration with backup and enterprise applications using an
optimized transport.

EMC Confidential 3
Data Domain, Deduplication and More

Data Domain Boost includes three main features:


 Distributed segment processing - distributes parts of the deduplication process from the
Data Domain system to the backup server or client, increasing backup application
performance by up to 50 percent.
 Managed file replication - allows backup applications to manage Data Domain replication
with full catalog awareness.
 Advanced load balancing and link failover - provides link aggregation for load balancing
and link failover, which eliminates the need for network layer aggregation.
An EMC whitepaper which provides in detail the Business Value of Data Domain Boost is located
at http://www.emc.com/collateral/white-papers/h11755-business-value-dd-boost-wp.pdf. A copy of
this whitepaper will accompany this proposal.
Over the last few years, leading backup applications have adopted and integrated the Data Domain
Boost plug-in into their applications. A complete list of backup applications which support Boost is
provided in Figure 2.

Figure 2 - Backup Applications Supporting Data Domain Boost Plug-in

Typically, when an application owners wish to control their own backup and recovery process, IT
departments end up creating silos of backup repository storage. To eliminate this, EMC has worked
with other vendors to help improve the backup speed and reliability of business critical applications
by leveraging the native backup interface, and providing the application owner with full control.
Currently with Data Domain Operating System version 5.5, the applications shown in Figure 3 are
supported to use Data Domain. This eliminates silos of storage as the enterprise backup solution
and supported applications store their backup data in a single globally de-duplicated Data Domain
appliance.

EMC Confidential 4
Data Domain, Deduplication and More

Figure 3 - Application Supporting Data Domain Boost

Not only is the Data Domain built for storing backup data, it has also been designed to support a
large ecosystem of archiving applications. With the release of Data Domain Operating System
version 5.5, Data Domain support up to 1 Billion small archive files. The current list of supported
archive applications is shown in Figure 4.

Figure 4 - Support Archive Applications by Data Domain

With the range of backup, enterprise and archive applications, Data Domain is designed to integrate
easily into an environment and used by variety of applications.

Deduplication – Fixed versus Variable Considerations


Deduplication technology is analysed as either fixed or in variable segments. As mentioned
previous, Data Domain performs inline, variable length segments deduplication within range from
4KB to 12KB. The net effect of fixed versus variable deduplication is that data is analysed more
effectively using variable length segments. This maximizes the storage of the deduplication system.
For example, given a 50TB environment assuming that 40TB of file and 10TB of database data
needs to be protected. File data is protected by performing weekly full and daily incremental
backups, while the databases are protected using daily full backups. The backup data is retained
for one month with nominal change rates, resulting in storing 460TB of logical data.
The de-duplicated data which needs to be stored based on fixed versus variable is summarized in
Table 1.

EMC Confidential 5
Data Domain, Deduplication and More

Table 1 – Fixed versus Variable Deduplication Summary

A summary of the required storage needed for range of de-duplicated ratios is provided in Table 2.

Table 2 - Summary of Deduplication Ratios and Commonality

When comparing deduplication ratios, a few percentage points of commonality difference may not
appear to be of any great significance, but the difference in the required backend storage is not
insignificant.

EMC Confidential 6

You might also like