Professional Documents
Culture Documents
Module+11 - Capacity and Throughput Planning and Monitoring
Module+11 - Capacity and Throughput Planning and Monitoring
Introduction
The main goal in capacity planning is to design your system with a model and
configuration that can store the required data for the required retention period.
When planning for throughput requirements, the goal is to ensure that the
bandwidth is sufficient to perform daily and weekly backups within the allotted
backup window. Effective throughput planning considers network bandwidth
sharing, and adequate backup and system housekeeping timeframes– windows.
Introduction
This lesson covers the process that is used to determine the capacity requirements
of a Data Domain system such as collecting information, and determining capacity
needs.
Dell EMC Sales uses detailed software tools and formulas when working with its
customers to identify backup environment capacity and throughput needs. Such
tools help systems architects recommend systems with appropriate capacities and
correct throughput to meet those needs. This lesson discusses the most basic
considerations for capacity and throughput planning.
Using information collected about the backup system, you calculate capacity needs
by understanding the amount of data (data size) to be backed up, the types of data,
the size of a full (complete) backup, the number of copies of the data backed up,
and the expected data reduction rates (deduplication).
Data Domain system internal indices and other product components use extra,
variable amounts of storage, depending on the type of data and the sizes of files. If
you send different datasets to otherwise identical systems, one system may, over
time, have room for more or less actual backup data than another.
Data reduction factors depend on the type of data being backed up. Some types of
challenging (deduplication-unfriendly) datatypes include:
• Compressed (multimedia, .mp3, .zip, and .jpg)
• Encrypted data
The reduction factors that are listed in this slide are examples of how changing
retention rates can improve the amount of data reduction over time.
A daily full backup that is retained only for one week on a Data Domain system
may result in a compression factor of only 5x. However, retaining weekly backups
plus daily incrementals for up to 90 days may result in 20x or higher reduction.
Data reduction rates depend on several variables including datatypes, the amount
of similar data, and the length of storage. It is difficult to determine exactly what
rates to expect from any given system. The highest rates are achieved when many
full backups are stored.
When calculating capacity planning, use average rates as a starting point for your
calculations and refine them after real data is available.
Calculate the required capacity by adding up the space required in this manner:
• First Full backup +
• Incremental backups (the number of days incrementals are run—typically 4 to
6) +
• Weekly cycle (one weekly full and 4 to 6 incrementals) times the number of
weeks data is retained
As subsequent full backups run, it is likely that the backup yields a higher data
reduction rate. 25x is estimated for the data reduction rate on subsequent full
backups. 1 TB of data compresses to 40 GB.
Four daily incremental backups require 10 GB each, and one weekly backup
needing 40-GB yields a burn rate of 80 GB per week. Running out the 80-GB
weekly burn rate over the full 8-week retention period means that an estimated 640
GB is needed to store the daily incremental backups and the weekly full backups.
Adding this to the initial full backup gives a total of 840 GB needed. On a Data
Domain system with 1 TB of usable capacity, this means that the unit operates at
about 84% of capacity. A 16% buffer may be OK for current needs. You might want
to consider a system with a larger capacity, or that can have extra storage added,
to compensate for data growth.
Again, these calculations are for estimation purposes only. Before determining true
capacity, use the analysis of real data that is gathered from your system as a part
of a Dell EMC BRS sizing evaluation.
While capacity is one part of the sizing calculation, it is important not to neglect the
throughput of the data during backups.
An assumption would be that the greatest backup need is to process a full 200-GB
backup within a 10-hour backup window. Incremental backups should require less
time to complete, and you can safely presume that incremental backups would
complete within the backup window.
window, the amount of time that is required to complete the backup would increase
considerably.
It is important to note the effective throughput of both the Data Domain system and
the network on which it runs. Both points in data transfer determine whether the
required speeds are reliably feasible. Feasibility can be assessed by running
network testing software such as iperf.
Introduction
This lesson applies the formulae from the previous two lessons to selecting the
best Data Domain system to fit specific capacity and throughput requirements.
The system capacity numbers of a Data Domain system assume a mix of typical
enterprise backup data—such as file systems, databases, mail, and developer files.
How often data is backed up determine the low and high ends of the range.
The maximum capacity for each Data Domain model assumes the maximum
number of drives (either internal or external) supported for that model.
Maximum throughput for each Data Domain model is dependent mostly on the
number and speed capability of the network interfaces being used to transfer data.
Some Data Domain systems have more and faster processors so they can process
incoming data faster.
Advertised capacity and throughput ratings for Data Domain products are based on
tests that are conducted in laboratory conditions. Your throughput varies depending
on your network conditions.
The number of network streams you may expect to use depends on your hardware
model. To learn specific maximum supported stream counts, see the specific model
Data Domain system guide.
There are external and internal factors that affect Data Domain system
performance in backup environments. External Factors in the backup environment
that often gate how fast data is sent to the Data Domain system. External factor
bottlenecks do not affect the potential throughput of the Data Domain system.
Internal Factors reduce potential throughput of the Data Domain system. Internal
factor bottlenecks require that specified values are addressed for potential
sustained performance of the DD system.
Throughput increases as more streams are used for Data Domain system models
and protocols to a point where peak performance occurs. After that, adding more
streams usually reduces performance. Data Domain systems are designed so that
performance does not drop below 85% of the peak throughput when more streams
are used (up to the maximum number supported for the model and protocol). It is
not recommended to run with more streams than the maximum supported values,
as it is not tested and can reduce system performance.
The number of disks in a Data Domain system is an important factor for the level of
performance the system can achieve.
Other internal factors are: Initial dataset backup speeds, compression, high
replication load, RAID rebuilds. Ensure that you follow the most current best
practice for each protocol.
Selecting a Model
If the capacity or throughput for a particular model does not provide at least a 20%
buffer, calculate the capacity and throughput for a Data Domain model of the next
higher capacity. For example, if the capacity calculation for a DD6300 yields a
Sometimes one model provides adequate capacity, but does not provide enough
throughput, or conversely. The model selection must accommodate both
throughput and capacity requirements with an appropriate buffer.
With capacity, performance, and buffer all taken under consideration, this is a best
practice of selecting a Data Domain model that is being implemented.
Discussion
Question/Discussion Topic:
Given the current Data Domain hardware offerings, determine which model is best
suited for the needs of the customer.
1. A customer estimates that they require 70-TB usable storage for backups over
the next 5 years. They require at least 3.25 TB/hour throughput to ensure that
all data is backed up within their backup window.
2. A customer estimates that they require 275-TB usable storage for backups over
the next 5 years. They require at least 15 TB/hour throughput to ensure that all
data is backed up within their backup window.
3. A customer estimates that they require 575-TB usable storage for backups over
the next 5 years. They require at least 25 TB/hour throughput to ensure that all
data is backed up within their backup window.
Discussion:
Accept answers, and discuss. Ensure that the following points are covered:
1. The customer could use the DD3300 if both Cloud Tier is used. Otherwise the
DD6300 would be the better choice.
2. The customer could use the DD6800 if both DD Boost and Cloud Tier are used.
Otherwise the DD9300 would be the better choice.
3. The customer could use the DD9300 if DD Boost and Cloud Tier are used.
Otherwise the DD9300 would be the better choice.
Introduction
This lesson covers basic throughput monitoring and tuning on a Data Domain
System.
Throughput Bottlenecks
Integrating Data Domain systems into an existing backup architecture can change
the responsiveness of the backup system. Bottlenecks can restrict the flow of data
being backed up.
As demand shifts among system resources – such as the backup host, client,
network, and Data Domain system itself – the source of the bottlenecks can shift.
To know what causes the performance bottlenecks and remove them, monitoring is
the first step. Using DD Management Center to perform daily monitoring prevents
serious problems from happening. Monitoring capacity is important, sometimes
capacity problems cause performance issues.
The Capacity Thresholds widget displays systems that have crossed warning or
critical storage capacity levels.
The Capacity Used widget lets you monitor aggregate totals of storage levels for all
the Data Domain systems it is configured to manage. This widget monitors the total
storage capacity of all systems (for space that is used and available) or a selected
group if a filter is set.
Some commands and tools can be used for evaluating customer data and
performance in Data Domain environment.
• replication show status can be used for checking the status of
MTree/Directory Replication Contexts
• Replication Data Transferred over 24 hr is the report of replication data over a
24-hour period
• From Replication History/Replication Detailed History, hourly breakdown 24-
hour replication can be seen
• Replication Throttle is used for checking Data Domain Throttle.
If you notice backups running slower than expected, it is useful to review system
performance metrics.
From the command line, use the command system show performance.
The command syntax is: system show performance [ {hr | min | sec}
[ {hr | min | sec} ]]
Servicing a file system request consists of three steps: receiving the request over
the network, processing the request, and sending a reply to the request.
An important section of the system show performance output is the CPU and disk
utilization.
CPU avg/max: The average and maximum CPU utilization. The CPU ID of the
most-loaded CPU is shown in the brackets.
Disk max: Maximum disk utilization over all disks. The disk ID of the most-loaded
disk is shown in the brackets.
If the CPU utilization shows 80% or greater, or if the disk utilization is 60% or
greater for an extended period, the Data Domain system is likely to run out of disk
capacity or reach the CPU processing maximum. Check that there is no cleaning or
disk reconstruction in progress. You can check cleaning and disk reconstruction in
the State section of the system show performance report.
The following is a list of states and their meaning that is indicated in the system
show performance output:
• C – Cleaning
• D – Disk reconstruction
• B – GDA (also known as multinode cluster [MNC] balancing)
• V – Verification (used in the deduplication process)
• M – Fingerprint merge (used in the deduplication process)
• F – Archive data movement (active to archive)
• S – Summary vector checkpoint (used in the deduplication process)
• I – Data integrity
Typically the processes that are listed in the State section of the system show
performance report impact the amount of CPU utilization for handling backup and
replication activity.
Throughput Monitoring
Besides watching disk utilization, you should monitor the rate at which data is being
received and processed. These throughput statistics are measured at several
points in the system to assist with analyzing the performance to identify
bottlenecks.
If slow performance is happening in real time, you can also run the system show
stats interval [interval in seconds] command. For example: system
show stats interval 2 produces a new line of data every two seconds.
The system show stats command reports CPU activity and disk read/write
amounts.
In the example report shown, you can see a high and steady amount of data
inbound on the network interface, which indicates that the backup host is writing
data. It is backup traffic and not replication traffic as the Repl column is reporting
no activity.
Low disk-write rates relative to steady inbound network activity are likely because
many of the incoming data segments are duplicates of segments that are already
stored on disk. The Data Domain system is identifying the duplicates in real time as
they arrive and writing only those new segments it detects.
Tuning Solutions
If you experience system performance concerns, for example, you are exceeding
your backup window, or if throughput is slower than expected, consider the
following:
Check that CPU utilization (1 – process) is not unusually high. If you see CPU
utilization at or above 80%, it is possible that the CPU is under-powered for the
load it is required to process.
Check the State output of the system show performance command. Confirm that
there is no cleaning (C) or disk reconstruction (D) in progress.
Check the output of the replication show performance all command. Confirm that
there is no replication in progress. If there is no replication activity, the output
reports zeros. Press Ctrl + c to stop the command. If replication is occurring during
data ingestion and causing slower-than-expected performance, you might want to
separate these two activities in your backup schedule.
If CPU utilization is unusually high for any extended period, and you are unable to
determine the cause, contact Data Domain Support for further assistance.
When you are identifying performance problems, document the time when poor
performance was observed to know where to look in the system show
performance output.
Introduction
This lesson covers how to monitor Data Domain file system space usage.
The factors affecting how fast data on a disk grows on a Data Domain system
include:
• The size and number of datasets being backed up
An increase in the number of backups or an increase in the amount of data
being backed-up and retained causes space usage to increase.
• The compressibility of data being backed up
The longer the retention period, the larger the amount of space required.
If any of these factors increase above the original sizing plan, your backup system
could overrun its capacity.
There are several ways to monitor the space usage on a Data Domain system to
help prevent system full conditions.
Data Management > File System > Summary displays current space usage and
availability. It also provides an up-to-the-minute indication of the compression
factor.
The Space Usage section shows two panes. The first pane shows the amount of
disk space available based on the last cleaning.
Used: The physical space that is used for compressed data. Warning messages go
to the system log and an email alert is generated when the use reaches 90%, 95%,
and 100%. At 100%, the Data Domain system accepts no more data from backup
hosts.
Available: The total amount of space available for data storage. This number can
change because an internal index may expand as the Data Domain system fills
with data. The index expansion takes space from the Available amount.
Space Usage
Data Management > File System > Charts displays graphs depicting space
usage and consumption on the Data Domain system.
The Space Usage view contains a graph that displays a visual representation of
data usage for the system. The Date Range choices are one week, one month,
three months, one year and All. Custom date ranges can be entered.
Pre-comp Used (blue)—The total amount of data that is sent to the Data Domain
system by backup servers. Pre-compressed data on a Data Domain system is what
a backup server sees as the total uncompressed data held by a Data Domain
system-as-storage unit. Shown with the Space Used (left) vertical axis of the graph.
Post-comp Used (red)—The total amount of disk storage in use on the Data
Domain system. Shown with the Space Used (left) vertical axis of the graph.
Comp Factor (green)—The amount of compression the Data Domain system has
performed with the data it received (compression ratio). Shown with the
Compression Factor (right) vertical axis of the graph.
The Consumption view contains a graph that displays the space that is used over
time, which is shown in relation to total system capacity.
It displays Post-Comp in red, Comp Factor in green, Cleaning in yellow and Data
Movement in purple.
Data Movement refers to the amount of disk space that is moved to the archiving
storage area.
With the Capacity option disabled, as shown on the slide, the scale is adjusted to
present a clear view of space used.
This view is useful to see trends in space availability on the Data Domain system,
such as changes in space availability and compression in relation to cleaning
processes.
Consumption (Capacity)
The Consumption view with the Capacity option enabled, as shown on the slide,
displays the total amount of disk storage available on the Data Domain system.
The amount is shown with the Space Used (left) vertical axis of the graph.
Clicking the Capacity checkbox switches this line on and off. The scale now
displays Space Used relative to the total capacity of the system, with a blue
Capacity line indicating the storage limit.
This view also displays cleaning start and stop data points. This graph is set for
one week and displays one cleaning event. The cleaning schedule on this Data
Domain system is at the default of one day per week.
This view is useful to see trends in space availability on the Data Domain system,
such as changes in space availability and compression in relation to cleaning
processes.
For more information about using PCM from the command line, see the Dell EMC
Data Domain Operating System Command Reference Guide.
At a system level, shared data is calculated only once. Shared data is reported to
each namespace that is sharing the data subset along with their unique data.
Physical Capacity Measurement can answer questions like, how much physical
space is each subset using? How much total compression is each subset
reporting? How does physical space utilization for a subset grow and shrink over
time? How can one tell whether a subset has reached its physical capacity quota?
And what proportion of the data is unique and what proportion is shared with other
subsets?
The Data Domain System Manager can configure and run physical capacity
measurement operations at the MTree level only.
You add physical capacity measurement schedules in the Data Management >
MTree window by selecting an MTree and then clicking the Manage Schedules
button.
Click the plus, pencil, or X button to add, edit, or delete a schedule, respectively.
When a measurement job completes, the results are graphed and are viewed
under the selected MTree in the Space Usage tab.
The Data Domain Management Center version 1.4 and later is enhanced to
perform all physical capacity measurement operations except defining pathsets.
Daily Written
The Daily Written view contains a graph that displays a visual representation of
data that is written daily to the system over time. The data amounts are shown over
time for pre- and post-compression amounts.
It is useful to see data ingestion and compression factor results over a selected
duration. You may notice trends in compression factor and ingestion rates.
Local-Comp Factor refers to the compression of the files as they are written to disk.
The default Local compression is lz. lz is the default algorithm that gives the best
throughput. Data Domain recommends the lz option.
gzfast is a zip-style compression that uses less space for compressed data, but
more CPU cycles (two times more than lz). gzfast is the recommended alternative
for sites that want more compression at the cost of lower performance.
gz is a zip-style compression that uses the least amount of space for data storage
(10% to 20% less than lz on average. However, some datasets get higher
compression). gz also uses the most CPU cycles (up to five times more than lz).
The gz compression type is commonly used for nearline storage applications in
which performance requirements are low.
For more detailed information about these compression types, see the Data
Domain Operating System Administration Guide.
Introduction
This lesson covers an introduction to file system cleaning and its operation.
When your backup application expires data, the Data Domain system marks the
data for deletion. The data is not deleted immediately, it is removed during a
cleaning operation. The file system is available during the cleaning operation for all
normal operations including backup (write) and restore (read).
Depending on the amount of space the file system must clean, file system cleaning
can take from several hours to several days to complete.
Cleaning Process
Data invulnerability requires that data be written only into new, empty containers –
data that is already written in existing containers cannot be overwritten. This
requirement also applies to file system cleaning. During file system cleaning, the
system reclaims space that is used by expired data so you can use it for new data.
The example on this slide refers to dead and valid segments. Dead segments are
segments in containers the system no longer needs. For example, segments a file
that has been deleted has claimed, and that was the only claim to that segment.
Valid segments contain unexpired data that is used to store backup-related files.
When files in a backup are expired, pointers to the related file segments are
removed. Dead segments are not permitted to be overwritten with new data since
this could put valid data at risk of corruption. Instead, valid segments are copied
forward into free containers to group the remaining valid segments together. When
the data is safe and reorganized, the original containers are appended back onto
the available disk space.
Since the Data Domain system uses a log structured file system, space that was
deleted must be reclaimed. The reclamation process runs automatically as a part of
file system cleaning.
Cleaning requires enough free capacity to store the cleanable containers until they
are verified.
During the cleaning process, a Data Domain system is available for all normal
operations, including accepting data from backup systems.
Using the Data Domain System Manager, go to Data Management > File System
> View Status of File System Services to see the Active Tier Cleaning Status. This
page displays the time when the last cleaning finished. To begin an immediate
cleaning session, select Start.
Access the Clean Schedule section by selecting Settings > Cleaning. This page
displays the current cleaning schedule and throttle setting. In this example, you can
see the default schedule - every Tuesday at 6 a.m. and 50% throttle. The schedule
can be edited.
Cleaning Considerations
Schedule cleaning for times when system traffic is lowest. Cleaning is a file system
operation that impacts overall system performance.
Adjusting the cleaning throttle higher than 50% consumes more system resources
during the cleaning operation and can potentially slow down other system
processes.
Data Domain recommends running a cleaning operation after the first full backup to
a Data Domain system. The initial local compression on a full backup is generally a
factor of 1.5 to 2.5. An immediate cleaning operation gives extra compression by
another factor of 1.15 to 1.2 and reclaims a corresponding amount of disk space.
Any operation that shuts down the Data Domain file system or turns off the device
(a system power-off, reboot, or filesys disable command) stops the clean
operation. File system cleaning does not continue when the Data Domain system
or file system restarts.
Expiring files from your backup does not guarantee that space will be freed after
cleaning. If active pointers exist to any segments related to the data you expire,
such as snapshots or fast copies, those data segments are still considered valid
and remain on the system until all references to those segments are removed.
Daily file system cleaning is not recommended as frequent cleaning can lead to
increased file fragmentation. File fragmentation can result in poor data locality and,
among other things, higher-than-normal disk utilization.
If the retention period of your backups is short, you might be able to run cleaning
more often than once weekly. The more frequently the data expires, the more
frequently file system cleaning can operate. Work with Dell EMC Data Domain
Support to determine the best cleaning frequency under unusual circumstances.
When the cleaning operation finishes, a message is sent to the system log giving
the percentage of storage space that was reclaimed.
Introduction
In this lab, you configure and run file system cleaning on a Data Domain system
using the System Manager and note the effect file system cleaning has on deleted
MTrees.
Summary