Professional Documents
Culture Documents
Module 9-Business Continuity - Participant Guide
Module 9-Business Continuity - Participant Guide
CONTINUITY
PARTICIPANT GUIDE
PARTICIPANT GUIDE
Table of Contents
Concepts in Practice................................................................................................ 33
Concepts in Practice .......................................................................................................... 34
Module Objectives
Business Continuity
Business continuity (BC) is a process that prepares for, responds to, and
recovers from a system outage that can adversely affect business operations.
Information Availability
4
2
5
3
Direct loss
Number of employees Compensatory payments
impacted x hours out x Lost future revenue
hourly rate Billing losses
Investment losses
Customers Revenue
Suppliers Cash flow
Other Expenses
Financial markets Lost discounts (A/P)
Banks Temporary employees, Payment
Business partners equipment rentals, overtime Credit rating
costs, extra shipping costs, Stock price
travel expenses, and so on.
Time
MTBF: Average time available for a system or component to perform its normal
operations between failures
• MTBF = Total uptime / Number of failures
Disaster
Time
Disaster Recovery
Replicati
on
Organizations should build a resilient IT infrastructure with the goal of meeting the
required information and service availability. Building resilient IT infrastructure
requires the following high availability and data protection solutions:
will trigger the DR process. This process typically involves both manual and
automated procedure to reactivate the service (application) at a functioning data
center. This reactivation of the service requires the transfer of application users,
VMs, data, and services to the new data center. This process involves the use of
redundant infrastructure across different geographic locations, live migration,
backup, and replication solutions.
Knowledge Check
Knowledge Check
1. Which defines the amount of data loss that a business can endure?
a. RTO
b. RPO
c. MTBF
d. MTTR
Fault tolerance ensures that a single fault or failure does not make an entire system
or a service unavailable. It protects an IT system or a service against various types
of unavailability.
Permanent Unavailability
Fault tolerant IT infrastructure meets two key requirements: fault isolation and
eliminating single points of failure (SPOF).
Fault Isolation
Fault Isolation
Fault isolation contains the scope of a fault so that the other areas of a system are
not impacted by the fault. It does not prevent failure of a component but ensures
that the failure does not impact the overall system.
Isolated
Dead
Path
Compute System
The example represents two I/O paths between a compute system and a storage
system. The compute system uses both paths to send I/O requests to the storage
system. If an error or fault occurs on a path causing a path failure, the fault isolation
mechanism present in the environment automatically detects the failed path. The
failed path is marked as a dead path to avoid sending pending I/Os through it. All
pending I/Os are redirected to the live path. This fault isolation avoids time-out and
the retry delays.
SPOF at Storage-level
SPOF at Network-
level
Data Center
Compute Clustering
Service Failover
Compute Cluster
Heartbeat Signal
− Active/active
− Active/passive
Link Aggregation
FC Switch
Link Aggregation
FC Switch
• Combines links between two switches and also between a switch and a node.
• Enables network traffic failover in the event of a link failure in the aggregation.
NIC Teaming
Teaming
Software
Physical NIC
NIC Teaming
• Groups network interface cards (NICs) so that they appear as a single logical
NIC to the operating system or hypervisor.
• Provides network traffic failover in the event of a NIC/link failure.
• Distributes network traffic across NICs.
Multipathing
HBA 1 HBA 2
Path 1 Path 2
FC Switch FC Switch
SC1 SC2
Storage System
SC - Storage Controller
• Enables a compute system to use multiple paths for transferring data to a LUN.
• Enables failover by redirecting I/O from a failed path to another active path.
• Performs load balancing by distributing I/O across active paths.
Data Center A
Load Balancer
Clients
Data Center B
Data centers comprise storage systems with a large number of disk drives and
solid state drives. The failure of these drives could result in data loss and
information unavailability.
RAID
• RAID is a technique that combines multiple drives into a logical unit that is
called a RAID set. Nearly all RAID implementation models provide data
protection against drive failures.
• The illustration provides an example of RAID 6 (dual distributed parity), where
data is protected against two disk failures. The data is striped across the disk
with dual parity, for example as shown in the figure below the data "A" is striped
across four disks (A1, A2 are data where as Ap and Aq are parity).
Erasure Coding
Data
Write
9 Fragments
Encoded
Fragments
k=3 m=9
− A set of n disks is divided into m disks to hold data and k disks to hold
coding information.
− Coding information is calculated from data.
The image illustrates an example of dividing data into nine data segments (m = 9)
and three coding fragments (k = 3). The maximum number of drive failures
tolerated in this example are three. Erasure coding offers higher fault tolerance
(tolerates k faults) than replication with less storage cost.
• Automatically replaces a failed drive with a spare drive to protect against data
loss.
Any kind of cache failure will cause loss of data that is not yet committed to the
storage drive. This risk of losing uncommitted data held in cache can be mitigated
using cache mirroring.
Service Failover
Availability Zone C
• Availability zone is a location with its own set of resources and isolated from
other zones.
• Availability zones, although isolated from each other, are connected through
low-latency network links.
• In the event of a zone outage, services can failover to another zone.
Knowledge Check
Knowledge Check
Concepts in Practice
Concepts in Practice
Dell PowerPath
Scenario
Deliverables
Solutions
Scenario
Deliverables
Solutions
Business continuity (BC) is a set of processes that includes all activities that a
business must perform to mitigate the impact of planned and unplanned downtime.
BC entails preparing for, responding to, and recovering from a system outage that
adversely affects business operations. It describes the processes and procedures
an organization establishes to ensure that essential functions can continue during
and after a disaster.
In a modern data center, policy-based services can be created that include data
protection through the self-service portal. Consumers can select the class of
service that best meets their performance, cost, and protection requirements on
demand. Once the service is activated, the underlying data protection solutions that
are required to support the service is automatically invoked to meet the required
data protection.
Data center failure due to disaster (natural or man-made disasters such as flood,
fire, earthquake, and so on) is not the only cause of information unavailability. Poor
application design or resource configuration errors can lead to information
unavailability. For example, if the database server is down for some reason, then
the data is inaccessible to the consumers, which leads to IT service outage.
Even the unavailability of data due to several factors (data corruption and human
error) leads to outage. The IT department is routinely required to take on activities
such as refreshing the data center infrastructure, migration, running routine
maintenance, or even relocating to a new data center. Any of these activities can
have its own significant and negative impact on information availability.
Note: In general, the outages can be broadly categorized into planned and
unplanned outages.
Both RPO and RTO are counted in minutes, hours, or days and are directly related
to the criticality of the IT service and data. Usually, the lower the RTO and RPO,
the higher is the cost of a BC solution or technology.
Organizations often keep their DR site ready to restart business operations if there
is an outage at the primary data center. This may require the maintenance of a
complete set of IT resources at the DR site that matches the IT resources at the
primary site. Organization can either build their own DR site, or they can use cloud
to build DR site.
Organizations may also create multiple availability zones to avoid single points of
failure at data center level. Usually, each zone is isolated from others, so that the
failure of one zone would not impact the other zones. It is important to have high
availability mechanisms that enable automated application/service failover within
and across the zones if there is a component failure or disaster.