WP ReplicationHACMP E

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

The Benefits of Data Replication

in HACMP/PowerHA Cluster
Management Implementations
W H I T E PA P E R

Executive Summary

In today's fast-paced world, businesses both large and small face increasing internal
and external demands for data protection and efficient, uninterrupted operations. Even
a brief interruption in services and processes can have potentially disastrous results that
businesses cannot afford to risk. IT departments are being tasked with accommodating
these requirements while also being expected to do more with less in a diminishing
economy.

Fortunately, technological advances in AIX high availability, clustering disaster recovery,


and continuous operations have continually risen to meet these challenges, ensuring that
both planned outages due to maintenance and upgrades and unplanned outages due to
environmental conditions, operator error, or software bugs result in minimal data loss.

The HACMP high availability solution that clusters multiple servers to shared storage
offers automatic recovery of applications and system resources if a failure occurs with the
primary server, thereby maintaining the highest levels of data currency in that scenario.
Nonetheless, clustering is only part of the equation of a truly resilient IT infrastructure
because should the shared storage become damaged or otherwise unusable, significant
disruption of business critical applications will still occur. That is why the other essential
component of a truly resilient AIX environment is data replication technology which protects
the database by maintaining a storage clone in an offsite location. This way both servers
are redundant and storage is redundant. Still, not all replication solutions are alike.

1 visionsolutions.com
W H I T E PA P E R

Ensuring Protection for the Server


With the advent of globalization and of business demand for increased service-level
agreements (SLAs) that require the highest level of availability of business-critical services
and servers, high availability solutions became critical components in information
systems—not just for large enterprises, but also for medium-sized and small businesses
that, in many ways, are even more vulnerable to system outages. While a larger enterprise
may have the human, technical, and financial resources to cope with and survive an
unplanned outage, smaller businesses that lack similar resources can easily be put out of
business if a core IT function becomes unavailable even for a short period of time.

High availability, also sometimes referred to as fault resilience, refers to technology with
which servers and business services can achieve availability characteristics in the range of
99.99%–99.999%. High availability systems should be designed for businesses that can
endure short periods of downtime; in contrast, fault tolerant systems are designed to achieve
virtually continuous operation, albeit that level of availability requires fully redundant hardware
and software components, resulting in higher solution cost.

High availability for the AIX operating system is accomplished by cost-efficiently utilizing
redundant hardware and software components as well as clustering software that manages
the system and is responsible for monitoring system health and performing the necessary
recovery actions should a failure occur.

Since there is typically sufficient system capacity available to temporarily host services—
either readily or via AIX’s Capacity on Demand facility—high availability clustering will
improve service availability not only during unplanned events, but also during scheduled
maintenance. As high availability clustering products enable the administrator to handle
system resources in groups, they can significantly improve system administration and
change management practices, thereby contributing high levels of achievable service-level
agreements and reducing administration labor expenses.

In order to provide high availability for AIX servers on the System p/AIX platform, IBM’s high
availability solution for AIX, High Availability Cluster Multi-Processing (or HACMP, now called
PowerHA) has traditionally been utilized. HACMP, in its base offering, is a high availability
solution that provides capabilities to assist with monitoring the cluster, automatically
recovering applications and system resources if a failure occurs, and easing system and
cluster administration and maintenance via its Single System Image–like capabilities.

HACMP provides a very mature, robust, and feature-rich environment for high availability,
with built-in capabilities to support 32 nodes, complex multi-tier business application
environments, and AIX’s superior virtualization features. It provides protection for network
resources, applications, logical volume manager (LVM) resources, and other resources that
may be less commonly used but are equally important for certain environments. HACMP’s
base offering is typically used in a shared-disk environment; protection against disk or disk
array failure typically has to be achieved by disk mirroring technologies (RAID solutions, LVM
mirroring, etc).

A typical local HACMP cluster is depicted in the following diagram:

2 visionsolutions.com
W H I T E PA P E R

Ethernet (IP)
Network
LAN

Heartbeat
Monitor

AIX AIX
Node Node

Shared
Data Store

High availability in the realm of a local data center was, for many years, the typical business
continuity solution for AIX environments. While there were solutions available for AIX that
addressed the need for disaster recovery (another major component of business continuity),
those solutions were expensive, challenging to install, and difficult to administer.

The recent decade, however, witnessed a surge in interest for proper disaster recovery
practices and solutions driven by large, as well as SMB-type, businesses. This is because,
for any size company, a failure of a data center, be it an unexpected power outage, a
scheduled site maintenance project, or an environmental disaster, must not jeopardize
service-level agreements; must not place unacceptable risk on those businesses’ ability
to continue serving their global market; and must not risk their ability to satisfy stringent
regulatory compliance requirements such as HIPAA, Sarbanes-Oxley, or Basel II, or other
country-specific regulations.

It is important to mention the third and final, yet equally important, component of business
continuity: continuous operations. Any high availability and disaster recovery solution has to
be relatively easy to use for the current IT staff and should introduce the least disruption to
existing IT processes; otherwise, the solution itself may become the source of disruptions to
business.

Considerations for Disaster Recovery Solutions


In order to satisfy the emerging business need for disaster recovery for AIX environments
and servers, solutions targeted toward disaster recovery needs have been introduced
and have gained wider acceptance over the last decade. HACMP’s disaster recovery
family of products, branded as HACMP/XD, provides various options for disaster recovery.

3 visionsolutions.com
W H I T E PA P E R

These solutions typically provide automated failover capabilities for the data (controlled by
HACMP’s cluster management functionality) and rely on HACMP to provide availability of all
the other resources necessary for the business service, such as applications and network
resources.

Disaster recovery solutions typically replicate data to a geographically distant remote location
either synchronously or asynchronously, via either an IP-based or a proprietary connection
(ESCON link, etc).

Synchronous data replication solutions replicate data to the remote location in a


synchronous manner; that is, the application’s write request is only considered finished
once both servers have written the data to their respective disks. While some business
services do require the characteristics of synchronous replication, the downside is that the
application may be slowed down because writing over the WAN takes significantly longer
than writing to the SAN or to direct-attached disks, and network bandwidth between the
two sites has to be sized to handle the peak data load. These solutions are also sensitive to
unexpected network use. Therefore, they typically require very careful network bandwidth
sizing and ongoing management, as well as quasi-dedicated networks, which ensure that
an unexpected network load (due to increased user workload, a network backup operation,
etc.) does not interfere with the performance of critical business applications.

Asynchronous replication solutions, on the other hand, buffer data on the local site, and,
as soon as data is written to the disk, the application’s write request is complete. The
advantage of these solutions is that the application’s performance is not affected noticeably,
and network bandwidth can be sized to the average data load. Any excess data that
cannot be replicated to the disaster recovery site because of network bandwidth limitation
is buffered up and is later replicated to the disaster recovery site as bandwidth allows.
Asynchronous replication solutions also typically cope with network outages better. For the
majority of business applications and businesses, asynchronous replication solutions provide
superior performance and cost efficiency while maintaining satisfactory recovery point
objectives (RPOs) and recovery time objectives (RTOs).

One important requirement for replication solutions is write-order fidelity, which means that
writes on the recovery site have to occur in the exact order they occur on the production
site; otherwise, applications could not reliably be recovered. If the replication solution
experiences a shorter or longer period of network outage, it has to be able to provide a
consistent image of the data with write-order fidelity.

When compared with local high availability solutions, disaster recovery solutions have unique
design challenges that stem from the geographic distance and the presence of two or
more copies of the data. One such consideration is what was discussed above: the choice
between synchronous and asynchronous replication. It is also important to evaluate between
automated and non-automated failover options. It is relatively straightforward to determine
failure of network and other components of the IT infrastructure in a local data center
environment, and an unnecessary failover typically has climited effects on the business.

These challenges become much more complex in the case of a disaster recovery cluster.
WAN outage can occur much more easily; providing redundant networks to distinguish
between network failures and site failures becomes more difficult; and a failover to the
disaster recovery site can have significant effects on the business: the disaster recovery
site’s environment may be less powerful; having the users reach the application service

4 visionsolutions.com
W H I T E PA P E R

via the WAN may introduce slower application response times; certain client and router
reconfigurations may be required in order to allow the clients to connect directly to the
disaster recovery site; and, when the business is ready to move production back to the main
site, the site fallback will introduce another outage. For these reasons, while it may seem
tempting to automatically fail over the service, customers typically opt for manual disaster
recovery failover.

Rolling Disasters
One additional scenario that may more easily happen in a disaster recovery configuration is
the occurrence of a failure condition the industry refers to as “rolling disasters.” In this case,
certain components of the system start to fail gradually, while replication is still occurring.
This leads to corrupt transactions being added to the database while replication is still
occurring, which results in data corruption not only at the production site, but also in the
disaster recovery server’s data image. Eventually, the entire site fails, but by that point, the
image of the data on the disaster recovery server is unusable for business purposes.

With traditional replication solutions, the only available image of the data is either the current
replica (as with HACMP/XD, HAGEO or GLVM) or a point-in-time (PIT) copy of the data,
which would have to be taken at predetermined time periods (e.g., Veritas Volume Replicator
with FlashSnap), which affects RPO targets. If none of the available copies are suitable for
business purposes (e.g., the latest image got corrupted five minutes ago, but the latest PIT
copy is from an hour ago), the business has to decide whether to revert back to the last
good copy (which may be last night’s backup in many instances) or attempt to repair the
data image available, both of which significantly degrade RPO and/or RTO.

In essence, the result of rolling disasters is a combination of physical and logical disasters
in the sense that the production site experiences a physical disaster and the data becomes
corrupt prior to the complete failure of the production site.

Disaster recovery solutions should not only provide protection against physical and logical
disasters, but should also be effective for mitigating extensive downtime due to scheduled
maintenance. Such maintenance procedures include upgrading applications, the operating
system, or hardware; moving servers from one location to another; ensuring power
maintenance at the production site; and maintaining the network infrastructure, to mention
just a subset of common scenarios.

Additional Data Protection with EchoStream for AIX


Traditional data replication solutions, both general solutions and the ones available for AIX,
typically replicate data to the disaster recovery site. In conjunction with a clustering solution,
they provide automated failover capabilities, thereby achieving a tier-7 disaster recovery
solution, which means it has addressed the requirement that data can be replicated to the
disaster recovery site and can be used to start up business services should the production
site experience a failure condition.

5 visionsolutions.com
W H I T E PA P E R

The disadvantage of traditional disaster recovery and data replication solutions is fourfold:

• Either they are unable to protect against logical as well as rolling disasters or they rely on
predetermined snapshot points to provide some level of protection, with degraded levels of
recovery point and recovery time objectives. Because of this, they place a lot of burden on
the administration staff to ensure proper operating procedures that result in an acceptable
balance among recovery point objectives, ongoing replication performance, and recovery
time objectives.
• Solutions that have been available for the AIX platform are either difficult and expensive
to configure and maintain, or do not include functionality needed by the majority of busi-
nesses, such as asynchronous replication and manual failover capability.
• If a data replication solution relies on sector-by-sector storage hardware replication or
predetermined snapshot points, it is difficult for IT to efficiently use this second set of data
for other workloads, such as reporting, business intelligence, and data warehousing. In
addition, backup tapes cannot typically be made from the data on the target system,
which means that process must still be conducted on the production system, with planned
downtime being required to do so.
• If the solution does not offer capabilities to easily create snapshots without affecting
replication, then disaster recovery testing (an important component of an overall business
continuity plan) either is difficult and cannot therefore be performed sufficiently frequently
or may cause degradation to recovery point and recovery time objectives if replication is
affected.

Going Beyond Traditional Recovery


EchoStream for AIX from Vision Solutions is an innovative disaster recovery solution that
addresses the disadvantages discussed above. It is an asynchronous, IP-based disaster
recovery solution; hence, it is able to utilize network bandwidth efficiently, without notice-
ably impacting application performance. Due to its unique continuous data protection (CDP)
capabilities, it is able to assist not only with unplanned physical disasters, but also with the
far more common logical disasters, as well as rolling disasters. Its unique virtual snapshot
capability assists in ensuring disaster recovery readiness, and its easy disaster recovery
capability enables a quick manual switch to the disaster recovery site.

Protecting the Data as Well as the Server with Real-Time Data


Replication
While Vision Solutions offers a tier-7 disaster recovery solution by giving customers the
option to combine clustering and replication products, many enterprise-level customers have
chosen to leverage EchoStream’s unique replication and disaster recovery characteristics
with HACMP’s feature-rich environment.

The benefits of this combined system architecture are many:

• HACMP’s mature, feature-rich, robust capabilities are utilized for automated high availability
within the main data center. Should a localized failure condition occur, HACMP can recover
critical system resources.

6 visionsolutions.com
W H I T E PA P E R

• HACMP’s Single System Image (SSI)–type capabilities greatly reduce system administra-
tion time and resource requirements.
• EchoStream provides data replication and disaster recovery capabilities to one or more
disaster recovery sites. Due to EchoStream’s flexible replication capabilities, cascaded or
star-like replication topologies can also be configured.
• EchoStream replicates data asynchronously to the disaster recovery site through an
IP-based connection either with or without data compression. Only changes are replicated,
thereby ensuring efficient bandwidth utilization. By compressing the communication flow,
customers typically achieve five to six times the network bandwidth utilization.
• EchoStream provides single-click manual failover capability to the disaster recovery site. As
discussed above, under most circumstances, disaster recovery failover is not an auto-
mated process, and automating it would introduce unacceptable risks.
• After disaster recovery failover due to either planned or unplanned outage, EchoStream
allows for a very network-efficient resynchronization process, requiring only the changes
that occurred after the failover to be synchronized back to the original production site.
Once the business is ready for failing back the production application to the original
production site, the failback procedure is similarly simple.
• EchoStream’s unique true CDP feature, which not only replicates changes to the disaster
recovery site, but also tracks and stores each change in buffers as it occurs, can be
utilized to quickly restore data and to recover from logical disasters. In other words, you
can recover objects from any point in time should an object become deleted or otherwise
corrupted.
• EchoStream’s virtual snapshot capability can be utilized for a wide variety of business uses,
ranging from offloading backup procedures, through data retrieval, to business-report-
generation purposes. More will be discussed about the benefits of this capability later in
this paper.
• Since EchoStream is a software-based replication solution that runs on the AIX server, it
does not require that major changes be made to existing data center operating proce-
dures. It can run on any underlying storage solution, either in a heterogeneous or homo-
geneous storage environment, thereby allowing for a gradual introduction and the full
utilization of existing capital investments.
• Similar and dissimilar storage options are accommodated. Businesses can leverage their
existing storage and SAN investment, including hardware, software, and staff knowledge.
This allows businesses to grow their storage size and performance with business needs
and helps them to avoid vendor lock-in.

Replication for HACMP System Architectures


In its simplest case, this configuration consists of two HACMP nodes at the production site
and one server at the disaster recovery site, as depicted in the diagram below. The servers
could be either standalone AIX servers or logical partitions (virtual servers) running on the
same physical AIX server. The HACMP cluster can be either an existing one, in which case
EchoStream would be added for disaster recovery purposes, or a new cluster that requires
superior local high availability and disaster recovery characteristics.

7 visionsolutions.com
W H I T E PA P E R

Offsite

HACMP

AIX AIX AIX

EchoStream for AIX

Local HA with Remote Replication

In the configuration depicted above, HACMP is utilized in a shared disk configuration on


two nodes to make business services highly available. HACMP can be used either in a
hot standby or in a mutual takeover configuration. An EchoStream context (EchoStream’s
replication “group” that can contain several logical volumes among which write-order fidelity
is maintained by EchoStream) is made part of the HACMP resource group.

During regular operations, one of the HACMP nodes hosts the resource group, and
replication occurs from that server to the recovery server at the disaster recovery site. If there
is a failure condition on the server hosting the resource group or if the administrator initiates
a resource group movement to the other HACMP node, the resource group is taken over.
As part of the failover procedure, EchoStream is stopped on the server originally hosting
the resource group and then is started on the server to host the resource group. From the
recovery server’s perspective, this is merely a short suspension in replication.

If the site fails or the administrator moves the production service to the disaster recovery site,
the disaster recovery failover is achieved by issuing an EchoStream command that will bring
up EchoStream and the file system on the disaster recovery site, after which the application
can be started up, resulting in a short RTO.

Extending Replication to Include True Continuous Data


Protection (CDP)
As discussed above, EchoStream provides its unique data replication and disaster recovery
capabilities by utilizing what is referred to as "true continuous data protection" (true CDP)
technology. In essence, this technology is similar to transaction logging in that each write
IO is buffered in the form of redo and undo logs and can later be used to reconstruct earlier
images of the data, allowing for advanced any-point-in-time data recovery.

8 visionsolutions.com
W H I T E PA P E R

Since changes are continuously tracked and buffered as they occur, a significant advantage
EchoStream has over other, more traditional replication products is that the recovery point
can be chosen after a problem occurs, rather than having to rely on placing snapshot
points before each major operation. This greatly improves recovery point and recovery time
objectives not only during physical disasters, but also during logical disasters. Rather than
having to restore a previous night’s backup in order to restore an accidentally deleted file,
a logical, virtual image of the data can be reconstructed within minutes, from which the
deleted file can then be recovered. This same process can be extended to databases and
deleted records and tables.

During physical disasters, the recovery point objective (RPO) essentially becomes a
continuum. Rather than having only the latest image of the data available, which may be
unusable from a business perspective even if it is consistent at the file system and database
levels, the business can now evaluate whether an earlier recovery point might be more
suitable to recover to. Criteria for that decision might include the necessity to bring the
AIX server in sync with auxiliary systems that have worse RPO characteristics; business
process–related significant recovery points (e.g., recovering to the middle of end-of-day
processing may not be desirable); or the occurrence of a rolling disaster that corrupted the
latest recovery point.

The process of evaluating this is easily achievable with virtual snapshots, which is discussed
in greater detail below.

Why Is the Inclusion of True CDP Superior to Replication


Without CDP?
As discussed above, EchoStream's true CDP functionality adds unparalleled capabilities
to all aspects of both disaster recovery and business continuity, capabilities that are
necessary in order to satisfactorily address unique challenges that occur in disaster recovery
environments.

Perhaps EchoStream’s most-utilized feature is its unique snapshot capability, which lets
companies achieve better utilization of their recovery server. EchoStream replicates data to
the disaster recovery server while journaling data updates as they occur. The journals are
buffered on the recovery server and can be archived to tape media to provide an expanded
recovery window (which may be mandated by regulatory compliance requirements).
EchoStream’s snapshot capability occurs on the recovery server, thereby mitigating any
risk that snapshots could impose on the production server. On the recovery server, the
administrator can create either read-only or read/write-capable virtual snapshots, which can
then be utilized for a variety of purposes (listed below).

EchoStream uses Copy-On-Write snapshot technology, which allows for very disk space–
efficient creation of snapshots, with very low disk-use overhead (typically on the order of a
few percent of the protected data set’s size).

It is important to note that, while snapshots are in use on the recovery server, replication
occurs uninterruptedly, thereby not exposing the business to either degraded recovery point
or recovery time objectives.

9 visionsolutions.com
W H I T E PA P E R

Snapshots then can be used for a variety of purposes:

1. Most importantly, the combination of snapshots and CDP allows you to easily and
resource-efficiently recover data without impacting production. If there is data corrup-
tion (caused by users, administration, or an application), you can easily reverse that by
creating a snapshot to an earlier point in time, performing any necessary investigation and
validation, and then reapplying the data onto the production server, in most cases without
affecting production and the users of the system.
2. One of the most common ways to use snapshots is to offload the tape backup procedure
from the production server to the recovery server, thereby completely eliminating the need
for a "backup window." A snapshot can be created on the recovery server.
3. A snapshot is an effective way to manage your reporting, business intelligence, and data
mining requirements.
4. With a snapshot, you can perform non-intrusive disaster recovery readiness testing to
ensure that service-level agreements are met.
5. When you need to test application upgrades or software patches before rolling them out
to production, a snapshot ensures the success of the process by providing a place to
return to if anything goes awry.
6. A snapshot is useful for creating an isolated "sandbox" training system for new
employees, which can both minimize employee ramp-up time and ensure that practice
activities and mistakes do not impact actual production systems.

Summary
In today's competitive economy, data and service availability is crucial to a business's
survival. Inefficiency cannot be tolerated. Hardware has become increasingly dependable,
but unplanned outages caused by physical disasters, logical disasters, and rolling disasters
still happen. And even planned outages for maintenance and upgrades can have a negative
effect on business performance.

In the case of unplanned outages, HACMP software for System p/AIX provides monitoring,
failure detection, and automated application recovery to help protect business-critical
applications—and the businesses that rely on them—from failing.

And during planned outages, the HACMP solution can transfer applications and data to
backup systems so that users still have access.

A reliable, cost-effective IT infrastructure that keeps a business running 24x7 is no longer a


luxury. It's a necessity.

10 visionsolutions.com
W H I T E PA P E R

Easy. Affordable. Innovative. Vision Solutions.


Vision Solutions, Inc. is the world’s leading provider of high availability, disaster recovery, and
data management solutions for the IBM System i and System p markets. With a portfolio
that spans the industry’s most innovative and trusted HA brands, Vision’s iTERA™, MIMIX®,
and OMS/ODS™ keep business-critical information continuously protected and available.
Complementing Vision’s availability offerings, Vision Director™ delivers a highly integrated set
of applications that proactively monitors, manages and optimizes System i servers, data-
bases and application environments to help ensure the continued health of System i servers.

Affordable and easy to use, Vision products help to ensure business continuity, increase
productivity, reduce operating costs, and satisfy compliance requirements. Vision also offers
advanced cluster management, data management, and systems management solutions,
and provides support for i5/OS®, Windows® and AIX® operating environments. As IBM’s
largest high availability Premier Business Partner, Vision Solutions oversees a global network
of business partners and services and certified support professionals to help our customers
achieve their business goals. Privately held by Thoma Cressey Bravo, Inc., Vision Solutions is
headquartered in Irvine, California with offices worldwide.

For more information call 801-799-0300 or toll free at 800-957-4511, or visit


visionsolutions.com.

iTERA MIMIX OMS/ODS

15300 Barranca Parkway © Copyright 2010, Vision Solutions. IBM and System i are trademarks of International Business
Irvine, California 92618 Machines Corporation. WP_ReplicationHACMP_E_1005
800-957-4511
801-799-0300
visionsolutions.com

You might also like