Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

close window

Print

High Availability on System p


HACMP & POWER6 Innovations

August 2007 | by Ken Milberg

Figure 1

Many people are already aware of the exciting innovations of the new POWER6
architecture, including live-partition mobility, which allows system administrators (or even
operators) to move partitions from one server to another while the partition is actually
running a production workload. While exciting and notable, is this really high availability?
Does this (or will it) replace IBM’s flagship availability product on the midrange, IBM’s High
Availability Cluster Multi-Processing (HACMP)?

The answer is no. It’s important to distinguish between partition mobility (or application
mobility, soon to be available with AIX V6.1) and true high availability (HA). Partition
mobility is a feature that can eliminate downtime during planned outages. It won’t increase
your systems availability during unplanned outages, such as a hardware crash. That’s where
IBM’s HACMP comes in. It’s worthwhile to discuss HACMP in the context of the new
availability innovations coming out later on this year. While HACMP V5.4 offers several new
elements, we should first dispel the notion that either partition mobility or live applications
mobility are somehow intended to replace HACMP.

What is HACMP?
HACMP is IBM’s HA product for UNIX (and now Linux), first released in the early 1990s.
It’s a very mature product, yet it can be very challenging to understand and deploy. There are
really two purposes of HACMP:

1. To build a cluster of nodes to provide an environment that is highly available.


2. To build a cluster of nodes to improve overall application performance.
For the purposes of this article, we’ll focus on the first purpose, as most users of HACMP use
the product for availability purposes. One of the obvious benefits of HACMP is that it
provides availability during planned system outages. As most systems administrators know,
planned outages (thankfully) come much more frequently then unplanned outages. Planned
outages include outages taken for systems software updates, application updates/upgrades
and firmware upgrades. As partition mobility will soon provide the ability to account for
planned outages, the necessity of HACMP moving forward will be primarily for unplanned
outages.

HACMP is certainly not for everybody. While it’s a mature, proven product, even the most
experienced system administrators will tell you that configuring and maintaining HACMP
clusters isn’t for the faint of heart. While much cheaper than competing products such as
VERITAS Cluster Manager (also available for AIX and Systems p), HACMP still comes with
a cost, much of which actually lies outside of the licensing and associated software costs. This
cost includes the funds necessary to train IT staff in HACMP, probable consulting costs
incurred during cluster installation and configuration and other related maintenance costs.

Further, HACMP is only really necessary in environments that must have continuous
availability. When deciding whether to use HACMP, you must consider the cost of deploying
it versus the cost of having your systems down for four to eight hours while your hardware is
being fixed. Other important considerations include the actual cost of failure to your
environment, and what applications must be highly available. The bottom line is, if you can
afford the downtime, then you don’t need really HACMP. If your application absolutely
can’t afford to be down, then you may not be able to live without it. (We should also note
that many applications, including Oracle, also provide availability solutions at the application
layer.)

How does HACMP work?

While HACMP can support up to 32 nodes (8 for Linux), the vast majority of configurations
are two-node clusters, in which one node functions as the failover node for the primary node.
In HACMP lingo, that means one active and one standby node are running, both using the
same shared disk. See Figure 1.

This figure illustrates a two-node IBM AIX HACMP environment running Oracle, consisting
of an active and a standby server. Mutual takeover configurations, where both nodes are
running applications and backing each other up, aren’t as common, though they certainly also
work well. When configuring your cluster, you must account and plan your applications,
cluster topology, network connectivity, shared storage devices, shared LVM components,
resource groups, cluster event processing and ultimately the clients themselves.

Each resource is defined as being part of a resource group, which are then configured to have
relationships with its nodes. Depending on this relationship, resources can be defined in four
different ways: cascading, cascading without fallback, rotating or concurrent access. When
the primary server goes down because of a failover event, the HACMP software on the
standby system recognizes this event and starts to take action, usually taking over the service
IP address of the primary policy server. The HACMP software will also mount the shared
filesystem on the standby system and start up its applications.

In a typical cascade relationship, the standby server remains operational until the HACMP
software on the standby system recognizes that the primary system is operational and falls
back. In this relationship, one would plan for when they want to run the application again on
the primary server. While you can theoretically have both hosts function as logical partitions
on one System p frame, that would defeat the purpose of having hardware availability, so
let’s assume these partitions are on two separate frames. It’s also important to note that in the
event that the standby server is configured to backup several primary servers, the failover
node must be configured to be able to service all available workloads. Note that because the
disk is shared, HACMP doesn’t provide for disk availability. Your storage subsystem must
provide for that level of redundancy.

Testing your HACMP is one of the most important components of an HACMP deployment.
Before deploying HACMP in production, every possible scenario should be tested to ensure
that the cluster works the way it was designed to work. When validating that testing, it’s not
enough for your UNIX administrators to say that it works. Functional applications teams must
be part of the HACMP validation process or else it really hasn’t been tested adequately. The
purpose of HACMP isn’t so much to ensure that filesystems or processes have started on
another box, but that every application you’ve identified must continue to work in the event
of a failover and works without any manual intervention.

What about HACMP V5.4?

The most important enhancements of HACMP V5.4 include non-disruptive startup and fast
failure detection. With non-disruptive startup, one doesn’t need to take down the application
when installing, upgrading and doing maintenance to HACMP. It’s difficult to configure
anything in a production environment, so this was IBM’s answer to providing flexibility
around service-level agreements. Fast failure detection lets users detect node failures much
more quickly than ever before.

Other improvements include:

New smart assists for DB2, Oracle and WebSphere, and a two-node assist to create a
cluster
WebSMIT is now easier to configure and also provides a GUI
Improved cluster verification tools
Reintroduction of forced stop to help avoid resource conflicts by putting resource
groups into an unmanaged state
HACMP/XD GLVM Multi-Link feature for improved data mirroring protection and
performance
Concurrent mode access for simultaneous applications execution at a local site
Support for Linux on System p for the first time

What HACMP Can Do For You


What does all this mean for you? Clearly, IBM’s recent POWER6 innovations show the
importance that IBM emphasizes on availability. At the same time, IBM continues to
innovate and enhance HACMP, its flagship HA product. If you need help installing and
configuring your cluster, contact your IBM Business Partner or IBM, which provides its own
High Availability Services, a fee-based service offering. Your Business Partner can also
usually sell you this offering at a discounted price.

As a user of System p systems, it’s important that you understand the purpose behind the new
features and also how best to implement HA in your environment. Understand also that
HACMP is not fault tolerance. It’s a step down from that, as it will take a few moments for
the standby server to start up. HA may not be for everyone, but it’s an essential part of any
mission-critical application running on a System p platform. If you can’t live with your
systems being down, don’t leave home without it.

IBM Systems Magazine is a trademark of International Business Machines Corporation. The


editorial content of IBM Systems Magazine is placed on this website by MSP TechMedia
under license from International Business Machines Corporation.

©2010 MSP Communications, Inc. All rights reserved.

You might also like