Configure Gpfs For Reliability

August 2012
IBM Systems and Technology Group
Configuring GPFS for Reliability

High availability for your enterprise applications
Scott Fadden – IBM Corporation

sfadden@us.ibm.com
Contents
Introduction .......................................................................................... 3
Quorum ................................................................................................. 3
Node Quorum ................................................................................... 3
Node Quorum with Tiebreaker disks ................................................ 4
File system descriptor quorum ......................................................... 5
Replication ............................................................................................ 5
File system descriptor quorum how it effects replication ................ 6
IO patterns in a replicated system .................................................... 6
Data replication Scenarios .................................................................... 9
Manual Recovery with replicated data ................................................. 9
Procedure for manual recovery ........................................................ 9
Automatic recovery with replicated data ........................................... 10
Automatic recovery with replicated data across sites ........................ 10
Option 1: Remote quorum node..................................................... 10
Option 2: Remote tiebreaker disk................................................... 11
Conclusion........................................................................................... 12
Page | 2
Introduction
I receive many questions on how to configure GPFS for
reliability. Questions like “what is quorum” and “why do I need
it”, or “what are failure groups” and “how do I use them”. This
article is an attempt to being all of these topics into one place.
This paper discusses of the options you have when configuring
GPFS for high availability.
Every application has different reliability requirements from

scientific scratch data to mission critical fraud detection
systems. GPFS supports a variety of reliability levels depending
on the needs of the application. When designing a GPFS cluster
consider what type of events you need your system to “survive”
and how automatic you want to the recovery to be. Any
discussion of reliability in a GPFS cluster starts with quorum.
Quorum
The worst type of failure in a cluster is called split brain. Split
brain happens when you have multiple nodes in a cluster that
continue operations independently, with no way to
communicate with each other. This situation cannot happen in
a cluster file system because without coordination your file
system could become corrupted. Coordination between the
nodes is essential to maintain data integrity. To keep the file
system consistent a lone node cannot be permitted to continue
to write data to the file system without being coordinating with
the other nodes in the cluster. When a network failure occurs
some node has to stop writing. Who continues and who stops is
determined in GPFS using a mechanism called quorum.
Maintaining quorum in a GPFS cluster means that a majority of

the nodes designated as quorum nodes are able to successfully
communicate. In a three quorum node configuration two nodes
have to be communicating for cluster operations to continue.
When one node is isolated by a network failure it stops all file
system operations until communications are restored so no
data is corrupted by a lack of coordination.
Node Quorum
So how many quorum nodes do you need? There is no exact
answer. Chose the number of quorum nodes based on your
Page | 3
August 2012
cluster design and your reliability requirements. If you have

three nodes in your cluster, all nodes should be quorum nodes.
If you have a 3,000 node cluster you do not want all nodes to be
quorum nodes. True you can’t configure all of the nodes to be
quorum since the maximum is 128, but even that is too many.
When a node failure occurs the quorum nodes have to do some
work to decide what to do. Can cluster operations continue?
Who is the leader? So think of the number of quorum numbers
just like any other committee. The more members of the
committee the longer it takes to make a decision.
You can change node designations dynamically, so if a rack of

nodes fail and they are going to be down for a while, you can
designate another node as quorum to maintain your desired
level of reliability. Choose the smallest number of quorum
nodes that makes sense for your cluster configuration. Even in
the largest clusters this is typically 5 to 7 quorum nodes. One
quorum node per GPFS building block in large clusters or
something similar is common. Yes 5 and 7 are odd numbers, and
the general recommendation is to choose an odd number of
quorum nodes. This is more a matter of style than requirement
but it makes sense when considering how many nodes can fail.
If you have 4 quorum nodes you need 3 available (4/2+1=3 or
one more than half) to continue cluster operations, the same as
if you had 5 (5/2+1=3 1) quorum nodes. That is why an odd
number is typically recommended. In a single node cluster (yes
there are single node production “clusters”, typically for HSM)
there is no one to communicate with so a single quorum node is
all that is required.
Node Quorum with Tiebreaker disks

Use tiebreaker disks when you have a two node cluster or you
have a cluster where all of the nodes are SAN attached to a
common set of LUNS and you want to continue to serve data
with a single surviving node. Typically tiebreaker disks are only
used in two node clusters.
Tiebreaker disks are not special NSD’s, you can use any NSD as a
tiebreaker disk. You can chose one from three different file
systems, or from different storage controllers for additional
1
Yes I know that (5/2)+1 is 3.5, but you cannot have ½ a quorum node
and 3.5 is greater than ½ +1 of the quorum nodes.
Page | 4
August 2012
availability. In most cases using tiebreaker disks adds to the

duration of a failover event, because there is an extra lease
timeout that has to occur. In a two node cluster you do not have
a choice if you want reliability, though this is why it is commonly
recommended that if you have more than 2 nodes you use node
quorum with no tiebreaker disks.
Using tiebreaker disks can improve failover performance only if

you use SCSI3 persistent reserve.
File system descriptor quorum

File system descriptor quorum is one type of quorum in GPFS
that is often overlooked. In a GPFS file system every disk has a
header that contains information about the file system. This
information is maintained on every disk in a file system but
when there are more than three NSDs in a file system 3 copies
of the file system descriptor are guaranteed to have the latest
information, all of the others are updated asynchronously when
the file system configuration is modified. Why not keep all of
them up to date? Consider a file system with 1,000 disk drives.
Each file system command would require that each copy is
guaranteed to be up to date, that many copies can be difficult
to guarantee, so three are maintained as official copies. For a
file system to remain accessible two of the three official copies
of the file system descriptor need to be available. We will
discuss this more after looking at replication.
Replication
In GPFS you can replicate (mirror) a single file, a set of files or
the entire file system and you can change the replication status
of a file at any time using a policy or command. You can
replicate metadata (file inode information) or file data or both.
Though in reality, if you do any replication, you need to
replicate metadata. Without replicated metadata, if there is a
failure, you cannot mount the file system to access the
replicated data anyway.
A replication factor of two in GPFS means that each block of a

replicated file is in at least two failure groups. A failure group is
defined by the administrator and contains one or more NSDs.
Each storage pool in a GPFS file system contains one or more
failure groups. Failure groups are defined by the administrator
and can be changed at any time. So when a file system is fully
Page | 5
August 2012
replicated any single failure group can fail and the data remains
online.
File system descriptor quorum how it effects replication

So far we have discussed a replication factor of two and two
failure groups. There is one more aspect to replicating data in
GPFS that is important to consider, and that is file system
descriptor quorum. Remember that for a file system to remain
accessible two of the three official copies of the file system
descriptor need to be available. How can you do that in a
replicated file system with two failure groups? You can’t. When
there are more than three NSD’s in a file system GPFS creates
three official copies of the file system descriptor. With two
failure groups GPFS places one descriptor in one failure group
and the other two in the other failure group (assuming there
are at least 3 NSDs). In this configuration if you lose the failure
group which contains the two official copies of the file system
descriptor, the file system unmounts. Therefore for the file
system to remain accessible you need to create one more
failure group that contains at least a single NSD.
Typically this third failure group contains a single small NSD that
is defined with a type of descriptor only (decOnly). The decOnly
designation means that this disk does not contain any file
metadata or data. It is only there to be one of the official copies
of the file system descriptor. The descOnly disk does not need
to be high performance and only needs to be 20MB or more in
size so this is one case where often a local partition on a node is
used for this NSD. To create a descOnly NSD on a node you can
use a partition from a local LUN and define that node as the
NSD server for that LUN so other nodes in the file system can
see it.
IO patterns in a replicated system

When replicating a file system all writes go to all failure groups
in a storage pool. Though with replicated data since you have
two copies of the information there are some optimizations
GPFS can do when your application is reading data. By default
when a file system is replicated GPFS spreads the reads over all
of the available failure groups. This configuration provides the
best read performance when the nodes running GPFS have
equal access to both copies of the data. For example this
Page | 6
August 2012
behavior is good if GPFS replication is used in a single data

center to replicate over two separate storage servers all SAN
attached to all of the GPFS nodes.
The readReplicaPolicy configuration parameter allows you to

change the read IO behavior in the file system. If you change
this parameter from default to a value of local GPFS changes the
read behavior with replicated data. A value of local has two
effects on reading data in a replicated storage pool. Instead of
simply reading from both failure groups GPFS reads data from
the failure group that is on either on “A local block device” or on
a “A local NSD server.”
A local block device means that the path to the disk is through a
block special device. On Linux, for example that would be a
/dev/sd* device or on AIX a /dev/hdisk device. GPFS does not
do any further determination, so if disks at two separate sites
are connected using a long distance SAN connection GPFS
cannot distinguish what copy is local. So to use this option
connect the sites using the NSD protocol over TCP/IP or
InfiniBand Verbs (Linux Only).
A local NSD server is determined by GPFS using the subnets

configuration setting to determine what NSD servers are "local"
to an NSD client. For NSD clients to benefit from "local" read
access the NSD servers supporting the local disk need to be on
the same subnet as the NSD clients accessing the data and that
subnet needs to be defined using the "subnets" configuration
parameter. This parameter is useful when GPFS replication is
used to mirror data across sites and there are NSD clients in the
cluster. This keeps read access requests from being sent over
the WAN.
Page | 7
August 2012
NSD Sever Cluster

Location 1
42
41
40
42
41
40
42
41
40
42
41
40
Compute Cluster
39 39 39 39
38 38 38 38
37 37 37 37
36 36 36 36
35 35 35 35
34 34 34 34
33 33 33 33
32 32 32 32
31 31 31 31
30 30 30 30
29 29 29 29
28 28 28 28
27 27 27 27
26 26 26 26
25 25 25 25
24 24 24 24
23 23 23 23
22 22 22 22
21 21 21 21
20 20 20 20
Local Subnet (5.3.2.*)

19 19 19 19
18 18 18 18
17 17 17 17
16 16 16 16
15 15 15 15
14 14 14 14
13 13 13 13
12 12 12 12
11 11 11 11
10 10 10 10
09 09 09 09
08 08 08 08
07 07 07 07
06 06 06 06
05 05 05 05
04 04 04 04
03 03 03 03
02 02 02 02
01 01 01 01
Location 3
System x3650
0
2 3 4 5 6 7
Location 2
Compute Cluster
42 42 42 42
41 41 41 41
40 40 40 40
39 39 39 39
38 38 38 38
37 37 37 37
36 36 36 36
35 35 35 35
34 34 34 34
33 33 33 33
32 32 32 32
31 31 31 31
30 30 30 30
29 29 29 29
28 28 28 28
27 27 27 27
26 26 26 26
25 25 25 25
24 24 24 24
23 23 23 23
Local Subnet (1.2.3.*)

22 22 22 22
21 21 21 21
20 20 20 20
19 19 19 19
18 18 18 18
17 17 17 17
16 16 16 16
15 15 15 15
14 14 14 14
13 13 13 13
12 12 12 12
11 11 11 11
10 10 10 10
09 09 09 09
08 08 08 08
07 07 07 07
06 06 06 06
05 05 05 05
04 04 04 04
03 03 03 03
02 02 02 02
01 01 01 01
Figure 1: Multi-site Configuration
Error! Reference source not found. is an example of a multisite

configuration that can benefit from a readReplicaPolicy of local.
In this example Location 1 and Location 2 both have a copy of
the file system data and metadata. The subnets parameter for
the clusters is configured as subnets=”5.3.2.0,1.2.3.0” and
readReplicaPolicy=local. So the compute cluster at Location 1
reads from the NSD servers in Locaiton 1.
Page | 8
Data replication Scenarios
Most clusters do not need the highest availability with replicated data spread
across three sites. For most, multiple nodes and storage with RAID is sufficient.
This section examines various GPFS cluster configurations looking at each to see
what that configuration means for data reliability. It starts at the beginning with
the most common cluster architectures and works towards the most highly
available configuration.
Manual Recovery with replicated data

If for some reason you cannot configure a system for automatic recovery as long
as you have replicated metadata and data you have the option of performing a
manual recovery in the event of a failure. In this case when a site failed, for
example, a GPFS administrator could bring the data back online manually.
Procedure for manual recovery

1. Shut the GPFS daemon down on the surviving nodes
mmshutdown -N surviving-nodes
1. If it is necessary, assign a new primary cluster configuration server.
mmchcluster -p survivingNode
2. Relax node quorum by temporarily changing the designation of each of

the failed quorum nodes to non-quorum nodes:
mmchnode --nonquorum -N quorumNode1, quorumNode2
3. Relax file system descriptor quorum by migrating the file system

descriptor off of the failed disks.
mmfsctl fs0 exclude -d "gpfs1nsd"
4. Restart the GPFS daemon on the surviving nodes
mmsstartup -N surviving-nodes
5. Mount the file system on the surviving nodes
mmmount gpfs1 -N surviving-nodes
Page | 9
August 2012
Automatic recovery with replicated data

If you need automatic recovery with direct attached storage it requires at least
three storage devices so you can maintain file system descriptor quorum.
Typically two separate RAID arrays, the third can be a small LUN from another
RAID array or disk local to one of the nodes.
Local attached with Replication
System x3650
0 TCP/IP System x3650
0
1
2 3 4 5 6 7
1
2 3 4 5 6 7
Quorum Node Quorum Node
SAN
SAN
SA
N SAN DescOnly Local
Drive
1 4
EXP3512
1 4
EXP3512
Failure Group 3
5 8 5 8
9 12 9 12
Failure Group 1 Failure Group 2
Figure 2: Automatic failover single site
In this configuration replication is typically configured as default data and

metadata and any single hardware component can fail and the file system
remains online. To set up this type of architecture setup a tiebreaker disk in
each failure groups and the DescOnly disk needs the server containing the LUN
to be defined as the NSD server for that LUN so all nodes can access the device.
Automatic recovery with replicated data across sites

To provide the highest level of data availability it is recommended that you use
a three site configuration with GPFS metadata and data replication. There are
two ways to create a three site configuration.
 Using a remote quorum node

 Using a remote quorum disk
These configurations are based on the GPFS use of quorum. Node quorum or
node quorum with tiebreaker disk and file system descriptor quorum. In the
event of a site failure your GPFS cluster needs to maintain node quorum and file
system descriptor quorum.
Option 1: Remote quorum node

In the remote quorum node configuration a third site contains a GPFS node and
the cluster uses node quorum. The quorum node does not have a performance
requirement so you have great flexibility in how this site is configured. For
example the quorum node can be run on a virtualized host with fractional CPU
access. The remote node has a few properties
Page | 10
August 2012
1. A GPFS node with the designation quorum-client.

2. TCP/IP access to all of the other nodes in the cluster.
3. One local disk or partition available for each replicated file system. This
partition can be created using OS partitioning tools and is defined as a
descOnly disk and placed in a third failure group.
Location 1
System x3650
0
2 3 4 5 6 7
SAN
1 4
EXP3512
5 8
9 12
Failure Group 1
Location 3
TCP/IP
Location 2
System p5
Failure Group 3 System x3650

0
2 3 4 5 6 7
Disk Descriptor NSD
SAN
1 4
EXP3512
5 8
9 12
Failure Group 2
Figure 3: Remote quorum node
Figure 3: Remote quorum node is an example of a configuration using a remote

quorum node.
Option 2: Remote tiebreaker disk

In this configuration a tiebreaker disk is located at a separate site. The
tiebreaker disk is created as a descOnly and placed in a third failure group. This
NSD contains a file system descriptor and is defined as a tiebreaker disk. One
small LUN (50MiB) is required for each replicated file system.
Page | 11
August 2012
Location 1
System x3650
0
2 3 4 5 6 7
SAN
1 4
EXP3512
5 8
9 12
SA
N
Location 2
System x3650
0
2 3 4 5 6 7
Location 3
SAN
SAN
1 4
EXP3512
5 8
1 4
EXP3512
9 12
5 8
9 12
Figure 4: Remote tiebreaker disk
Figure 4: Remote tiebreaker disk is an example of a configuration using a

remote tiebreaker disk using a long distance SAN connection. The connection to
storage could be based on any disk connection technology, iSCSI for example.
Conclusion
GPFS can be configured for basic availability using RAID protected data all the
way to multi-site configurations with GPFS replicated metadata and data. What
configuration you choose depends on your requirements and budget.
Page | 12
© IBM Corporation 2012
IBM Corporation
Marketing Communications
Systems Group
Route 100
Somers, New York 10589
Produced in the United States of America

August 2012
All Rights Reserved
This document was developed for products and/or

services offered in the United States. IBM may not
offer the products, features, or services discussed in
this document in other countries.
The information may be subject to change without

notice. Consult your local IBM business contact for
information on the products, features and services
available in your area.
All statements regarding IBM’s future directions and

intent are subject to change or withdrawal without
notice and represent goals and objectives only.
IBM, the IBM logo, AIX, eServer, General Purpose File

System, GPFS, pSeries, System p, System x, Tivoli are
trademarks or registered trademarks of International
Business Machines Corporation in the United States or
other countries or both. A full list of U.S. trademarks
owned by IBM may be found at
http://www.ibm.com/legal/copytrade.shtml.
UNIX is a registered trademark of The Open Group in

the United States, other countries or both.
Linux is a trademark of Linus Torvalds in the United

States, other countries or both.
Windows is a trademark of Microsoft in the United

States, other countries or both.
Intel is a registered trademark of Intel Corporation in

the United States and/or other countries.
Other company, product, and service names may be

trademarks or service marks of others.
Information concerning non-IBM products was

obtained from the suppliers of these products or
other public sources. Questions on the capabilities of
the non-IBM products should be addressed with the
suppliers.
When referring to storage capacity, 1 TB equals total

GB divided by 1024; accessible capacity may be less.
The IBM home page on the Internet can be found at

http://www.ibm.com.
Page | 13

Configure Gpfs For Reliability

Uploaded by

Copyright:

Available Formats

You might also like

Configure Gpfs For Reliability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Configure Gpfs For Reliability

Uploaded by

Copyright:

Available Formats

August 2012

IBM Systems and Technology Group

Configuring GPFS for Reliability

Scott Fadden – IBM Corporation

Every application has different reliability requirements from

Maintaining quorum in a GPFS cluster means that a majority of

cluster design and your reliability requirements. If you have

You can change node designations dynamically, so if a rack of

Node Quorum with Tiebreaker disks

availability. In most cases using tiebreaker disks adds to the

Using tiebreaker disks can improve failover performance only if

File system descriptor quorum

A replication factor of two in GPFS means that each block of a

File system descriptor quorum how it effects replication

IO patterns in a replicated system

behavior is good if GPFS replication is used in a single data

The readReplicaPolicy configuration parameter allows you to

A local NSD server is determined by GPFS using the subnets

NSD Sever Cluster

Local Subnet (5.3.2.*)

Local Subnet (1.2.3.*)

Figure 1: Multi-site Configuration

Error! Reference source not found. is an example of a multisite

Manual Recovery with replicated data

Procedure for manual recovery

1. If it is necessary, assign a new primary cluster configuration server.

2. Relax node quorum by temporarily changing the designation of each of

mmchnode --nonquorum -N quorumNode1, quorumNode2

3. Relax file system descriptor quorum by migrating the file system

mmfsctl fs0 exclude -d "gpfs1nsd"

4. Restart the GPFS daemon on the surviving nodes

5. Mount the file system on the surviving nodes

mmmount gpfs1 -N surviving-nodes

Automatic recovery with replicated data

Local attached with Replication

Quorum Node Quorum Node

Failure Group 1 Failure Group 2

Figure 2: Automatic failover single site

In this configuration replication is typically configured as default data and

Automatic recovery with replicated data across sites

 Using a remote quorum node

Option 1: Remote quorum node

1. A GPFS node with the designation quorum-client.

Failure Group 3 System x3650

Disk Descriptor NSD

Figure 3: Remote quorum node

Figure 3: Remote quorum node is an example of a configuration using a remote

Option 2: Remote tiebreaker disk

Figure 4: Remote tiebreaker disk

Figure 4: Remote tiebreaker disk is an example of a configuration using a

Produced in the United States of America

This document was developed for products and/or

The information may be subject to change without

All statements regarding IBM’s future directions and

IBM, the IBM logo, AIX, eServer, General Purpose File

UNIX is a registered trademark of The Open Group in

Linux is a trademark of Linus Torvalds in the United

Windows is a trademark of Microsoft in the United