Professional Documents
Culture Documents
Configure Gpfs For Reliability
Configure Gpfs For Reliability
Configure Gpfs For Reliability
Page | 2
Introduction
I receive many questions on how to configure GPFS for
reliability. Questions like “what is quorum” and “why do I need
it”, or “what are failure groups” and “how do I use them”. This
article is an attempt to being all of these topics into one place.
This paper discusses of the options you have when configuring
GPFS for high availability.
Quorum
The worst type of failure in a cluster is called split brain. Split
brain happens when you have multiple nodes in a cluster that
continue operations independently, with no way to
communicate with each other. This situation cannot happen in
a cluster file system because without coordination your file
system could become corrupted. Coordination between the
nodes is essential to maintain data integrity. To keep the file
system consistent a lone node cannot be permitted to continue
to write data to the file system without being coordinating with
the other nodes in the cluster. When a network failure occurs
some node has to stop writing. Who continues and who stops is
determined in GPFS using a mechanism called quorum.
Node Quorum
So how many quorum nodes do you need? There is no exact
answer. Chose the number of quorum nodes based on your
Page | 3
August 2012
IBM Systems and Technology Group
Tiebreaker disks are not special NSD’s, you can use any NSD as a
tiebreaker disk. You can chose one from three different file
systems, or from different storage controllers for additional
1
Yes I know that (5/2)+1 is 3.5, but you cannot have ½ a quorum node
and 3.5 is greater than ½ +1 of the quorum nodes.
Page | 4
August 2012
IBM Systems and Technology Group
Replication
In GPFS you can replicate (mirror) a single file, a set of files or
the entire file system and you can change the replication status
of a file at any time using a policy or command. You can
replicate metadata (file inode information) or file data or both.
Though in reality, if you do any replication, you need to
replicate metadata. Without replicated metadata, if there is a
failure, you cannot mount the file system to access the
replicated data anyway.
Page | 5
August 2012
IBM Systems and Technology Group
replicated any single failure group can fail and the data remains
online.
Typically this third failure group contains a single small NSD that
is defined with a type of descriptor only (decOnly). The decOnly
designation means that this disk does not contain any file
metadata or data. It is only there to be one of the official copies
of the file system descriptor. The descOnly disk does not need
to be high performance and only needs to be 20MB or more in
size so this is one case where often a local partition on a node is
used for this NSD. To create a descOnly NSD on a node you can
use a partition from a local LUN and define that node as the
NSD server for that LUN so other nodes in the file system can
see it.
Page | 6
August 2012
IBM Systems and Technology Group
A local block device means that the path to the disk is through a
block special device. On Linux, for example that would be a
/dev/sd* device or on AIX a /dev/hdisk device. GPFS does not
do any further determination, so if disks at two separate sites
are connected using a long distance SAN connection GPFS
cannot distinguish what copy is local. So to use this option
connect the sites using the NSD protocol over TCP/IP or
InfiniBand Verbs (Linux Only).
Page | 7
August 2012
IBM Systems and Technology Group
41
40
42
41
40
42
41
40
42
41
40
Compute Cluster
39 39 39 39
38 38 38 38
37 37 37 37
36 36 36 36
35 35 35 35
34 34 34 34
33 33 33 33
32 32 32 32
31 31 31 31
30 30 30 30
29 29 29 29
28 28 28 28
27 27 27 27
26 26 26 26
25 25 25 25
24 24 24 24
23 23 23 23
22 22 22 22
21 21 21 21
20 20 20 20
18 18 18 18
17 17 17 17
16 16 16 16
15 15 15 15
14 14 14 14
13 13 13 13
12 12 12 12
11 11 11 11
10 10 10 10
09 09 09 09
08 08 08 08
07 07 07 07
06 06 06 06
05 05 05 05
04 04 04 04
03 03 03 03
02 02 02 02
01 01 01 01
Location 3
System x3650
0
2 3 4 5 6 7
Location 2
Compute Cluster
42 42 42 42
41 41 41 41
40 40 40 40
39 39 39 39
38 38 38 38
37 37 37 37
36 36 36 36
35 35 35 35
34 34 34 34
33 33 33 33
32 32 32 32
31 31 31 31
30 30 30 30
29 29 29 29
28 28 28 28
27 27 27 27
26 26 26 26
25 25 25 25
24 24 24 24
23 23 23 23
21 21 21 21
20 20 20 20
19 19 19 19
18 18 18 18
17 17 17 17
16 16 16 16
15 15 15 15
14 14 14 14
13 13 13 13
12 12 12 12
11 11 11 11
10 10 10 10
09 09 09 09
08 08 08 08
07 07 07 07
06 06 06 06
05 05 05 05
04 04 04 04
03 03 03 03
02 02 02 02
01 01 01 01
Page | 8
Data replication Scenarios
Most clusters do not need the highest availability with replicated data spread
across three sites. For most, multiple nodes and storage with RAID is sufficient.
This section examines various GPFS cluster configurations looking at each to see
what that configuration means for data reliability. It starts at the beginning with
the most common cluster architectures and works towards the most highly
available configuration.
mmshutdown -N surviving-nodes
mmchcluster -p survivingNode
mmsstartup -N surviving-nodes
Page | 9
August 2012
IBM Systems and Technology Group
System x3650
0 TCP/IP System x3650
0
1
2 3 4 5 6 7
1
2 3 4 5 6 7
SAN
SAN
SA
N SAN DescOnly Local
Drive
1 4
EXP3512
1 4
EXP3512
Failure Group 3
5 8 5 8
9 12 9 12
These configurations are based on the GPFS use of quorum. Node quorum or
node quorum with tiebreaker disk and file system descriptor quorum. In the
event of a site failure your GPFS cluster needs to maintain node quorum and file
system descriptor quorum.
Page | 10
August 2012
IBM Systems and Technology Group
Location 1
System x3650
0
2 3 4 5 6 7
SAN
1 4
EXP3512
5 8
9 12
Failure Group 1
Location 3
TCP/IP
Location 2
System p5
2 3 4 5 6 7
SAN
1 4
EXP3512
5 8
9 12
Failure Group 2
Page | 11
August 2012
IBM Systems and Technology Group
Location 1
System x3650
0
2 3 4 5 6 7
SAN
1 4
EXP3512
5 8
9 12
SA
N
Location 2
System x3650
0
2 3 4 5 6 7
Location 3
SAN
SAN
1 4
EXP3512
5 8
1 4
EXP3512
9 12
5 8
9 12
Conclusion
GPFS can be configured for basic availability using RAID protected data all the
way to multi-site configurations with GPFS replicated metadata and data. What
configuration you choose depends on your requirements and budget.
Page | 12
© IBM Corporation 2012
IBM Corporation
Marketing Communications
Systems Group
Route 100
Somers, New York 10589
Page | 13