Professional Documents
Culture Documents
2000 - WMQ Ha DR
2000 - WMQ Ha DR
2000 - WMQ Ha DR
Mark Taylor
marke_taylor@uk.ibm.com
IBM Hursley
Session 13929
Introduction
• You can have the best technology in the world, but you have to manage it
correctly
1
What is DR – Wikipedia Version
What is DR
• It is not about High Availability although often the two are related and share
design and implementation choices
‒ “HA is having 2, DR is having them a long way apart”
‒ More seriously, HA is about keeping things running, while DR is about recovering
when HA has failed.
2
Disaster Recovery vs High Availability
• Designs for HA typically involve a single site for each component of the
overall architecture
• Designs for DR typically involve separate sites
HIGH AVAILABILITY
3
Single Points of Failure
WebSphere MQ HA technologies
• Queue-sharing groups
• HA clusters
• Client reconnection
4
Queue Manager Clusters
• Sharing cluster queues on
multiple queue managers
prevents a queue from being a
SPOF
Queue-Sharing Groups
• On z/OS, queue managers can be
members of a queue-sharing group
• Benefits:
Shared
‒ Messages remain available even if a
queues
queue manager fails
‒ Pull workload balancing
‒ Apps can connect to the group Queue
manager
Private
queues
5
Introduction to Failover and MQ
Failover considerations
6
MULTI-INSTANCE
QUEUE MANAGERS
• Instances are the SAME queue manager – only one set of data files
‒ Queue manager data is held in networked storage
7
Multi-instance Queue Managers
1. Normal MQ MQ
execution Client Client
network
168.0.0.1 168.0.0.2
Machine A Machine B
QM1 QM1
Active can fail-over Standby
instance instance
QM1
networked storage
Owns the queue manager data
2. Disaster MQ MQ
strikes Client Client
Client network
connections
broken
IPA 168.0.0.2
Machine A Machine B
QM1 QM1
Active locks freed Standby
instance instance
QM1
networked storage
8
Multi-instance Queue Managers
3. FAILOVER MQ MQ
Client Client
Standby
Client
becomes connection
active network still broken
168.0.0.2
Machine B
QM1
Active
instance
QM1
networked storage
Owns the queue manager data
4. Recovery MQ MQ
complete Client Client
Client
network connections
reconnect
168.0.0.2
Machine B
QM1
Active
instance
QM1
networked storage
Owns the queue manager data
9
Multi-instance Queue Managers
10
Administering Multi-instance QMgrs
MQ Explorer
automatically
switches to
the active
instance
11
HA CLUSTERS
HA clusters
• HA clusters can:
‒ Coordinate multiple resources such as application server, database
‒ Consist of more than two machines
‒ Failover more than once without operator intervention
‒ Takeover IP address as part of failover
‒ Likely to be more resilient in cases of MQ and OS defects
12
HA clusters
• In HA clusters, queue manager data and logs are placed on a shared disk
‒ Disk is switched between machines during failover
MQ in an HA cluster – Active/active
Normal MQ MQ
execution Client Client
network
HA cluster
168.0.0.1 168.0.0.2
Machine A Machine B
QM1 QM2
Active Active
QM1 instance
instance
data
and logs
shared disk
QM2
data
and logs
13
MQ in an HA cluster – Active/active
FAILOVER MQ MQ
Client Client
IP address
takeover
network
HA cluster
168.0.0.1 168.0.0.2
QM1
Active
Machine A Machine
instance B
QM2
Active Queue
QM1
instance manager
data
and logs
restarted
shared disk
Shared disk
switched QM2
data
and logs
Multi-instance QM or HA cluster?
• HA cluster
Capable of handling a wider range of failures
Failover historically rather slow, but some HA clusters are improving
Capable of more flexible configurations (eg N+1)
Required MC91 SupportPac or equivalent configuration
Extra product purchase and skills required
• Storage distinction
• Multi-instance queue manager typically uses NAS
• HA clustered queue manager typically uses SAN
14
Creating QM for failover
Virtual Systems
15
Comparison of Technologies
Shared Queues,
HP NonStop Server continuous continuous
MQ
Clusters none continuous
automatic continuous
HA Clustering,
Multi-instance automatic automatic
No special
support none none
APPLICATIONS AND
AUTO-RECONNECTION
16
HA applications – MQ connectivity
‒ End abnormally
17
Automatic client reconnection
• Requires:
‒ Threaded client
‒ 7.0.1 server – including z/OS
‒ Full-duplex client communications (SHARECNV >= 1)
18
Client Configurations for Availability
• http://www.ibm.com/developerworks/websphere/library/techarticles/1303_b
roadhurst/1303_broadhurst.html
19
Disaster Recovery
/var/mqm/log/QMGR
Registry Registry
/var/mqm/qmgrs/QMGR
Obj Definiitions
Security
Cluster
QFiles etc
System QMgr
20
Backups
CP
ARCHIVED LOGS CP ACTIVE CP LOGS LOG
Tapes ... ... VSAM Linear
DS
Shared Messages
Shared Objects
Group Objects
DB2 CF
21
What makes up a Queue Manager?
• Pagesets
‒ Updates made “lazily” and brought “up to date” from logs during restart
‒ Start up with an old pageset (restored backup) is not really any different from start up after
queue manager failure
‒ Backup needs to copy page 0 of pageset first (don’t do volume backup!)
22
Remote Recovery
Topologies
• Sometimes 2 data centres are in daily use; back each other up for disasters
‒ Normal workload distributed to the 2 sites
‒ These sites are probably geographically distant
23
Queue Manager Connections
Disk replication
• Disk replication cannot be used between the active and standby instances
of a multi-instance queue manager
‒ Could be used to replicate to a DR site in addition though
24
Combining HA and DR
QM1 QM1
Replication
shared storage
Site 1 Site 2
HA Pair
Replication
QM1
HA Pair
QM2 QM2
Replication
© 2013 IBM Corporation
25
Backup Queue Manager - Objective
26
Integration with other products
• Only way for guaranteed consistency is disk replication where all logs are
in same group
‒ Otherwise transactional state might be out of sync
27
Planning for Recovery
• Write a DR plan
‒ Document everything – to tedious levels of detail
‒ Include actual commands, not just a description of the operation
Not “Stop MQ”, but “as mqm, run /usr/local/bin/stopmq.sh US.PROD.01”
• Remember that the person executing the plan in a real emergency might
be under-skilled and over-pressured
‒ Plan for no access to phones, email, online docs …
• May have different plans for different disaster scenarios © 2013 IBM Corporation
• If Hursley machine room was taken out by a plane missing its landing at
Southampton airport
‒ Could we carry on developing the WMQ product
‒ Source code libraries, build machines, test machines …
‒ Could fixes be produced
• (A common one) Someone hit emergency power-off button © 2013 IBM Corporation
28
Networking Considerations
• LOCLADDR configuration
‒ Not normally used by MQ, but firewalls, IPT and channel exits may inspect it
‒ May need modification if a machine changes address
• Didn’t have monitoring in place that picked this up until users complained
about lack of response
29
Other Resources
• And afterwards:
‒ Update your plan for the next time
30
Summary
13929 – MQ HA and DR
31