PowerHA

PowerHA
PD and Daily Maintenance
PowerHA实施专家级课程
© Copyright IBM Corporation 2010

实施专家级课程 PowerHA
Problem Determination Overview
¾Identify the problem

z What is indicating a problem?
z Where are the resource groups?
z Check the AIX environment
z Are the cluster processes running?
z Is the cluster stable?
¾Check the log files

¾Identify the source of the problem
¾Fix the problem
¾Verify the cluster is running correctly
Page 2
What Can Cause HACMP Problems?
¾Common reasons why HACMP fails:

z A poor cluster design and lack of thorough planning–Basic TCP/IP and LVM
configuration problems
z HACMP Event Script augmentation bugs
z General user-supplied script bugs
z HACMP cluster topology and resource configuration problems
z Absence of change management discipline in a running cluster
z Failure to use cluster-aware administration commands
z Unsuccessful migration to a new version of HACMP
Carefully read the manuals and the release notes!
z Updates to AIX / RSCT
Including emergency fixes, PTFs, SPs, TLs
z Lack of adequate and thorough testing
Page 3
Recommendations To Reduce Problems!
¾When the cluster has been built, it is imperative that it be tested

thoroughly on every failure one could imagine
z RG Failovers / Fallbacks
z Node / Network / NIC / Cluster process failures
z Pull cables from the network and the storage unit
z Application monitoring to detect failure and performance problems
¾During cluster design / implementation it is recommended there is

minimum change to the environment and any change:
z Follows a change management process
z Is done to the cluster through supplied cluster-aware C-SPOC tools
z Is ideally done first on an identical test environment
Page 4
Business Impact
¾Your first responsibility is to the business

z Can the applications be "forced" to operate?
z Is down time acceptable?
z Is HACMP a part of the problem?
z Can the application be started without HACMP?
z Can the application be started on a standby node?
z Do not waste too much time in repairing your cluster without starting the
application
¾It may be necessary to continue without HACMP if the problem cannot
be recovered in the time allowed
Page 5
Business Impact
¾At first chance, investigate the problem:

z Get the status
z Identify the log files that relate to the failure
z Determine the time of the initial failure
z Locate the specific failure data–Determine what caused the failure
z Fix the problem
z How will restarting HACMP affect the applications?
z Do you need to schedule a maintenance window to re-integrate HACMP?
z Restart HACMP and test the cluster carefully to make sure the problem
won’t return
Page 6
Problem Scenarios: DMS Timeout
¾Dead man switch kernel extension

z Designed to ensure that if a node causes a fallover due to being too busy, it
won't later continue normal processing and try to access shared resources
z Topology services resets the DMS timer every second
z DMS time-out is set to the failure detection time of the slowest network
z If the DMS is not reset and times out, it will cause a panic
¾Checking the dead man switch:

z TS_DMS_WARNING_STerror in errlog when DMS gets close
You could create an error notify method to warn you.
z /usr/sbin/rsct/bin/hatsdmsinfo
Provides statistics on the DMS: how many times it was reset, how many times it was "close"
to tripping, etc.
Page 7
Problem Scenarios: DMS Timeout
¾Steps to avoid DMS time-out problems:

z Isolate and fix the cause of excessive I/O or TCP/IP traffic
z Reduce the failure detection rate for the slowest network
The DMS time-out is based on the failure detection rate of the slowest network
DMS time-out = hbrate X cycle * 2
z Increase the frequency of the syncd
z Turn on I/O pacing
z Buy a bigger machine
Page 8
Problem Scenarios: SRC Halts A Node
¾What happens when clstrmgrES exits?

z If possible, clstrmgrES writes exit status to /usr/es/sbin/cluster/.clstrmgr.exit
z When a subsystem exits, SRC runs the SRCnotifymethod (if one exists)
For clstrmgrES and clinfoES, the notify method is clexit.rc
z clexit.rc checks the .clstrmgr.exit file
If just stopping cluster services: clexit.rc just restarts clstrmgrES
If abnormal exit status or .clstrmgr.exit file doesn't exist: (If clstrmgrES exited abnormally, we
want to prevent any cluster problem)
Run /usr/es/sbin/cluster/etc/hacmp.term, if executable
Else: halt the node
¾Proving that SRC halted a node:
z Check the AIX error log
Look for abnormal termination of clstrmgrES daemon
¾Steps to avoid SRC halts:
z Don't give untrained staff access to the root password
z Consider modifying hacmp.term
Reboot instead of halt
Stop RSCT and clean up
Page 9
Problem Scenarios: Partitioned Cluster
¾Node isolation (partitioned cluster)

z If all communication is lost, each side of the partition assumes the other
side is down and tries to takeover resources
z This can result in both sides accessing shared storage, with potential for
data corruption
¾Communication is restored after a cluster partition
z How is it detected?
Heartbeats are received from a node that was marked as failed
HACMP ODM configuration is not the same on a joining node as nodes already active in the
cluster
Two clusters with the same ID appear in the same logical network
z What happens?
One partition will be chosen to survive
Partition with most nodes survives
If equal number of nodes, partition with lowest node number survives
Page 10
Problem Scenarios: Partitioned Cluster
¾Node(s) in the other partition are sent

GS_DOM_MERGE_ERmessage
z grpsvcsand clstrmgrexit
z clexit.rcruns (node halts by default)
¾Proving that node isolation caused the problem: On the node(s) that
died:
z /tmp/clstrmgr.debuglog file–AIX error log entry: GS_DOM_MERGE_ER
¾Steps to avoid node isolation:
z Configure and test one or more non-IP network(s)
Page 11
Problem Scenarios: Event Script Problems
¾Event fails
z Non-recoverable
Causes event_error event on all nodes
Node that had the failing event goes to the ST_RP_FAILED state
Other nodes typically go to the ST_BARRIER state
Event processing stops on all nodes until user performs Recover From HACMP Script
Failure
z Recoverable
Some script failures do not cause an event failure in HACMP (e.g.: start_server)
Failure to acquire resources: If HACMP is unable to acquire all the resources for an RG, it will
try to run the RG on another node. If not possible, RG will go to the ERROR state
z Event hangs or takes longer than expected
Event processing stops until the script completes or is killed
If an event exceeds the Time Until Warning,the config_too_long event occurs
Page 12
Problem Scenarios: Event Script Problems
¾Recovery
z Locate the problem
cluster.log and hacmp.out are usually most helpful
z If it is a config_too_long
A. Fix the problem
B. Complete any steps that did not complete in the script that failed or hung
C. Kill hung script (if needed)
z If event_error ran (it is an actual HACMP event fail, ST_RP_FAILED), run
Recover From HACMP Script Failure
z Verify cluster
Page 13
PowerHA Status Command and Cluster Process Flow
Page 14
PowerHA And SNMP
¾The PowerHA MIB is defined in the hacmp.defs and hacmp.my files

¾The clstrmgrES daemon maintains current values of MIB objects and
provides them to the snmpd
¾Many programs can get PowerHA status from snmpd using the
SNMP protocal on the same system or across the network.
Page 15
PowerHA And SNMP
Page 16
PowerHA And SNMP
Page 17
PowerHA And SNMP
Page 18
Useful AIX Commands
Page 19
Useful HACMP Commands
Page 20
SMIT Problem Determination Menu
Page 21
SMIT Log Viewing and Management
Page 22
SMIT View Detailed HACMP Log Files
Page 23
Summary of HACMP Log Files
Page 24
Log File Maintenance: clcycle
¾Saves 7 archive copies of targeted log files (logfile.1-logfile.7)

¾At boot time clcycleis run and only rotates the clstrmgr.debugfile
z Note: clstrmgr.debugis also cycled when you stop cluster services
¾By default clcycle is run daily at midnight from cron
¾It can also be called from the command line
¾When run from the command line or from cron, clcycle rotates:
z Files which are always rotated:
z Files which are rotated if greater than 1 MB in size or if specified on the command
line:
z Files only rotated if no files are specified on the command line(default cronentry)or if
explicitly specified on the command line:
z Files only rotated if specified as an argument to clcycle:
z Never rotated:
z Note: If desired, you can modify root’s crontab file so that additional log files are
rotated on a regular basis
¾For example: clcycle cluster.log to rotate hacmp.out and cluster.log daily
Page 25
More Log File Maintenance
¾hacmp.out
z Rotated nightly by clcycle(default)
¾cl_event_summaries.txt
z Event summaries are copied from hacmp.outbyclcycle
z No automatic maintenance
¾Maintain using the SMIT View/Save/Delete HACMP Event

Summariesmenu
¾clstrmgr.debug–Rotated by clcyclewhen node is booted
z Rotated when cluster services are stopped on node
¾cluster.log
z Not rotated unless specified on command line to clcycle
Page 26
More Log File Maintenance
¾clverify.log and autoverify.log

z Rotated every time clverorcl_auto_versyncare run; 9 copies
¾cluster.mmddyyyy
z Created by event scripts on each day an event occurs
z No automatic maintenance•clcomd.log–Rotated by clcomdwhen > 1MB; 1
copy
¾clcomddiag.log
z Rotated by clcomdwhen > 10MB; 1 copy
¾RSCT logs
z Maintained by the RSCT daemons
Page 27
Saving Log Files & Configuration: clsnap
¾clsnap saves the HACMP log files and configuration

z clsnapcreates and compresses a PAX archive for each node
z By default: /tmp/ibmsupt/hacmp/nodename.pax.Z
z smitty hacmp -> Problem Determination Tools ->HACMP Log Viewing and
Management -> Collect Cluster log files for Problem Reporting
z snap -e(callsclsnap)
¾Since some log files are rotated when starting/stopping cluster

services, it's a good idea to run clsnapwhen you begin troubleshooting
z this also saves a picture of the cluster configuration before you begin
making changes
¾clsnap runs in two passes:
z First pass estimates the disk space needed
z Second pass creates the archives
z Run pass one to see if you need more disk space
Page 28
Using HACMP Log Files
¾For event problems

z Start with the /var/hacmp/adm/cluster.log
z Locate the earliest error or failure indication
z Use the time or text of the failure to search /var/hacmp/log/hacmp.out
z Search backwards in hacmp.out to find the command that failed
¾For other problems (verification, C-SPOC, DARE, etc.)

z Check cluster.log
z Use the associated log file(s)
Page 29
Step 1: Find the Error in cluster.log
¾cluster.log HACMP Event and Daemon Log

z High level daemon activity
z Start and stop information for every cluster event generated in a running
cluster
z Rotated by clcycle only if specified on command line: cluster.log.1-7
¾Look for the earliest error or failure associated with the problem
z This usually indicates the problem source
¾You'll use the time or text of the earliest error as an index into
hacmp.out
Page 30
cluster.log (1 of 4)
Page 31
Page 32
¾Action: Restart cluster services on node that was forced down. (The
node_up script was edited to exit with error (RC=42))
z Node with failure runs event_error
z Node with failure: internal clstrmgrES state is: ST_RP_FAILED
z Three minutes later, node runs config_too_long
Page 33
¾Administrator runs Recover From HACMP Script Failure
z clstrmgrES continues processing
z The internal clstrmgrES state is: ST_STABLEJul
Page 34
Step 2: Get the Details from hacmp.out
¾When an error is identified in cluster.log, the next step is to look in

/var/hacmp/log/hacmp.out
z This file is very long and very detailed
z it can be overwhelming
z You must develop the ability to read and understand this file
¾hacmp.out HACMP Event Script Log

z Line-by-line record of every command executed by the event scripts (If
Debug Level is set to high (the default) )
z Includes the values of all the arguments to each command
z Event summaries appear at the end of each event's details to make it

easier to check for errors (If Formatting Option is set to Standard (the
default) )
z Rotated daily by clcycle:hacmp.out.1-7
Page 35
SMIT hacmp.out Debug LevelChange
Page 36
SMIT hacmp.out Formatting Options Change
Page 37
hacmp.out Syntax
¾There are four basic types of entries:

z EVENT START
Time: EVENT START:Event_name[parameters]
z Output from event scripts
[RG_name]:Script_name[Line#]:command_with_args
z EVENT COMPLETED | FAILED
Time: EVENT COMPLETED: Event_name[parameters]RC
z Event Summary
Page 38
hacmp.out Example
Page 39
Event Summaries
¾Can be viewed from hacmp.out (sample below)

¾Or view using SMIT
z Problem Determination tools -> HACMP log viewing and management ->
View/Save/Delete HACMP Event Summaries
¾Event summaries are saved to event summaries log by clcycle(daily
by default)
z /var/hacmp/log/cl_event_summaries.txt
z This file must be manually maintained
Page 40
Other HACMP Log Files
¾/var/hacmp/adm/history/cluster.mmddyyyy Cluster History Logs

z Like cluster.log, but only EVENT START and EVENT COMPLETED |
FAILED entries
z Entries made by event scripts as the event occurs
z New file every day (if there are events)
z No automatic maintenance mechanism
z Recommended Use: View daily summary of events. View cluster events

long term.
¾/var/hacmp/log/clstrmgr.debug Cluster Manager Log, Contains time
stamped, formatted messages generated by the clstrmgrES daemon
z clstrmgrES exit messages
z rg_move enqueued events
z Resource group available on recover from error
z Dynamic Node Priority (DNP) data
Page 41
¾/var/hacmp/log/cspoc.log C-SPOC Log

z Contains time stamped, formatted messages generated by C-SPOC
commands
z The cspoc.log file resides on the node that invokes the C-SPOC command
z Rotated by clcycleonly if specified on command line: cspoc.log.1-7
z Recommended Use:Use the C-SPOC log file when tracing a C-SPOC

command's execution on cluster nodes or to detect a C-SPOC failure
¾/var/hacmp/log/autoverify.log Autoverify Log
z Contains the output of cl_auto_versync, which is the utility that performs
verification and synchronization when you start cluster
services(cl_auto_versynccallsclver, which logs to clverify.logas above)
z Rotated every time you start cluster services: autoverify.log.1-9
z Recommended Use:Use the autoverify.logwhen debugging a problem with

verification and synchronization during cluster services start
Page 42
¾/var/hacmp/clverify/clverify.logCluster Verification Log

z Contains the verbose messages output by the cluster verification utility
(clver)
z clverruns when:
You perform Verify and Synchronize HACMP Configuration(standard path) or Extended
Verification and Synchronization(extended path)
You perform HACMP Verification(PD Tools menu)
You start cluster services (called by cl_auto_versync)
The Automated Cluster Configuration Monitoring runs
z The messages indicate the node(s), devices, command, and so on,in which
any verification error occurred
z Rotated every time clverruns:clverify.log.1-9
z Recommended Use:Use the clverify.logwhen debugging a cluster

verification problem
Page 43
¾/var/hacmp/clcomd/clcomd.log clcomd Log

z Contains time stamped, formatted messages generated by Cluster
Communications daemon (clcomd) activity
z The log shows information about incoming and outgoing connections, both
successful and unsuccessful
z Rotated by clcomdwhen file is > 1MB: clcomd.log.0
z Recommended Use:Use information in this file to troubleshoot inter-node

communications, and to obtain information about attempted connections to
the daemon
¾/var/hacmp/clcomd/clcomddiag.log clcomd Diagnostic Log–Contains
time stamped, formatted, diagnostic messages generated by clcomd
z Rotated by clcomdwhen file is > 10MB: clcomddiag.log.0
z Recommended Use:Used to debug clcomdproblems -probably only useful if
you understand clcomdinternals
Page 44
¾/var/hacmp/log/clutils.log Cluster Utilities Log

z Contains information about automatic cluster configuration monitoring: date,
time, results, and which node performed the verification
z Also contains high-level information for the file collections utility, the two-
node cluster configuration assistant, the cluster test tool and the OLPW
conversion tool
z Rotated by clcycleif >1MB, or if specified on command line: clutils.log.1-7
z Recommended Use:Useful for viewing high level
¾/var/hacmp/log/clconfigassist.log Two-Node Assist Log

z Contains debugging information for the Two-Node Cluster Configuration
Assistant.
z Rotated by the Assistant: clconfigassist.log.1-9
z Recommended Use:Debug errors from the Two-Node Cluster Configuration

Assistant
Page 45
¾/var/hacmp/log/cl_testtool.log Cluster Test Tool Log

z Detailed output from the cluster test tool
z Rotated by the cluster test tool: cl_testtool.log.1-3
z Recommended Use:Use to view the results following Automated cluster
testing
¾/var/adm/clavan.log Application Availability Log
z Contains the state transitions of applications managed by HACMP.
z For example, when each application managed by HACMP is started or
stopped and when the node stops on which an application is running.
z Each node has its own instance of the file. Each record in the clavan.logfile
consists of a single line. Each line contains a fixed and variable section.
z Only rotated by clcycleif specified on the command line: clavan.log.1-7.
z Recommended Use:Used by the Application Availability Analysis tool to
provide reports on application availability. Only useful if you have configured
application monitors:
z smitty hacmp -> System Management (C-SPOC) -> HACMP Resource Group
and Application Management -> Application Availability Analysis
Page 46
Automatic Cluster Configuration Monitoring
¾smitty hacmp-> Problem Determination Tools -> HACMP

Verification -> Automatic Cluster Configuration Monitoring.
Page 47
RSCT Log Files: Overview
¾RSCT logs cannot be moved and reside in /var/ha/log/

¾Rotated by the RSCT daemons, either based on size or when daemon
restarted
z Rotation can be controlled via SMIT menu for some files
¾Topology Services Logs
z /var/ha/log/topsvcs.default
z /var/ha/log/topsvcs.dd.hhmmss.lang
z /var/ha/log/topsvcs.dd.hhmmss
z /var/ha/log/nim.topsvcs.IF.cluster
z /var/ha/log/nmDiag.nim.topsvcs.IF.cluster
¾Group Services logs

z /var/ha/log/grpsvcs.default.node#_instance
z /var/ha/log/grpsvcs_node#_instance.cluster
z /var/ha/log/grpsvcs_node#_instance.cluster.log
Page 48
RSCT Log Files: Heartbeat Activity Log
¾/var/ha/log/nim.topsvcs.IF.cluster NIM Heartbeat Activity Log

z IF = interface name
Separate file for each interface(en0,en1,rhdisk1,etc.)
z Cluster=cluster name
z Contains the output from the Network Interface Module
(/usr/sbin/rsct/bin/hats_nim,hats_diskhb_nim,etc)
z Includes:
Connection is established or closed
TS daemon has sent a command to start or stop heartbeating
TS daemon has sent a command to start or stop monitoring heartbeats
Local adapter goes up or down
Message is sent or received
Heartbeat from the remote adapter has been missed
Page 49
Problem Determination Summary
¾Identify the problem

z Gather facts so you have a clear understanding of the symptoms
¾Check the log files
z Find the time and basic details of the problem in the cluster log files
¾Identify the source of problem
z Use the detailed logs, HACMP and AIX commands to discover the source of the
problem
z What was the problem
z What actions did HACMP take
z What is the current status
¾Fix the problem
z Use SMIT,HACMP commands or AIX commands to correct problem
z Kill hung script (if needed)
z Run Recover From HACMP Script Failure (if needed)
¾Verify the cluster is running correctly
z RGs on correct node, applications, service addresses, shared storage
Page 50
参考资料
¾ PowerHA Website
z www.ibm.com/systems/power/software/availability/
¾ PowerHA on AIX redbook (SG24-7739-00)
¾ Availability Factory
z Contact your IBM representative or an IBM Business Partner and they will contact us via e-mail
(hacoc@us.ibm.com) to learn more.
¾ IBM Technology Services
z IBM Implementation Services for Power Systems for PowerHA/XD GLVM for AIX
z http://www-935.ibm.com/services/us/index.wss/offering/its/a1000032
¾ Education: Lab Services AN44 Extended Distance and Disaster Recovery
z http://www-
304.ibm.com/jct03001c/services/learning/ites.wss/us/en?pageType=course_list&subChapter=194&sub
ChapterInd=S&region=us&subChapterName=AIX+high+availability&country=us
¾ GLVM white paper
z www.ibm.com/systems/resources/systems_p_os_aix_whitepapers_pdf_aix_glvm.pdf
¾ IBM storage virtualization offerings
z www.ibm.com/systems/storage/virtualization
¾ SAP consulting services for POWERHA and POWERVM
z gehenni@us.ibm.com
z sbranden@us.ibm.com
¾ Wiki
z http://www.ibm.com/developerworks/wikis/display/WikiPtype/High%20Availability
Page 51
Thank
You!
何兵 hebing@cn.ibm.com
© Copyright IBM Corporation 2010

PowerHA - 5 - PD and Daily Maintenance

Uploaded by

Copyright:

Available Formats

You might also like

PowerHA - 5 - PD and Daily Maintenance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PowerHA - 5 - PD and Daily Maintenance

Uploaded by

Copyright:

Available Formats

PD and Daily Maintenance

© Copyright IBM Corporation 2010

Problem Determination Overview

¾Identify the problem

z Check the AIX environment

z Are the cluster processes running?

z Is the cluster stable?

¾Check the log files

What Can Cause HACMP Problems?

¾Common reasons why HACMP fails:

Recommendations To Reduce Problems!

¾When the cluster has been built, it is imperative that it be tested

z Pull cables from the network and the storage unit

z Application monitoring to detect failure and performance problems

¾During cluster design / implementation it is recommended there is

z Is ideally done first on an identical test environment

¾Your first responsibility is to the business

z Is HACMP a part of the problem?

z Can the application be started without HACMP?

z Can the application be started on a standby node?

¾At first chance, investigate the problem:

z Determine the time of the initial failure

z Locate the specific failure data–Determine what caused the failure

z Fix the problem

z How will restarting HACMP affect the applications?

z Do you need to schedule a maintenance window to re-integrate HACMP?

Problem Scenarios: DMS Timeout

¾Dead man switch kernel extension

¾Checking the dead man switch:

Problem Scenarios: DMS Timeout

¾Steps to avoid DMS time-out problems:

z Buy a bigger machine

Problem Scenarios: SRC Halts A Node

¾What happens when clstrmgrES exits?

Problem Scenarios: Partitioned Cluster

¾Node isolation (partitioned cluster)

Problem Scenarios: Partitioned Cluster

¾Node(s) in the other partition are sent

Problem Scenarios: Event Script Problems

Problem Scenarios: Event Script Problems

PowerHA Status Command and Cluster Process Flow

PowerHA And SNMP

¾The PowerHA MIB is defined in the hacmp.defs and hacmp.my files

PowerHA And SNMP

PowerHA And SNMP

PowerHA And SNMP

Useful AIX Commands

Useful HACMP Commands

SMIT Problem Determination Menu

SMIT Log Viewing and Management

SMIT View Detailed HACMP Log Files

Summary of HACMP Log Files

Log File Maintenance: clcycle

¾Saves 7 archive copies of targeted log files (logfile.1-logfile.7)

More Log File Maintenance

¾Maintain using the SMIT View/Save/Delete HACMP Event

More Log File Maintenance