PowerHA - 5 - PD and Daily Maintenance

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

PowerHA

PD and Daily Maintenance

PowerHA实施专家级课程

© Copyright IBM Corporation 2010


实施专家级课程 PowerHA

Problem Determination Overview

¾Identify the problem


z What is indicating a problem?
z Where are the resource groups?

z Check the AIX environment

z Are the cluster processes running?

z Is the cluster stable?

¾Check the log files


¾Identify the source of the problem
¾Fix the problem
¾Verify the cluster is running correctly

Page 2
实施专家级课程 PowerHA

What Can Cause HACMP Problems?

¾Common reasons why HACMP fails:


z A poor cluster design and lack of thorough planning–Basic TCP/IP and LVM
configuration problems
z HACMP Event Script augmentation bugs
z General user-supplied script bugs
z HACMP cluster topology and resource configuration problems
z Absence of change management discipline in a running cluster
z Failure to use cluster-aware administration commands
z Unsuccessful migration to a new version of HACMP
ƒ Carefully read the manuals and the release notes!
z Updates to AIX / RSCT
ƒ Including emergency fixes, PTFs, SPs, TLs
z Lack of adequate and thorough testing

Page 3
实施专家级课程 PowerHA

Recommendations To Reduce Problems!

¾When the cluster has been built, it is imperative that it be tested


thoroughly on every failure one could imagine
z RG Failovers / Fallbacks
z Node / Network / NIC / Cluster process failures

z Pull cables from the network and the storage unit

z Application monitoring to detect failure and performance problems

¾During cluster design / implementation it is recommended there is


minimum change to the environment and any change:
z Follows a change management process
z Is done to the cluster through supplied cluster-aware C-SPOC tools

z Is ideally done first on an identical test environment

Page 4
实施专家级课程 PowerHA

Business Impact

¾Your first responsibility is to the business


z Can the applications be "forced" to operate?
z Is down time acceptable?

z Is HACMP a part of the problem?

z Can the application be started without HACMP?

z Can the application be started on a standby node?

z Do not waste too much time in repairing your cluster without starting the
application
¾It may be necessary to continue without HACMP if the problem cannot
be recovered in the time allowed

Page 5
实施专家级课程 PowerHA

Business Impact

¾At first chance, investigate the problem:


z Get the status
z Identify the log files that relate to the failure

z Determine the time of the initial failure

z Locate the specific failure data–Determine what caused the failure

z Fix the problem

z How will restarting HACMP affect the applications?

z Do you need to schedule a maintenance window to re-integrate HACMP?

z Restart HACMP and test the cluster carefully to make sure the problem
won’t return

Page 6
实施专家级课程 PowerHA

Problem Scenarios: DMS Timeout

¾Dead man switch kernel extension


z Designed to ensure that if a node causes a fallover due to being too busy, it
won't later continue normal processing and try to access shared resources
z Topology services resets the DMS timer every second

z DMS time-out is set to the failure detection time of the slowest network

z If the DMS is not reset and times out, it will cause a panic

¾Checking the dead man switch:


z TS_DMS_WARNING_STerror in errlog when DMS gets close
ƒ You could create an error notify method to warn you.
z /usr/sbin/rsct/bin/hatsdmsinfo
ƒ Provides statistics on the DMS: how many times it was reset, how many times it was "close"
to tripping, etc.

Page 7
实施专家级课程 PowerHA

Problem Scenarios: DMS Timeout

¾Steps to avoid DMS time-out problems:


z Isolate and fix the cause of excessive I/O or TCP/IP traffic
z Reduce the failure detection rate for the slowest network
ƒ The DMS time-out is based on the failure detection rate of the slowest network
ƒ DMS time-out = hbrate X cycle * 2
z Increase the frequency of the syncd
z Turn on I/O pacing

z Buy a bigger machine

Page 8
实施专家级课程 PowerHA

Problem Scenarios: SRC Halts A Node

¾What happens when clstrmgrES exits?


z If possible, clstrmgrES writes exit status to /usr/es/sbin/cluster/.clstrmgr.exit
z When a subsystem exits, SRC runs the SRCnotifymethod (if one exists)
ƒ For clstrmgrES and clinfoES, the notify method is clexit.rc
z clexit.rc checks the .clstrmgr.exit file
ƒ If just stopping cluster services: clexit.rc just restarts clstrmgrES
ƒ If abnormal exit status or .clstrmgr.exit file doesn't exist: (If clstrmgrES exited abnormally, we
want to prevent any cluster problem)
ƒ Run /usr/es/sbin/cluster/etc/hacmp.term, if executable
ƒ Else: halt the node
¾Proving that SRC halted a node:
z Check the AIX error log
ƒ Look for abnormal termination of clstrmgrES daemon
¾Steps to avoid SRC halts:
z Don't give untrained staff access to the root password
z Consider modifying hacmp.term
ƒ Reboot instead of halt
ƒ Stop RSCT and clean up
Page 9
实施专家级课程 PowerHA

Problem Scenarios: Partitioned Cluster

¾Node isolation (partitioned cluster)


z If all communication is lost, each side of the partition assumes the other
side is down and tries to takeover resources
z This can result in both sides accessing shared storage, with potential for
data corruption
¾Communication is restored after a cluster partition
z How is it detected?
ƒ Heartbeats are received from a node that was marked as failed
ƒ HACMP ODM configuration is not the same on a joining node as nodes already active in the
cluster
ƒ Two clusters with the same ID appear in the same logical network
z What happens?
ƒ One partition will be chosen to survive
ƒ Partition with most nodes survives
ƒ If equal number of nodes, partition with lowest node number survives

Page 10
实施专家级课程 PowerHA

Problem Scenarios: Partitioned Cluster

¾Node(s) in the other partition are sent


GS_DOM_MERGE_ERmessage
z grpsvcsand clstrmgrexit
z clexit.rcruns (node halts by default)

¾Proving that node isolation caused the problem: On the node(s) that
died:
z /tmp/clstrmgr.debuglog file–AIX error log entry: GS_DOM_MERGE_ER
¾Steps to avoid node isolation:
z Configure and test one or more non-IP network(s)

Page 11
实施专家级课程 PowerHA

Problem Scenarios: Event Script Problems

¾Event fails
z Non-recoverable
ƒ Causes event_error event on all nodes
ƒ Node that had the failing event goes to the ST_RP_FAILED state
ƒ Other nodes typically go to the ST_BARRIER state
ƒ Event processing stops on all nodes until user performs Recover From HACMP Script
Failure
z Recoverable
ƒ Some script failures do not cause an event failure in HACMP (e.g.: start_server)
ƒ Failure to acquire resources: If HACMP is unable to acquire all the resources for an RG, it will
try to run the RG on another node. If not possible, RG will go to the ERROR state
z Event hangs or takes longer than expected
ƒ Event processing stops until the script completes or is killed
ƒ If an event exceeds the Time Until Warning,the config_too_long event occurs

Page 12
实施专家级课程 PowerHA

Problem Scenarios: Event Script Problems

¾Recovery
z Locate the problem
ƒ cluster.log and hacmp.out are usually most helpful
z If it is a config_too_long
ƒ A. Fix the problem
ƒ B. Complete any steps that did not complete in the script that failed or hung
ƒ C. Kill hung script (if needed)
z If event_error ran (it is an actual HACMP event fail, ST_RP_FAILED), run
Recover From HACMP Script Failure
z Verify cluster

Page 13
实施专家级课程 PowerHA

PowerHA Status Command and Cluster Process Flow

Page 14
实施专家级课程 PowerHA

PowerHA And SNMP

¾The PowerHA MIB is defined in the hacmp.defs and hacmp.my files


¾The clstrmgrES daemon maintains current values of MIB objects and
provides them to the snmpd
¾Many programs can get PowerHA status from snmpd using the
SNMP protocal on the same system or across the network.

Page 15
实施专家级课程 PowerHA

PowerHA And SNMP

Page 16
实施专家级课程 PowerHA

PowerHA And SNMP

Page 17
实施专家级课程 PowerHA

PowerHA And SNMP

Page 18
实施专家级课程 PowerHA

Useful AIX Commands

Page 19
实施专家级课程 PowerHA

Useful HACMP Commands

Page 20
实施专家级课程 PowerHA

SMIT Problem Determination Menu

Page 21
实施专家级课程 PowerHA

SMIT Log Viewing and Management

Page 22
实施专家级课程 PowerHA

SMIT View Detailed HACMP Log Files

Page 23
实施专家级课程 PowerHA

Summary of HACMP Log Files

Page 24
实施专家级课程 PowerHA

Log File Maintenance: clcycle

¾Saves 7 archive copies of targeted log files (logfile.1-logfile.7)


¾At boot time clcycleis run and only rotates the clstrmgr.debugfile
z Note: clstrmgr.debugis also cycled when you stop cluster services
¾By default clcycle is run daily at midnight from cron
¾It can also be called from the command line
¾When run from the command line or from cron, clcycle rotates:
z Files which are always rotated:
z Files which are rotated if greater than 1 MB in size or if specified on the command
line:
z Files only rotated if no files are specified on the command line(default cronentry)or if
explicitly specified on the command line:
z Files only rotated if specified as an argument to clcycle:
z Never rotated:
z Note: If desired, you can modify root’s crontab file so that additional log files are
rotated on a regular basis
¾For example: clcycle cluster.log to rotate hacmp.out and cluster.log daily

Page 25
实施专家级课程 PowerHA

More Log File Maintenance

¾hacmp.out
z Rotated nightly by clcycle(default)
¾cl_event_summaries.txt
z Event summaries are copied from hacmp.outbyclcycle
z No automatic maintenance

¾Maintain using the SMIT View/Save/Delete HACMP Event


Summariesmenu
¾clstrmgr.debug–Rotated by clcyclewhen node is booted
z Rotated when cluster services are stopped on node
¾cluster.log
z Not rotated unless specified on command line to clcycle

Page 26
实施专家级课程 PowerHA

More Log File Maintenance

¾clverify.log and autoverify.log


z Rotated every time clverorcl_auto_versyncare run; 9 copies
¾cluster.mmddyyyy
z Created by event scripts on each day an event occurs
z No automatic maintenance•clcomd.log–Rotated by clcomdwhen > 1MB; 1
copy
¾clcomddiag.log
z Rotated by clcomdwhen > 10MB; 1 copy
¾RSCT logs
z Maintained by the RSCT daemons

Page 27
实施专家级课程 PowerHA

Saving Log Files & Configuration: clsnap

¾clsnap saves the HACMP log files and configuration


z clsnapcreates and compresses a PAX archive for each node
z By default: /tmp/ibmsupt/hacmp/nodename.pax.Z

z smitty hacmp -> Problem Determination Tools ->HACMP Log Viewing and
Management -> Collect Cluster log files for Problem Reporting
z snap -e(callsclsnap)

¾Since some log files are rotated when starting/stopping cluster


services, it's a good idea to run clsnapwhen you begin troubleshooting
z this also saves a picture of the cluster configuration before you begin
making changes
¾clsnap runs in two passes:
z First pass estimates the disk space needed
z Second pass creates the archives

z Run pass one to see if you need more disk space

Page 28
实施专家级课程 PowerHA

Using HACMP Log Files

¾For event problems


z Start with the /var/hacmp/adm/cluster.log
z Locate the earliest error or failure indication

z Use the time or text of the failure to search /var/hacmp/log/hacmp.out

z Search backwards in hacmp.out to find the command that failed

¾For other problems (verification, C-SPOC, DARE, etc.)


z Check cluster.log
z Use the associated log file(s)

Page 29
实施专家级课程 PowerHA

Step 1: Find the Error in cluster.log

¾cluster.log HACMP Event and Daemon Log


z High level daemon activity
z Start and stop information for every cluster event generated in a running
cluster
z Rotated by clcycle only if specified on command line: cluster.log.1-7

¾Look for the earliest error or failure associated with the problem
z This usually indicates the problem source
¾You'll use the time or text of the earliest error as an index into
hacmp.out

Page 30
实施专家级课程 PowerHA

cluster.log (1 of 4)

Page 31
实施专家级课程 PowerHA

cluster.log (2 of 4)

Page 32
实施专家级课程 PowerHA

cluster.log (3 of 4)
¾Action: Restart cluster services on node that was forced down. (The
node_up script was edited to exit with error (RC=42))
z Node with failure runs event_error
z Node with failure: internal clstrmgrES state is: ST_RP_FAILED

z Three minutes later, node runs config_too_long

Page 33
实施专家级课程 PowerHA

cluster.log (4 of 4)
¾Administrator runs Recover From HACMP Script Failure
z clstrmgrES continues processing
z The internal clstrmgrES state is: ST_STABLEJul

Page 34
实施专家级课程 PowerHA

Step 2: Get the Details from hacmp.out

¾When an error is identified in cluster.log, the next step is to look in


/var/hacmp/log/hacmp.out
z This file is very long and very detailed
z it can be overwhelming

z You must develop the ability to read and understand this file

¾hacmp.out HACMP Event Script Log


z Line-by-line record of every command executed by the event scripts (If
Debug Level is set to high (the default) )
z Includes the values of all the arguments to each command

z Event summaries appear at the end of each event's details to make it


easier to check for errors (If Formatting Option is set to Standard (the
default) )
z Rotated daily by clcycle:hacmp.out.1-7

Page 35
实施专家级课程 PowerHA

SMIT hacmp.out Debug LevelChange

Page 36
实施专家级课程 PowerHA

SMIT hacmp.out Formatting Options Change

Page 37
实施专家级课程 PowerHA

hacmp.out Syntax

¾There are four basic types of entries:


z EVENT START
ƒ Time: EVENT START:Event_name[parameters]
z Output from event scripts
ƒ [RG_name]:Script_name[Line#]:command_with_args
z EVENT COMPLETED | FAILED
ƒ Time: EVENT COMPLETED: Event_name[parameters]RC
z Event Summary

Page 38
实施专家级课程 PowerHA

hacmp.out Example

Page 39
实施专家级课程 PowerHA

Event Summaries

¾Can be viewed from hacmp.out (sample below)


¾Or view using SMIT
z Problem Determination tools -> HACMP log viewing and management ->
View/Save/Delete HACMP Event Summaries
¾Event summaries are saved to event summaries log by clcycle(daily
by default)
z /var/hacmp/log/cl_event_summaries.txt
z This file must be manually maintained

Page 40
实施专家级课程 PowerHA

Other HACMP Log Files

¾/var/hacmp/adm/history/cluster.mmddyyyy Cluster History Logs


z Like cluster.log, but only EVENT START and EVENT COMPLETED |
FAILED entries
z Entries made by event scripts as the event occurs

z New file every day (if there are events)

z No automatic maintenance mechanism

z Recommended Use: View daily summary of events. View cluster events


long term.
¾/var/hacmp/log/clstrmgr.debug Cluster Manager Log, Contains time
stamped, formatted messages generated by the clstrmgrES daemon
z clstrmgrES exit messages
z rg_move enqueued events

z Resource group available on recover from error

z Dynamic Node Priority (DNP) data

Page 41
实施专家级课程 PowerHA

Other HACMP Log Files

¾/var/hacmp/log/cspoc.log C-SPOC Log


z Contains time stamped, formatted messages generated by C-SPOC
commands
z The cspoc.log file resides on the node that invokes the C-SPOC command

z Rotated by clcycleonly if specified on command line: cspoc.log.1-7

z Recommended Use:Use the C-SPOC log file when tracing a C-SPOC


command's execution on cluster nodes or to detect a C-SPOC failure
¾/var/hacmp/log/autoverify.log Autoverify Log
z Contains the output of cl_auto_versync, which is the utility that performs
verification and synchronization when you start cluster
services(cl_auto_versynccallsclver, which logs to clverify.logas above)
z Rotated every time you start cluster services: autoverify.log.1-9

z Recommended Use:Use the autoverify.logwhen debugging a problem with


verification and synchronization during cluster services start

Page 42
实施专家级课程 PowerHA

Other HACMP Log Files

¾/var/hacmp/clverify/clverify.logCluster Verification Log


z Contains the verbose messages output by the cluster verification utility
(clver)
z clverruns when:
ƒ You perform Verify and Synchronize HACMP Configuration(standard path) or Extended
Verification and Synchronization(extended path)
ƒ You perform HACMP Verification(PD Tools menu)
ƒ You start cluster services (called by cl_auto_versync)
ƒ The Automated Cluster Configuration Monitoring runs
z The messages indicate the node(s), devices, command, and so on,in which
any verification error occurred
z Rotated every time clverruns:clverify.log.1-9

z Recommended Use:Use the clverify.logwhen debugging a cluster


verification problem

Page 43
实施专家级课程 PowerHA

Other HACMP Log Files

¾/var/hacmp/clcomd/clcomd.log clcomd Log


z Contains time stamped, formatted messages generated by Cluster
Communications daemon (clcomd) activity
z The log shows information about incoming and outgoing connections, both
successful and unsuccessful
z Rotated by clcomdwhen file is > 1MB: clcomd.log.0

z Recommended Use:Use information in this file to troubleshoot inter-node


communications, and to obtain information about attempted connections to
the daemon
¾/var/hacmp/clcomd/clcomddiag.log clcomd Diagnostic Log–Contains
time stamped, formatted, diagnostic messages generated by clcomd
z Rotated by clcomdwhen file is > 10MB: clcomddiag.log.0
z Recommended Use:Used to debug clcomdproblems -probably only useful if
you understand clcomdinternals

Page 44
实施专家级课程 PowerHA

Other HACMP Log Files

¾/var/hacmp/log/clutils.log Cluster Utilities Log


z Contains information about automatic cluster configuration monitoring: date,
time, results, and which node performed the verification
z Also contains high-level information for the file collections utility, the two-
node cluster configuration assistant, the cluster test tool and the OLPW
conversion tool
z Rotated by clcycleif >1MB, or if specified on command line: clutils.log.1-7

z Recommended Use:Useful for viewing high level

¾/var/hacmp/log/clconfigassist.log Two-Node Assist Log


z Contains debugging information for the Two-Node Cluster Configuration
Assistant.
z Rotated by the Assistant: clconfigassist.log.1-9

z Recommended Use:Debug errors from the Two-Node Cluster Configuration


Assistant

Page 45
实施专家级课程 PowerHA

Other HACMP Log Files

¾/var/hacmp/log/cl_testtool.log Cluster Test Tool Log


z Detailed output from the cluster test tool
z Rotated by the cluster test tool: cl_testtool.log.1-3
z Recommended Use:Use to view the results following Automated cluster
testing
¾/var/adm/clavan.log Application Availability Log
z Contains the state transitions of applications managed by HACMP.
z For example, when each application managed by HACMP is started or
stopped and when the node stops on which an application is running.
z Each node has its own instance of the file. Each record in the clavan.logfile
consists of a single line. Each line contains a fixed and variable section.
z Only rotated by clcycleif specified on the command line: clavan.log.1-7.
z Recommended Use:Used by the Application Availability Analysis tool to
provide reports on application availability. Only useful if you have configured
application monitors:
z smitty hacmp -> System Management (C-SPOC) -> HACMP Resource Group
and Application Management -> Application Availability Analysis
Page 46
实施专家级课程 PowerHA

Automatic Cluster Configuration Monitoring

¾smitty hacmp-> Problem Determination Tools -> HACMP


Verification -> Automatic Cluster Configuration Monitoring.

Page 47
实施专家级课程 PowerHA

RSCT Log Files: Overview

¾RSCT logs cannot be moved and reside in /var/ha/log/


¾Rotated by the RSCT daemons, either based on size or when daemon
restarted
z Rotation can be controlled via SMIT menu for some files
¾Topology Services Logs
z /var/ha/log/topsvcs.default
z /var/ha/log/topsvcs.dd.hhmmss.lang
z /var/ha/log/topsvcs.dd.hhmmss
z /var/ha/log/nim.topsvcs.IF.cluster
z /var/ha/log/nmDiag.nim.topsvcs.IF.cluster

¾Group Services logs


z /var/ha/log/grpsvcs.default.node#_instance
z /var/ha/log/grpsvcs_node#_instance.cluster
z /var/ha/log/grpsvcs_node#_instance.cluster.log

Page 48
实施专家级课程 PowerHA

RSCT Log Files: Heartbeat Activity Log

¾/var/ha/log/nim.topsvcs.IF.cluster NIM Heartbeat Activity Log


z IF = interface name
ƒ Separate file for each interface(en0,en1,rhdisk1,etc.)
z Cluster=cluster name
z Contains the output from the Network Interface Module
ƒ (/usr/sbin/rsct/bin/hats_nim,hats_diskhb_nim,etc)
z Includes:
ƒ Connection is established or closed
ƒ TS daemon has sent a command to start or stop heartbeating
ƒ TS daemon has sent a command to start or stop monitoring heartbeats
ƒ Local adapter goes up or down
ƒ Message is sent or received
ƒ Heartbeat from the remote adapter has been missed

Page 49
实施专家级课程 PowerHA

Problem Determination Summary

¾Identify the problem


z Gather facts so you have a clear understanding of the symptoms
¾Check the log files
z Find the time and basic details of the problem in the cluster log files
¾Identify the source of problem
z Use the detailed logs, HACMP and AIX commands to discover the source of the
problem
z What was the problem
z What actions did HACMP take
z What is the current status
¾Fix the problem
z Use SMIT,HACMP commands or AIX commands to correct problem
z Kill hung script (if needed)
z Run Recover From HACMP Script Failure (if needed)
¾Verify the cluster is running correctly
z RGs on correct node, applications, service addresses, shared storage

Page 50
实施专家级课程 PowerHA

参考资料
¾ PowerHA Website
z www.ibm.com/systems/power/software/availability/
¾ PowerHA on AIX redbook (SG24-7739-00)
¾ Availability Factory
z Contact your IBM representative or an IBM Business Partner and they will contact us via e-mail
(hacoc@us.ibm.com) to learn more.
¾ IBM Technology Services
z IBM Implementation Services for Power Systems for PowerHA/XD GLVM for AIX
z http://www-935.ibm.com/services/us/index.wss/offering/its/a1000032
¾ Education: Lab Services AN44 Extended Distance and Disaster Recovery
z http://www-
304.ibm.com/jct03001c/services/learning/ites.wss/us/en?pageType=course_list&subChapter=194&sub
ChapterInd=S&region=us&subChapterName=AIX+high+availability&country=us
¾ GLVM white paper
z www.ibm.com/systems/resources/systems_p_os_aix_whitepapers_pdf_aix_glvm.pdf
¾ IBM storage virtualization offerings
z www.ibm.com/systems/storage/virtualization
¾ SAP consulting services for POWERHA and POWERVM
z gehenni@us.ibm.com
z sbranden@us.ibm.com
¾ Wiki
z http://www.ibm.com/developerworks/wikis/display/WikiPtype/High%20Availability

Page 51
Thank
You!
何兵 hebing@cn.ibm.com

© Copyright IBM Corporation 2010

You might also like