Professional Documents
Culture Documents
PowerHA - 5 - PD and Daily Maintenance
PowerHA - 5 - PD and Daily Maintenance
PowerHA - 5 - PD and Daily Maintenance
PowerHA实施专家级课程
Page 2
实施专家级课程 PowerHA
Page 3
实施专家级课程 PowerHA
Page 4
实施专家级课程 PowerHA
Business Impact
z Do not waste too much time in repairing your cluster without starting the
application
¾It may be necessary to continue without HACMP if the problem cannot
be recovered in the time allowed
Page 5
实施专家级课程 PowerHA
Business Impact
z Restart HACMP and test the cluster carefully to make sure the problem
won’t return
Page 6
实施专家级课程 PowerHA
z DMS time-out is set to the failure detection time of the slowest network
z If the DMS is not reset and times out, it will cause a panic
Page 7
实施专家级课程 PowerHA
Page 8
实施专家级课程 PowerHA
Page 10
实施专家级课程 PowerHA
¾Proving that node isolation caused the problem: On the node(s) that
died:
z /tmp/clstrmgr.debuglog file–AIX error log entry: GS_DOM_MERGE_ER
¾Steps to avoid node isolation:
z Configure and test one or more non-IP network(s)
Page 11
实施专家级课程 PowerHA
¾Event fails
z Non-recoverable
Causes event_error event on all nodes
Node that had the failing event goes to the ST_RP_FAILED state
Other nodes typically go to the ST_BARRIER state
Event processing stops on all nodes until user performs Recover From HACMP Script
Failure
z Recoverable
Some script failures do not cause an event failure in HACMP (e.g.: start_server)
Failure to acquire resources: If HACMP is unable to acquire all the resources for an RG, it will
try to run the RG on another node. If not possible, RG will go to the ERROR state
z Event hangs or takes longer than expected
Event processing stops until the script completes or is killed
If an event exceeds the Time Until Warning,the config_too_long event occurs
Page 12
实施专家级课程 PowerHA
¾Recovery
z Locate the problem
cluster.log and hacmp.out are usually most helpful
z If it is a config_too_long
A. Fix the problem
B. Complete any steps that did not complete in the script that failed or hung
C. Kill hung script (if needed)
z If event_error ran (it is an actual HACMP event fail, ST_RP_FAILED), run
Recover From HACMP Script Failure
z Verify cluster
Page 13
实施专家级课程 PowerHA
Page 14
实施专家级课程 PowerHA
Page 15
实施专家级课程 PowerHA
Page 16
实施专家级课程 PowerHA
Page 17
实施专家级课程 PowerHA
Page 18
实施专家级课程 PowerHA
Page 19
实施专家级课程 PowerHA
Page 20
实施专家级课程 PowerHA
Page 21
实施专家级课程 PowerHA
Page 22
实施专家级课程 PowerHA
Page 23
实施专家级课程 PowerHA
Page 24
实施专家级课程 PowerHA
Page 25
实施专家级课程 PowerHA
¾hacmp.out
z Rotated nightly by clcycle(default)
¾cl_event_summaries.txt
z Event summaries are copied from hacmp.outbyclcycle
z No automatic maintenance
Page 26
实施专家级课程 PowerHA
Page 27
实施专家级课程 PowerHA
z smitty hacmp -> Problem Determination Tools ->HACMP Log Viewing and
Management -> Collect Cluster log files for Problem Reporting
z snap -e(callsclsnap)
Page 28
实施专家级课程 PowerHA
Page 29
实施专家级课程 PowerHA
¾Look for the earliest error or failure associated with the problem
z This usually indicates the problem source
¾You'll use the time or text of the earliest error as an index into
hacmp.out
Page 30
实施专家级课程 PowerHA
cluster.log (1 of 4)
Page 31
实施专家级课程 PowerHA
cluster.log (2 of 4)
Page 32
实施专家级课程 PowerHA
cluster.log (3 of 4)
¾Action: Restart cluster services on node that was forced down. (The
node_up script was edited to exit with error (RC=42))
z Node with failure runs event_error
z Node with failure: internal clstrmgrES state is: ST_RP_FAILED
Page 33
实施专家级课程 PowerHA
cluster.log (4 of 4)
¾Administrator runs Recover From HACMP Script Failure
z clstrmgrES continues processing
z The internal clstrmgrES state is: ST_STABLEJul
Page 34
实施专家级课程 PowerHA
z You must develop the ability to read and understand this file
Page 35
实施专家级课程 PowerHA
Page 36
实施专家级课程 PowerHA
Page 37
实施专家级课程 PowerHA
hacmp.out Syntax
Page 38
实施专家级课程 PowerHA
hacmp.out Example
Page 39
实施专家级课程 PowerHA
Event Summaries
Page 40
实施专家级课程 PowerHA
Page 41
实施专家级课程 PowerHA
Page 42
实施专家级课程 PowerHA
Page 43
实施专家级课程 PowerHA
Page 44
实施专家级课程 PowerHA
Page 45
实施专家级课程 PowerHA
Page 47
实施专家级课程 PowerHA
Page 48
实施专家级课程 PowerHA
Page 49
实施专家级课程 PowerHA
Page 50
实施专家级课程 PowerHA
参考资料
¾ PowerHA Website
z www.ibm.com/systems/power/software/availability/
¾ PowerHA on AIX redbook (SG24-7739-00)
¾ Availability Factory
z Contact your IBM representative or an IBM Business Partner and they will contact us via e-mail
(hacoc@us.ibm.com) to learn more.
¾ IBM Technology Services
z IBM Implementation Services for Power Systems for PowerHA/XD GLVM for AIX
z http://www-935.ibm.com/services/us/index.wss/offering/its/a1000032
¾ Education: Lab Services AN44 Extended Distance and Disaster Recovery
z http://www-
304.ibm.com/jct03001c/services/learning/ites.wss/us/en?pageType=course_list&subChapter=194&sub
ChapterInd=S®ion=us&subChapterName=AIX+high+availability&country=us
¾ GLVM white paper
z www.ibm.com/systems/resources/systems_p_os_aix_whitepapers_pdf_aix_glvm.pdf
¾ IBM storage virtualization offerings
z www.ibm.com/systems/storage/virtualization
¾ SAP consulting services for POWERHA and POWERVM
z gehenni@us.ibm.com
z sbranden@us.ibm.com
¾ Wiki
z http://www.ibm.com/developerworks/wikis/display/WikiPtype/High%20Availability
Page 51
Thank
You!
何兵 hebing@cn.ibm.com