PDF pt203 Sos Nutanix Troubleshooting

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 26

Agend

aMANAGING
L NUTANIX ENVIRONMENTS
• Cluster Monitoring
• NCC overview
• Prism Analysis (and Prism Central)

I I . TROUBLESHOOTING N UTANIX
ENVIRONMENTS
• General Troubleshooting
• Troubleshooting Scenarios
• Engaging support best practices
• Additional Resources

I I I.
Q/A

CONFERENCE
Monitoring

Pulse ,.•,
Emai
l SNMP

Syslog

Prism Alerts

CONFERENCE
Prism Alecs Pulse
HD

I N S GH T S
Pulse
HDurly Cluster
RepDrts
Deep Analytics
And InventDP/

/\UtDFFIBtIC
Case generatiDFl
Cluster PhDn e
Prism Alerts
HDme
Health Alerts

COINF
ERENCE
Auto-case Generation
Example:
Description Block Serlal Number:
alert tima: Tue Mar 22 2016 18:54:51 GMT-0700 (PDT)
aIert_type: PowerSupplyDown
alert msg: A1046:Bottom power supply iB down on
block
cluster id:
aIert„body: No Alert Body Available

New Alerts Appended

Block Serial Number:


alert time: Tue Mar 22 2016 21:46:25 GMT-0700 (PDT)
aIart_typa: PowerSupplyDown
a 6:Top power supply is down on block

cluster id:
aIert_body: No Alert Body Available
Resolution Scheduled Maintenance. As advised by customer
CONFERENCE
Auto-case Generation

THESE ALERTS WILL AUTO GENERATE SUPPORT CASES:


• Stargate process is down for more than 3 hours
(StargateTemporarilyDown)
• Curator scan fails (CuratorScanFailure)
• Running out of space on the cluster
• Running out of space on CVMs
• Hardware Clock Failure (HardwareClockFailure)
• Faulty RAM module (RAMFault)
• Power Supply failure (PowerSupplyDown)

If you want up to date information check


http://portal.nutanix.com/kb/1959 on the portal — KB
1959
For our customers leveraging our partners hardware platforms, we will
generate software based alerts which triggers auto support cases. CONFERENCE
Working with Prism Alerts

›I
i

COINFERE
NCE
Working with Prism Central Alerts Dashboard

COINFERE
NCE
NCC Health
ChecksCLI - (NCC HEALTH PRISM (AOS 5.X)
CHECKS RUN ALL)

- Summary of Cluster Check Executed on 4/28/2047, ¥

Passed

Total

CONFERENCE
DC Chcck Na mc

Checks• Aftecte a C V M s
NCC s a framewo of a tomatically diagnose cluster
scfi$
• Default
hea checks are non-disru we
• KB article for each NCC check
• Helps get a baselines
• NCC can be upgrade
Troubleshooting no impa
withrelevant
Information fincludinp KB) act to cluster

• Poperation
: The tested aspect of the cluster is healthy and no
further
action is required

cannot be evaluated as PASS/FAIL


CONFERENCE
’° w
CONFERENCE
Entity 8‹ Metric Charts

COINFERE
NCE
CONFERENCE
Troubleshooting Nutanix Environments: A Framework

• Problem Isolation

• Fixes and Mitigations

• Root Cause Analysis

• Product
Improvement

CONFERENCE
Troubleshooting by
Layers
A PPLICAT1ON
• SOL, VDI, Oracle RAC. etc.
CVM
• Stargate. Curator. Cassandra. etc.
HYPERVISOR
• AHV, ESXi, Hyper-V, XenServer
HARDWARE
• NVMe. SSD, HDD, Memory, NIC. Processor, etc.
N ETWOR K
• OVS. vswitch, Physical Switch, etc.

CONFERENCE
Troubleshooting: Problem Isolation
• Rapidly reduce failure domain scope. achieve faster resolution.
• Any recent changes in the environment*

IMPACT
• Is storage available*
• Are there performance issues*
• Can you reach Prism*

Use Build -In RE PORTS NG


• Prisrr Alerts
• Cluster Health
• NCC
• Cluster logs
• User Reports
CONFERENCE
Troubleshooting: Problem Isolation — Cluster
States

He \pful additionaJ commands


• cluster status I ex p -v UP showing condensed version
• genesis sBtus — shows only local
services/processes
CONFERENCE
Troubleshooting: Problem Isolation — allssh, hostssh, NCCR
Logging
• allssh

• NCC

• /home/nutanix/data/logs and sysstats


• INFO. WARN, ERROR, FATAL
• allssh "Is -Itr data/logs/*.FATAL”
• If FATALs are actively occurring and you’re experiencing issues, they may be related.
• hostssh "vmware -vl” instead of allssh ‘ssh -I root 192.168.5.1 "vmware -vl”’
• If you’re seeing an error, check the Nutanix Knowledge Base!

CONFERENCE
Problem Isolation - Data Resilenc States

O&

Rebuild capaclty
available

• r›cli cluster get-domain-fault-tolerance-status CONFERENCE


Typical Troubleshooting Scenarios
U PG RADE IS NOT PROG RESS ING
• Logging: genesis.out. host_upgrade.out, firmware_upgrade.out
• upgrade_status
• host_upgrade_status
• firmware_upgrade_status

STORAG E U NAVAI LAB LE


• Do all CVMs have connectivity to each other and to the hypervisor?
• Recent stargate FATALs*
• Cassandra status*
REPLICATION, SN A PS HOTTING , A ND METRO RELATED ISSUES
• Logging: Cerebro logs
NCC // HEALTH C H E C KS FAI LING
• Running NCC should indicate the nature oT the issue and give a KB describing
how to resolve the issue. .
CONFERENCE
rio - fflin
H
AH ĞK
V

CONFERENCE
Root Cause Analysis - Log Collection

Logs will be collected for all the no0es and components. Once the
task completes the bundle will de aveilabJe for download.

5 umm C he ck 0 C oiieci Logs startlng now


ary s E UTII 'I C 'C E R 7 STATL!S

Cluster }0b Succceded

Pun C h ec k s
BY CHI-CK S TA I US
Log Collecfor

Passed 39
1
C an cel

CONFERENCE
Best Practices for Engaging Suppor
• Update your break/fix contact via My Nutanix Portal
• Upgrade to the latest NCC and start a health
check
• Clear problem description
• What steps have you already taken?
• Keep components on the recommended version levels
• Press the Escalate Button in portal for immediate
attention
• Provide feedback after case closure. Surveys
matter!

CONFERENCE
Additional Resources
The Nutanix Bible - Architecture details
portal.nutanix.com - Nutanix Support Portal, KBs, Documentation, Software, etc.
portal.nutanix.com/ h/4530 — Additional troubleshooting details for Acropolis File Services

IF LIKED THIS SESSION, YOU MAY ALSO LIKE:


YOU
• Nutanix Architecture Deep Dive and the Deep Dive Super Session
• Getting the Network Right (The First Time)
• Fail Fast and Never Again
• AHV — Virtualization You Always Wanted

CONFERENCE

You might also like