5 ACI Monitoring and Tshoot PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 82


Operations Planning & Best Practices: Cisco ACI

Ruben Del Monte
Cisco Customer Experience
March 2022
What you’ll learn Monitoring

today to help you on

your Cisco ACI Day 2 Troubleshooting

operations journey Infrastructure and


How can you get more value

from Cisco ACI Day 2

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public ATX: Prepare to Implement Cisco ACI
1 10-minute daily routine

2 Navigation shortcuts

Monitor 3 Health Scores

4 Faults, audits, and events

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
First 10 Minutes – Daily Routine
• Alert List
• Health Scores
• System
• Node
• APIC Controller

• Faults

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Alert List
Shows critical warnings and
error information to inform
user to take action

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Alert List (Continued)

• Alert to detect if OSPF connectivity is

down (MPoD) configuration

• Alert to detect process crash and

acknowledge old crashes
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Check Health Score

Looks like we
Looks like
hadwean issue!
had an issue!

Health Score
Health Scores are based on Faults and events

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Check APIC Cluster Status

APIC Cluster Status

“Fully Fit” – All APICs are in sync

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Check Faults

Faults are indications of mis-config or any
issues on ACI Fabric
※ This is a lab setup. Try to clear all Faults whenever a
new one is raised in production.

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
First 10 Minutes – Daily Routine (contd)
• Fabric inventory
• Topology Summary
• Physical switches
• Global EndPoints
• Interfaces’ status
• Policies-to-interfaces association
• Duplicated Ips
• Disabled interfaces
• DOM values

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Topology Summary: Physical Switches

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Global Endpoints

EPG Level

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Interface and Policies

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Duplicate IP/Disabled Interfaces

Duplicate IPs

Disabled interfaces on faric

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Understand DOM values (inventory > leaf > physical interfaces)
• dBm – logarithmic scale
• value between low and high warnings

• Optics used must support DOM
• Must create Fabric Node Controls policy with DOM enabled first, to be associated with switch profiles:
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
First 10 Minutes – Daily Routine (contd)
• Tenant statistics
• Flows stats
• Drops (operational tab)

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Operations: Packets

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
1 Health score overview

2 Health score impact

Health Scores 3 Drill downs

4 Evaluation policy

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Health Score overview
• Health Score provides a quick overview of the health of the system/module
• It is based on the Faults generated in the Fabric
• Range: 0 to 100 (100 is perfect Health Score)
• Each Fault reduces the Health Score based on the severity of the Fault
• Health Score is propagated to container and related MOs
• Health Score policies can control the penalty values, propagation, healthRecords

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Health Score Views
• System — Aggregation of system-wide health, including Pod Health Scores, Tenant Health
Scores, system Fault counts by domain and type, and the APIC Cluster health state
• Pod — Aggregation of Health Scores for a Pod (a group of Spine and Leaf switches), and Pod-
wide Fault counts by domain and type
• Tenant — Aggregation of Health Scores for a Tenant, including performance data for objects
such as applications and EPGs that are specific to a Tenant, and Tenant-wide Fault counts by
domain and type
• Managed Object — Health Score policies for managed objects (MOs), which includes their
dependent and related MOs. These policies can be customized by an administrator

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Health Score: Impact Example
• In this example, a hardware Fault impacts the
Health Score of an application component
• The Health Score is propagated to the
following MOs:
• Parent MO
• MOs that have a relation pointing to the
current MO

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Health Score

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Drill Down

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Health Score Evaluation Policy

• Currently can modify the penalty of the Health Score at the Fault severity level or ignore
acknowledged Faults

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
1 Statistics overview

2 GUI counters
3 Capacity dashboard

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Helps in quantifying the data with respected to application traffic
• Statistics contain counters
• Sampled into various granularities (5min, 15min, 1hr, etc.)
• History retained for each granularity
• Statistics are related to observable overlay objects
• Tenants/VRF/BD/EPG

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Counter values in GUI

Stats tab to see stats counter values

Traffic is aggregated by bytes

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Capacity Dashboard – Fabric Capacity

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Capacity Dashboard – Leaf Capacity

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
1 Faults

Faults/Events/Audit 2 Events
3 Audit logs

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Faults, events and audit logs are essential tools for monitoring the administrative
and operational state of an ACI Fabric as well as troubleshooting current and past
• They are the first thing to check when something is not behaving as expected

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• When a Fault occurs in an MO, a Fault instance MO is created
under the MO (Fault:Inst)
• It can be queried by DN and class

• For every Fault, a Fault record object (Fault:Record) is created in

the Fault log
• Severity:
• Critical, Major, Minor, Warning, Info, Cleared

• Types:
• Generic, Equipment, Configuration, Connectivity,
Environmental, Management, Network

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Fault Types
Type Description

Generic The system has detected a generic issue

Equipment The system has detected that a physical component is inoperable or has another
functional issue

Configuration The system is unable to successfully configure a component

Connectivity The system has detected a connectivity issue, such as an unreachable adapter

Environmental The system has detected has detected a power issue, thermal issue, voltage issue,
or a loss of CMPS settings

Management The system has detected a serious management issue, such as one of the
• Critical services could not be started
• Components in the instance include incompatible firmware versions
Network The system has detected a network issue, such as a link down

Operational The system has detected an operational issue, such as a log capacity limit or a
failed component discovery

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Fault Triggers
• Four types of Fault triggers:
1. Specific conditions described in the model by Fault rules
2. Counters crossing thresholds specified in user-programmable policies
3. Task or FSM failures
4. Object resolution failures

• Faults are raised and managed on the node (switch or controller) where the condition is

• Faults are raised and cleared automatically by the system

• User cannot define new Faults

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Fault Lifecycle and Acknowledgement
• Faults can be “acknowledged”, meaning “mark as
• Acknowledging a Fault in “retaining” state causes
the Fault to be deleted immediately (without
waiting for expiration or retention timer)
• Faults in Acknowledged state will not affect Health

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Fault Policies

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Sample fault

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Documentation
• https://<APIC IP>/doc/html/
• Or go directly to a Fault code:
• https://<APIC IP>/doc/html/FAULT-<Fault Code>.html

• Cisco APIC Faults, Events, and System Messages Management Guide

• https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/1-
• ACI System Messages:
• https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/2-

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• An event is a specific condition that occurs at a certain point in time (for example “link went
from down to up”)
• As they are part of the normal system workflow, they do not necessarily require user
• Useful for monitoring and debugging issues
• Similar to an entry in a log file: once created, they are never modified
• Deleted once the maximum number specified in retention is reached
• Events are triggered by “event rules”

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• Configuration or state change creates an event

• Logs:
• Audit log: user-initiated actions
• Health Score log: changes in Health Scores
• Event log: other system-generated events

• Viewing
• Via GUI (“History” tab in-context)
• Via CLI, API, Syslog, SNMP, Cisco Call
Home, subscriptions

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Audit Logs
• A mechanism to track user-initiated configuration changes
• When a user creates/modifies/deletes an MO, we create an “audit record” containing affected MO DN,
user name, timestamp and change details
• System also creates logs for log-in/log-out to controllers and nodes
• Similar to an entry in a log file: once created, the aaaModLR are never modified
• Configuration change logs are MOs of class modification log record
• Login/logout logs are MOs of class aaaSessionLR
• Accounting logs get deleted only when a maximum number specified in a retention policy is hit

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Audit Logs
All fields have filters and searchable

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Audit Logs Search - Date

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Audit Logs Search - User

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Audit Logs Search - Action

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Faults with Event Correlation

Show relevant
events before

BRKACI-1001 48

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

2 Endpoint Tracker
(“operations” tab)
3 Troubleshooting Wizard

4 CLI commands

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
ACI SPAN Feature Overview

Tenant SPAN Fabric SPAN Access SPAN

Source EPG Fabric Links (leaf or spine) Leaf downlinks

Destination ERSPAN ERSPAN ERSPAN, Leaf downlink port

Filter None Bridge Domain, VRF EPG, Routed Outside (L3out)

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

• ACI allows for SPAN of EPG EP Learnt

L1 L2
• ERSPAN Destination must be an IP EP Learnt ERSPAN
in ACI
• EP Can run Wireshark or Tshark Leaf101# show monitor session all
session 1
description : Span session 1
type : erspan
version : 2
EPG 100 oper version : 1
state : up (active)
erspan-id : 1
granularity :
vrf-name : CiscoLive:VRF1
acl-name :
ip-ttl : 64
ip-dscp : ip-dscp not
SPAN Source SPAN Destination 51 destination-ip
mode : access
source VLANs :
EPG ERSPAN rx : 100
tx : 100
both : 100
Port ERPSAN/Local Port filter VLANs
: filter not

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Access SPAN Overview
• Access SPAN in ACI is used to configure local SPAN on leaf switches
• Access SPAN is the only SPAN option that supports SPAN destination to physical port. Access
SPAN also supports ERSPAN destination
• SPAN can be ingress, egress, or both directions
• SPAN sources can be physical port, port-channels, VPC port-channels, or VPC component
• Access SPAN currently supports EPG filters. An EPG filter is translated to a VLAN filter when
the SPAN session is programmed on the leaf

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
SPAN Enhancements
• In ACI 4.1, we are introducing 3 key functionalities for SPAN

• SPAN on drop
• SPAN based on filter (5 tuple)
• SPAN with destination as Port-channel

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Traffic Map (“operations” tab > “visualization”)

• Find out quickly whether your

network is dropping traffic and
• Find out quickly whether you
have any hot spot
• Find out latency between

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

EP Tracker

“We had a
problem at

events are logged
for each EP

Was IP Moving?
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Troubleshooting Wizard - Faults

Shows Faults
in the Path

Builds Topology of Flow

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Troubleshooting Wizard – Drop Stats

Shows Drops on Every Hop. Green

Arrows portray no Drops
NOTE: Some Drops are expected.
Look for Drops like “Buffer” and “Error”

Recommended Content! – Understanding Drop Faults in ACI


© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Troubleshooting Wizard - Contracts

Shows Contracts for Flows

Implicit Deny Allow SSH

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Troubleshooting Wizard – Atomic Counters

No Drops!

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Troubleshooting Wizard – SPAN
Ability to SPAN to APIC or other devices
attached to the Fabric

User can select which ports to SPAN

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
• iPing works by sending an ICMP frame to the destination endpoint, from the source leaf on behalf of the
source endpoint
• The reply is relayed back to the source leaf over the infra VRF to ensure the source leaf recognizes the
response without disrupting existing traffic
• Recommend to set the source IP address for troubleshooting

Leaf101 # show vrf

VRF-Name VRF-ID State Reason
black-hole 3 Up --
management 2 Up --
overlay-1 4 Up --
rcdn:vrf 5 Up --

leaf101 # iping -h
iping: option requires an argument -- 'h'

usage: iping [-V vrf] [-c count] [-i wait] [-p pattern] [-s packetsize] [-t timeout] [-S source
ip/interface] host

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CLI Commands
show version
acidiag avread (show controller)
acidiag fnvread (show switch)
moquery (CLI base MO browser)
show audit
show event
show faults

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
show all tenants and children objects
moquery -c fvTenant -x 'query-target=subtree’

find EPG with pcTag between 1 and 16389

moquery -c fvAEPg -f 'fv.AEPg.pcTag~"(1,16389)”’

find active faults

moquery -c faultInst | egrep -e "^descr" | sort | uniq -c

find out changes made by user on date

moquery -c aaaModLR -f 'aaa.ModLR.created>"2020-01-01" and
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
switch CLI commands
fabric <nodeID> <command> - run cmd from APIC

show lldp neighbor

show interface
show port-channel summary
show vpc extended
show vlan extended
show endpoint (apic and switch)
show zoning-rule
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
CURL commands
• use Postman to generate code

OUT=$(curl -s -X POST -k https://$APIC/api/aaaLogin.json -

d '{ "aaaUser" : { "attributes" : { "name" : "admin" ,
"pwd" : ”password" } } }' -c cookie.txt EOF)
curl -b cookie.txt -X GET -k | jq
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Automating Operations
• Cover in Automation Accelerators
• Automation Tools
• Save/Edit/Post XML/JSON
• API inspector
• Visore
• Curl
• Postman
• Postman Runner
• Ansible
• Terraform
• Python
• UCS Director
• CloudCenter

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Other Operational Support Products

• Cisco Business Critical Service – Insights (AS service)

• Cisco Network Assurance Engine
• ACI Apps
• Network Insight – Resource
• Network Insight – Adviser
• Cisco NAE explorer
• UCS Integration App (provision FI vlans automatically) – 4.1

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Troubleshooting Review Questions:
• We notice a slight decrease in System health, how do we find out what happened?
• How to check if Fabric is running out of resources?
• Server team is reporting connectivity issues between two servers. How do I check
if Fabric is in good shape on datapath between two endpoints?
• Server team just connected a new server, gave me only IP and asking if I see their
server on the network?
• I see a new server, but can I ping it from the leaf?

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
1 Backups

Infrastructure and
3 Syslog

*NTP, DNS and management already

covered in previous sessions
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Infrastructure and Services
Backups – Configuration Export
• The current Fabric configuration/policy in JSON/XML
• Best practice for DISASTER RECOVERY

Enabled -> Encrypted password (best practice-must)

Disabled -> No password

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Infrastructure and Services
Backups - Snapshots

Creates a Config Backup that is stored on the APIC by default

Run on a Per Fabric or Tenant Basis

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Infrastructure and Services
Backups - Snapshots

• Rollback feature allows config

rollback between 2 snapshots
• Can also compare differences
between a previous snapshot
Changed To
Changed From

Changed From

Changed To

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
SNMP Support
• http://www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/aci/apic/sw/1-

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
SNMP Configuration

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Open Question: SNMP or REST API?
• SNMP support for any information needs to be explicitly implemented
• Therefore SNMP will never cover 100% of all info available
• The REST API covers 100% of all info available
• It is very easy figuring out which REST call to make to poll a specific counter
• Most tools out there support REST-based polling (mostly indirectly through the help of additional scripts):
Cacti, Nagios, Graphite, etc.

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Syslog Integration
• Forwards to syslog server
• Can forward events, audit
and faults

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Syslog Integration
• Verify SYSLOG configuration

• apic1# logit severity 1 dest-grp SyslogServer server “THIS IS A TEST“

• [root@syslogserver]# tail -f /var/log/messages
Aug 11 09:19:10 Aug 11 11:28:32.743 apic2 %LOG_-6-SYSTEM_MSG [login,session][info][subj-
[uni/userext/user-admin]/sess-8590507914] From-
Aug 11 09:19:17 Aug 11 11:28:39.572 apic1 %LOG_-1-SYSTEM_MSG [E4210472][transition][info][sys] sent
user message to syslog group:SyslogServer:THIS IS A TEST

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Syslog configuration – Why 3 locations?

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

Interlock and communications

Next Steps
Additional services

Additional Accelerators and Ask the

Expert sessions

Accelerator close
© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
ACI Troubleshooting Guide
ACI Upgrade/Downgrade Matrix Continue the conversation in
our ACI community

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

You might also like