Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

K12531: Troubleshooting health monitors

https://my.f5.com/manage/s/article/K12531
Published Date: Oct 31, 2018 UTC Updated Date: Aug 18, 2023 UTC

Issue
A monitor is a BIG-IP feature that verifies connections to pool members or nodes. A health monitor is
designed to report the status of a pool, pool member, or node on an ongoing basis, at a set interval. When a
health monitor marks a pool, pool member, or node as down, the BIG-IP system stops sending traffic to the
device.

A failing or misconfigured health monitor may cause traffic management issues similar to, but not limited to,
the following:

Connections to the virtual server are interrupted or fail.


Web pages or applications fail to load or run.
Certain pool members or nodes receive more connections than others.

Any of these symptoms may indicate that a health monitor is marking a pool, pool member, or node as
indefinitely down or that a monitor is repeatedly marking a pool member or node as down and then as back
up (often called "bouncing"). For example, if a misconfigured health monitor repeatedly marks pool members
as down and then as back up, connections to the virtual server may be interrupted or fail altogether. If this
occurs, you need to determine whether the monitor is misconfigured, the device or application is failing, or
some other factor, such as a network-related issue, is causing the monitor to fail. The troubleshooting steps
you take depend on the monitor type and the symptoms you observe.

You can use the following procedures to troubleshoot health monitor issues:

Identifying a failing health monitor


Verifying monitor settings
Troubleshooting monitor types
Troubleshooting daemons related to health monitoring
Using tcpdump to capture the monitor traffic
Verify the connectivity between F5 and pool members

Identifying a failing health monitor

You can use the Configuration utility, command line utilities, logs, or SNMP to help identify when a health
monitor marks a pool, pool member, or node as down.

Configuration utility

The following table lists Configuration utility pages where you can check the status of pools, pool members,
and nodes.

Configuration utility page Description Location


Network map Summary of pools, pool members, and nodes Local Traffic > Network Map
Pools Current status of pool/members Local Traffic > Pools > Statistics
Pool members Current status of pool/members Local Traffic > Pools > Statistics
Nodes Current status of nodes Local Traffic > Nodes > Statistics

Command line utilities

The following table lists command line utilities that you can use to monitor the status of pools, pool
members, and nodes.

Command line utility Description Example commands


tmsh show /ltm pool
TMOS Shell (tmsh) (BIG-IP Statistical information about pools, pool <pool_name>
10.x and later) members, and nodes tmsh show /ltm node
<node_IP>
bigtop Live statistics for pool members and nodes bigtop -n
Statistical information about pools, pool bigpipe pool show, bigpipe
bigpipe (BIG-IP 10.x)
members, and nodes node show

Logs

The BIG-IP system logs messages related to health monitors to the /var/log/ltm file. You can review log files
to determine the frequency with which the system marks pool members and nodes as down.

Pools

When a health monitor marks all members of a pool as down or up, the BIG-IP system logs messages
to the /var/log/ltm file which appear similar to the following example:

tmm err tmm[4779]: 01010028:3: No members available for pool <Pool_name>


tmm err tmm[4779]: 01010221:3: Pool <Pool_name> now has available members

Pool members

When a health monitor marks pool members as down or up, the BIG-IP system logs messages to the
/var/log/ltm file which appear similar to the following example:

notice mcpd[2964]: 01070638:5: Pool <Pool_name> member <ServerIP_port> monitor status down [
<MonitorA_name>: down, <MonitorB_name>: down ] [ was up for <#>hrs:<#>mins:<#>sec ]
notice mcpd[2964]: 01070727:5: Pool <Pool_name> member <ServerIP_port> monitor status up. [ <
MonitorA_name>: down, <MonitorB_name>: up ] [ was down for <#>hrs:<#>mins:<#>sec ]

When a pool member is forced offline by the administrator, the BIG-IP system logs messages to the
/var/log/ltm file which appear similar to the following example:

notice mcpd[5897]: 01070638:5: Pool <Pool_name> member <ServerIP_port> monitor status forced
down. [ <MonitorA_name>: down, <MonitorB_name>: up ] [ was up for <#>hrs:<#>mins:<#>sec ]

Nodes
When a health monitor marks a node as down or up, the BIG-IP system logs messages to the
/var/log/ltm file which appear similar to the following example:

notice mcpd[2964]: 01070640:5: Node <ServerIP> monitor status down.


notice mcpd[2964]: 01070728:5: Node <ServerIP> monitor status up.

Monitor logging

In BIG-IP 11.5.0 and later, the Monitor Logging option is available to allow the system to log more verbose
messages for each pool member and node level. The BIG-IP system stores the log of the respective pool
member or node in the /var/log/monitors/ directory. The system does not save the Monitor Logging option
setting into the system configuration but rather disables the option when the configuration loads.
Additionally, the BIG-IP system does not include the Monitor Logging option in syncing operations.

The log file has the following file naming format:

<MonitorPartition>_<MonitorName>-<NodePartition>_<NodeName>-<port>.log

For example, if the Gateway_ICMP monitor is set to monitor pool member 10.10.12.200 and the Monitor
Logging option is set to Enabled, the BIG-IP system generates the following log file for the pool member:

/var/log/monitors/Common_gateway_icmp-Common_10.10.12.200-0.log

Enabling monitor logging for a pool member

Impact of procedure: The /var/log directory may become full if you leave monitor logging enabled for a
long period of time. Be sure to disable monitor logging after troubleshooting.

1. Log in to the Configuration utility.


2. Go to Local Traffic > Pools > Pool List.
3. Select the name of the pool that contains the pool member for which you want to enable monitor
logging.
4. Select the Members tab.
5. In the Current Members, select the Member name of the pool member for which you want to enable
monitor logging.
6. For Monitor Logging, select the Enable check box.
7. Select Update.

Enabling monitor logging for a node

Impact of procedure: The /var/log directory may become full if you leave monitor logging enabled for a
long period of time. Be sure to disable monitor logging after troubleshooting.

1. Log in to the Configuration utility.


2. Go to Local Traffic > Nodes > Node List.
3. Select the name of the node for which you want to enable monitor logging.
4. For Monitor Logging, select the Enable check box.
5. Select Update.

SNMP

When you configure the BIG-IP system to send SNMP traps and a health monitor marks a pool member or
node as down or up, the system sends the following traps:
Pool members

alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS {
snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.10"
}
alert BIGIP_MCPD_MCPDERR_POOL_MEMBER_MON_STATUS_UP {
snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.11"
}

Nodes

alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS {
snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.12"
}
alert BIGIP_MCPD_MCPDERR_NODE_ADDRESS_MON_STATUS_UP {
snmptrap OID=".1.3.6.1.4.1.3375.2.4.0.13"
}

Verifying monitor settings

You must verify that monitor settings are properly defined for your environment. F5 recommends that in
most cases the timeout value should be equal to three times the interval value, plus one. For example, the
default timeout/interval ratio is 5/16 (three times 5 plus one equals 16). This setting prevents the monitor
from marking the node as down before sending the last check.

Simple monitors

You can use a simple monitor to verify the status of a destination node (or the path to the node through a
transparent device). Simple monitors only monitor the node address itself, not individual protocols, services,
or applications on a node. The BIG-IP system provides the following pre-configured simple monitor types:
gateway_icmp, icmp, tcp_echo, tcp_half_open. If you determine that a simple monitor is marking a node as
down, you can verify the following settings:

Note: There are other monitor settings that can be defined for simple monitors. For more information, refer to
the Configuration Guide for BIG-IP Local Traffic Management. For information about how to locate F5
product manuals, refer to K98133564: Tips for searching AskF5 and finding product documentation.

Interval/timeout ratio

You must configure an appropriate interval/timeout ratio for simple monitors. In most cases, the
timeout value should be equal to three times the interval value, plus one. For example, the default ratio
is 5/16 (three times 5 plus one equals 16). Verify that the ratio is properly defined.

Transparent

A transparent monitor uses a path through the associated node to monitor the aliased destination.
Verify that the destination target device is reachable and configured properly for the monitor.

Extended content verification (ECV) monitors


ECV monitors use Send and Receive string settings to retrieve content from pool members or nodes. The
BIG-IP system provides the following pre-configured monitor types: tcp, http, https, and https_443. If you
determine that a simple monitor is marking a node as down, you can verify the following settings:

Note: There are other monitor settings that can be defined for ECV monitors. For more information, refer to
the Configuration Guide for BIG-IP Local Traffic Management. For information about how to locate F5
product manuals, refer to K98133564: Tips for searching AskF5 and finding product documentation.

Note: HTTPS monitors use OpenSSL for cipher negotiations.

Interval/timeout ratio

As with simple monitors, you need to properly set the interval/timeout ratio for ECV monitors. In most
cases, the timeout value should be equal to three times the interval value, plus one. For example, the
default ratio is 5/16 (three times 5 plus one equals 16). Verify that the ratio is properly defined

Send string

The Send string is a text string that the monitor sends to the pool member. The default setting is GET /
, which retrieves a default HTML file for a website. If the Send string is not properly constructed, the
server may send an unexpected response and be subsequently marked as down by the monitor. For
example, if the server requires the monitor request to be HTTP/1.1 compliant, you must adjust the
monitor’s Send string.

Note: For information about modifying HTTP requests for use with HTTP or HTTPS application
health monitors, refer to the following articles:

K2167: Constructing HTTP requests for use with the HTTP or HTTPS application health
monitor
K3224: HTTP health checks may fail even though the node is responding correctly
K10655: CR/LF characters appended to the HTTP monitor Send string
K16526: Configuring the SSL cipher strength for a custom HTTPS health monitor
Receive string

The Receive string is the regular expression representing the text string that the monitor looks for in
the returned resource. ECV monitor requests may fail and mark the pool member as down if the
Receive string is not configured properly. For example, if the Receive string appears too late in the
server response, or the server responds with a redirect, the monitor marks the pool member as down.

Note: For information about modifying the monitor to issue a request to a redirection target, refer to
K3224: HTTP health checks may fail even though the node is responding correctly.

User name and password

ECV monitors have User Name and Password fields, which can be used for resources that require
authentication. Verify whether the pool member requires authentication and ensure that these fields
contain valid credentials.

Troubleshooting monitor types

Simple monitors
If you determine that a simple monitor is marking a node as down (or if the node is bouncing), you can use
the following steps to troubleshoot:

1. Determine the IP address of the nodes being marked as down.

You can determine the IP address or the nodes that the monitor is marking as down by using the
Configuration utility, command line utilities, or log files. You can quickly search the /var/log/ltm file
for node status messages by typing the following command:

# grep 'Node' /var/log/ltm |grep 'status'

Output will appear similar to the following example:

Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.1 monitor status down.
Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070640:5: Node 172.24.64.4 monitor status down.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.200 monitor status down.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.10.65.122 monitor status down.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 10.1.0.100 monitor status
unchecked.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 11.1.1.1 monitor status down.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.3 monitor status down.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070640:5: Node 172.16.65.229 monitor status
down.

Note: If a large number of nodes are being marked as down (or bouncing), you can sort the results by
IP addresses by typing the following command.

grep 'Node' /var/log/ltm |grep 'status' | sort -t . -k 3,3n -k 4,4n

2. Check connectivity to the node.

If there are occurrences of node addresses being marked as down and not back up, or of nodes
bouncing, use commands such as ping and traceroute (BIG-IP 10.x and 11.x) to check the
connectivity to the nodes from the BIG-IP system. For example, if you determine that a simple monitor
is marking the node address 10.10.65.1 as down, you can attempt to ping the resource from the BIG-IP
system, as shown in the following example:

# ping -c 4 10.10.65.1
PING 10.10.65.1 (10.10.65.1) 56(84) bytes of data.
64 bytes from 10.10.65.1: icmp_seq=1 ttl=64 time=11.32 ms
64 bytes from 10.10.65.1: icmp_seq=2 ttl=64 time=8.989 ms
64 bytes from 10.10.65.1: icmp_seq=3 ttl=64 time=10.981 ms
64 bytes from 10.10.65.1: icmp_seq=4 ttl=64 time=9.985 ms

Note: The ping output in the previous example shows high round-trip times, which may indicate a
network issue or a slowly responding node.

In addition, make sure that the node is configured to respond to the simple monitor. For example,
tcp_echo is a simple monitor type that requires that you enable TCP echo service on the monitored
nodes. The BIG-IP system sends a SYN segment with information that the receiving device echoes.

3. Check the monitor settings.


3.

Use the Configuration utility or command line utilities to verify that the monitor settings (such as the
interval/timeout ratio) are appropriate for the node.

Type the following tmsh command to list the configuration for the icmp_new monitor:

tmsh list /ltm monitor icmp_new

4. Create a custom monitor (if needed).

If you are using a default monitor and have determined that the settings are not appropriate for your
environment, consider creating and testing a new monitor with custom settings.

ECV monitors

If you determine that an ECV monitor is marking a pool member as down (or if the pool member is
bouncing), you can use the following steps to troubleshoot the issue:

1. Determine the IP address of the pool members that the monitor is marking as down by using the
Configuration utility, command line utilities, or log files.

For example, you can search the /var/log/ltm file for pool member status messages by typing the
following command:

# grep -i 'pool member' /var/log/ltm | grep 'status'

Output appears similar to the following example:

Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:21 monitor
status node down.
Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor
status node down.
Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor
status node down.
Jan 21 15:04:34 local/3400a notice mcpd[2964]: 01070638:5: Pool member 10.10.65.1:80 monitor
status node down.
Jan 21 15:04:51 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor
status node down.
Jan 21 15:05:05 local/3400a notice mcpd[2964]: 01070638:5: Pool member 172.16.65.3:80 monitor
status unchecked.

2. Check connectivity to the pool member.

Check the connectivity to the pool members from the BIG-IP system using the ping or traceroute
commands.

3. Check the ECV monitor settings.

Use the Configuration utility or command line utilities to verify that the monitor settings (such as the
interval/timeout ratio) are appropriate for the pool members.

The following tmsh command lists the configuration for the http_new monitor:

tmsh list /ltm monitor http_new


4. Create a custom monitor (if needed).

If you are using a default monitor and have determined that the settings are not appropriate for your
environment, consider creating and testing a new monitor with custom settings.

5. Test the response from the application.

Use a command line utility on the BIG-IP system to test the response from the web application. For
example, the following curl and time command attempts to transfer data from the web server while
timing the response:

# time curl http://10.10.65.1

Output syntax appears similar to the following example:

<html>
<head>
---
</body>
</html>
real 0m18.032s
user 0m0.030s
sys 0m0.060s

Note: If you want to test a specific HTTP request, including HTTP headers, you can use the telnet
command to connect to the pool member.

For example:

telnet <serverIP> <serverPort>

At the prompt, enter an appropriate HTTP request line and HTTP headers, pressing Enter once after
each line.

For example:

GET / HTTP/1.1 <enter>


Host: www.yoursite.com <enter>
Connection: close <enter>
<enter>

Troubleshooting daemons related to health monitoring

The bigd process manages health checking for pool members, nodes, and services on the BIG-IP LTM
system. The bigd process collects health checking status and communicates the status information to the
mcpd process, which stores the data in shared memory so that the Traffic Management Microkernel (TMM)
can read it. If you are having monitoring issues, you can check the memory utilization of the bigd process. If
the %MEM is unusually high, or continually increases, the process may be leaking memory.

For example, to check the current memory utilization of bigd, type the ps command:

# ps aux |grep bigd


Output appears similar to the following example:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3020 0.0 0.6 28208 10488 ? S 2010 5:08 /usr/bin/bigd

Note: If the bigd process fails, the health check status of pool members, nodes, and services remain in their
current state until the bigd process restarts. For more information, refer to K6967: When the BIG-IP LTM
bigd daemon fails, the health check status of pool members, nodes, and services remain unchanged until the
bigd daemon restarts.

Additionally, you can run the bigd process in debug mode. Debug logging for the bigd process is extremely
verbose as it logs multiple messages for every monitor attempt. For information about running bigd in debug
mode, contact F5 Technical Support.

Using tcpdump to capture the monitor traffic

If you are unable to determine the cause of a failing health monitor, you may need to perform packet captures
on the BIG-IP system. To use the tcpdump command to capture monitor traffic, perform the following steps:

Impact of procedure: You should only run tcpdump packet captures during active troubleshooting sessions.

1. Log into the BIG-IP command line.


2. Use the following command syntax to determine the self IP address that the BIG-IP system uses for
health monitoring:

ip route get <server ip address>

Note: Replace <server ip address> with the IP address of the destination server.

Output appears similar to the following example, which uses the destination server address 10.20.4.100
:

ip route get 10.20.4.100


10.20.4.100 dev internal_vlan src 10.20.4.3
cache

Note: In the example, the server 10.20.4.100 is associated with VLAN internal_vlan and the self IP
address for health monitoring is 10.20.4.3.

3. Use the following tcpdump syntax to capture monitor traffic.

tcpdump -nnvi <internal_vlan_name>:nnn -s0 -w /var/tmp/<filename>.pcap host <self-ip address>

For example,

tcpdump -nnvi internal_vlan:nnn -s0 -w /var/tmp/monitortraffic.pcap host 10.20.4.3

4. When you have captured the appropriate amount of monitor traffic, press Ctrl+C to terminate the
tcpdump capture.

Note: For more information about running tcpdump, refer to K411: Overview of packet tracing with the
tcpdump utility.
Verify the connectivity between F5 and pool members

1. Send a ping from F5 to a pool member.


2. Identify the intermediate device between F5 and pool member and ping to that device IP from F5.
3. If the intermediate device is a switch, check for ARP entry in F5 ARP table using the arp -a
command.
4. Verify VLAN and VLAN tagging configuration on F5 and the connected switch/L3 switch.
5. If the ping is blocked, perform a telnet test.

Related Content

K33598123: Pool member / Node stays down erroneously, due to bigd and mcpd is out-of-sync.
K83316932: Overview of the Manual Resume feature for BIG-IP LTM monitors
K14403: Maintaining disk space on the BIG-IP system (11.x - 13.x)
K15530: Debug logging and BIG-IP system resource utilization
K3451: Content length limits for HTTP and HTTPS health monitors
K16008: Overview of BIG-IP pool status (11.x - 15.x)
K10516: Overview of BIG-IP pool status (9.x - 10.x)
K13898: Determining which monitor triggered a change in the availability of a node or pool member
(11.x)
K10966: Determining which monitor triggered a change in the availability of a node or pool member
(9.x - 10.x)
K15408: Troubleshooting BIG-IP GTM monitors
ringdump on DevCentral
Run tcpdump on event on DevCentral
For more information about the bigtop utility, refer to K7318: Overview of the bigtop utility
For more information about tmsh, refer to the Traffic Management Shell (tmsh) Reference Guide

Note: For information about how to locate F5 product manuals, refer to K98133564: Tips for searching
AskF5 and finding product documentation.

You might also like