Life Cycle Manager Guide v2 - 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

LCM 2.

Life Cycle Manager Guide


September 1, 2020
Contents

1.  Overview...................................................................................................................... 3
Life Cycle Manager................................................................................................................................................... 3
Prism Central and Prism Element.......................................................................................................................4

2.  Opening LCM............................................................................................................ 5


Accessing the Life Cycle Manager from Prism Element............................................................................ 5
Accessing the Life Cycle Manager from Prism Central.............................................................................. 6

3.  LCM Inventory.......................................................................................................... 8


Performing Inventory With the Life Cycle Manager....................................................................................8

4.  LCM Updates...........................................................................................................10


Performing Updates with the Life Cycle Manager..................................................................................... 10
Effects of Updates on the Cluster.................................................................................................................... 13

5.  LCM Prechecks....................................................................................................... 16


Life Cycle Manager Pre-Checks......................................................................................................................... 16

6.  LCM Log Collection.............................................................................................. 19


Collecting Life Cycle Manager Logs.................................................................................................................19

7.  LCM Glossary.......................................................................................................... 20


Glossary.......................................................................................................................................................................20

Copyright...................................................................................................................22
License......................................................................................................................................................................... 22
Conventions............................................................................................................................................................... 22
Default Cluster Credentials..................................................................................................................................22
Version..........................................................................................................................................................................23
1
OVERVIEW
Life Cycle Manager
The Life Cycle Manager (LCM) tracks software and firmware versions of all entities in the
cluster.

LCM Structure
LCM consists of a framework and a set of modules for inventory and update.
This document assumes you are using LCM at a site that has Internet access. To use LCM at a
location without internet access, see the Life Cycle Manager Dark Site Guide.
LCM supports software updates for all platforms that use Nutanix software.
LCM supports firmware updates only for the following platforms.

Table 1: Platforms That Support Firmware Updates Through LCM

Nutanix (NX) Dell XC / XC Core Lenovo HX / HX Ready

HPE DX Fujitsu XF Intel DCB

HPE DL (G10) Inspur InMerge

The LCM framework is accessible through the Prism interface. It acts as a download manager
for LCM modules, validating and downloading module content. All communication between the
cluster and LCM modules goes through the LCM framework.
LCM modules are independent of AOS. They contain libraries and images, as well as metadata
and checksums for security. Currently Nutanix supplies all modules.
The LCM framework targets a configurable URL to download content from the LCM modules.

LCM Support
LCM supports both Prism Element and Prism Central.
Nutanix recommends that you use LCM to update your cluster to the most recent version of
Nutanix Foundation before performing any other LCM updates.
Nutanix recommends that you perform all updates through LCM, where available. However,
your platform vendor may recommend updates that are not yet available through LCM. The lag
is caused by the time it takes Nutanix to revalidate the LCM payload after incorporating new
updates from vendors.

LCM Operation
LCM performs two functions: taking inventory of the cluster and performing updates on the
cluster. LCM updates are not reversible.

LCM |  Overview | 3
Before performing an update, LCM runs a set of pre-checks to verify the state of the cluster. If
any checks fail, LCM stops the update.
LCM writes all operations to output logs:

• genesis.out

• lcm_ops.out

• lcm_ops.trace

• lcm_wget.log
The log files record all operations, including successes and failures. If an operation fails, LCM
suspends it to wait for mitigation. Contact Nutanix Support for assistance if there is an LCM
failure.
The LCM framework can also update itself when necessary. Although connected to AOS, the
framework is not part of the AOS release cycle.

Limitations

• LCM cannot perform firmware upgrades on single-node clusters. Firmware updates usually
require services to stop, and on a single-node cluster there is no other node to take over the
workload. This limitation does not apply to software updates.
• LCM is a separate process from Prism one-click upgrades. CLI commands specific to the
one-click upgrade workflow, such as firmware_upgrade_status, do not apply to LCM. For
update status, use lcm_update_status.

• LCM cannot operate when you have VLAN traffic filtering (such as VLAN trunking or virtual
guest tagging) deployed on the dvPortGroup connected to the Controller VM eth0 interface.

Prism Central and Prism Element


Information on using LCM with Prism Central and Prism Element.

Prism Central
Some entities are local to Prism Central, such as Calm, Epsilon, Karbon, and Objects. To manage
these entities, you must log on to Prism Central.
Other entities, such as component firmware, are local to Prism Element. To manage these
entities, you must log on to Prism Element.
In general, follow these guidelines for using LCM with Prism Element and Prism Central:

• Make sure that Prism Central and Prism Element use the same LCM URL. (At a dark site,
make sure that both Prism Central and Prism Element use the same dark site bundle.)
• Before you access Prism Element from Prism Central, perform an LCM inventory on Prism
Central, using cluster quick access.
• Every time you register a new Prism Element instance to Prism Central, perform an LCM
inventory on Prism Central.
• Nutanix recommends that you enable auto-inventory for LCM on Prism Central, to make sure
that the UI stays up to date on all Prism Element instances.
2
OPENING LCM
Accessing the Life Cycle Manager from Prism Element
Open LCM from Prism Element.

About this task


Different versions of AOS use different methods of opening entities in Prism.

Procedure
Open LCM with the method for your version of AOS.

» For AOS 5.11 and later:


In Prism, open the drop-down menu on the upper left and select LCM.

LCM |  Opening LCM | 5


» For AOS 5.10:
In Prism, click the gearbox button to open the Settings page and select Life Cycle
Management in the sidebar.

» For AOS 5.5:


In Prism, click the gearbox button and select Life Cycle Management from the drop-down
menu.

Accessing the Life Cycle Manager from Prism Central


Open LCM from Prism Central.

About this task


Different versions of AOS use different methods of opening entities in Prism.

Procedure
Open LCM with the method for your version of AOS.

LCM |  Opening LCM | 6


» For AOS 5.11 and later:
In Prism Central, open the drop-down menu on the upper left and select Administration >
LCM.

» For AOS 5.10:


In Prism Central, click the gearbox button to open the Settings page and select Life Cycle
Management in the sidebar.

LCM |  Opening LCM | 7


3
LCM INVENTORY
Performing Inventory With the Life Cycle Manager
Use LCM to display software and firmware versions of entities in the cluster.

About this task


Inventory information for a given node is persistent as long as the node remains in the chassis.
When you remove a node from a chassis, LCM does not retain inventory information for that
node. When you return the node to the chassis, you must perform the inventory operation
again to restore the inventory information.

Procedure

1. Open LCM according to the procedure for the version of AOS you are using, as described in
Accessing the Life Cycle Manager from Prism Element on page 5 or Accessing the Life Cycle
Manager from Prism Central on page 6.

2. Select Inventory.

LCM |  LCM Inventory | 8


3. To take an inventory, click Perform Inventory.
If you do not have auto-update enabled, and a new version of the LCM framework is
available, LCM shows the following warning message:

4. Click OK.
The new inventory appears on the Inventory page.

5. Use the Focus button to switch between a general display and a component-by-component
display.

6. Click Export to export the inventory as a spreadsheet.

7. To enable auto-inventory, click Settings and select the Enable LCM Auto Inventory checkbox
in the dialog box that appears.

8. To return to Prism, open the drop-down menu on the upper left and select Home.

LCM |  LCM Inventory | 9


4
LCM UPDATES
Performing Updates with the Life Cycle Manager
Use LCM to perform software or firmware updates for components in a cluster.

Before you begin


Make sure that your system is able to perform LCM updates:
Foundation
Nutanix recommends that you update your cluster to the most recent version of Nutanix
Foundation before performing any LCM updates.
Firewall
Configure rules in your external firewall to allow LCM updates. For details, see the Prism
Web Console Guide: Firewall Requirements.
Affinity rules
When LCM takes down a node prior to performing an update, AOS uses the Acropolis
dynamic server (ADS) to migrate virtual machines to other nodes in the cluster. If you
have VM affinity rules configured on your cluster, particularly affinity rules that restrict
which nodes the VM can migrate to, then LCM can run slowly while waiting for ADS to
migrate the VMs.
If you have configured affinity rules that do not allow VMs to migrate, then the LCM
precheck test_esx_entering_mm_pinned_vms fails and LCM updates cannot continue.
To avoid update failures or delays due to affinity rules, disable any VM affinity rules
before beginning an LCM update and reenable the rules after performing the update.

Procedure

1. Open LCM according to the procedure for the version of AOS you are using, as described in
Accessing the Life Cycle Manager from Prism Element on page 5 or Accessing the Life Cycle
Manager from Prism Central on page 6.

LCM |  LCM Updates | 10


2. In the LCM toolbar, select Software Updates or Firmware Updates.

3. Select the updates you want to perform.

a. Select the checkbox for the node you want to update, or select All to update the entire
cluster.
b. Select the components you want to update. When you select a node, LCM selects
the checkboxes for all updateable components by default. Clear the checkbox of any
component you do not want to update.

4. Click NCC Check.

5. In the dialog box that appears, specify which prechecks you want LCM to run before
updating and click Run.

6. When the prechecks are complete, click Update.

LCM |  LCM Updates | 11


7. Specify how you want LCM to handle pinned VMs.

a. ESXi: LCM cannot place ESXi hosts with active pinned VMs into maintenance mode.
Before you can proceed with LCM updates, you must choose one of the following.

• Select Automatically shut down non-migratable VMs through LCM.


• Uncheck Automatically shut down non-migratable VMs through LCM and handle
pinned VMs manually, either by migrating them or shutting them down.
• If you have a system configuration that can automatically handle pinned VMs, uncheck
Automatically shut down non-migratable VMs through LCM and leave your system to
handle them.

Figure 1: Applying updates on ESXi hosts


b. AHV: LCM automatically shuts down any running non-migratable VMs before updating
their host, and starts them again after completing the update.

Note: Non-migratable VMs include VMs with affinity set to a single host; VMs with GPUs;
and agent VMs.

8. Click Apply Updates.


LCM updates the selected components.

9. To return to Prism, open the drop-down menu on the upper left and select Home.

LCM |  LCM Updates | 12


Effects of Updates on the Cluster
Effects on the cluster while LCM updates components.

LCM Workflow
LCM updates the cluster one node at a time: it brings a node down, performs updates, brings
the node up again, and then moves on to the next node. If LCM encounters a problem during an
update, it waits until you resolve the problem before moving on to the next node.
During an LCM update, there is never more than one node down at the same time.
For hosts running ESXi: LCM restarts the host after it migrates the guest VMs but before it
enters maintenance mode, so a "maintenance" status alert never appears in vCenter. This is
expected behavior.
All LCM updates follow the procedure shown in the following flowchart.

1. If updates for the LCM framework are available, LCM auto-updates its own framework, then
continues with the operation.
2. After updating itself, if necessary, LCM runs the series of pre-checks described in Life Cycle
Manager Pre-Checks on page 16.
3. When the pre-checks are complete, LCM looks at the available component updates and
batches them according to dependencies. LCM batches updates in order to reduce the
down-time of the cluster; when updates are batched, LCM only performs the pre-update
and post-update actions once. For example, on NX platforms, BIOS updates depend on BMC
updates, so LCM batches them so the BMC always updates before the BIOS on each node.
4. Next, LCM chooses a node and performs any necessary pre-update actions.
5. Next, LCM performs the update. What actually happens during an update varies by
component; see the following sections for lists of update actions for supported components.

LCM |  LCM Updates | 13


6. LCM performs any necessary post-update actions and brings the node back up.
7. When cluster data resiliency is back to Normal, LCM moves on to the next node.

Update Actions for Each Component


BIOS
What happens during the update:
1. Put the CVM in maintenance mode.
2. The CVM entering maintenance mode automatically triggers the host to migrate all
guest VMs to another node.
3. Restart the node into the Phoenix ISO.
4. Perform the first stage of the update.
5. Warm-reset Phoenix. (A warm reset does not power-cycle the node.)
6. Perform the second stage of the update.
7. Cold-reset Phoenix. (A cold reset power-cycles the node.)
8. Perform the third stage of the update.
9. Restart out of Phoenix and bring the CVM out of maintenance mode.

Note: Because of requirements of Intel microcode, updating the BIOS requires several
restarts, so BIOS updates take longer than updates for other components.

BMC
What happens during the update:
1. Put the CVM in maintenance mode.
2. The CVM entering maintenance mode automatically triggers the host to migrate all
guest VMs to another node.
3. Restart the node into the Phoenix ISO.
4. Perform the update.
5. Restart out of Phoenix and bring the CVM out of maintenance mode.
Data Drives and HBA Controllers
What happens during the update:
1. Check disk health on all CVMs, to make sure that taking down one drive does not
cause any data loss.
2. Move all storage traffic out of the Controller VM.
3. Put the CVM in maintenance mode.
4. The CVM entering maintenance mode automatically triggers the host to migrate all
guest VMs to another node.
5. Stop services on the CVM.
6. Restart the node into the Phoenix ISO.
7. Perform the firmware update.
8. Restart the node out of Phoenix.
9. Bring the CVM out of maintenance mode, automatically returning guest VMs.
10. Restart services on the CVM.
11. Return storage traffic to the CVM.
12. Recheck all data drives.
SATA DOM
What happens during the update:
1. Put the CVM in maintenance mode.
2. The CVM entering maintenance mode automatically triggers the host to migrate all
guest VMs to another node.

LCM |  LCM Updates | 14


3. Restart the node into the Phoenix ISO.
4. Perform the update.
5. Restart out of Phoenix and bring the CVM out of maintenance mode.
M.2 Drive
What happens during the update:
1. Put the CVM in maintenance mode.
2. The CVM entering maintenance mode automatically triggers the host to migrate all
guest VMs to another node.
3. Restart the node into the Phoenix ISO.
4. Perform the update.
5. Restart out of Phoenix and bring the CVM out of maintenance mode.

LCM |  LCM Updates | 15


5
LCM PRECHECKS
Life Cycle Manager Pre-Checks
List of pre-checks that LCM performs during entity update.
LCM runs a series of pre-checks before beginning an inventory or update operation. If a check
fails, the operation does not run.
There are also prerequisites for certain components that fall outside the scope of these tests. If
your pre-check fails, make sure that you have met all these prerequisites:

• For all LCM operations, disable Common Access Card (CAC) authentication on your cluster.
• For all LCM updates on ESXi, disable admission control in vCenter.

Note: For information on configuring vCenter, see the vSphere Administration Guide for
Acropolis: vCenter Configuration.

LCM Pre-Checks
test_aos_afs_compatibility
Verifies that AOS and Nutanix Files are compatible.
test_cluster_status
Verifies that the cluster statuscommand returns a healthy state.
test_cassandra_status
Checks zeus_config_proto (run zeus_config_printer for equivalent) for the following:

• Node cassandra_status is not Normal.


• No nodes are in maintenance.
• The NCC check described in KB 4301 returns a PASS.
test_nutanix_partition_space
Checks that all CVMs have sufficient space on /home/nutanix.
test_network
Verifies that CVM IPs are reachable.
test_hypervisor_tests
Verifies that SSH can connect to the hypervisor.

• AHV: verifies libvirtd on AHV hosts.

• ESXi: checks if there is a VmkNic for all ESXi hosts in the same subnet as eth0/eth2.
Verifies pyVim connectivity to ESXi hypervisor.

LCM |  LCM Prechecks | 16


test_zk_status
Verifies that at least three zk hosts are available, on a cluster having three or more nodes.
test_replication_factor
Verifies that no containers in the cluster have RF=1.
test_under_replication
Checks for under-replication in the cluster.
test_prism_central_prism_element_compatibility
Checks that the registered Prism Central is compatible with Prism Element by reading
prism_central_compatibility.json.
test_prism_central_ova_compatibility
Checks that Prism Central has boot and data partitions on different disks.
test_degraded_node
Verifies that no nodes are degraded.
test_vlan_detection
Verifies that LCM can fetch VLAN information from all nodes using NodeManager.
test_upgrade_in_progress
Checks to see if a non-LCM upgrade is in progress.
test_foundation_service_status_in_cluster
Checks to see if the Foundation service is running on any node in the cluster.
test_vmotion_support/test_live_migration_support
Checks that an ESXi or Hyper-V host has vMotion enabled.
test_cluster_config
Verifies that the cluster has vSphere HA enabled on vCenter mapped to a Nutanix cluster.
(Make sure that you disable admission control.)
test_hosts_managed_by_same_mgmt_server
Verifies that the same management server (vCenter) manages all hosts.
test_hypervisor_config

• All hosts: verifies that the host has sufficient space.


• ESXi:

• Verifies that the host does not have APD/VMCP enabled. (If yes, disable it.)
• Verifies that no CD-ROMs (ISO image stored in local storage) or other physical
devices are attached to the VM. (If yes, remove the physical device from UVM.)
• AHV: verifies that the host can enter maintenance mode.
• Hyper-V: none.
test_check_revoke_shutdown_token
Checks that the CVM can revoke the shutdown token after an LCM update completes on
a node in a cluster.
test_ntp_server_reachability
Checks that the configured NTP servers are reachable before starting LCM updates.

LCM |  LCM Prechecks | 17


test_prism_central_minimum_resource_check
Check that Prism Central is configured with the correct license and has enough memory
to support all licensed features.
test_two_node_cluster_checks
Check that the configuration on a two-node cluster allows LCM updates.

• Check that both nodes are up and the cluster is functional.


• Check that the cluster has selected an LCM leader.
• Check that the cluster has configured a witness VM.
• Check that there are at least two meta disks available on each CVM.
test_zk_migration
Check to see if there is a Zookeeper migration in progress.
test_catalog_workflows
Check that the most commonly used catalog workflows for LCM are currently
performable.
test_foundation_workflows
Check whether the current version of Foundation on the cluster is on the LCM blacklist.
test_uniform_foundation_across_cluster
Check to make sure that every node in the cluster is running the same version of
Foundation.
test_esxi_bootbank_space
Check that the ESXi bootbank has enough space for LCM upgrades that involve
restarting into Phoenix.
test_esx_entering_mm_pinned_vms
Check to see if any node in the cluster contains a VM that cannot be live migrated to
another node.

Note: For inventory scans, LCM skips the aos_afs_compatibility and under_replication tests,
because inventory scans are not disruptive.
6
LCM LOG COLLECTION
Collecting Life Cycle Manager Logs
Run the LCM log collector utility.

About this task


The LCM log collector utility is a script independent of NCC, that collects only files that are
relevant to the LCM workflow. It completes more quickly and returns a smaller log bundle than
the NCC log_collector command.
The utility is independent of the state of LCM, so it can collect logs even when the LCM
framework is down.
The log collector utility performs the following actions:

• Collects LCM log files


• Detects if the node is stuck in Phoenix and collects kernel logs
• Collects logs for the services with which LCM interacts, such as Foundation and Prism
• Parses LCM configurations from zookeeper nodes.

Note: The collector cannot pull logs from a node booted into Phoenix when the node IP address
is not reachable over the network. In that case, apply the CVM IP address to the Phoenix instance
using IPMI, following the procedure in KB 5346, before running the collector.

Procedure
From any CVM, run the collector command.
$ python /home/nutanix/cluster/bin/lcm/lcm_log_collector.py
LCM creates a log bundle in the /home/nutanix directory on the node where you ran the script.
The log bundle format is lcm_logs_lcm-leader_timestamp.tar.gz.

LCM |  LCM Log Collection | 19


7
LCM GLOSSARY
Glossary
Glossary of LCM-related terms.

Hardware Terms
BMC
baseboard management controller, the microcontroller that manages the motherboard.
BIOS
basic input/output system, the firmware that initializes the motherboard and runtime
services on startup.
HBA
host bus adapter, a device that manages communication between storage media and other
system components.
SATA DOM
SATA disk on a module, the hypervisor boot drive for Nutanix platforms up to and including
G5.
M.2
mSATA 2, the hypervisor boot drive for Nutanix platforms G6 and later.

Dell-Specific Terms
iDRAC
integrated Dell remote access controller, a software tool that lets you administer a server
without needing physical access.
iSM
iDRAC service module, a module that integrates iDRAC with an operating system.
PTAgent
Power Tools agent, the software entity responsible for configuring the iSM.
Firmware Entities
A collective term for all Dell firmware not related to iDRAC.

Lenovo-Specific Terms
IMM2
integrated management module 2, a service processor that manages the motherboard
and allows remote access for server administration. Present on Lenovo generation-5
platforms; replaced by XCC in later generations.

LCM |  LCM Glossary | 20


UEFI
unified extensible firmware interface, the software entity that initializes the motherboard and
runtime services on startup. A replacement for the BIOS.
XCC
XClarity controller, a service processor that manages the motherboard and allows remote
access for server administration. Present on Lenovo platforms that use the Intel "Purley"
motherboard, replacing the IMM2 from earlier generations.

LCM |  LCM Glossary | 21


COPYRIGHT
Copyright 2020 Nutanix, Inc.
Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110
All rights reserved. This product is protected by U.S. and international copyright and intellectual
property laws. Nutanix and the Nutanix logo are registered trademarks of Nutanix, Inc. in the
United States and/or other jurisdictions. All other brand and product names mentioned herein
are for identification purposes only and may be trademarks of their respective holders.

License
The provision of this software to you does not grant any licenses or other rights under any
Microsoft patents with respect to anything other than the file server implementation portion of
the binaries for this software, including no licenses or any other rights in any hardware or any
devices or software that are used to communicate with or in connection with this software.

Conventions
Convention Description

variable_value The action depends on a value that is unique to your environment.

ncli> command The commands are executed in the Nutanix nCLI.

user@host$ command The commands are executed as a non-privileged user (such as


nutanix) in the system shell.

root@host# command The commands are executed as the root user in the vSphere or
Acropolis host shell.

> command The commands are executed in the Hyper-V host shell.

output The information is displayed as output from a command or in a


log file.

Default Cluster Credentials


Interface Target Username Password

Nutanix web console Nutanix Controller VM admin Nutanix/4u

vSphere Web Client ESXi host root nutanix/4u

vSphere Client ESXi host root nutanix/4u

SSH client or console ESXi host root nutanix/4u

SSH client or console AHV host root nutanix/4u

LCM | 
Interface Target Username Password

SSH client or console Hyper-V host Administrator nutanix/4u

SSH client Nutanix Controller VM nutanix nutanix/4u

SSH client Nutanix Controller VM admin Nutanix/4u

IPMI web interface or Nutanix node ADMIN ADMIN


ipmitool

SSH client or console Acropolis OpenStack root admin


Services VM (Nutanix OVM)

SSH client or console Xtract VM nutanix nutanix/4u

SSH client or console Xplorer VM nutanix nutanix/4u

Version
Last modified: September 1, 2020 (2020-09-01T12:55:05-07:00)

LCM | 

You might also like