Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Metro Availability

Nutanix Best Practices

Version 3.5 • June 2020 • BP-2009


Metro Availability

Copyright
Copyright 2020 Nutanix, Inc.
Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110
All rights reserved. This product is protected by U.S. and international copyright and intellectual
property laws.
Nutanix is a trademark of Nutanix, Inc. in the United States and/or other jurisdictions. All other
marks and names mentioned herein may be trademarks of their respective companies.

Copyright | 2
Metro Availability

Contents

1. Executive Summary.................................................................................5

2. Introduction.............................................................................................. 7
2.1. Audience.........................................................................................................................7
2.2. Purpose.......................................................................................................................... 7

3. Nutanix Enterprise Cloud Overview...................................................... 8


3.1. Nutanix HCI Architecture............................................................................................... 9
3.2. Metro Availability Overview............................................................................................9

4. VMware Component Overview............................................................. 19


4.1. VMware High Availability (HA) Cluster........................................................................ 19
4.2. VMware Distributed Resource Scheduler (DRS).........................................................20
4.3. VMware vMotion.......................................................................................................... 21
4.4. VMware vCenter Server...............................................................................................22

5. Operational Scenarios...........................................................................24
5.1. Establishing Metro Availability..................................................................................... 24
5.2. Planned Failover.......................................................................................................... 25
5.3. Network Outage Between the Nutanix Clusters.......................................................... 28
5.4. Site Failure................................................................................................................... 31
5.5. Site Recovery...............................................................................................................35
5.6. Metro Availability Witness-Specific Workflows.............................................................38
5.7. Operational Scenarios Summary................................................................................. 41
5.8. Asynchronous Snapshot Workflows.............................................................................43

6. Metro Availability Best Practices Checklist........................................ 50


6.1. Requirements............................................................................................................... 50
6.2. Interoperability.............................................................................................................. 50
6.3. Limitations.................................................................................................................... 51
6.4. Nutanix Recommendations.......................................................................................... 51
6.5. VMware Recommendations......................................................................................... 52

3
Metro Availability

7. Metro Availability Best Practices (Detail)............................................54


7.1. Nutanix Platform Guidance.......................................................................................... 54
7.2. VMware Guidance........................................................................................................56

8. Utilities and Alerting..............................................................................58


8.1. REST API.....................................................................................................................58
8.2. nCLI Commands.......................................................................................................... 58
8.3. PowerShell Commands................................................................................................59
8.4. Nutanix Cluster Check (NCC)......................................................................................60
8.5. Alerts............................................................................................................................ 60

9. Conclusion..............................................................................................62

Appendix..........................................................................................................................63
About the Author................................................................................................................. 63
About Nutanix...................................................................................................................... 63

List of Figures................................................................................................................ 64

List of Tables.................................................................................................................. 66

4
Metro Availability

1. Executive Summary
IT departments now regularly face demanding service-level agreements (SLAs) that necessitate
building resiliency into all aspects of the datacenter. While traditional solutions provide
redundancy at the hardware layer, their administrators must still manage downtime due to
infrastructure failures, site maintenance, or natural disasters.
Most solutions that exist today are expensive, complex, and built on legacy principles and
infrastructure. As a result, many application workloads are not protected—or are underprotected
—and vulnerable to outage. To help address these concerns, Nutanix has developed a
continuous availability solution called Metro Availability. Metro Availability creates a global file
system namespace across Nutanix clusters and uses synchronous replication. Combining the
Nutanix hyperconverged infrastructure with a continuous availability solution limits downtime
and preserves all data, even during a complete site failure. Further, Metro Availability enables
workload mobility for disaster avoidance or planned maintenance scenarios.
Metro Availability allows administrators to leverage hypervisor clustering technologies across
datacenters. We call this type of configuration a stretched cluster, and it helps to minimize
downtime during unplanned outages. Metro Availability also supports the migration of virtual
machines (VMs) across sites, using technologies such as vMotion, which means that you have
zero downtime while transitioning workloads between datacenters.
With Metro Availability, Nutanix now delivers an entire spectrum of solutions to provide the
data and application protection you need to fulfill your SLAs and meet your recovery point and
recovery time objectives (RPOs and RTOs). Nutanix has built safeguards into the platform for
everything from minor events, such as individual VM deletion, to major ones, including unplanned
datacenter failure. Setup and management are simple, intuitive, and controlled from the Prism UI.
With Prism, enterprises have, for the first time, a consumer-grade management experience for
handling disaster recovery and high availability.
The following diagram shows the Nutanix features as they align with RPO and RTO requirements
for minor or large incidents.

1. Executive Summary | 5
Metro Availability

Figure 1: Nutanix Data Protection Spectrum, Including Metro Availability

1. Executive Summary | 6
Metro Availability

2. Introduction

2.1. Audience
We wrote this best practices document for those responsible for architecting, designing,
managing, and supporting Nutanix infrastructures. Readers should already be familiar with
VMware vSphere and the Nutanix enterprise cloud software, which includes Acropolis and Prism.
We have organized this document to address key items for enabling successful design,
implementation, and transition to operation.

2.2. Purpose
This document presents an overview of Nutanix and the Metro Availability feature, and we
discuss deployment considerations and general best practices around Metro Availability
functionality in a VMware vSphere environment. At the conclusion of this document, the reader
should be comfortable architecting and deploying a Metro Availability–based solution on Nutanix.

Table 1: Document Version History

Version
Published Notes
Number
1.0 February 2015 Original publication.
2.0 April 2016 Updated for AOS 4.6.
3.0 January 2017 Updated for AOS 5.0.
3.1 March 2018 Updated platform overview.
3.2 May 2018 Updated Nutanix overview and deduplication recommendations.
3.3 September 2018 Updated container sizing and network configuration guidance.
3.4 August 2019 Updated for AOS 5.11.
3.5 June 2020 Updated for AOS 5.15 and AOS 5.17.

2. Introduction | 7
Metro Availability

3. Nutanix Enterprise Cloud Overview


Nutanix delivers a web-scale, hyperconverged infrastructure solution purpose-built for
virtualization and both containerized and private cloud environments. This solution brings the
scale, resilience, and economic benefits of web-scale architecture to the enterprise through the
Nutanix enterprise cloud platform, which combines the core HCI product families—Nutanix AOS
and Nutanix Prism management—along with other software products that automate, secure, and
back up cost-optimized infrastructure.
Available attributes of the Nutanix enterprise cloud OS stack include:
• Optimized for storage and compute resources.
• Machine learning to plan for and adapt to changing conditions automatically.
• Intrinsic security features and functions for data protection and cyberthreat defense.
• Self-healing to tolerate and adjust to component failures.
• API-based automation and rich analytics.
• Simplified one-click upgrades and software life cycle management.
• Native file services for user and application data.
• Native backup and disaster recovery solutions.
• Powerful and feature-rich virtualization.
• Flexible virtual networking for visualization, automation, and security.
• Cloud automation and life cycle management.
Nutanix provides services and can be broken down into three main components: an HCI-
based distributed storage fabric, management and operational intelligence from Prism,
and AHV virtualization. Nutanix Prism furnishes one-click infrastructure management for
virtual environments running on AOS. AOS is hypervisor agnostic, supporting two third-party
hypervisors—VMware ESXi and Microsoft Hyper-V—in addition to the native Nutanix hypervisor,
AHV.

3. Nutanix Enterprise Cloud Overview | 8


Metro Availability

Figure 2: Nutanix Enterprise Cloud OS Stack

3.1. Nutanix HCI Architecture


Nutanix does not rely on traditional SAN or network-attached storage (NAS) or expensive storage
network interconnects. It combines highly dense storage and server compute (CPU and RAM)
into a single platform building block. Each building block delivers a unified, scale-out, shared-
nothing architecture with no single points of failure.
The Nutanix solution requires no SAN constructs, such as LUNs, RAID groups, or expensive
storage switches. All storage management is VM-centric, and I/O is optimized at the VM virtual
disk level. The software solution runs on nodes from a variety of manufacturers that are either
entirely solid-state storage with NVMe for optimal performance or a hybrid combination of SSD
and HDD storage that provides a combination of performance and additional capacity. The
storage fabric automatically tiers data across the cluster to different classes of storage devices
using intelligent data placement algorithms. For best performance, algorithms make sure the
most frequently used data is available in memory or in flash on the node local to the VM.
To learn more about Nutanix enterprise cloud software, visit the Nutanix Bible and Nutanix.com.

3.2. Metro Availability Overview


Metro Availability is part of a suite of data protection features offered on the Nutanix platform. A
continuous availability solution, Metro Availability provides a global file system namespace across

3. Nutanix Enterprise Cloud Overview | 9


Metro Availability

a container “stretched” between Nutanix clusters. Synchronous storage replication supports


the stretched container across independent Nutanix clusters, using the “protection domain” and
“remote site” constructs. The platform enables synchronous replication at the container level and
replicates all VMs and files stored in that container synchronously to another Nutanix cluster.
Each protection domain maps to one container. Administrators can create multiple protection
domains to enable different policies, including bidirectional replication, where each Nutanix
cluster replicates synchronously to one or more clusters. Containers have two primary roles
while enabled for Metro Availability: active and standby. As shown in the following figure, active
containers replicate data synchronously to standby containers.
Administrators create a protection domain by specifying an active container on the local cluster
and a standby container of the same name on the remote cluster. The active and standby
containers mount to their respective hypervisor hosts using the same datastore name, which
effectively spans the datastore across both clusters and sites. With a datastore stretched across
both Nutanix clusters, you can create a single hypervisor cluster and use common clustering
features, like VMware vMotion and VMware High Availability, to manage the environment. Hosts
presenting standby containers can run VMs targeted for the standby container; however, standby
containers are not available for direct VM traffic. The system forwards all I/O targeted for a
standby container to the active site.

3. Nutanix Enterprise Cloud Overview | 10


Metro Availability

Figure 3: Metro Availability Overview

The Nutanix platform supports Metro Availability in conjunction with other data management
features, including compression, deduplication, and tiering. Metro Availability also allows you
to enable compression for the synchronous replication traffic between the Nutanix clusters.
When you configure the remote site, enabling replication traffic compression reduces the total
bandwidth required to maintain the synchronous relationship.

Active Containers
Active containers are fully read and write and process all VM I/O operations. Local read
operations for a VM running against an active container function similarly to non-Metro
environments. All local write operations, including both random and sequential writes, go
through the oplog of the CVM with Metro Availability enabled. When local writes occur, data

3. Nutanix Enterprise Cloud Overview | 11


Metro Availability

replicates locally based on the container’s replication factor. In parallel, the system sends remote
writes to the oplog of a CVM in the cluster maintaining the standby container. The standby
container’s replication factor determines the number of writes required to complete the remote
write operation against the peer cluster. After all replication factor writes are processed for both
the active and standby containers, the VM receives acknowledgement that the write is complete.
The following diagram depicts this process.

Figure 4: VM Write to an Active Container with Replication Factor 2 at Both Sites

Standby Containers
Standby containers receive synchronous write updates from the active container. Standby
containers mount to the appropriate hypervisor instances in the Nutanix cluster, so you can run
VMs locally to the hypervisor nodes that own the container and datastore. While the standby
container appears available, it does not process VM I/O directly. The system forwards all VM
read or write I/O targeted for a standby container to a CVM in the cluster that owns the active
container before returning data or acknowledgement to the VM. The following figures depict the
behavior of these remote read and write operations.

3. Nutanix Enterprise Cloud Overview | 12


Metro Availability

Figure 5: VM Write to a Standby Container with Replication Factor 2 at Both Sites

3. Nutanix Enterprise Cloud Overview | 13


Metro Availability

Figure 6: VM Read from a Standby Container

Metro Availability Witness


The Metro Availability witness allows Nutanix clusters to automate management of storage
and replication states, including the promotion of standby protection domains, so VMs can
automatically fail over during site failures. The witness is a VM you can install in both Nutanix
and non-Nutanix environments. It resides in a site and failure domain separate from the Nutanix
clusters forming the Metro relationship, enabling failover decisions that avoid conditions that
could lead to a split-brain scenario, where storage is active on both sites.
Nutanix clusters register with the witness, and each protection domain represents a unique
relationship. This design permits multiple sets of clusters to register with one witness and
supports both active and standby protection domains. The Nutanix clusters must be running
a minimum of AOS 5.0 to use the witness-based functionality. The following sections provide
additional information about this feature.

3. Nutanix Enterprise Cloud Overview | 14


Metro Availability

Metro Availability Replication States


When you first enable Metro Availability, a full copy of the data residing in the active container
replicates between the clusters. Nutanix uses a snapshot for this initial full copy and maintains
it as a reference point. The relationship is in a synchronizing state until the copy completes. Any
writes that occur following the initial snapshot replicate synchronously. Once fully synchronized,
the protection domain enters an “enabled” state, which indicates that the standby container is
consistent.

Failure Handling
A network failure between Nutanix clusters in a Metro relationship first halts any writes
against the active container, and the replication state reports “remote unreachable.” A standby
container becomes unavailable and inactive to the hypervisor host while the active container is
unreachable. The failure handling setting offers three options for disabling Metro Availability to
allow writes to continue against the active container: manual, automatic resume, or witness.
With failure handling set to manual, VM writes do not resume until either network connectivity
is restored between the clusters or an administrator manually disables Metro Availability. Use
this option in environments that require strict synchronous replication at all times between sites.
If network connectivity is restored, the replication state immediately returns to enabled and
synchronous replication continues between the clusters; otherwise, writes are held indefinitely.
If an administrator manually disables replication, a “disabled” state is reported and writes
resume, but they only propagate to the active container. Following a manual disable operation,
replication to the standby container does not resume until network connectivity is restored and
the administrator manually performs a reenable operation.
With failure handling set to automatic resume, if network connectivity is restored within the
timeout specified under the automatic replication break setting, the replication state immediately
returns to enabled and writes continue to replicate synchronously between the clusters. The
default automatic replication break timeout is 10 seconds. If the timeout period expires, the
system automatically disables Metro Availability and writes resume, but they only propagate to
the active container. Using this option, temporary network outages between sites do not impact
applications running against the active container. Following the automatic replication break,
replication to the standby container does not resume until network connectivity is restored and
the administrator manually performs a reenable operation.
The Nutanix Availability witness provides an automatic mechanism for disabling Metro Availability
for the primary protection domain and promoting the standby protection domain. With failure
handling set to witness, when network connectivity between the Nutanix clusters is interrupted,
both sites attempt to obtain a lock against the witness VM for each protection domain. A cluster
attempts to obtain the witness lock against primary protection domains after 10 seconds and
against standby protection domains after 120 seconds. If the primary protection domain obtains
the lock, the system automatically disables Metro Availability and writes resume, but they only
propagate to the active container. If the primary protection domain fails to obtain the lock, all I/O

3. Nutanix Enterprise Cloud Overview | 15


Metro Availability

is held for its respective container until either network connectivity is restored or an administrator
disables the protection domain. If the standby protection domain obtains the lock, the system
automatically promotes it to active. If the standby protection domain fails to get the lock, it
becomes unavailable and inactive to the hypervisor host until either network connectivity is
restored or an administrator manually performs the promotion. The Operational Scenarios section
contains more details regarding automatic failure handling when using the witness.
Once you have set the failure handling option during the protection domain creation process, you
can modify it from the Nutanix command line interface (nCLI) or the Prism UI. The Utilities and
Alerting section provides an example of how to modify this setting using the nCLI.

Reenable Replication
When replication has been disabled, either manually by an administrator or automatically based
on the failure handling setting, once network connectivity is restored, you can issue a reenable
command from the original active cluster to resume replication. When you select “reenable,” the
system creates a snapshot that references the last snapshot taken as a part of a previous enable
or reenable operation. While enabled, Metro Availability takes a snapshot automatically (once
every four hours or once every six hours with AOS 5.17 and later) to use as a reference point.
The data that represents the differences between the current and the last reference snapshot
then replicates to the standby container. The amount of data replicated may be more than the
incremental amount of change since the network or site outage and could even represent all data
if the containers were empty at the time of the last reference snapshot.
You can also perform the reenable option against the standby cluster to carry out planned
failovers or to reestablish Metro Availability following an unplanned event. The direction of
replication after choosing reenable depends on the cluster from which you selected it: Whichever
cluster you use to issue the reenable command becomes the active site and the source for
replication. If you use the standby cluster to reenable Metro Availability, replication for the
container is essentially reversed; the previous standby container is now active, and the original
active container is now standby.
The following flowcharts provide an overview of the replication states from the perspective of the
active protection domain in the next figure, and the standby protection domain in the subsequent
figure. Please note that these flowcharts do not include details about deleting a protection
domain, which is an additional option and affects states. The Operational Scenarios section
contains additional information on which steps to perform in each event.

3. Nutanix Enterprise Cloud Overview | 16


Metro Availability

Figure 7: Active Protection Domain Flowchart

3. Nutanix Enterprise Cloud Overview | 17


Metro Availability

Figure 8: Standby Protection Domain Flowchart

3. Nutanix Enterprise Cloud Overview | 18


Metro Availability

4. VMware Component Overview


To deploy and operate a Metro Availability–based stretch cluster successfully, you must
understand specific VMware components. The following VMware components impact the Metro
Availability solution and are integral to a robust environment.

4.1. VMware High Availability (HA) Cluster


One of the goals of Metro Availability is to simplify VM recovery management. Because Metro
Availability presents a stretched datastore between two Nutanix clusters, you can form a single
VMware cluster. With that single stretched VMware cluster, you can use VMware HA to provide
speed and automation to VM recovery.
Each host in a VMware HA cluster operates as either a leader host or a worker host. A cluster
has one leader host, with all other hosts operating as workers. The leader performs several
functions, including monitoring worker hosts, monitoring the power state of protected VMs,
keeping an inventory of cluster hosts and protected VMs, and communicating cluster health state
to the vCenter Server.
When a server fails, the VMware HA leader automates restarting the affected VMs onto other
servers in the cluster. By default, VMware HA attempts to restart VMs five times over roughly
a 30-minute time span. When you use it in combination with the VMware Distributed Resource
Scheduler (DRS) affinity rules, you can configure VMware HA to target nodes in the cluster onto
which specific VMs should restart. VMware DRS, in combination with VMware HA, controls which
servers and, by extension, which Nutanix cluster, takes ownership of a given set of VMs.

VMware HA Host Isolation


VMware HA uses a process called host isolation to respond to the loss of network connectivity
between nodes in a cluster. Host isolation occurs when hosts are still running but can neither
communicate with other hosts nor ping the configured isolation addresses. Additionally, VMware
HA uses datastore heartbeating to help determine the nature of the host failure. If a host has
stopped issuing heartbeats to the configured datastores, the system considers it to have failed,
and it restarts its VMs on other hosts in the cluster.
When the isolated host remains online, there are three response options you can use to manage
its VM states:
• Disabled or leave powered on (depending on vSphere version): This is the default response; it
leaves the VMs running on an isolated host powered on.
• Power off: Powers off all VMs on the isolated host; a hard stop.

4. VMware Component Overview | 19


Metro Availability

• Shut down: Shuts down all VMs on the isolated host, assuming VMware tools are installed. If a
shutdown is not successful, a power-off command is issued within five minutes.

VMware HA Compatibility List (Matrix)


VMware HA monitors which hosts in a cluster have the required resources (such as network port
group or datastore) for restarting VMs. This feature is sometimes referred to as the VMware HA
compatibility list or compatibility matrix. VMware HA only attempts to restart VMs against hosts in
the cluster that contain the required network and storage resources.
When you are using Metro Availability and a network partition or site failure occurs, the VMware
HA cluster sees standby containers go into an inactive state. This inactive state prevents VMware
HA from attempting to restart VMs against the standby containers. VMware HA continues to
check the state of the containers every minute to update the compatibility list. When the standby
containers are promoted, they become locally read and write capable and no longer report as
inactive. VMware HA updates its compatibility list and then attempts to restart the VMs against
the now-active container. VMware HA retry counts and retry intervals are then in effect.

VMware HA Admission Control


Admission control ensures that sufficient resources exist in the VMware cluster to support VM
resource reservations. In the context of Metro Availability, you can use admission control in
coordination with VMware HA to ensure that enough resources exist between the two Nutanix
clusters to allow VMs to restart in the event of a site failure. Admission control helps guarantee
that all VMs can restart against a single Nutanix cluster in the remote site.

4.2. VMware Distributed Resource Scheduler (DRS)


VMware DRS helps balance resource utilization across nodes in a cluster. VMware DRS
monitors the CPU and memory resources for all hosts in the cluster and can recommend or
automatically run VM migrations for load balancing. In Metro Availability configurations it may be
undesirable for VMs to automatically move to nodes in a cluster that own a datastore in a standby
state. You can create and use VMware DRS affinity rules to ensure that VMs do not move to
specific servers for either load balancing or VMware HA restart purposes.

DRS Affinity Rules


DRS affinity rules control the placement of VMs on hosts in a cluster. DRS supports affinity or
antiaffinity between groups of VMs and hosts, or between individual VMs. Affinity between VMs
prompts DRS to attempt to keep the specified VMs running on the same hosts. Antiaffinity for
VMs prompts DRS to attempt to keep VMs running on different hosts.
Affinity between a group of VMs and a group of hosts allows DRS to place members of a VM
DRS group on members of a host DRS group. Affinity is enforced based on “must run on” or

4. VMware Component Overview | 20


Metro Availability

“should run on” rules. Must run on rules force VMs to always reside on a member of a specific
host group. Should run on rules attempt to place VMs on members of the specified host DRS
group, but they can be overridden based on certain conditions, including host failure.
Antiaffinity between a group of VMs and a group of hosts relies on “must not run on” and “should
not run on” rules. Must not run on rules prevent VMs from ever running on members of a
specified host group. Should not run on rules attempt to prevent VMs from running on members
of a specified host group, but can be overridden based on certain conditions, including host
failure.
Metro Availability offers flexibility when deciding whether to use must or should affinity or
antiaffinity rules. Using should rules can help automate restarting VMs across Nutanix clusters
during a site failure. If you don’t want automated restart across Nutanix clusters, you can use
must rules instead.

4.3. VMware vMotion


VMware vMotion allows the live migration of VMs between physical servers, enabling such tasks
as performance optimization or hardware maintenance without disruption. With Metro Availability,
you can live-migrate a VM between a Nutanix cluster with an active container and a Nutanix
cluster with the corresponding standby container. vMotion with Metro Availability can, without
interruption, help relieve memory or CPU pressure against ESXi hosts in the active cluster. You
can also use it to prepare for site maintenance or, in some cases, disaster avoidance. While
running against the standby container, the VM’s I/O redirects to the active site, as described in
the Standby Containers section. Should you need to service VM I/O locally following a vMotion,
you must promote the standby container. The Operational Scenarios section offers more detail
concerning the use of vMotion with Metro Availability. See the following figure for a representation
of how Metro Availability operates in conjunction with vMotion.

4. VMware Component Overview | 21


Metro Availability

Figure 9: Metro Availability Overview with vMotion

4.4. VMware vCenter Server


VMware vCenter Server is the centralized management application for VMware environments.
VMware vCenter enables the creation of clusters, the use of vMotion, and the configuration of
VMware HA and DRS. Metro Availability typically involves a single VMware cluster; a single
vCenter Server instance then manages that cluster. Ensuring that the vCenter Server is highly
available is critical in stretched cluster environments. While VMware HA is not dependent on the
vCenter Server once it has been enabled, operations such as vMotion and DRS rules are not
available when the vCenter Server is offline.

4. VMware Component Overview | 22


Metro Availability

An ideal Metro Availability configuration places the vCenter Server into a fault domain (generally
a third site) separate from the active and standby Nutanix clusters. This structure allows
environment management if either the active or standby site fails.
Alternatively, you can place the vCenter Server in a container used for Metro Availability and
replicated to the standby site. In such a configuration, you can protect and recover the vCenter
Server with VMware HA or as a part of disaster recovery procedures between the Nutanix
clusters.

4. VMware Component Overview | 23


Metro Availability

5. Operational Scenarios
Metro Availability provides several methods for managing replication states, including the Prism
UI, REST API, the nCLI, and PowerShell. The following sections outline nondisruptive planned
workload migration, manual unplanned failover, and automated unplanned failover. The Prism
Web Console Guide contains additional details on how to manage Metro Availability from Prism.

5.1. Establishing Metro Availability


Initial configuration of Metro Availability involves creating remote sites and containers in both
Nutanix clusters. The containers created in each Nutanix cluster mount to their respective ESXi
hosts. The containers must have identical names between the clusters, and the round-trip
time (RTT) latency between the remote sites cannot exceed 5 ms. Administrators then create
protection domains for each container targeted for replication.

Figure 10: Prism: Protection Domain Example

For scenarios including the Metro Availability witness, you must install the witness in a
failure domain separate from the Nutanix clusters, then register the witness with the clusters
participating in the Metro relationship.
An administrator can form a single VMware cluster that contains all the nodes from both Nutanix
clusters. You can configure VMware HA and DRS rules to manage VM placement policies.

5. Operational Scenarios | 24
Metro Availability

Nutanix recommends keeping VMs running locally to active containers. You can use VMware
DRS affinity rules to ensure that VMs are running against cluster nodes that own the active
containers. Should run on rules keep VMs running against a specific group of hosts but also
allow failover if a site fails. You can use must run on rules to prevent VMs from restarting against
certain cluster nodes even in the event of a site failure.
We provide specific steps for creating containers, remote sites, and Metro Availability protection
domains in the Prism Web Console Guide. The following figure summarizes the general steps.

Figure 11: Initial Metro Availability Configuration

5.2. Planned Failover


Planned failover involves moving VM operations between sites for disaster recovery testing,
planned maintenance, and disaster avoidance. There are two general methods for planned
failover with Metro Availability: either vMotion or a cold migration of VMs. When using vMotion,
you can migrate VMs nondisruptively between Nutanix clusters. This capability includes
nondisruptively promoting standby containers and reenabling replication.
As a part of the overall procedure, reenabling replication can take place either following the
promotion of the containers or not at all for certain test scenarios. Reenabling Metro Availability
from the standby site acts as a reversal of the replication relationship. The process marks the
original active container as standby and overwrites its contents with any new writes that occur.
When compared to a starting state, as shown in the Metro Availability Overview figure above, the
end state looks like the following figure.

5. Operational Scenarios | 25
Metro Availability

Figure 12: Planned Failover with Replication Reversal

Reenabling replication following a promotion provides the shortest window for resuming the
synchronous relationship. Waiting to reenable replication allows you to validate VM operations
prior to marking the previous active container as standby. Additionally, if the scenario is simply to
test VM failover and not persist the data in the standby site, you can skip reenabling replication
from the standby site.

Planned Failover with vMotion


Using vMotion to move VMs to the standby site provides nondisruptive workload migration during
planned failovers. Operationally, using vMotion takes more time than a cold migration, as the VM
memory state must be transferred across sites.

5. Operational Scenarios | 26
Metro Availability

Planned failover requires a force-promote of the targeted standby container. The force-promote
allows a standby container to become active and enables reads and writes to occur against the
local cluster. All VMs running against a given container must migrate to the secondary cluster
prior to promoting the standby container. The force-promote is nondisruptive to VMs running in
the standby container. When you force-promote a standby container, the current active container
goes into a decoupled state, making the datastore read-only.
Following promotion, if failure handling is configured for manual or automatic resume, the
formerly active site needs to be disabled. Following the disable operation, reenabling Metro
Availability from the standby site reverses the replication relationship. With AOS 5.11 and later,
witness failure handling does not require you to disable the formerly active site. While the
formerly active site is decoupled, you can reenable it directly from the promoted standby site.
The original active container is marked as standby and any new writes that occur overwrite its
contents.
The next figure outlines the planned failover process with vMotion when you have configured the
failure handling setting for either manual or automatic resume mode.
If you are using asynchronous snapshots in combination with Metro, you may need to suspend
those schedules prior to issuing a disable command:
ncli pd suspend-schedules name=<PDName>

Figure 13: Nondisruptive Migration with Manual or Automatic Resume Failure Handling

Be sure to resume any suspended schedules when you are done:


ncli pd resume-schedules name=<PDName>

The next figure outlines the planned failover process with vMotion when you have configured
the failure handling setting for witness mode. Manage any asynchronous snapshot schedules as
previously noted.

5. Operational Scenarios | 27
Metro Availability

Figure 14: Nondisruptive Migration with Witness Failure Handling

Planned Failover with Cold Migration


Cold migration provides the fastest operational time for performing a planned failover, as it cuts
out the time that vMotion takes. VMs, however, incur downtime, as cold migration from the active
to the standby site is an offline operation. The following figure outlines the planned failover
process with cold migration.

Tip: When using AOS 5.11 and witness failure handling, you don’t need to perform
the disable operation during a planned failover.

Figure 15: Planned Failover Using Cold Migration

5.3. Network Outage Between the Nutanix Clusters


When considering the converged networking architecture of the Nutanix platform, a loss of
communication between sites generally causes communication failure between the VMware
hosts and the Nutanix clusters, as shown in the next figure. When you lose VMware host
communication, the cluster is partitioned into two sites. VMware HA undergoes an election
process for the site that has lost communication with the VMware HA leader. Both sites have a
VMware HA leader and the ability to restart VMs they perceive to have failed.
During the network outage, Metro Availability replication becomes degraded (remote
unreachable), and the standby datastores become unavailable and enter an inactive state. The

5. Operational Scenarios | 28
Metro Availability

lack of site communication also causes datastore heartbeating to fail for partitioned servers
from the perspective of the active datastores. With both host communication and datastore
heartbeating failed, the leader VMware HA agent attempts to restart VMs.
Before attempting to restart failed VMs, VMware HA ensures that the cluster has resources
available to support the VMs, such as network port groups and datastores. When standby
datastores have an inactive state, VMware HA can’t attempt to restart VMs against those
containers. VMware HA retry counters don’t apply, as the servers are no longer on the VMware
HA compatibility list. Thus, VMware HA continually looks to update its compatibility list so it can
restart the VMs once the standby containers become available again.
Any VM running against a standby container when this kind of network outage occurs loses
access to its datastore (one of the reasons Nutanix recommends running VMs locally to their
active containers). You can restart these failed VMs on the opposite site, against the active
container, as a part of the VMware HA failure detection and VM restart process.
While VMware HA does not require VMware vCenter to be available to restart VMs, certain
management operations, such as modifying VMware HA or DRS settings (including affinity rules),
are affected when the vCenter Server is unavailable. Depending on the location of the vCenter
Server instance, options for managing the cluster in a degraded state may be limited.

5. Operational Scenarios | 29
Metro Availability

Figure 16: Network Failure Between Nutanix Clusters

VMs running against active containers continue to operate but the network failure may affect
them, depending on the Metro Availability failure handling setting and the duration of the outage.

Network Failure Handling: Manual Mode


With failure handling set to manual, VM writes to the active container do not resume until either
network connectivity is restored between the clusters or an administrator manually disables Metro
Availability. The benefit of this setting is that it enables strict synchronous replication between the
sites. The downside is that applications within the VMs can timeout when writes are held.

5. Operational Scenarios | 30
Metro Availability

Network Failure Handling: Automatic Resume Mode


With failure handling set to automatic resume mode, if network connectivity is restored within the
timeout period (10 seconds by default), the replication state immediately returns to enabled and
writes continue to replicate synchronously between the clusters. If the timeout period expires,
Metro Availability is disabled automatically and writes resume, but they only propagate to the
active container. When Metro Availability has been disabled, replication to the standby container
does not occur until both network connectivity is restored and you manually perform a reenable
operation. The next figure shows the general process.

Figure 17: Network Failure General Workflow

Network Failure Handling: Witness Mode


With failure handling set to witness, if network connectivity is restored within the timeout period
(10 seconds by default), the replication state immediately returns to enabled and writes continue
to replicate synchronously between the clusters. If the timeout period expires, the cluster that
owns the active protection domain attempts to acquire the witness lock. If the active protection
domain successfully acquires the lock, Metro Availability is disabled automatically and writes
resume, but they only propagate to the active container.
During this time, the cluster that owns the standby protection domain also detects the network
interruption. After 120 seconds (by default), it attempts to obtain the witness lock. Because the
active site already acquired the lock, the standby site fails to obtain it and becomes inactive.
When witness coordination has automatically disabled Metro Availability as described above,
replication to the standby container stops until network connectivity is restored and you manually
perform a reenable operation. The process is similar to that shown in the previous diagram.

5.4. Site Failure


The surviving site perceives site failure similarly to a complete network outage. Both
communications and datastore heartbeating fail, and the vSphere HA agent reports a “host failed”

5. Operational Scenarios | 31
Metro Availability

state (see the following figure). If required, VMware HA reelects an HA leader that attempts to
restart VMs that are offline because of the site outage.
Metro Availability replication enters a degraded state (remote unreachable), and the standby
datastores become unavailable and enter an inactive state. The inactive state prevents VMware
HA from attempting to restart VMs against those containers. VMware HA retry counters do not
apply, as the servers are no longer on the VMware HA compatibility list. This means that VMware
HA continually looks to update its compatibility list so it can restart the VMs when the standby
containers become available again.
When the remote site has failed, you can promote any standby containers in the surviving
site. Once promoted, the containers become active again to the VMware HA cluster. VMware
HA updates its compatibility list and powers on VMs that reside in that container. VMware HA
automatically overrides VMware DRS should run on affinity rules, and VMs governed by those
rules can restart in the surviving site. Must run on affinity rules are enforced, and you must
update them to allow VMs covered by those rules to restart in the surviving site.
Any VM running against a standby container when this kind of network outage occurs loses
access to its datastore (one of the reasons Nutanix recommends running VMs locally to their
active containers). As the opposite site has failed, these VMs can only resume when the standby
container is promoted.
While VMware HA doesn’t require VMware vCenter to be available to restart VMs, certain
management operations, such as the modification of VMware HA or DRS settings (including
affinity rules), are affected when the vCenter Server is unavailable. Depending on the location of
the vCenter Server instance, options for managing the cluster in a degraded state may be limited.

5. Operational Scenarios | 32
Metro Availability

Figure 18: Site Failure

VMs running against active containers continue to operate, but the site failure may affect them
if the Metro Availability failure handling setting is set to manual, as we discussed in an earlier
section.

Site Failure Handling: Automatic Resume Mode


With failure handling set to automatic resume mode, site loss causes the break replication
timeout to expire in the remaining cluster. Metro Availability is disabled automatically for active
protection domains and writes resume against those containers. Containers in standby protection
domains become inactive and must be manually promoted to allow VMs to restart. Any DRS
rules must be manually updated as required, either to allow VM restart if using must rules or to

5. Operational Scenarios | 33
Metro Availability

prevent migration attempts if the failed site and cluster return to service. The next figure shows
the general process.

Figure 19: Site Failure General Workflow

Site Failure Handling: Witness Mode


With failure handling set to witness mode, the surviving cluster obtains the witness lock for the
active and standby protection domains. Active protection domains are disabled automatically,
which allows VM I/O to resume to those containers. Standby protection domains are
automatically promoted, making those containers active for I/O. VMware HA then automatically
restarts VMs. You must manually update any DRS rules as required to either allow VM restart if
using must rules, or prevent migration attempts if the failed site and cluster return to service. The
process is similar to that shown in the previous diagram.

Site Failure Handling: Witness Mode, Storage Only


A storage-only outage is defined as a condition where the ESXi hosts are up but have lost
connectivity to the storage. Because the hosts are up, an HA event is not triggered. In the event
of a storage-only outage on the active site, the witness promotes the storage on the standby site.
This causes the storage on the active site to go into a read-only state because the protection
domain is decoupled.
Metro (with a witness) relies on a VMware HA event to power up the VMs on the standby
site. VM Component Protection (VMCP) helps protect vSphere environments from storage
connectivity loss. When a host loses a storage device, VMCP marks it in one of the following
states:
• PDL (Permanent Device Loss): A device is marked as permanently lost if the storage array
responds with a SCSI sense code marking the device as unavailable.
• APD (All Paths Down): If the PDL SCSI code is not returned from a device, this is marked as
All-Paths-Down and the ESXi host continues to send I/O requests until the host receives a
response.

5. Operational Scenarios | 34
Metro Availability

Starting with the AOS 5.15 release, Metro uses VMCP APD implementation to handle storage-
only failures. In case of storage-only failure on the primary site, Metro Availability detects the
APD condition and automatically fails over the VMs on the affected site to the secondary site
after the storage is promoted on the secondary site. This scenario makes the storage on the
primary (affected) site unavailable for reads or writes.

Tip: You must enable APD on the ESXi hosts for automatic VM failover to work in
a Metro Availability configuration. For more information on how to configure VMCP,
refer to VM Component Protection in VMware’s documentation.

5.5. Site Recovery


Site recovery involves returning the environment to its original state by recovering the cluster
and reenabling replication between the sites. The exact process depends on whether you are
recovering an existing configuration or forming a new cluster pairing.

Original Cluster Pairing


In some site failure scenarios, like power loss, you can recover the original cluster and restart
the Metro Availability environment using the previous remote site and cluster pairings. The
procedures outlined in this section assume you have recovered the original cluster in the failed
site with the original configuration and data that was available at the time of the failure (see the
following figure). This section also assumes that you promoted all standby containers in the
surviving site, as described in the Site Failure section.
When you recover a cluster following an outage, Metro Availability validates the state of the
protection domain to see if the remote site version has diverged. When a standby container
is promoted while the active container and cluster are offline, the protection domain on the
recovered cluster can report as either disabled or decoupled on recovery. An active container
reports as disabled if the failure handling setting, when set to automatic, triggered a disable
command before cluster failure. If the relationship was not automatically disabled prior to failure,
the cluster reports a decoupled state. When the Metro relationship reports as disabled, the
container can read and write to the VMware cluster. When the Metro relationship reports as
decoupled, the container is in a read-only state.
The Metro Availability witness helps ensure that protection domains in failed sites are unavailable
upon recovery. Protection domains come back online in either the standby or decoupled state,
preventing VM operation in that cluster.

5. Operational Scenarios | 35
Metro Availability

Figure 20: Initial Cluster State Following Site Recovery with Existing Configuration

Prior to recovering a cluster in a Metro Availability relationship that has been offline, ensure that
DRS settings, including affinity rules, do not cause unwanted VM movement. If you recover a
site and cluster, it is possible for legacy DRS settings to attempt to move VMs to unwanted hosts
that have stale data. To prevent this unwanted movement, set DRS to manual or update rules to
enforce must not run on requirements temporarily while replication is reenabled.
To restart Metro Availability once you have recovered a site, you must disable active protection
domains in the recovered cluster if you use manual or automatic resume mode failure handling.
You can then issue a reenable command. When you use AOS 5.11 or later and witness failure
handling, you do not need to disable the active protection domains in the recovered cluster in
order to reenable them from the promoted site. Ensure that you issue the reenable operation
from the appropriate Nutanix cluster, as the site chosen becomes the sole active copy and source

5. Operational Scenarios | 36
Metro Availability

of replication. For the example given in the previous figure, the reenable command comes from
Site B. The following figure outlines the general process.
If you are using asynchronous snapshots in combination with Metro, you may need to suspend
those schedules before you issue a disable command:
ncli pd suspend-schedules name=<PDName>

Figure 21: Site Recovery with Original Clusters General Workflow

Be sure to resume any suspended schedules when you are done:


ncli pd resume-schedules name=<PDName>

New Cluster Pairing


When you completely lose a remote cluster, you need to form a new remote site and protection
domain relationship. The steps for this process resemble those for creating a new Metro
Availability configuration from scratch. Before you enable the new relationship, remove the
previous Nutanix cluster nodes from the VMware cluster in vCenter and delete any protection
domains and remote sites. Once you have configured the new cluster intended for Metro
Availability, you can establish the new pairing. You can use this same process to relocate a Metro
Availability relationship between new sites and clusters for migration. We recommend following
the general process outlined in the next figure.

5. Operational Scenarios | 37
Metro Availability

Figure 22: Moving Metro Availability to a New Cluster Pairing

5.6. Metro Availability Witness-Specific Workflows


The previous section detailed workflows that are similar whether failure handling is set to manual,
automatic resume, or witness. A few additional scenarios exist that are witness-specific; we
outline these in the following sections.

Witness Failure
In the context of the Metro relationship, the witness is passive; it isn’t required for cluster
availability or replication when Metro is in a healthy state. A witness can fail or sites can lose
communication with the witness and Metro replication continues without interruption. Similarly, if
a failed witness returns to operation, you don’t need to take any additional steps to recover the
environment.
The witness only queries lock status and determines whether to allow an automatic disable or
promotion of protection domains when communication between the Nutanix clusters fails. If the
witness is unavailable at this time, the automatic decision making fails, affecting VM availability in
much the same way as manual failure handling.
If the witness is permanently lost, you must associate the protection domains with a new witness
or new control protection domain states as needed. With a healthy Metro Availability relationship,
you can change the failure handling of the protection domain from witness to another option,
such as automatic resume, to accomplish this “unwitness” operation.

Primary Site and Witness Loss


Although very unlikely, it is possible to lose both a primary site and access to the witness. This
scenario prevents automatic disabling and promotion of protection domains in the remaining
cluster, as witness communication is unavailable. You can, however, still manage protection
domains in the remaining cluster to allow VMs to continue operation. Metro Availability allows

5. Operational Scenarios | 38
Metro Availability

you to locally unwitness protection domains. A local unwitness operation disassociates protection
domains in that cluster from the existing witness, allowing you to manually disable active
protection domains and promote standby protection domains. Use the nCLI to perform the local
unwitness function:
ncli pd update-failure-handling name=<PDName> failure-handling=Automatic local-only=true

Only perform a local unwitness operation following site loss if you expect the existing witness to
be unavailable for an extended period. The next figure depicts the general process for recovery
where both Site A and the witness site fail.

Figure 23: Primary Site and Witness Loss Recovery

Primary Site Complete Network Loss


If the primary site completely loses connectivity, a Nutanix cluster is still running in that site but
can no longer communicate with its peer Metro cluster or the witness. From the perspective
of the opposite site, this situation appears as a complete site loss. In this scenario, using our
previous Site A, Site B, and Site C example, the following occurs:
1. Site A attempts to contact the witness and fails.
2. All active and standby containers in the Site A cluster become unavailable for I/O.
a. Running VMs stop responding.
3. Site B attempts to contact the witness and obtain the lock. This attempt succeeds.
4. In Site B, active protection domains are disabled and standby protection domains are
promoted to active, allowing VMs to recover via HA.
The following diagram captures the resulting state.

5. Operational Scenarios | 39
Metro Availability

Figure 24: Primary Site Complete Network Loss

Recovering from Complete Network Loss in the Primary Site


If network connectivity is restored, to recover the Metro Availability relationship, shut down any
VMs hanging against the cluster in Site A. A normal power-off operation in this state generally
does not work while the datastore is inactive, but you can use a few options to shut down these
unresponsive VMs, including resetting the hosts, disabling the protection domain (making the
container active, so a normal power-off operation can proceed), or killing the active VM process
with esxcli. Once powered off, the vSphere cluster automatically unregisters those VMs to
resolve the conflict. The remaining process is then similar to recovering from a site loss, as you
can see in the following diagram.

5. Operational Scenarios | 40
Metro Availability

Figure 25: Complete Network Loss Recovery

Rolling Failure
A rolling failure entails communication loss first between Site A and Site B followed by the
complete loss of Site A. During the initial failure between Site A and Site B, the cluster that owns
the active protection domain attempts to acquire the witness lock. If the active protection domain
successfully acquires the lock, Metro Availability is disabled automatically and I/O resumes.
During this time, the cluster that owns the standby protection domain also detects the network
interruption and, after 120 seconds (by default), attempts to obtain the witness lock. As the
lock was already acquired by the active site, the standby site fails to obtain it, and the standby
container becomes inactive.
The rolling failure continues and Site A is then lost. Because Site B was initially unable to obtain
the witness lock, the standby protection domains remain in an inactive state, unable to serve
VMs. You must locally unwitness any standby protection domains in order to promote them to
active. Once you have promoted these domains, VMs can restart via the HA recovery process.
Perform the local unwitness operation as described in the Primary Site and Witness Loss section.

5.7. Operational Scenarios Summary


The following table gives a quick overview of possible scenarios based on the failure handling
setting. The table assumes a basic starting configuration where Site A contains only active
protection domains and Site B only standby protection domains.

5. Operational Scenarios | 41
Metro Availability

Table 2: Failure Scenario Summary

Automatic Resume
Failure Scenario Witness Mode Manual Mode
Mode
Automatic protection
An administrator must An administrator must
Site A outage or domain promotion
promote protection promote protection
complete network in Site B. VMs
domains in Site B for domains in Site B for
failure in Site A automatically restart in
VMs to restart. VMs to restart.
Site B.
VMs continue to run VMs continue to run VMs are paused and
Site B outage or
on Site A following an on Site A following an an administrator must
complete network
automatic disable of automatic disable of disable protection
failure in Site B
protection domains. protection domains. domains in Site A.
VMs continue to run VMs continue to run VMs are paused and
Connection loss
on Site A following an on Site A following an an administrator must
between Site A and
automatic disable of automatic disable of disable protection
Site B
protection domains. protection domains. domains in Site A.
No impact while Metro
Witness failure N/A N/A
relationship is healthy.
Connection loss
No impact while Metro
between witness and N/A N/A
relationship is healthy.
Site A
Connection loss
No impact while Metro
between witness and N/A N/A
relationship is healthy.
Site B
Connection loss
No impact while Metro
between witness and N/A N/A
relationship is healthy.
both Site A and Site B
VMs on Site A are
Connection loss paused. Automatic VMs continue to run VMs are paused and
between Site A and protection domain on Site A following an an administrator must
Site B and between promotion in Site B. automatic disable of disable protection
witness and Site A VMs automatically protection domains. domains in Site A.
restart in Site B.

5. Operational Scenarios | 42
Metro Availability

Automatic Resume
Failure Scenario Witness Mode Manual Mode
Mode
VMs on Site A
are paused. An
Connection loss VMs continue to run VMs are paused and
administrator must
between all sites on Site A following an an administrator must
unwitness in Site A
including the witness automatic disable of disable protection
and disable protection
(Site A recovery) protection domains. domains in Site A.
domains to resume
VMs.
An administrator
Connection loss must unwitness in An administrator must An administrator must
between all sites Site B and promote promote protection promote protection
including the witness protection domains domains in Site B for domains in Site B for
(Site B recovery) in Site B for VMs to VMs to restart. VMs to restart.
restart.
An administrator
must unwitness in An administrator must An administrator must
Rolling failure: Metro
Site B and promote promote protection promote protection
failure followed by Site
protection domains domains in Site B for domains in Site B for
A failure
in Site B for VMs to VMs to restart. VMs to restart.
restart.
All VMs pause until All VMs pause until
you fix the storage you fix the storage
The witness promotes
outage or manually outage or manually
the storage on Site B.
promote the storage promote the storage
The storage on Site A
Storage-only outage on Site B. After you on Site B. After you
becomes inaccessible.
on Site A promote storage promote storage
If you enabled VMCP
on Site B, manually on Site B, manually
for APD, VMs fail over
migrate VMs by migrate VMs by
to Site B.
generating a VMware generating a VMware
HA event. HA event.

5.8. Asynchronous Snapshot Workflows


Nutanix supports the use of local and remote snapshots in combination with Metro Availability.
Administrators can schedule snapshots between the clusters used for Metro Availability as well
as to a third cluster outside of the Metro relationship, as shown in the following figure.

5. Operational Scenarios | 43
Metro Availability

Figure 26: Metro Availability Three-Site Configuration

5. Operational Scenarios | 44
Metro Availability

Snapshots for Metro Availability–enabled protection domains occur at the container level, which
means that each snapshot operation captures all files in the container. By default, the system
takes a snapshot every four hours to create a checkpoint used for incremental resynchronization
if the Metro relationship becomes disabled. The default snapshot interval with AOS 5.17 and later
is six hours.

Snapshot Scheduling
Administrators can take snapshots intended for backup and restoration manually or configure
them within a schedule. Schedules are configured against the cluster that has the Metro
Availability protection domain in the active state. You can use Prism to establish schedules that
create local snapshots in both the source and target Metro clusters. Optionally, you can select a
third remote site to retain snapshots outside of the Metro relationship. The following figure shows
these options.

Figure 27: Snapshot Schedule for a Metro Protection Domain

Snapshot Restore
Snapshots are restored either within the active Metro protection domain, or, in three-site
scenarios, against the asynchronous protection domain. You cannot restore snapshots within
the cluster that hosts the standby Metro protection domain. A snapshot contains all files in the

5. Operational Scenarios | 45
Metro Availability

protection domain. Because the restoration process recovers all files, use a redirected restore.
A redirected restore recovers the files to a subdirectory in a container, preventing you from
overwriting any active files.
Perform restores from the nCLI using the following steps:
• Obtain the snapshot ID. For example, to get the snapshot ID for a protection domain called
metropd (using a bash shell):
ncli pd ls-snaps name=metropd | grep "Id\|Create Time"

• Restore the snapshot with redirection:


ncli pd restore-snapshot name=metropd snap-id=15113205 path-prefix=/temp

Note: If the path-prefix parameter is omitted, a folder with a starting name of


Nutanix-Clone is created automatically.

• Following a restore using the above example, a /temp directory is available in the metropd
datastore in the ESXi cluster. You can copy any folders and files required for recovery to other
locations if necessary.
• Register the restored VMs by adding the configuration files to inventory.
• If you restore snapshots and register the VMs to a new location, ensure that the VMDK paths
for the VM point to the path specified as part of the restore operation.
• Delete all remaining folders and files that were not needed as a part of the restore.

Planned Third Site Failover


In three-site configurations you can migrate the entire protection domain to the asynchronous
remote site. Move the VMs to the third site using the following steps:
• Optional: Perform an incremental replication to shorten migration time.
pd add-one-time-snapshot name=metropd remote-sites=SiteC retention-time=86400

Note: Retention time is in seconds.

• Shut down the VMs.


• Optional: Unregister the VMs.
• Disable Metro Availability.
• Migrate the protection domain to the remote site.
⁃ Use the nCLI from the active Metro cluster:
ncli pd migrate name=metropd remote-site=SiteC

• Register the VMs in the third site.

5. Operational Scenarios | 46
Metro Availability

Planned Third Site Failback


To perform a planned migration from the third site back to the active Metro site:
• Optional: Perform an incremental replication to shorten migration time.
pd add-one-time-snapshot name=metropd remote-sites=SiteC retention-time=86400

Note: Retention time is in seconds.

• Shut down the VMs in the third site.


• Optional: Unregister the VMs in the third site.
• Migrate the protection domain to the active Metro cluster.
⁃ Use the nCLI from the third site cluster:
ncli pd migrate name=metropd remote-site=SiteA

• Manage VM registration and power state.


⁃ If VMs remained registered during the planned failover, power them on in the active Metro
cluster. You may be prompted to confirm whether the VM was copied or moved. Selecting
moved maintains the same UUID for the VM.
⁃ If the VMs were unregistered during the planned failover, register them and power them
on in the active Metro cluster. You may need to reenable Metro first so both sides of the
relationship have the same view of the datastore.
• Reenable Metro Availability if you did not already do so in the previous step.

Unplanned Third-Site Failover


A worst-case scenario entails performing an unplanned failover to the third site. You can
accomplish this task using the following steps:
• Use the nCLI to activate the asynchronous protection domain in the third-site cluster:
ncli pd activate name=metropd

• Register the VMs in the third site.

Planned Third-Site Failback Following an Unplanned Failover: Original Cluster


The following procedure is for moving back to an existing Metro configuration following an
unplanned failover to the third site.
• If needed, shut down the VMs in the Metro site.
• Optional: Unregister the VMs in the Metro site.
• Ensure that Metro Availability is disabled.

5. Operational Scenarios | 47
Metro Availability

• Deactivate the active Metro protection domain:


ncli pd deactivate-and-destroy-vms name=metropd

• Optional: Because Metro clusters have been offline, synchronize changes to shorten migration
time.
pd add-one-time-snapshot name=metropd remote-sites=SiteA retention-time=86400

Note: Retention time is in seconds.

• Shut down the VMs in the third site.


• Optional: Unregister the VMs in the third site.
• Migrate the protection domain to the active Metro cluster.
⁃ Use the nCLI from the third-site cluster:
ncli pd migrate name=metropd remote-site=SiteA

• Manage VM registration and power state.


⁃ If VMs remained registered during the planned failover, power them on in the active Metro
cluster. You may be prompted to confirm whether the VM was copied or moved. Selecting
moved maintains the same UUID for the VM.
⁃ If the VMs were unregistered during the planned failover, register them and power them
on in the active Metro cluster. You may need to reenable Metro first, so both sides of the
relationship have the same view of the datastore.
• Reenable Metro Availability if you did not already do so in the previous step.

Planned Third-Site Failback Following an Unplanned Failover: New Cluster


The following procedure is for moving back to a new Metro configuration following an unplanned
failover to the third site.
• Recreate the Remote Sites and Metro Availability relationship in the new clusters. Be sure to
use the same container name and protection domain name as in the third site.
• Disable Metro Availability.
• Deactivate the active Metro protection domain:
ncli pd deactivate-and-destroy-vms name=metropd

• Optional but recommended: Perform the initial full replication to shorten migration time.
pd add-one-time-snapshot name=metropd remote-sites=SiteA retention-time=86400

Note: Retention time is in seconds.

• Shut down the VMs in the third site.

5. Operational Scenarios | 48
Metro Availability

• Optional: Unregister the VMs in the third site.


• Migrate the protection domain to the active Metro cluster.
⁃ Use the nCLI from the third-site cluster:
ncli pd migrate name=metropd remote-site=SiteA

• Reenable Metro Availability.


• Register the VMs and power them on in the active Metro cluster.

Site Recovery Manager (SRM) Third Site Support


The Nutanix Storage Replication Adapter (SRA) for SRM supports replication, test, and failover
automation between the active Metro cluster and an asynchronous third site. The Nutanix
Storage Replication Adapter for Site Recovery Manager Administration Guide outlines the
required configuration and operational steps with the following considerations:
• You do not need to issue a vStore protect command in any site.
• You must configure a vStore mapping between the cluster with the active protection domain
and the cluster at the third site. Do not configure a vStore mapping between the cluster with
the standby protection domain and the third site.
• To continue third-site replication on failover, add the vStore mapping after promoting a standby
protection domain to active and remove the mapping from the previous active side. Also
reconfigure the protection domain replication schedules.
• SRM test workflows are supported while Metro is enabled.
• SRM planned recovery workflows require that you disable Metro. These workflows include:
⁃ Failover between the active Metro cluster and the third-site cluster.
⁃ Replication from the third-site cluster to the active Metro cluster.
⁃ Failback from the third-site cluster to the active Metro cluster.
• Ensure that you reenable Metro following failback operations.
We recommend performing SRM recovery workflows without promoting the standby Metro
cluster.

5. Operational Scenarios | 49
Metro Availability

6. Metro Availability Best Practices Checklist

6.1. Requirements
• 5 ms round-trip time latency maximum between active and standby Nutanix clusters.
⁃ Validated by the system when configuring Metro Availability protection domains.
• Active and standby container names must be identical between clusters.
⁃ Validated by the system when configuring Metro Availability protection domains.
• Datastore names must be the same as the container name.
• Ensure that all virtual disks associated with a given VM reside on a container enabled for
Metro Availability.

6.2. Interoperability
• The NX-2000 series is not supported.
• You can create local and remote snapshots between the Metro Availability–enabled
containers. You can configure a third Nutanix cluster to be the target for remote asynchronous
replication.
• Cluster configuration:
⁃ Cluster models can be different between active and standby sites.
⁃ vMotion may require Enhance vMotion Compatibility (EVC).
⁃ Cluster resource sizes can be different.
⁃ Redundancy factor can be different.
• Container configuration settings:
⁃ Container specific settings, such as compression and replication factor, can be different.
• Do not enable a proxy on remote sites that you use with a Metro Availability protection
domain.
• When linked clones are in a Metro Availability container, the gold image must reside in the
same container.

6. Metro Availability Best Practices Checklist | 50


Metro Availability

6.3. Limitations
Prior to AOS 5.17, Metro Availability performed a snapshot every four hours, which is supported
in nodes with up to 80 TB HDD tiers. With AOS 5.17 and later, the default snapshot interval is six
hours, which enables support for nodes with up to 120 TB HDD tiers. If you use a version before
5.17, you can modify the default snapshot schedule if needed to support denser nodes.
Before you enable Metro Availability on a container with VMs in an async DR protection domain,
delete the async DR protection domain or remove the VMs associated with the Metro container
from the protection domain.
Latency on vDisks might increase during the synchronization between two clusters.
If you disable Metro Availability, I/O from the standby site can’t be forwarded to the primary site.
Commands run on the hypervisors on the standby site may take additional time to run because
the hypervisors can’t access the underlying storage.
Restoring snapshots of a Metro protection domain with overwrite is not supported.
Symbolic links and hard links are not supported.
You can’t host VMs on the secondary cluster during the Metro enable operation for the same
container.
You can’t host VMs on the primary cluster during promotion of the secondary site for the same
container.

6.4. Nutanix Recommendations


• Redundant remote replication links between clusters.
• Similar performance characteristics between the clusters used for Metro Availability:
⁃ Similar number of nodes.
⁃ Similar server memory configuration.
⁃ Similarly sized oplog.
⁃ Similar drive count for oplog draining to the extent store.
• Adequate bandwidth to support peak write workloads.
• VMs should run locally to the active container where their data resides.
• Place no more than 3,600 files into a Metro Availability–enabled container. See the Alerts
section for more detail.
• Use the Nutanix Metro Availability witness where AOS is version 5.0 or greater.

6. Metro Availability Best Practices Checklist | 51


Metro Availability

• Create no more than 50 protection domains in a Metro Availability cluster pair configured for
witness failure handling.
• 200 ms round-trip latency or less between the witness and the Nutanix clusters participating in
Metro Availability.
• Use Nutanix alerts to monitor Metro Availability replication health.
• Enable remote site compression in bandwidth-limited environments.
• Manually take a snapshot of the protection domain before you disable Metro Availability to
ensure the most efficient synchronization when you reenable Metro Availability.

6.5. VMware Recommendations


• Most Metro Availability configurations assume a layer 2 network across the two sites, but layer
2 is not a requirement.
⁃ Layer 3 is sufficient to enable Metro Availability, VMware vMotion, and VMware HA restart
capabilities. For example, the Nutanix nodes in site 1 could be on network 192.168.10.0/24,
while the Nutanix nodes in site 2 could be on 172.16.10.0/24.
⁃ Layer 2 helps enable seamless network failover for VMs with static assignments.
• VMware network port group names should be identical between the VMware hosts in each
Nutanix cluster.
• Use a single ESXi cluster spanning the two Nutanix clusters in each site.
• Use a single vCenter Server instance to manage the VMware clusters between sites.
⁃ Ideally, maintain the vCenter Server in a third site to allow cluster management regardless
of site failure.
⁃ Metro Availability can also protect and replicate the vCenter Server if a third site is not
available.
• Configure DRS affinity rules such that VMs are not migrated to hosts on the cluster that owns
the standby container.
⁃ Use should affinity rules to allow automated VM restart on nodes that own the standby
datastore.
⁃ Use must affinity rules if you do not want automated VM restart against the standby
datastore.
⁃ Manually modify must rules to allow VMs to restart against blocked hosts. Modifying rules
assumes the availability of the vCenter Server.
• Configure VM restart priorities as appropriate for the applications running in the environment.

6. Metro Availability Best Practices Checklist | 52


Metro Availability

• Configure VMware HA with the following settings:


⁃ Change the VM restart priority of all CVMs to disabled.
⁃ Change the host isolation response of all Controller VMs to either leave powered on or
disabled, depending on the version of vSphere you are using.
⁃ Modify user VMs as appropriate. General recommendation for user VMs is to choose
shutdown.
⁃ Change the VM monitoring setting for all Controller VMs to disabled.
• Configure datastore monitoring:
⁃ Choose Select only from my preferred datastores and select the datastores used for
Metro Availability.
⁃ If the VMware cluster has only one datastore, add the advanced option:
das.ignoreInsufficientHbDatastore=true

• Configure VMware HA admission control based on the cluster configuration.


⁃ Assuming a balanced configuration, one option is to set Percentage of cluster resources
reserved as failover spare capacity to 50 percent or lower. This setting ensures that
enough resources exist in a single Nutanix cluster to support both sites.

6. Metro Availability Best Practices Checklist | 53


Metro Availability

7. Metro Availability Best Practices (Detail)

7.1. Nutanix Platform Guidance


Acropolis Upgrade
Nutanix supports upgrading Metro Availability–enabled clusters online without having to pause
either steady state or resynchronizing replication traffic. We recommend upgrading one cluster in
the Metro Availability relationship at a time. After you have upgraded the first cluster, the second
cluster can follow.

Networking Performance and Availability


Because Metro Availability is a synchronous replication solution, VM write performance typically
falls below that observed in standalone environments. The level of performance impact depends
on the round-trip latency that exists between the two Nutanix clusters, along with the total
bandwidth available for performing the replication between sites.
Metro Availability has been qualified and is supported in environments that have latency up to 5
ms RTT between the Nutanix clusters. Bandwidth requirements depend on the total concurrent
write workload of the active containers used with Metro Availability. If a customer chooses to run
VMs against standby containers, the total read workload for those VMs also impacts the total
bandwidth requirement.
Nutanix recommends enabling compression for the synchronous remote write operations to
reduce the total bandwidth required to maintain the synchronous relationship. Compression
can also reduce the bandwidth required to perform synchronizations when establishing new
protection domains or during other operational procedures.
For both performance and redundancy reasons, Nutanix recommends having at least two
network circuits between the Nutanix clusters. We also recommend having a layer 2 network
between the sites to stretch the subnets between the two clusters. A stretched subnet allows
seamless VM failover when using VMware HA and the existing IP addresses for each VM.

Cluster Hardware Configuration and Sizing


Metro Availability is a flexible solution that does not require identical hardware between the
two Nutanix clusters. While Metro Availability does not enforce exact configurations, the best
practice is to ensure a balanced configuration between the two sites. In the context of an active-
passive cluster, where VMs are only running in one cluster at a time, a balanced configuration
helps ensure that replication performs consistently, wherever the VMs reside. A balanced
configuration also makes the same CPU, memory, and storage resources available in both sites,

7. Metro Availability Best Practices (Detail) | 54


Metro Availability

so VM performance is consistent when running in either cluster. Additionally, a slower standby


cluster (with a smaller oplog or fewer drives for oplog draining compared to the active site) could
adversely affect the write performance of VMs writing to the active container.
A balanced configuration simplifies resource constraint management, including VMware HA
admission control configuration. Further, when both clusters offer the same available space
between their respective storage pools, free-space monitoring and maintenance are easier.
Ideally, a balanced configuration means that containers between both sites have the same
capacity optimization settings so that space consumption is similar between the two clusters.
For environments in which VMs are running in both sites and replicated between the two clusters,
in addition to maintaining a balanced configuration, Nutanix recommends sizing to ensure that
all VMs can run in a single site and cluster. This sizing approach means that each of the Nutanix
clusters participating in the Metro Availability relationship has the CPU, memory, SSD, and HDD
resources to run all replicated VMs satisfactorily. Otherwise, during site maintenance or failure
events that require a single cluster to support the whole environment, performance degrades and
the cluster may operate at higher oversubscription ratios than desired.

Virtual Machine Locality


The platform achieves its best performance when running VMs locally to their respective
active containers. When VMs run locally, read performance is similar to non–Metro Availability
environments, and writes, as previously mentioned, are impacted by a single round trip between
the Nutanix clusters.
If a VM runs against a standby container, writes are forwarded to the active container and
additional round-trip latencies occur for the operation. The active site also services reads against
the standby container, which causes an additional round trip to service I/O. Reads are not
cached in CVMs hosting the standby container once completed, so any rereads must retrieve
the data from the active site again. Because of this additional performance overhead, Nutanix
recommends running VMs locally to the active containers.
In addition to performance considerations, should a temporary network outage between sites
occur, VMs running against standby containers can have their datastores become inactive
and unavailable, which causes them to fail and go offline. This possibility further reinforces our
recommendation to run VMs locally against their active containers.
You can automate adding VMs to specific DRS groups upon deployment using the vSphere
API or with tools such as the vSphere PowerCLI. The VMware community website has several
examples.
When considering the potential for temporary network outages, Nutanix recommends configuring
the VM availability setting to automatically disable replication. Otherwise, extended network
outages between sites cause running VMs to fail against their active containers. The timeout
used prior to the automatic disable should be a value less than any application timeouts for the
services running in the replicated VMs.

7. Metro Availability Best Practices (Detail) | 55


Metro Availability

Metro Availability replication to the standby site does not maintain VM-specific data locality
to a particular node in the standby cluster. When failing over a VM to the standby cluster and
promoting the standby container, any node in that cluster can service local reads. Over time, data
becomes resident on the node that owns the running VM. Therefore, you do not have to target a
specific node for a VM to use on failover to a promoted standby cluster.

7.2. VMware Guidance


VMware HA Cluster
Nutanix recommends configuring a single VMware HA cluster across two Nutanix clusters for
use with Metro Availability. One of the prime reasons for using a stretched-cluster configuration is
for the operational speed of recovery during unplanned events. The speed at which VMware HA
can restart VMs upon container promotion should generally outpace solutions that require you to
register and restart VMs across two different VMware clusters.
You can use a separate VMware cluster for each individual Nutanix cluster in combination with
the synchronous replication capabilities of Metro Availability. Such a configuration is possible and
is officially supported, but it is outside of the scope of this document.

Restart Priorities
You can configure VM restart priorities to control the order in which VMware HA restarts VMs.
Setting these priorities can help in correctly restarting multitiered applications that run across
multiple VMs. It is important to understand that the restart priority is based on the task of
powering on the VM and is not linked to application availability within the VM. Because of this
basis, it is possible for lower-priority VMs to be operational before higher-priority VMs.

Isolation Response
The isolation response setting for a Metro Availability environment should only be relevant when
individual nodes fail, not when a cluster is partitioned. When a cluster is partitioned (as with a
network outage between the sites), the nodes local to a site can communicate with each other
and respond to VMware HA election traffic. In this case, host isolation does not occur.
Host isolation does occur when an individual node fails because of a network outage for that
node, and management traffic, including response to either election traffic or the VMware HA
leader, also fails. Datastore heartbeating is likely to fail as well, considering the converged
networking of the Nutanix platform. Because of this behavior, we generally recommend
configuring the isolation response to shutdown for any user VMs running on that server. Always
set Nutanix CVMs to leave powered on or disabled, depending on the version of vSphere you
use.

7. Metro Availability Best Practices (Detail) | 56


Metro Availability

Datastore Heartbeating
We recommend configuring datastore heartbeating against containers that mount to all nodes
in a VMware HA cluster. When using Metro Availability, this means you configure datastore
heartbeating against the stretched containers enabled for replication. Given that the single
VMware HA leader in the cluster could be operating in either site, a perceived host failure in
the opposite site could be validated against the replicated container configured for datastore
heartbeating. The additional remote replication traffic caused by enabling datastore heartbeating
is minimal and should not be a performance concern.

vCenter Server Availability


When using a single VMware HA cluster as recommended, you only need one vCenter Server
instance. The vCenter Server is integral to managing the VMware HA cluster; therefore, its
placement and availability are of the utmost importance. We recommend making the vCenter
Server available in a failure domain separate from the Metro Availability clusters. A separate
failure domain generally means a third site with separate connectivity to the two sites that
represent the stretched cluster. This design allows the vCenter Server to remain online and be
available immediately if either site fails.
If a third site is not available, you can host the vCenter Server in the Metro Availability
environment and replicate it between the sites. You can configure the vCenter Server with should
affinity rules to allow automated restart if the site fails.

VMware DRS Affinity Rules


As detailed in the Virtual Machine Locality section, you should target VMs to run locally against
active containers. VMware DRS affinity rules can enforce this requirement. Nutanix recommends
using should run on rules to keep VMs local to active containers, while also allowing VMs to fail
over automatically to the standby site in the event of a larger failure. Using should run on rules
automates failover and restart once you have promoted standby containers, so these processes
do not depend on vCenter Server availability.
You can use must run on rules, but you need to modify them before you can restart VMs in a site-
failure scenario. vCenter Server availability impacts your ability to modify the DRS rules, so take
it into account when designing the solution.
Update DRS affinity rules in the event of a site failure, either immediately following the failure or
before you recover an existing environment. Updating the affinity rules helps ensure that VMs do
not move to a cluster unexpectedly during site recovery procedures to reestablish replication.
The Operational Scenarios and VMware Distributed Resource Scheduler (DRS) sections contain
additional information regarding DRS affinity rules.

7. Metro Availability Best Practices (Detail) | 57


Metro Availability

8. Utilities and Alerting

8.1. REST API


The Nutanix REST API allows you to create scripts that run system administration commands
against the Nutanix cluster. With the API, you can use HTTP requests to get information about
the cluster and make changes to the configuration. Output from the commands is returned in
JSON format.
The REST API Explorer section of the Nutanix web console (Prism) contains a complete list of
REST API functions and parameters, including those related to managing Metro Availability.

Figure 28: REST API Explorer Access

8.2. nCLI Commands


You can download the nCLI from Prism and use it to manage the Nutanix cluster. The following
list contains the commands relevant to Metro Availability.
• protection-domain | pd
⁃ list | ls
⁃ list-replication-status | ls-repl-status

8. Utilities and Alerting | 58


Metro Availability

⁃ create | add
⁃ remove | rm
⁃ metro-avail-enable (includes reenable option)
⁃ metro-avail-disable
⁃ promote-to-active
⁃ update-break-replication-timeout
⁃ update-failure-handling
Example:
protection-domain update-break-replication-timeout name=DSHB_PD timeout=10

8.3. PowerShell Commands


You can download PowerShell commands (referred to as “Cmdlets Installer”) from Prism and use
them to manage the Nutanix cluster. The following list contains the commands relevant to Metro
Availability.
• Get-NTNXProtectionDomain
⁃ Example:
$myPD=Get-NTNXProtectionDomain -name DSHB_PD
$myPD.metroAvail | ft -AutoSize
role remoteSite container status timeout
---- ---------- --------- ------ -------
Active POC01b DSHB Enabled 10

• Promote-NTNXProtectionDomainStretchCluster
• Start-NTNXProtectionDomainStretchCluster
⁃ Maps to enable
⁃ Includes reenable option
• Stop-NTNXProtectionDomainStretchCluster
⁃ Maps to disable
• Update-NTNXProtectionDomainStretchTimeout
• Update-NTNXStretchFailureHandling

8. Utilities and Alerting | 59


Metro Availability

8.4. Nutanix Cluster Check (NCC)


You can perform the following checks when running NCC on a Metro Availability–enabled cluster.
• Health_check >> Data_protection_checks >> Remote_site_checks >>
Remote_site_connectivity_check
⁃ Checks that the remote sites are reachable.
• Health check >> stretch_cluster_checks >> backup_snapshots_on_metro_secondary_check
⁃ Checks for snapshots on Metro secondary containers that are candidates for deletion and
space reclamation.
• Health check >> stretch_cluster_checks >> data_locality_check
⁃ Checks whether VMs are operating against data locally.
• Health check >> stretch_cluster_checks >> secondary_metro_pd_in_sync_check
⁃ Checks if the secondary protection domain site is synchronized with the primary.
• Health check >> stretch_cluster_checks >> stale_state_of_secondary_check
⁃ Checks if the secondary site configuration is stale.
• Health check >> stretch_cluster_checks >> unsupported_vm_config_check
⁃ Checks to ensure that all VM files are on the same stretched container.

8.5. Alerts
The following Prism alerts are generated with Metro Availability.
• Information
⁃ If a reenable operation requires full resynchronization.
• Warning
⁃ When latency between sites is more than 5 ms for 10 seconds.
⁃ When the number of entities in a Metro Availability–protected container exceeds 3,600.
⁃ When the total number of entities in live containers and associated remote snapshots
exceeds 50,000.

Note: Metro Availability snapshots automatically fail when a container reaches


this threshold. Exceeding 3,600 entities also prevents you from enabling Metro
Availability. Existing synchronous replication would continue.

8. Utilities and Alerting | 60


Metro Availability

Note: Exceeding 50,000 entities, including snapshots, also prevents you from
enabling Metro Availability.

• Critical
⁃ When a protection domain is in a remote unreachable state.
⁃ When a protection domain is in a decoupled state.
⁃ When a protection domain fails to automatically disable.
• Licensing
⁃ Metro Availability requires the Ultimate software edition. If the Ultimate license is not
enabled, you see a license warning while configuring Metro Availability and in the Prism UI
after configuration.

8. Utilities and Alerting | 61


Metro Availability

9. Conclusion
Metro Availability represents two industry firsts for hyperconverged platforms: the first continuous
availability solution and the first synchronous storage replication solution. Like all Nutanix
features, management of Metro Availability is simple, intuitive, and built directly into the
software included with Acropolis. The simplified management and operation of Metro Availability
stands out compared to other more complex solutions offered with legacy three-tiered storage
architectures.

9. Conclusion | 62
Metro Availability

Appendix

About the Author


Mike McGhee has responsibilities around Nutanix Files, Nutanix Volumes, Metro Availability, Era,
and the Microsoft ecosystem of products including Hyper-V at Nutanix. Follow Mike on Twitter
@mcghem.

About Nutanix
Nutanix makes infrastructure invisible, elevating IT to focus on the applications and services that
power their business. The Nutanix enterprise cloud software leverages web-scale engineering
and consumer-grade design to natively converge compute, virtualization, and storage into
a resilient, software-defined solution with rich machine intelligence. The result is predictable
performance, cloud-like infrastructure consumption, robust security, and seamless application
mobility for a broad range of enterprise applications. Learn more at www.nutanix.com or follow us
on Twitter @nutanix.

Appendix | 63
Metro Availability

List of Figures
Figure 1: Nutanix Data Protection Spectrum, Including Metro Availability......................... 6

Figure 2: Nutanix Enterprise Cloud OS Stack................................................................... 9

Figure 3: Metro Availability Overview.............................................................................. 11

Figure 4: VM Write to an Active Container with Replication Factor 2 at Both Sites......... 12

Figure 5: VM Write to a Standby Container with Replication Factor 2 at Both Sites........ 13

Figure 6: VM Read from a Standby Container................................................................ 14

Figure 7: Active Protection Domain Flowchart................................................................ 17

Figure 8: Standby Protection Domain Flowchart............................................................. 18

Figure 9: Metro Availability Overview with vMotion......................................................... 22

Figure 10: Prism: Protection Domain Example................................................................24

Figure 11: Initial Metro Availability Configuration.............................................................25

Figure 12: Planned Failover with Replication Reversal................................................... 26

Figure 13: Nondisruptive Migration with Manual or Automatic Resume Failure Handling. 27

Figure 14: Nondisruptive Migration with Witness Failure Handling................................. 28

Figure 15: Planned Failover Using Cold Migration.......................................................... 28

Figure 16: Network Failure Between Nutanix Clusters....................................................30

Figure 17: Network Failure General Workflow.................................................................31

Figure 18: Site Failure..................................................................................................... 33

Figure 19: Site Failure General Workflow........................................................................34

Figure 20: Initial Cluster State Following Site Recovery with Existing Configuration....... 36

Figure 21: Site Recovery with Original Clusters General Workflow.................................37

64
Metro Availability

Figure 22: Moving Metro Availability to a New Cluster Pairing........................................38

Figure 23: Primary Site and Witness Loss Recovery...................................................... 39

Figure 24: Primary Site Complete Network Loss............................................................ 40

Figure 25: Complete Network Loss Recovery................................................................. 41

Figure 26: Metro Availability Three-Site Configuration.................................................... 44

Figure 27: Snapshot Schedule for a Metro Protection Domain....................................... 45

Figure 28: REST API Explorer Access............................................................................58

65
Metro Availability

List of Tables
Table 1: Document Version History................................................................................... 7

Table 2: Failure Scenario Summary................................................................................ 42

66

You might also like