TB XC Appliance System Scalability 092514

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Technical Brief: System Scalability

Dell XC Web-scale Converged Appliance

Powered by Nutanix software
Dell XC Series Web-scale Converged Appliances
integrate Nutanix web-scale software and Dells proven
storage and x86 server platform to provide enterpriseclass features for virtualized environments. As a highly
differentiated converged infrastructure solution, the
XC Series consolidates compute and storage into a single
appliance enabling application and virtualization teams to
quickly and simply deploy new workloads. This solution
enables data center capacity to be easily expanded one
node at a time delivering linear and predictable scaleout with pay-as-you-grow flexibility

Data management in distributed systems

A key tenet in the design of massively scalable systems
is ensuring that each participating node manages only a
bounded amount of state, independent of cluster size.
Accomplishing this requires that there be no master node
responsible for maintaining all data and metadata in the
clustered system. This concept is paramount in truly
scalable architectures, and one that is very difficult to
retrofit into legacy architectures.
Nutanix Distributed File system (NDFS) efficiently manages
all types of data in order to scale out capacity linearly for
both small and large-size clusters, and without loss of
node or cluster performance.

Configuration Data
NDFS stores cluster configuration data using a very
small in-memory database backed by solid-state drives
(SSDs). Three copies of this configuration database are
maintained in the cluster at all times. Importantly, there

are strict upper bounds that are honored to ensure that

this database never exceeds a few megabytes in size.
Even for a hypothetical million node cluster, the
database holding configuration data (such as identity of
participating nodes, health status of services, and so on)
for the entire cluster would only be a few megabytes in
size. In the event that one of the three participating nodes
fails or becomes unavailable, any other node in the cluster
can be seamlessly converted to a configuration node.

The most important and complex part of a file system is its
metadata. In a scalable file system, the amount of metadata
can potentially get very large. Further complicating the task,
it is not possible to hold the metadata centrally in a few
designated nodes or in memory.
NDFS employs multiple NoSQL concepts to scale the
storage and management of metadata. For example, the
system implements a NoSQL database called Cassandra
to maintain key-value pairs, where the key is the offset in a
particular virtual disk and the value represents the physical
locations of the replicas of that data in the cluster.
When a key needs to be stored, a consistent hash is used
to calculate the locations where the key and value will
be stored in the cluster. The consistent hash function is
responsible for uniformly distributing the load of storing
keys in the cluster. As the cluster grows or shrinks, the
ring self-heals and rebalances key storage responsibility
among the participating nodes. This ensures that every
node will be responsible for managing roughly the same
amount of metadata.

Virtual machine data and I/O

Every XC converged appliance includes a Controller Virtual
Machine (CVM) to handle all data I/O operations for the
local hypervisor and guest VMs, and to serve as a gateway
to NDFS. This n-way controller model means the number
of CVMs scales evenly with the number of XC converged
appliances in the cluster, thus eliminating the possibility of
controller bottlenecks that occur in traditional arrays.
The NDFS architecture ensures that the amount of data
stored on a cluster node is directly proportional to the
amount of storage space on that node including both
SSD and HDD capacities. This by definition is bounded
and does not depend on the size of the cluster. A system
component called Curator, which is responsible for keeping
the cluster running smoothly, runs background map-reduce
tasks periodically to check for uneven disk utilization on the
nodes in the cluster in a distributed manner.
When a VM creates data, NDFS keeps one copy resident
on the local node for optimal performance, and distributes
a redundant copy across other nodes in the cluster.
Distributed replication enables quick recovery in the
event of a disk or node failure. All remote XC converged
appliances participate in the replication. This enables
higher I/O for larger size clusters as there are more nodes
handling the replication.
If a VM moves to another node in response to a High
Availability (HA) or VM movement event, Curator
automatically migrates hot data to the node where the
VM is running. NDFS enables all of the clusters storage
resources to be available to any host or any VM, but
without requiring all data to be local to that node.

Scalable vDisk-level locks

In traditional, non-converged architectures I/O from a
single VM may arrive via multiple interfaces due to network
multi-pathing and controller load balancing. This forces
legacy file systems to use fine-grained locks to avoid nonstop transitions of ownership of locks between controllers
on writes, and massive chatter of invalidations on the
network. Such use of fine-grained locks in architectures
employing n-way controllers imposes scalability
bottlenecks that are nearly impossible to overcome.
NDFS eliminates this impediment to system scalability by
implementing locks sparingly, and only at a VMs file level
(vDisk on NDFS). NDFS queries the hypervisor and gathers
information on all the VMs running on the host, as well as

Technical Brief: System Scalability

the files backing the virtual disks of the VMs. Each virtual disk,
or any other large file, is converted into a Nutanix vDisk that
is managed as a first-class citizen in the file system.
I/O for a particular VM is served by the local Controller
VM that is running on the host. That local controller VM
acquires the lock for all of the virtual disks backing the VM.
Because virtual disks are not typically shared with other
hosts, NDFS simply uses a vDisk-level lock. As such, there
are no invalidations or cache coherency issues. Even with
a very large number of VMs, NDFS effectively manages
locks with minimal overhead to ensure that the system can
still scale linearly.

Scalable alerting, monitoring and reporting

One of the more common oversights in designing scalable
systems is building troubleshooting facilities that scale with
the system. When such services become unstable, it is
even more important to have an alert and event recording
system that does not buckle due to non-scalability.
NDFS implements a completely scalable alert system
supported by a service running on every node. All events
and alerts are recorded into a strictly consistent distributed
NoSQL database that is accessible via any node in the
cluster. Alerts and events are indexed at scale, and within
the NoSQL database. They are easily accessed either via
the GUI or through a standards-based REST-based API.

Statistics and visualization

All distributed systems require reliable, live statistical
monitoring and reporting. NDFS scales the statisticsgathering component to provide near real-time insights
to the cluster administrator. This is accomplished by
implementing a scale-out statistics database that leverages
the NoSQL key-value store.
Each host in the Dell XC cluster runs an agent that gathers
local statistics and periodically updates the NoSQL store.
When the GUI requests data in a sorted fashion (like the
top 10 CPU consuming VMs), the request is sent using a
map-reduce framework to all hosts in the cluster to get
live information in a scalable fashion. Each host is only
responsible to serve requests for local stats.
Nutanix software also provides cluster-wide statistics,
which are handled by a dynamically elected leader in the
cluster. Strict limits are enforced on the number of such
cluster-wide statistics to ensure overall scalability.

Avoiding single points of failure

A key NDFS design principle is to not require any fixed
special nodes to maintain cluster operation and services.
There are a few operations in the clustered environment,
however, which embody the notion of a leader. It is
necessary that this leader not be statically assigned,
otherwise that node reduces to a special node and
introduces a potential single-point-of-failure, which
inhibits scalability.
To overcome these drawbacks, Nutanix software
implements a dynamic leader election scheme. For all
functions and administrative roles in the cluster, services
on all hosts volunteer to be elected as leader. A leader
is elected efficiently using the distributed configuration
service implemented in the system.
It is not necessary, however, for the leaders for all
functions be co-located on any given node. In fact, the
leaders are randomly distributed in order to spread the
leadership load among cluster nodes. When an elected
leader fails, either due to the failure of the service or the
host itself, a new leader is elected automatically from
among the healthy nodes in the cluster. This occurs in
a sub-second timeframe for all size clusters. In other
words, there is no correlation between the time duration
to elect a new leader upon failure and the number of
nodes in a cluster. The newly elected leader automatically
assumes the responsibilities of the previous leader.

Strict consistency at scale using Paxos

Most NoSQL implementations sacrifice strict consistency
to gain better availability. For example, Facebooks
Cassandra and Amazons Dynamo provide only eventual
consistency, which is not a viable option for true file
systems. Eventual consistency implies that if a piece
of data is written to one node in a cluster, the data will
become visible to another node in the cluster only
eventually. There is no guarantee that the data will be
immediately visible.

Paxos is a widely used protocol that builds consensus

among nodes in clustered systems. In the NDFS metadata
store, for each key there is a set of three nodes that
might have the latest value of the data. NDFS runs the
Paxos algorithm to obtain consensus on the latest value
for the key being requested. Paxos guarantees that the
most recent version of the data will get consensus. Strict
consistency is guaranteed even though the underlying
data store is based on NoSQL. For any key, the consensus
needs to be formed between only three nodes regardless
of cluster size.
The ability to achieve strict consistency using a scalable
NoSQL database enables NDFS to be a linearly scalable
file system.

Scalable map-reduce framework

For relatively mundane work that must be performed
by a typical file system, it is far more efficient and
scalable to complete common tasks in the background.
User performance benefits by offloading tasks from
the active I/O path to background processes. Typical
examples include disk scrubbing, disk balancing, offline
compression, metadata scrubbing and calculation of
free space.
NDFS implements a map-reduce framework, called
Curator, which is responsible for performing cluster
wide operations at scale to perform these tasks without
impacting scalability. Since all nodes participate and each
handle a part of the Curator responsibility, performance
scales linearly as the size of the cluster grows.
Each node is responsible for a bounded number of mapreduce tasks, which are processed by all cluster nodes
in phases. A coordinating node is randomly elected and
performs only the lightweight tasks of coordinating nodes
in the cluster.

While this works well for some systems, it will cause

corruption in storage file systems. It may appear as a
natural consequence that NoSQL systems are ill suited for
building scalable file systems. NDFS, however, achieves
strict consistency on top of NoSQL by implementing a
distributed version of the Paxos algorithm.

Learn More at Dell.com/XCconverged.

2015 Dell Inc. All rights reserved. Dell, the DELL logo, and the DELL badge are trademarks of Dell Inc. Other trademarks and trade names
may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary
interest in the marks and names of others. This document is for informational purposes only. Dell reserves the right to make changes
without further notice to any products herein. The content provided is as is and without express or implied warranties of any kind.

You might also like